• Open

    Toward A Reinforcement-Learning-Based System for Adjusting Medication to Minimize Speech Disfluency. (arXiv:2312.11509v1 [cs.CL])
    We propose a Reinforcement-Learning-based system that would automatically prescribe a hypothetical patient medications that may help the patient with their mental-health-related speech disfluency, and adjust the medication and the dosages in response to data from the patient. We demonstrate the components of the system: a module that detects and evaluates speech disfluency on a large dataset we built, and a Reinforcement Learning algorithm that automatically finds good combinations of medications. To support the two modules, we collect data on the effect of psychiatric medications for speech disfluency from the literature, and build a plausible patient simulation system. We demonstrate that the Reinforcement Learning system is, under some circumstances, able to converge to a good medication regime. We collect and label a dataset of people with possible speech disfluency and demonstrate our methods using that dataset. Our work is a proof of concept: we show that there is promise in the idea of using automatic data collection to address disfluency.  ( 2 min )
    Active Preference Inference using Language Models and Probabilistic Reasoning. (arXiv:2312.12009v1 [cs.CL])
    Actively inferring user preferences, for example by asking good questions, is important for any human-facing decision-making system. Active inference allows such systems to adapt and personalize themselves to nuanced individual preferences. To enable this ability for instruction-tuned large language models (LLMs), one may prompt them to ask users questions to infer their preferences, transforming the language models into more robust, interactive systems. However, out of the box, these models are not efficient at extracting preferences: the questions they generate are not informative, requiring a high number of user interactions and impeding the usability of the downstream system. In this work, we introduce an inference-time algorithm that helps LLMs quickly infer preferences by using more informative questions. Our algorithm uses a probabilistic model whose conditional distributions are defined by prompting an LLM, and returns questions that optimize expected entropy and expected model change. Results in a simplified interactive web shopping setting with real product items show that an LLM equipped with our entropy reduction algorithm outperforms baselines with the same underlying LLM on task performance while using fewer user interactions.  ( 2 min )
    XLand-MiniGrid: Scalable Meta-Reinforcement Learning Environments in JAX. (arXiv:2312.12044v1 [cs.LG])
    We present XLand-MiniGrid, a suite of tools and grid-world environments for meta-reinforcement learning research inspired by the diversity and depth of XLand and the simplicity and minimalism of MiniGrid. XLand-Minigrid is written in JAX, designed to be highly scalable, and can potentially run on GPU or TPU accelerators, democratizing large-scale experimentation with limited resources. To demonstrate the generality of our library, we have implemented some well-known single-task environments as well as new meta-learning environments capable of generating $10^8$ distinct tasks. We have empirically shown that the proposed environments can scale up to $2^{13}$ parallel instances on the GPU, reaching tens of millions of steps per second.  ( 2 min )
    Agglomerative Federated Learning: Empowering Larger Model Training via End-Edge-Cloud Collaboration. (arXiv:2312.11489v1 [cs.DC])
    Federated Learning (FL) enables training Artificial Intelligence (AI) models over end devices without compromising their privacy. As computing tasks are increasingly performed by a combination of cloud, edge, and end devices, FL can benefit from this End-Edge-Cloud Collaboration (EECC) paradigm to achieve collaborative device-scale expansion with real-time access. Although Hierarchical Federated Learning (HFL) supports multi-tier model aggregation suitable for EECC, prior works assume the same model structure on all computing nodes, constraining the model scale by the weakest end devices. To address this issue, we propose Agglomerative Federated Learning (FedAgg), which is a novel EECC-empowered FL framework that allows the trained models from end, edge, to cloud to grow larger in size and stronger in generalization ability. FedAgg recursively organizes computing nodes among all tiers based on Bridge Sample Based Online Distillation Protocol (BSBODP), which enables every pair of parent-child computing nodes to mutually transfer and distill knowledge extracted from generated bridge samples. This design enhances the performance by exploiting the potential of larger models, with privacy constraints of FL and flexibility requirements of EECC both satisfied. Experiments under various settings demonstrate that FedAgg outperforms state-of-the-art methods by an average of 4.53\% accuracy gains and remarkable improvements in convergence rate.  ( 2 min )
    Maatphor: Automated Variant Analysis for Prompt Injection Attacks. (arXiv:2312.11513v1 [cs.CR])
    Prompt injection has emerged as a serious security threat to large language models (LLMs). At present, the current best-practice for defending against newly-discovered prompt injection techniques is to add additional guardrails to the system (e.g., by updating the system prompt or using classifiers on the input and/or output of the model.) However, in the same way that variants of a piece of malware are created to evade anti-virus software, variants of a prompt injection can be created to evade the LLM's guardrails. Ideally, when a new prompt injection technique is discovered, candidate defenses should be tested not only against the successful prompt injection, but also against possible variants. In this work, we present, a tool to assist defenders in performing automated variant analysis of known prompt injection attacks. This involves solving two main challenges: (1) automatically generating variants of a given prompt according, and (2) automatically determining whether a variant was effective based only on the output of the model. This tool can also assist in generating datasets for jailbreak and prompt injection attacks, thus overcoming the scarcity of data in this domain. We evaluate Maatphor on three different types of prompt injection tasks. Starting from an ineffective (0%) seed prompt, Maatphor consistently generates variants that are at least 60% effective within the first 40 iterations.  ( 2 min )
    A Hybrid SOM and K-means Model for Time Series Energy Consumption Clustering. (arXiv:2312.11475v1 [cs.LG])
    Energy consumption analysis plays a pivotal role in addressing the challenges of sustainability and resource management. This paper introduces a novel approach to effectively cluster monthly energy consumption patterns by integrating two powerful techniques: Self-organizing maps and K-means clustering. The proposed method aims to exploit the benefits of both of these algorithms to enhance the accuracy and interpretability of clustering results for a dataset in which finding patterns is difficult. The main focus of this study is on a selection of time series energy consumption data from the Smart meters in London dataset. The data was preprocessed and reduced in dimensionality to capture essential temporal patterns while retaining their underlying structures. The SOM algorithm was utilized to extract the central representatives of the consumption patterns for each one of the houses over the course of each month, effectively reducing the dimensionality of the dataset and making it easier for analysis. Subsequently, the obtained SOM centroids were clustered using K-means, a popular centroid-based clustering technique. The experimental results demonstrated a significant silhouette score of 66%, indicating strong intra-cluster cohesion and inter-cluster separation which confirms the effectiveness of the proposed approach in the clustering task.  ( 2 min )
    3D-LFM: Lifting Foundation Model. (arXiv:2312.11894v1 [cs.CV])
    The lifting of 3D structure and camera from 2D landmarks is at the cornerstone of the entire discipline of computer vision. Traditional methods have been confined to specific rigid objects, such as those in Perspective-n-Point (PnP) problems, but deep learning has expanded our capability to reconstruct a wide range of object classes (e.g. C3PDO and PAUL) with resilience to noise, occlusions, and perspective distortions. All these techniques, however, have been limited by the fundamental need to establish correspondences across the 3D training data -- significantly limiting their utility to applications where one has an abundance of "in-correspondence" 3D data. Our approach harnesses the inherent permutation equivariance of transformers to manage varying number of points per 3D data instance, withstands occlusions, and generalizes to unseen categories. We demonstrate state of the art performance across 2D-3D lifting task benchmarks. Since our approach can be trained across such a broad class of structures we refer to it simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.  ( 2 min )
    Learning Merton's Strategies in an Incomplete Market: Recursive Entropy Regularization and Biased Gaussian Exploration. (arXiv:2312.11797v1 [q-fin.PM])
    We study Merton's expected utility maximization problem in an incomplete market, characterized by a factor process in addition to the stock price process, where all the model primitives are unknown. We take the reinforcement learning (RL) approach to learn optimal portfolio policies directly by exploring the unknown market, without attempting to estimate the model parameters. Based on the entropy-regularization framework for general continuous-time RL formulated in Wang et al. (2020), we propose a recursive weighting scheme on exploration that endogenously discounts the current exploration reward by the past accumulative amount of exploration. Such a recursive regularization restores the optimality of Gaussian exploration. However, contrary to the existing results, the optimal Gaussian policy turns out to be biased in general, due to the interwinding needs for hedging and for exploration. We present an asymptotic analysis of the resulting errors to show how the level of exploration affects the learned policies. Furthermore, we establish a policy improvement theorem and design several RL algorithms to learn Merton's optimal strategies. At last, we carry out both simulation and empirical studies with a stochastic volatility environment to demonstrate the efficiency and robustness of the RL algorithms in comparison to the conventional plug-in method.  ( 2 min )
    Fast Decision Boundary based Out-of-Distribution Detector. (arXiv:2312.11536v1 [cs.LG])
    Efficient and effective Out-of-Distribution (OOD) detection is essential for the safe deployment of AI in latency-critical applications. Recently, studies have revealed that detecting OOD based on feature space information can be highly effective. Despite their effectiveness, however, exiting feature space OOD methods may incur non-negligible computational overhead, given their reliance on auxiliary models built from training features. In this paper, we aim to obviate auxiliary models to optimize computational efficiency while leveraging the rich information embedded in the feature space. We investigate from the novel perspective of decision boundaries and propose to detect OOD using the feature distance to decision boundaries. To minimize the cost of measuring the distance, we introduce an efficient closed-form estimation, analytically proven to tightly lower bound the distance. We observe that ID features tend to reside further from the decision boundaries than OOD features. Our observation aligns with the intuition that models tend to be more decisive on ID samples, considering that distance to decision boundaries quantifies model uncertainty. From our understanding, we propose a hyperparameter-free, auxiliary model-free OOD detector. Our OOD detector matches or surpasses the effectiveness of state-of-the-art methods across extensive experiments. Meanwhile, our OOD detector incurs practically negligible overhead in inference latency. Overall, we significantly enhance the efficiency-effectiveness trade-off in OOD detection.  ( 2 min )
    Goal Exploration Augmentation via Pre-trained Skills for Sparse-Reward Long-Horizon Goal-Conditioned Reinforcement Learning. (arXiv:2210.16058v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) often struggles to accomplish a sparse-reward long-horizon task in a complex environment. Goal-conditioned reinforcement learning (GCRL) has been employed to tackle this difficult problem via a curriculum of easy-to-reach sub-goals. In GCRL, exploring novel sub-goals is essential for the agent to ultimately find the pathway to the desired goal. How to explore novel sub-goals efficiently is one of the most challenging issues in GCRL. Several goal exploration methods have been proposed to address this issue but still struggle to find the desired goals efficiently. In this paper, we propose a novel learning objective by optimizing the entropy of both achieved and new goals to be explored for more efficient goal exploration in sub-goal selection based GCRL. To optimize this objective, we first explore and exploit the frequently occurring goal-transition patterns mined in the environments similar to the current task to compose skills via skill learning. Then, the pretrained skills are applied in goal exploration. Evaluation on a variety of spare-reward long-horizon benchmark tasks suggests that incorporating our method into several state-of-the-art GCRL baselines significantly boosts their exploration efficiency while improving or maintaining their performance. The source code is available at: https://github.com/GEAPS/GEAPS.  ( 3 min )
    Privacy against Real-Time Speech Emotion Detection via Acoustic Adversarial Evasion of Machine Learning. (arXiv:2211.09273v4 [cs.LG] UPDATED)
    Smart speaker voice assistants (VAs) such as Amazon Echo and Google Home have been widely adopted due to their seamless integration with smart home devices and the Internet of Things (IoT) technologies. These VA services raise privacy concerns, especially due to their access to our speech. This work considers one such use case: the unaccountable and unauthorized surveillance of a user's emotion via speech emotion recognition (SER). This paper presents DARE-GP, a solution that creates additive noise to mask users' emotional information while preserving the transcription-relevant portions of their speech. DARE-GP does this by using a constrained genetic programming approach to learn the spectral frequency traits that depict target users' emotional content, and then generating a universal adversarial audio perturbation that provides this privacy protection. Unlike existing works, DARE-GP provides: a) real-time protection of previously unheard utterances, b) against previously unseen black-box SER classifiers, c) while protecting speech transcription, and d) does so in a realistic, acoustic environment. Further, this evasion is robust against defenses employed by a knowledgeable adversary. The evaluations in this work culminate with acoustic evaluations against two off-the-shelf commercial smart speakers using a small-form-factor (raspberry pi) integrated with a wake-word system to evaluate the efficacy of its real-world, real-time deployment.  ( 3 min )
    Symbolic Learning for Material Discovery. (arXiv:2312.11487v1 [cond-mat.mtrl-sci])
    Discovering new materials is essential to solve challenges in climate change, sustainability and healthcare. A typical task in materials discovery is to search for a material in a database which maximises the value of a function. That function is often expensive to evaluate, and can rely upon a simulation or an experiment. Here, we introduce SyMDis, a sample efficient optimisation method based on symbolic learning, that discovers near-optimal materials in a large database. SyMDis performs comparably to a state-of-the-art optimiser, whilst learning interpretable rules to aid physical and chemical verification. Furthermore, the rules learned by SyMDis generalise to unseen datasets and return high performing candidates in a zero-shot evaluation, which is difficult to achieve with other approaches.  ( 2 min )
    Asymmetric Norms to Approximate the Minimum Action Distance. (arXiv:2312.10276v2 [cs.LG] UPDATED)
    This paper presents a state representation for reward-free Markov decision processes. The idea is to learn, in a self-supervised manner, an embedding space where distances between pairs of embedded states correspond to the minimum number of actions needed to transition between them. Unlike previous methods, our approach incorporates an asymmetric norm parametrization, enabling accurate approximations of minimum action distances in environments with inherent asymmetry. We show how this representation can be leveraged to learn goal-conditioned policies, providing a notion of similarity between states and goals and a useful heuristic distance to guide planning. To validate our approach, we conduct empirical experiments on both symmetric and asymmetric environments. Our results show that our asymmetric norm parametrization performs comparably to symmetric norms in symmetric environments and surpasses symmetric norms in asymmetric environments.  ( 2 min )
    Federated Best Arm Identification with Heterogeneous Clients. (arXiv:2210.07780v3 [cs.LG] UPDATED)
    We study best arm identification in a federated multi-armed bandit setting with a central server and multiple clients, when each client has access to a {\em subset} of arms and each arm yields independent Gaussian observations. The goal is to identify the best arm of each client subject to an upper bound on the error probability; here, the best arm is one that has the largest {\em average} value of the means averaged across all clients having access to the arm. Our interest is in the asymptotics as the error probability vanishes. We provide an asymptotic lower bound on the growth rate of the expected stopping time of any algorithm. Furthermore, we show that for any algorithm whose upper bound on the expected stopping time matches with the lower bound up to a multiplicative constant ({\em almost-optimal} algorithm), the ratio of any two consecutive communication time instants must be {\em bounded}, a result that is of independent interest. We thereby infer that an algorithm can communicate no more sparsely than at exponential time instants in order to be almost-optimal. For the class of almost-optimal algorithms, we present the first-of-its-kind asymptotic lower bound on the expected number of {\em communication rounds} until stoppage. We propose a novel algorithm that communicates at exponential time instants, and demonstrate that it is asymptotically almost-optimal.  ( 3 min )
    Neuro-Symbolic Continual Learning: Knowledge, Reasoning Shortcuts and Concept Rehearsal. (arXiv:2302.01242v2 [cs.LG] UPDATED)
    We introduce Neuro-Symbolic Continual Learning, where a model has to solve a sequence of neuro-symbolic tasks, that is, it has to map sub-symbolic inputs to high-level concepts and compute predictions by reasoning consistently with prior knowledge. Our key observation is that neuro-symbolic tasks, although different, often share concepts whose semantics remains stable over time. Traditional approaches fall short: existing continual strategies ignore knowledge altogether, while stock neuro-symbolic architectures suffer from catastrophic forgetting. We show that leveraging prior knowledge by combining neuro-symbolic architectures with continual strategies does help avoid catastrophic forgetting, but also that doing so can yield models affected by reasoning shortcuts. These undermine the semantics of the acquired concepts, even when detailed prior knowledge is provided upfront and inference is exact, and in turn continual performance. To overcome these issues, we introduce COOL, a COncept-level cOntinual Learning strategy tailored for neuro-symbolic continual problems that acquires high-quality concepts and remembers them over time. Our experiments on three novel benchmarks highlights how COOL attains sustained high performance on neuro-symbolic continual learning tasks in which other strategies fail.  ( 2 min )
    Android Malware Detection with Unbiased Confidence Guarantees. (arXiv:2312.11559v1 [cs.CR])
    The impressive growth of smartphone devices in combination with the rising ubiquity of using mobile platforms for sensitive applications such as Internet banking, have triggered a rapid increase in mobile malware. In recent literature, many studies examine Machine Learning techniques, as the most promising approach for mobile malware detection, without however quantifying the uncertainty involved in their detections. In this paper, we address this problem by proposing a machine learning dynamic analysis approach that provides provably valid confidence guarantees in each malware detection. Moreover the particular guarantees hold for both the malicious and benign classes independently and are unaffected by any bias in the data. The proposed approach is based on a novel machine learning framework, called Conformal Prediction, combined with a random forests classifier. We examine its performance on a large-scale dataset collected by installing 1866 malicious and 4816 benign applications on a real android device. We make this collection of dynamic analysis data available to the research community. The obtained experimental results demonstrate the empirical validity, usefulness and unbiased nature of the outputs produced by the proposed approach.  ( 2 min )
    AI-TA: Towards an Intelligent Question-Answer Teaching Assistant using Open-Source LLMs. (arXiv:2311.02775v3 [cs.LG] UPDATED)
    Responding to the thousands of student questions on online QA platforms each semester has a considerable human cost, particularly in computing courses with rapidly growing enrollments. To address the challenges of scalable and intelligent question-answering (QA), we introduce an innovative solution that leverages open-source Large Language Models (LLMs) from the LLaMA-2 family to ensure data privacy. Our approach combines augmentation techniques such as retrieval augmented generation (RAG), supervised fine-tuning (SFT), and learning from human preferences data using Direct Preference Optimization (DPO). Through extensive experimentation on a Piazza dataset from an introductory CS course, comprising 10,000 QA pairs and 1,500 pairs of preference data, we demonstrate a significant 30% improvement in the quality of answers, with RAG being a particularly impactful addition. Our contributions include the development of a novel architecture for educational QA, extensive evaluations of LLM performance utilizing both human assessments and LLM-based metrics, and insights into the challenges and future directions of educational data processing. This work paves the way for the development of AI-TA, an intelligent QA assistant customizable for courses with an online QA platform  ( 3 min )
    Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering. (arXiv:2305.14901v2 [cs.CL] UPDATED)
    We train a language model (LM) to robustly answer multistep questions by generating and answering sub-questions. We propose Chain-of-Questions, a framework that trains a model to generate sub-questions and sub-answers one at a time by leveraging human annotated question decomposition meaning representation (QDMR). The key technical challenge is that QDMR only contains sub-questions but not answers to those sub-questions, so we treat sub-answers as latent variables and optimize them using a novel dynamic mixture of Hard-EM and MAPO. Chain-of-Questions greatly outperforms strong neuro-symbolic methods by 9.0 F1 on DROP contrast set, and outperforms GPT-3.5 by 24.3 F1 on HOTPOTQA adversarial set, thus demonstrating the effectiveness and robustness of our framework.  ( 2 min )
    Sign Language Conversation Interpretation Using Wearable Sensors and Machine Learning. (arXiv:2312.11903v1 [eess.SP])
    The count of people suffering from various levels of hearing loss reached 1.57 billion in 2019. This huge number tends to suffer on many personal and professional levels and strictly needs to be included with the rest of society healthily. This paper presents a proof of concept of an automatic sign language recognition system based on data obtained using a wearable device of 3 flex sensors. The system is designed to interpret a selected set of American Sign Language (ASL) dynamic words by collecting data in sequences of the performed signs and using machine learning methods. The built models achieved high-quality performances, such as Random Forest with 99% accuracy, Support Vector Machine (SVM) with 99%, and two K-Nearest Neighbor (KNN) models with 98%. This indicates many possible paths toward the development of a full-scale system.  ( 2 min )
    ACCL+: an FPGA-Based Collective Engine for Distributed Applications. (arXiv:2312.11742v1 [cs.DC])
    FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs or network-attached accelerators. Despite their potential, developing distributed FPGA-accelerated applications remains cumbersome due to the lack of appropriate infrastructure and communication abstractions. To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-source versatile FPGA-based collective communication library. Portable across different platforms and supporting UDP, TCP, as well as RDMA, ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective communication. Additionally, it can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks. It is user-extensible, allowing new collectives to be implemented and deployed without having to re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100 Gb/s networking, comparing its performance against software MPI over RDMA. The results demonstrate ACCL+'s significant advantages for FPGA-based distributed applications and highly competitive performance for CPU applications. We showcase ACCL+'s dual role with two use cases: seamlessly integrating as a collective offload engine to distribute CPU-based vector-matrix multiplication, and serving as a crucial and efficient component in designing fully FPGA-based distributed deep-learning recommendation inference.  ( 2 min )
    SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution. (arXiv:2312.11598v1 [cs.RO])
    Diffusion models have demonstrated strong potential for robotic trajectory planning. However, generating coherent and long-horizon trajectories from high-level instructions remains challenging, especially for complex tasks requiring multiple sequential skills. We propose SkillDiffuser, an end-to-end hierarchical planning framework integrating interpretable skill learning with conditional diffusion planning to address this problem. At the higher level, the skill abstraction module learns discrete, human-understandable skill representations from visual observations and language instructions. These learned skill embeddings are then used to condition the diffusion model to generate customized latent trajectories aligned with the skills. It allows for generating diverse state trajectories that adhere to the learnable skills. By integrating skill learning with conditional trajectory generation, SkillDiffuser produces coherent behavior following abstract instructions across diverse tasks. Experiments on multi-task robotic manipulation benchmarks like Meta-World and LOReL demonstrate state-of-the-art performance and human-interpretable skill representations from SkillDiffuser.  ( 2 min )
    End-to-End Reinforcement Learning for Torque Based Variable Height Hopping. (arXiv:2307.16676v2 [cs.RO] UPDATED)
    Legged locomotion is arguably the most suited and versatile mode to deal with natural or unstructured terrains. Intensive research into dynamic walking and running controllers has recently yielded great advances, both in the optimal control and reinforcement learning (RL) literature. Hopping is a challenging dynamic task involving a flight phase and has the potential to increase the traversability of legged robots. Model based control for hopping typically relies on accurate detection of different jump phases, such as lift-off or touch down, and using different controllers for each phase. In this paper, we present a end-to-end RL based torque controller that learns to implicitly detect the relevant jump phases, removing the need to provide manual heuristics for state detection. We also extend a method for simulation to reality transfer of the learned controller to contact rich dynamic tasks, resulting in successful deployment on the robot after training without parameter tuning.  ( 3 min )
    Polar Encoding: A Simple Baseline Approach for Classification with Missing Values. (arXiv:2210.01905v3 [cs.LG] UPDATED)
    We propose polar encoding, a representation of categorical and numerical $[0,1]$-valued attributes with missing values to be used in a classification context. We argue that this is a good baseline approach, because it can be used with any classification algorithm, preserves missingness information, is very simple to apply and offers good performance. In particular, unlike the existing missing-indicator approach, it does not require imputation, ensures that missing values are equidistant from non-missing values, and lets decision tree algorithms choose how to split missing values, thereby providing a practical realisation of the "missingness incorporated in attributes" (MIA) proposal. Furthermore, we show that categorical and $[0,1]$-valued attributes can be viewed as special cases of a single attribute type, corresponding to the classical concept of barycentric coordinates, and that this offers a natural interpretation of polar encoding as a fuzzified form of one-hot encoding. With an experiment based on twenty real-life datasets with missing values, we show that, in terms of the resulting classification performance, polar encoding performs better than the state-of-the-art strategies \e{multiple imputation by chained equations} (MICE) and \e{multiple imputation with denoising autoencoders} (MIDAS) and -- depending on the classifier -- about as well or better than mean/mode imputation with missing-indicators.  ( 3 min )
    Residual ANODE. (arXiv:2312.11629v1 [hep-ph])
    We present R-ANODE, a new method for data-driven, model-agnostic resonant anomaly detection that raises the bar for both performance and interpretability. The key to R-ANODE is to enhance the inductive bias of the anomaly detection task by fitting a normalizing flow directly to the small and unknown signal component, while holding fixed a background model (also a normalizing flow) learned from sidebands. In doing so, R-ANODE is able to outperform all classifier-based, weakly-supervised approaches, as well as the previous ANODE method which fit a density estimator to all of the data in the signal region instead of just the signal. We show that the method works equally well whether the unknown signal fraction is learned or fixed, and is even robust to signal fraction misspecification. Finally, with the learned signal model we can sample and gain qualitative insights into the underlying anomaly, which greatly enhances the interpretability of resonant anomaly detection and offers the possibility of simultaneously discovering and characterizing the new physics that could be hiding in the data.  ( 2 min )
    Time-Transformer: Integrating Local and Global Features for Better Time Series Generation. (arXiv:2312.11714v1 [cs.LG])
    Generating time series data is a promising approach to address data deficiency problems. However, it is also challenging due to the complex temporal properties of time series data, including local correlations as well as global dependencies. Most existing generative models have failed to effectively learn both the local and global properties of time series data. To address this open problem, we propose a novel time series generative model named 'Time-Transformer AAE', which consists of an adversarial autoencoder (AAE) and a newly designed architecture named 'Time-Transformer' within the decoder. The Time-Transformer first simultaneously learns local and global features in a layer-wise parallel design, combining the abilities of Temporal Convolutional Networks and Transformer in extracting local features and global dependencies respectively. Second, a bidirectional cross attention is proposed to provide complementary guidance across the two branches and achieve proper fusion between local and global features. Experimental results demonstrate that our model can outperform existing state-of-the-art models in 5 out of 6 datasets, specifically on those with data containing both global and local properties. Furthermore, we highlight our model's advantage on handling this kind of data via an artificial dataset. Finally, we show our model's ability to address a real-world problem: data augmentation to support learning with small datasets and imbalanced datasets.  ( 2 min )
    Clustering Mixtures of Bounded Covariance Distributions Under Optimal Separation. (arXiv:2312.11769v1 [cs.LG])
    We study the clustering problem for mixtures of bounded covariance distributions, under a fine-grained separation assumption. Specifically, given samples from a $k$-component mixture distribution $D = \sum_{i =1}^k w_i P_i$, where each $w_i \ge \alpha$ for some known parameter $\alpha$, and each $P_i$ has unknown covariance $\Sigma_i \preceq \sigma^2_i \cdot I_d$ for some unknown $\sigma_i$, the goal is to cluster the samples assuming a pairwise mean separation in the order of $(\sigma_i+\sigma_j)/\sqrt{\alpha}$ between every pair of components $P_i$ and $P_j$. Our contributions are as follows: For the special case of nearly uniform mixtures, we give the first poly-time algorithm for this clustering task. Prior work either required separation scaling with the maximum cluster standard deviation (i.e. $\max_i \sigma_i$) [DKK+22b] or required both additional structural assumptions and mean separation scaling as a large degree polynomial in $1/\alpha$ [BKK22]. For general-weight mixtures, we point out that accurate clustering is information-theoretically impossible under our fine-grained mean separation assumptions. We introduce the notion of a clustering refinement -- a list of not-too-small subsets satisfying a similar separation, and which can be merged into a clustering approximating the ground truth -- and show that it is possible to efficiently compute an accurate clustering refinement of the samples. Furthermore, under a variant of the "no large sub-cluster'' condition from in prior work [BKK22], we show that our algorithm outputs an accurate clustering, not just a refinement, even for general-weight mixtures. As a corollary, we obtain efficient clustering algorithms for mixtures of well-conditioned high-dimensional log-concave distributions. Moreover, our algorithm is robust to $\Omega(\alpha)$-fraction of adversarial outliers.  ( 3 min )
    H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. (arXiv:2306.14048v3 [cs.LG] UPDATED)
    Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H$_2$). Through a comprehensive investigation, we find that (i) the emergence of H$_2$ is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and (ii) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (H$_2$O), a KV cache eviction policy that dynamically retains a balance of recent and H$_2$ tokens. We formulate the KV cache eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of H$_2$O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29$\times$, 29$\times$, and 3$\times$ on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the latency by up to 1.9$\times$. The code is available at https://github.com/FMInference/H2O.  ( 3 min )
    ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference. (arXiv:2312.11882v1 [cs.CL])
    Early Exiting is one of the most popular methods to achieve efficient inference. Current early exiting methods adopt the (weighted) sum of the cross entropy loss of all internal classifiers during training, imposing all these classifiers to predict all instances correctly. However, during inference, as long as one internal classifier predicts an instance correctly, it can accelerate without losing accuracy. Thus, there is a notable gap between training and inference. We propose ConsistentEE, an early exiting method that is consistent in training and inference. ConsistentEE formulates the early exiting process as a reinforcement learning problem. A policy network is added to decide whether an instance should exit or continue. The training objective of ConsistentEE only require each instance to be predicted correctly by one internal classifier. Additionally, we introduce the concept Memorize Layer to measure the hardness of an instance. We incorporate memorized layer into reward function design, which allows ``easy'' instances to focus more on acceleration while ``hard'' instances to focus more on accuracy. Experimental results show that our method outperforms other baselines on various natural language understanding and generation tasks.  ( 2 min )
    Root Cause Explanation of Outliers under Noisy Mechanisms. (arXiv:2312.11818v1 [cs.AI])
    Identifying root causes of anomalies in causal processes is vital across disciplines. Once identified, one can isolate the root causes and implement necessary measures to restore the normal operation. Causal processes are often modelled as graphs with entities being nodes and their paths/interconnections as edge. Existing work only consider the contribution of nodes in the generative process, thus can not attribute the outlier score to the edges of the mechanism if the anomaly occurs in the connections. In this paper, we consider both individual edge and node of each mechanism when identifying the root causes. We introduce a noisy functional causal model to account for this purpose. Then, we employ Bayesian learning and inference methods to infer the noises of the nodes and edges. We then represent the functional form of a target outlier leaf as a function of the node and edge noises. Finally, we propose an efficient gradient-based attribution method to compute the anomaly attribution scores which scales linearly with the number of nodes and edges. Experiments on simulated datasets and two real-world scenario datasets show better anomaly attribution performance of the proposed method compared to the baselines. Our method scales to larger graphs with more nodes and edges.  ( 2 min )
    LR-XFL: Logical Reasoning-based Explainable Federated Learning. (arXiv:2308.12681v2 [cs.AI] UPDATED)
    Federated learning (FL) is an emerging approach for training machine learning models collaboratively while preserving data privacy. The need for privacy protection makes it difficult for FL models to achieve global transparency and explainability. To address this limitation, we incorporate logic-based explanations into FL by proposing the Logical Reasoning-based eXplainable Federated Learning (LR-XFL) approach. Under LR-XFL, FL clients create local logic rules based on their local data and send them, along with model updates, to the FL server. The FL server connects the local logic rules through a proper logical connector that is derived based on properties of client data, without requiring access to the raw data. In addition, the server also aggregates the local model updates with weight values determined by the quality of the clients' local data as reflected by their uploaded logic rules. The results show that LR-XFL outperforms the most relevant baseline by 1.19%, 5.81% and 5.41% in terms of classification accuracy, rule accuracy and rule fidelity, respectively. The explicit rule evaluation and expression under LR-XFL enable human experts to validate and correct the rules on the server side, hence improving the global FL model's robustness to errors. It has the potential to enhance the transparency of FL models for areas like healthcare and finance where both data privacy and explainability are important.  ( 2 min )
    FP8-LM: Training FP8 Large Language Models. (arXiv:2310.18313v2 [cs.LG] UPDATED)
    In this paper, we explore FP8 low-bit data formats for efficient training of large language models (LLMs). Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision framework for training LLMs. This framework offers three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs. It gradually incorporates 8-bit gradients, optimizer states, and distributed learning in an incremental manner. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 37%. This largely reduces the training costs for large foundation models. Furthermore, our FP8 mixed-precision training methodology is generic. It can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses. Our FP8 low-precision training framework is open-sourced at {https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.  ( 3 min )
    GroupMixNorm Layer for Learning Fair Models. (arXiv:2312.11969v1 [cs.LG])
    Recent research has identified discriminatory behavior of automated prediction algorithms towards groups identified on specific protected attributes (e.g., gender, ethnicity, age group, etc.). When deployed in real-world scenarios, such techniques may demonstrate biased predictions resulting in unfair outcomes. Recent literature has witnessed algorithms for mitigating such biased behavior mostly by adding convex surrogates of fairness metrics such as demographic parity or equalized odds in the loss function, which are often not easy to estimate. This research proposes a novel in-processing based GroupMixNorm layer for mitigating bias from deep learning models. The GroupMixNorm layer probabilistically mixes group-level feature statistics of samples across different groups based on the protected attribute. The proposed method improves upon several fairness metrics with minimal impact on overall accuracy. Analysis on benchmark tabular and image datasets demonstrates the efficacy of the proposed method in achieving state-of-the-art performance. Further, the experimental analysis also suggests the robustness of the GroupMixNorm layer against new protected attributes during inference and its utility in eliminating bias from a pre-trained network.  ( 2 min )
    MISA: Unveiling the Vulnerabilities in Split Federated Learning. (arXiv:2312.11026v2 [cs.LG] UPDATED)
    \textit{Federated learning} (FL) and \textit{split learning} (SL) are prevailing distributed paradigms in recent years. They both enable shared global model training while keeping data localized on users' devices. The former excels in parallel execution capabilities, while the latter enjoys low dependence on edge computing resources and strong privacy protection. \textit{Split federated learning} (SFL) combines the strengths of both FL and SL, making it one of the most popular distributed architectures. Furthermore, a recent study has claimed that SFL exhibits robustness against poisoning attacks, with a fivefold improvement compared to FL in terms of robustness. In this paper, we present a novel poisoning attack known as MISA. It poisons both the top and bottom models, causing a \textbf{\underline{misa}}lignment in the global model, ultimately leading to a drastic accuracy collapse. This attack unveils the vulnerabilities in SFL, challenging the conventional belief that SFL is robust against poisoning attacks. Extensive experiments demonstrate that our proposed MISA poses a significant threat to the availability of SFL, underscoring the imperative for academia and industry to accord this matter due attention.  ( 2 min )
    In-Context Exemplars as Clues to Retrieving from Large Associative Memory. (arXiv:2311.03498v2 [cs.CL] UPDATED)
    Recently, large language models (LLMs) have made remarkable progress in natural language processing. The most representative ability of LLMs is in-context learning (ICL), which enables LLMs to learn patterns from in-context exemplars without training. The performance of ICL greatly depends on the exemplars used. However, how to choose exemplars remains unclear due to the lack of understanding of how in-context learning works. In this paper, we present a novel perspective on ICL by conceptualizing it as contextual retrieval from a model of associative memory. We establish a theoretical framework of ICL based on Hopfield Networks. Based on our framework, we look into how in-context exemplars influence the performance of ICL and propose more efficient active exemplar selection. Our study sheds new light on the mechanism of ICL by connecting it to memory retrieval, with potential implications for advancing the understanding of LLMs.  ( 2 min )
    Mining Patents with Large Language Models Elucidates the Chemical Function Landscape. (arXiv:2309.08765v2 [q-bio.QM] UPDATED)
    The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of orthogonal methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631K molecule-function pairs, was created using an LLM- and embedding-based method to obtain functional labels for approximately 100K molecules from their corresponding 188K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an orthogonal approach to traditional structure-based methods in the pursuit of designing novel functional molecules.  ( 3 min )
    Divide-and-Conquer Dynamics in AI-Driven Disempowerment. (arXiv:2310.06009v2 [cs.CY] UPDATED)
    AI companies are attempting to create AI systems that outperform humans at most economically valuable work. Current AI models are already automating away the livelihoods of some artists, actors, and writers. But there is infighting between those who prioritize current harms and future harms. We construct a game-theoretic model of conflict to study the causes and consequences of this disunity. Our model also helps explain why throughout history, stakeholders sharing a common threat have found it advantageous to unite against it, and why the common threat has in turn found it advantageous to divide and conquer. Under realistic parameter assumptions, our model makes several predictions that find preliminary corroboration in the historical-empirical record. First, current victims of AI-driven disempowerment need the future victims to realize that their interests are also under serious and imminent threat, so that future victims are incentivized to support current victims in solidarity. Second, the movement against AI-driven disempowerment can become more united, and thereby more likely to prevail, if members believe that their efforts will be successful as opposed to futile. Finally, the movement can better unite and prevail if its members are less myopic. Myopic members prioritize their future well-being less than their present well-being, and are thus disinclined to solidarily support current victims today at personal cost, even if this is necessary to counter the shared threat of AI-driven disempowerment.  ( 3 min )
    Time-Series Contrastive Learning against False Negatives and Class Imbalance. (arXiv:2312.11939v1 [cs.LG])
    As an exemplary self-supervised approach for representation learning, time-series contrastive learning has exhibited remarkable advancements in contemporary research. While recent contrastive learning strategies have focused on how to construct appropriate positives and negatives, in this study, we conduct theoretical analysis and find they have overlooked the fundamental issues: false negatives and class imbalance inherent in the InfoNCE loss-based framework. Therefore, we introduce a straightforward modification grounded in the SimCLR framework, universally adaptable to models engaged in the instance discrimination task. By constructing instance graphs to facilitate interactive learning among instances, we emulate supervised contrastive learning via the multiple-instances discrimination task, mitigating the harmful impact of false negatives. Moreover, leveraging the graph structure and few-labeled data, we perform semi-supervised consistency classification and enhance the representative ability of minority classes. We compared our method with the most popular time-series contrastive learning methods on four real-world time-series datasets and demonstrated our significant advantages in overall performance.  ( 2 min )
    STERLING: Synergistic Representation Learning on Bipartite Graphs. (arXiv:2302.05428v2 [cs.LG] UPDATED)
    A fundamental challenge of bipartite graph representation learning is how to extract informative node embeddings. Self-Supervised Learning (SSL) is a promising paradigm to address this challenge. Most recent bipartite graph SSL methods are based on contrastive learning which learns embeddings by discriminating positive and negative node pairs. Contrastive learning usually requires a large number of negative node pairs, which could lead to computational burden and semantic errors. In this paper, we introduce a novel synergistic representation learning model (STERLING) to learn node embeddings without negative node pairs. STERLING preserves the unique local and global synergies in bipartite graphs. The local synergies are captured by maximizing the similarity of the inter-type and intra-type positive node pairs, and the global synergies are captured by maximizing the mutual information of co-clusters. Theoretical analysis demonstrates that STERLING could improve the connectivity between different node types in the embedding space. Extensive empirical evaluation on various benchmark datasets and tasks demonstrates the effectiveness of STERLING for extracting node embeddings.  ( 2 min )
    Improving Lipschitz-Constrained Neural Networks by Learning Activation Functions. (arXiv:2210.16222v2 [cs.LG] UPDATED)
    Lipschitz-constrained neural networks have several advantages over unconstrained ones and can be applied to a variety of problems, making them a topic of attention in the deep learning community. Unfortunately, it has been shown both theoretically and empirically that they perform poorly when equipped with ReLU activation functions. By contrast, neural networks with learnable 1-Lipschitz linear splines are known to be more expressive. In this paper, we show that such networks correspond to global optima of a constrained functional optimization problem that consists of the training of a neural network composed of 1-Lipschitz linear layers and 1-Lipschitz freeform activation functions with second-order total-variation regularization. Further, we propose an efficient method to train these neural networks. Our numerical experiments show that our trained networks compare favorably with existing 1-Lipschitz neural architectures.  ( 2 min )
    Robust Communicative Multi-Agent Reinforcement Learning with Active Defense. (arXiv:2312.11545v1 [cs.MA])
    Communication in multi-agent reinforcement learning (MARL) has been proven to effectively promote cooperation among agents recently. Since communication in real-world scenarios is vulnerable to noises and adversarial attacks, it is crucial to develop robust communicative MARL technique. However, existing research in this domain has predominantly focused on passive defense strategies, where agents receive all messages equally, making it hard to balance performance and robustness. We propose an active defense strategy, where agents automatically reduce the impact of potentially harmful messages on the final decision. There are two challenges to implement this strategy, that are defining unreliable messages and adjusting the unreliable messages' impact on the final decision properly. To address them, we design an Active Defense Multi-Agent Communication framework (ADMAC), which estimates the reliability of received messages and adjusts their impact on the final decision accordingly with the help of a decomposable decision structure. The superiority of ADMAC over existing methods is validated by experiments in three communication-critical tasks under four types of attacks.  ( 2 min )
    SeGA: Preference-Aware Self-Contrastive Learning with Prompts for Anomalous User Detection on Twitter. (arXiv:2312.11553v1 [cs.SI])
    In the dynamic and rapidly evolving world of social media, detecting anomalous users has become a crucial task to address malicious activities such as misinformation and cyberbullying. As the increasing number of anomalous users improves the ability to mimic normal users and evade detection, existing methods only focusing on bot detection are ineffective in terms of capturing subtle distinctions between users. To address these challenges, we proposed SeGA, preference-aware self-contrastive learning for anomalous user detection, which leverages heterogeneous entities and their relations in the Twittersphere to detect anomalous users with different malicious strategies. SeGA utilizes the knowledge of large language models to summarize user preferences via posts. In addition, integrating user preferences with prompts as pseudo-labels for preference-aware self-contrastive learning enables the model to learn multifaceted aspects for describing the behaviors of users. Extensive experiments on the proposed TwBNT benchmark demonstrate that SeGA significantly outperforms the state-of-the-art methods (+3.5\% ~ 27.6\%) and empirically validate the effectiveness of the model design and pre-training strategies. Our code and data are publicly available at https://github.com/ying0409/SeGA.  ( 2 min )
    Sparse is Enough in Fine-tuning Pre-trained Large Language Model. (arXiv:2312.11875v1 [cs.LG])
    With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for low-cost adaptation, including Adapters, Bia-only, and the recently widely used Low-Rank Adaptation. Although these methods have demonstrated their effectiveness to some extent and have been widely applied, the underlying principles are still unclear. In this paper, we reveal the transition of loss landscape in the downstream domain from random initialization to pre-trained initialization, that is, from low-amplitude oscillation to high-amplitude oscillation. The parameter gradients exhibit a property akin to sparsity, where a small fraction of components dominate the total gradient norm, for instance, 1% of the components account for 99% of the gradient. This property ensures that the pre-trained model can easily find a flat minimizer which guarantees the model's ability to generalize even with a low number of trainable parameters. Based on this, we propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT), and validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.  ( 2 min )
    Narrowing the Gap between Supervised and Unsupervised Sentence Representation Learning with Large Language Model. (arXiv:2309.06453v2 [cs.CL] UPDATED)
    Sentence Representation Learning (SRL) is a fundamental task in Natural Language Processing (NLP), with the Contrastive Learning of Sentence Embeddings (CSE) being the mainstream technique due to its superior performance. An intriguing phenomenon in CSE is the significant performance gap between supervised and unsupervised methods, with their only difference lying in the training data. Previous works attribute this performance gap to differences in two representation properties (alignment and uniformity). However, since alignment and uniformity only measure the results, they fail to answer "What aspects of the training data contribute to the performance gap?" and "How can the performance gap be narrowed?", In this paper, we conduct empirical experiments to answer these "What" and "How" questions. We first answer the "What" question by thoroughly comparing the behavior of supervised and unsupervised CSE during their respective training processes. From the comparison, we identify the similarity pattern as a key factor to the performance gap, and introduce a metric, called Relative Fitting Difficulty (RFD), to measure the complexity of the similarity pattern. Then, based on the insights gained from the "What" question, we tackle the "How" question by increasing the pattern complexity of the training data. We achieve this by leveraging the In-Context Learning (ICL) capability of the Large Language Model (LLM) to generate data that simulates complex patterns. By utilizing the hierarchical patterns in the LLM-generated data, we effectively narrow the gap between supervised and unsupervised CSE. We release our codes and appendix at https://github.com/BDBC-KG-NLP/NGCSE.  ( 3 min )
    A Study on Transferability of Deep Learning Models for Network Intrusion Detection. (arXiv:2312.11550v1 [cs.CR])
    In this paper, we explore transferability in learning between different attack classes in a network intrusion detection setup. We evaluate transferability of attack classes by training a deep learning model with a specific attack class and testing it on a separate attack class. We observe the effects of real and synthetically generated data augmentation techniques on transferability. We investigate the nature of observed transferability relationships, which can be either symmetric or asymmetric. We also examine explainability of the transferability relationships using the recursive feature elimination algorithm. We study data preprocessing techniques to boost model performance. The code for this work can be found at https://github.com/ghosh64/transferability.  ( 2 min )
    Towards AI-driven Integrative Emissions Monitoring & Management for Nature-Based Climate Solutions. (arXiv:2312.11566v1 [cs.LG])
    AI has been proposed as an important tool to support several efforts related to nature-based climate solutions such as the detection of wildfires that affect forests and vegetation-based offsets. While this and other use-cases provide important demonstrative value of the power of AI in climate change mitigation, such efforts have typically been undertaken in silos, without awareness of the integrative nature of real-world climate policy-making. In this paper, we propose a novel overarching framework for AI-aided integrated and comprehensive decision support for various aspects of nature-based climate decision-making. Focusing on vegetation-based solutions such as forests, we demonstrate how different AI-aided decision support models such as AI-aided wildfire detection, AI-aided vegetation carbon stock assessment, reversal risk mitigation, and disaster response planning can be integrated into a comprehensive framework. Rather than being disparate elements, we posit that the exchange of data and analytical results across elements of the framework, and careful mitigation of uncertainty propagation will provide tremendous value relative to the status-quo for real-world climate policy-making.  ( 2 min )
    Hierarchical and Incremental Structural Entropy Minimization for Unsupervised Social Event Detection. (arXiv:2312.11891v1 [cs.SI])
    As a trending approach for social event detection, graph neural network (GNN)-based methods enable a fusion of natural language semantics and the complex social network structural information, thus showing SOTA performance. However, GNN-based methods can miss useful message correlations. Moreover, they require manual labeling for training and predetermining the number of events for prediction. In this work, we address social event detection via graph structural entropy (SE) minimization. While keeping the merits of the GNN-based methods, the proposed framework, HISEvent, constructs more informative message graphs, is unsupervised, and does not require the number of events given a priori. Specifically, we incrementally explore the graph neighborhoods using 1-dimensional (1D) SE minimization to supplement the existing message graph with edges between semantically related messages. We then detect events from the message graph by hierarchically minimizing 2-dimensional (2D) SE. Our proposed 1D and 2D SE minimization algorithms are customized for social event detection and effectively tackle the efficiency problem of the existing SE minimization algorithms. Extensive experiments show that HISEvent consistently outperforms GNN-based methods and achieves the new SOTA for social event detection under both closed- and open-set settings while being efficient and robust.  ( 2 min )
    Assessing SATNet's Ability to Solve the Symbol Grounding Problem. (arXiv:2312.11522v1 [cs.AI])
    SATNet is an award-winning MAXSAT solver that can be used to infer logical rules and integrated as a differentiable layer in a deep neural network. It had been shown to solve Sudoku puzzles visually from examples of puzzle digit images, and was heralded as an impressive achievement towards the longstanding AI goal of combining pattern recognition with logical reasoning. In this paper, we clarify SATNet's capabilities by showing that in the absence of intermediate labels that identify individual Sudoku digit images with their logical representations, SATNet completely fails at visual Sudoku (0% test accuracy). More generally, the failure can be pinpointed to its inability to learn to assign symbols to perceptual phenomena, also known as the symbol grounding problem, which has long been thought to be a prerequisite for intelligent agents to perform real-world logical reasoning. We propose an MNIST based test as an easy instance of the symbol grounding problem that can serve as a sanity check for differentiable symbolic solvers in general. Naive applications of SATNet on this test lead to performance worse than that of models without logical reasoning capabilities. We report on the causes of SATNet's failure and how to prevent them.  ( 2 min )
    Efficient and Scalable Graph Generation through Iterative Local Expansion. (arXiv:2312.11529v1 [cs.SI])
    In the realm of generative models for graphs, extensive research has been conducted. However, most existing methods struggle with large graphs due to the complexity of representing the entire joint distribution across all node pairs and capturing both global and local graph structures simultaneously. To overcome these issues, we introduce a method that generates a graph by progressively expanding a single node to a target graph. In each step, nodes and edges are added in a localized manner through denoising diffusion, building first the global structure, and then refining the local details. The local generation avoids modeling the entire joint distribution over all node pairs, achieving substantial computational savings with subquadratic runtime relative to node count while maintaining high expressivity through multiscale generation. Our experiments show that our model achieves state-of-the-art performance on well-established benchmark datasets while successfully scaling to graphs with at least 5000 nodes. Our method is also the first to successfully extrapolate to graphs outside of the training distribution, showcasing a much better generalization capability over existing methods.  ( 2 min )
    FAL-CUR: Fair Active Learning using Uncertainty and Representativeness on Fair Clustering. (arXiv:2209.12756v2 [cs.LG] UPDATED)
    Active Learning (AL) techniques have proven to be highly effective in reducing data labeling costs across a range of machine learning tasks. Nevertheless, one known challenge of these methods is their potential to introduce unfairness towards sensitive attributes. Although recent approaches have focused on enhancing fairness in AL, they tend to reduce the model's accuracy. To address this issue, we propose a novel strategy, named Fair Active Learning using fair Clustering, Uncertainty, and Representativeness (FAL-CUR), to improve fairness in AL. FAL-CUR tackles the fairness problem in AL by combining fair clustering with an acquisition function that determines which samples to query based on their uncertainty and representativeness scores. We evaluate the performance of FAL-CUR on four real-world datasets, and the results demonstrate that FAL-CUR achieves a 15% - 20% improvement in fairness compared to the best state-of-the-art method in terms of equalized odds while maintaining stable accuracy scores. Furthermore, an ablation study highlights the crucial roles of fair clustering in preserving fairness and the acquisition function in stabilizing the accuracy performance.  ( 2 min )
    Conductivity Imaging from Internal Measurements with Mixed Least-Squares Deep Neural Networks. (arXiv:2303.16454v3 [math.NA] UPDATED)
    In this work we develop a novel approach using deep neural networks to reconstruct the conductivity distribution in elliptic problems from one measurement of the solution over the whole domain. The approach is based on a mixed reformulation of the governing equation and utilizes the standard least-squares objective, with deep neural networks as ansatz functions to approximate the conductivity and flux simultaneously. We provide a thorough analysis of the deep neural network approximations of the conductivity for both continuous and empirical losses, including rigorous error estimates that are explicit in terms of the noise level, various penalty parameters and neural network architectural parameters (depth, width and parameter bound). We also provide multiple numerical experiments in two- and multi-dimensions to illustrate distinct features of the approach, e.g., excellent stability with respect to data noise and capability of solving high-dimensional problems.  ( 2 min )
    Risk-Sensitive Reinforcement Learning with Exponential Criteria. (arXiv:2212.09010v4 [eess.SY] UPDATED)
    While reinforcement learning has shown experimental success in a number of applications, it is known to be sensitive to noise and perturbations in the parameters of the system, leading to high variance in the total reward amongst different episodes in slightly different environments. To introduce robustness, as well as sample efficiency, risk-sensitive reinforcement learning methods are being thoroughly studied. In this work, we provide a definition of robust reinforcement learning policies and formulate a risk-sensitive reinforcement learning problem to approximate them, by solving an optimization problem with respect to a modified objective based on exponential criteria. In particular, we study a model-free risk-sensitive variation of the widely-used Monte Carlo Policy Gradient algorithm and introduce a novel risk-sensitive online Actor-Critic algorithm based on solving a multiplicative Bellman equation using stochastic approximation updates. Analytical results suggest that the use of exponential criteria generalizes commonly used ad-hoc regularization approaches, improves sample efficiency, and introduces robustness with respect to perturbations in the model parameters and the environment. The implementation, performance, and robustness properties of the proposed methods are evaluated in simulated experiments.  ( 2 min )
    Specious Sites: Tracking the Spread and Sway of Spurious News Stories at Scale. (arXiv:2308.02068v2 [cs.SI] UPDATED)
    Misinformation, propaganda, and outright lies proliferate on the web, with some narratives having dangerous real-world consequences on public health, elections, and individual safety. However, despite the impact of misinformation, the research community largely lacks automated and programmatic approaches for tracking news narratives across online platforms. In this work, utilizing daily scrapes of 1,334 unreliable news websites, the large-language model MPNet, and DP-Means clustering, we introduce a system to automatically identify and track the narratives spread within online ecosystems. Identifying 52,036 narratives on these 1,334 websites, we describe the most prevalent narratives spread in 2022 and identify the most influential websites that originate and amplify narratives. Finally, we show how our system can be utilized to detect new narratives originating from unreliable news websites and to aid fact-checkers in more quickly addressing misinformation. We release code and data at https://github.com/hanshanley/specious-sites.  ( 2 min )
    Finite Element Operator Network for Solving Parametric PDEs. (arXiv:2308.04690v2 [math.NA] UPDATED)
    Partial differential equations (PDEs) underlie our understanding and prediction of natural phenomena across numerous fields, including physics, engineering, and finance. However, solving parametric PDEs is a complex task that necessitates efficient numerical methods. In this paper, we propose a novel approach for solving parametric PDEs using a Finite Element Operator Network (FEONet). Our proposed method leverages the power of deep learning in conjunction with traditional numerical methods, specifically the finite element method, to solve parametric PDEs in the absence of any paired input-output training data. We performed various experiments on several benchmark problems and confirmed that our approach has demonstrated excellent performance across various settings and environments, proving its versatility in terms of accuracy, generalization, and computational flexibility. Our FEONet framework shows potential for application in various fields where PDEs play a crucial role in modeling complex domains with diverse boundary conditions and singular behavior. Furthermore, we provide theoretical convergence analysis to support our approach, utilizing finite element approximation in numerical analysis.  ( 2 min )
    Vertical Federated Alzheimer's Detection on Multimodal Data. (arXiv:2312.10237v2 [cs.LG] UPDATED)
    In the era of rapidly advancing medical technologies, the segmentation of medical data has become inevitable, necessitating the development of privacy preserving machine learning algorithms that can train on distributed data. Consolidating sensitive medical data is not always an option particularly due to the stringent privacy regulations imposed by the Health Insurance Portability and Accountability Act (HIPAA). In this paper, we introduce a HIPAA compliant framework that can train from distributed data. We then propose a multimodal vertical federated model for Alzheimer's Disease (AD) detection, a serious neurodegenerative condition that can cause dementia, severely impairing brain function and hindering simple tasks, especially without preventative care. This vertical federated model offers a distributed architecture that enables collaborative learning across diverse sources of medical data while respecting privacy constraints imposed by HIPAA. It is also able to leverage multiple modalities of data, enhancing the robustness and accuracy of AD detection. Our proposed model not only contributes to the advancement of federated learning techniques but also holds promise for overcoming the hurdles posed by data segmentation in medical research. By using vertical federated learning, this research strives to provide a framework that enables healthcare institutions to harness the collective intelligence embedded in their distributed datasets without compromising patient privacy.  ( 3 min )
    Transformer Network for Multi-Person Tracking and Re-Identification in Unconstrained Environment. (arXiv:2312.11929v1 [cs.CV])
    Multi-object tracking (MOT) has profound applications in a variety of fields, including surveillance, sports analytics, self-driving, and cooperative robotics. Despite considerable advancements, existing MOT methodologies tend to falter when faced with non-uniform movements, occlusions, and appearance-reappearance scenarios of the objects. Recognizing this inadequacy, we put forward an integrated MOT method that not only marries object detection and identity linkage within a singular, end-to-end trainable framework but also equips the model with the ability to maintain object identity links over long periods of time. Our proposed model, named STMMOT, is built around four key modules: 1) candidate proposal generation, which generates object proposals via a vision-transformer encoder-decoder architecture that detects the object from each frame in the video; 2) scale variant pyramid, a progressive pyramid structure to learn the self-scale and cross-scale similarities in multi-scale feature maps; 3) spatio-temporal memory encoder, extracting the essential information from the memory associated with each object under tracking; and 4) spatio-temporal memory decoder, simultaneously resolving the tasks of object detection and identity association for MOT. Our system leverages a robust spatio-temporal memory module that retains extensive historical observations and effectively encodes them using an attention-based aggregator. The uniqueness of STMMOT lies in representing objects as dynamic query embeddings that are updated continuously, which enables the prediction of object states with attention mechanisms and eradicates the need for post-processing.  ( 2 min )
    Relative Policy-Transition Optimization for Fast Policy Transfer. (arXiv:2206.06009v2 [cs.LG] UPDATED)
    We consider the problem of policy transfer between two Markov Decision Processes (MDPs). We introduce a lemma based on existing theoretical results in reinforcement learning to measure the relativity gap between two arbitrary MDPs, that is the difference between any two cumulative expected returns defined on different policies and environment dynamics. Based on this lemma, we propose two new algorithms referred to as Relative Policy Optimization (RPO) and Relative Transition Optimization (RTO), which offer fast policy transfer and dynamics modelling, respectively. RPO transfers the policy evaluated in one environment to maximize the return in another, while RTO updates the parameterized dynamics model to reduce the gap between the dynamics of the two environments. Integrating the two algorithms results in the complete Relative Policy-Transition Optimization (RPTO) algorithm, in which the policy interacts with the two environments simultaneously, such that data collections from two environments, policy and transition updates are completed in one closed loop to form a principled learning framework for policy transfer. We demonstrate the effectiveness of RPTO on a set of MuJoCo continuous control tasks by creating policy transfer problems via variant dynamics.  ( 2 min )
    JaxPruner: A concise library for sparsity research. (arXiv:2304.14082v3 [cs.LG] UPDATED)
    This paper introduces JaxPruner, an open-source JAX-based pruning and sparse training library for machine learning research. JaxPruner aims to accelerate research on sparse neural networks by providing concise implementations of popular pruning and sparse training algorithms with minimal memory and latency overhead. Algorithms implemented in JaxPruner use a common API and work seamlessly with the popular optimization library Optax, which, in turn, enables easy integration with existing JAX based libraries. We demonstrate this ease of integration by providing examples in four different codebases: Scenic, t5x, Dopamine and FedJAX and provide baseline experiments on popular benchmarks.  ( 2 min )
    A Simple and Practical Method for Reducing the Disparate Impact of Differential Privacy. (arXiv:2312.11712v1 [cs.CR])
    Differentially private (DP) mechanisms have been deployed in a variety of high-impact social settings (perhaps most notably by the U.S. Census). Since all DP mechanisms involve adding noise to results of statistical queries, they are expected to impact our ability to accurately analyze and learn from data, in effect trading off privacy with utility. Alarmingly, the impact of DP on utility can vary significantly among different sub-populations. A simple way to reduce this disparity is with stratification. First compute an independent private estimate for each group in the data set (which may be the intersection of several protected classes), then, to compute estimates of global statistics, appropriately recombine these group estimates. Our main observation is that naive stratification often yields high-accuracy estimates of population-level statistics, without the need for additional privacy budget. We support this observation theoretically and empirically. Our theoretical results center on the private mean estimation problem, while our empirical results center on extensive experiments on private data synthesis to demonstrate the effectiveness of stratification on a variety of private mechanisms. Overall, we argue that this straightforward approach provides a strong baseline against which future work on reducing utility disparities of DP mechanisms should be compared.  ( 2 min )
    Efficient Failure Pattern Identification of Predictive Algorithms. (arXiv:2306.00760v1 [cs.LG] CROSS LISTED)
    Given a (machine learning) classifier and a collection of unlabeled data, how can we efficiently identify misclassification patterns presented in this dataset? To address this problem, we propose a human-machine collaborative framework that consists of a team of human annotators and a sequential recommendation algorithm. The recommendation algorithm is conceptualized as a stochastic sampler that, in each round, queries the annotators a subset of samples for their true labels and obtains the feedback information on whether the samples are misclassified. The sampling mechanism needs to balance between discovering new patterns of misclassification (exploration) and confirming the potential patterns of classification (exploitation). We construct a determinantal point process, whose intensity balances the exploration-exploitation trade-off through the weighted update of the posterior at each round to form the generator of the stochastic sampler. The numerical results empirically demonstrate the competitive performance of our framework on multiple datasets at various signal-to-noise ratios.  ( 2 min )
    Adaptive Smooth Activation for Improved Disease Diagnosis and Organ Segmentation from Radiology Scans. (arXiv:2312.11480v1 [cs.NE])
    In this study, we propose a new activation function, called Adaptive Smooth Activation Unit (ASAU), tailored for optimized gradient propagation, thereby enhancing the proficiency of convolutional networks in medical image analysis. We apply this new activation function to two important and commonly used general tasks in medical image analysis: automatic disease diagnosis and organ segmentation in CT and MRI. Our rigorous evaluation on the RadImageNet abdominal/pelvis (CT and MRI) dataset and Liver Tumor Segmentation Benchmark (LiTS) 2017 demonstrates that our ASAU-integrated frameworks not only achieve a substantial (4.80\%) improvement over ReLU in classification accuracy (disease detection) on abdominal CT and MRI but also achieves 1\%-3\% improvement in dice coefficient compared to widely used activations for `healthy liver tissue' segmentation. These improvements offer new baselines for developing a diagnostic tool, particularly for complex, challenging pathologies. The superior performance and adaptability of ASAU highlight its potential for integration into a wide range of image classification and segmentation tasks.  ( 2 min )
    Neural Network Approximation for Pessimistic Offline Reinforcement Learning. (arXiv:2312.11863v1 [cs.LG])
    Deep reinforcement learning (RL) has shown remarkable success in specific offline decision-making scenarios, yet its theoretical guarantees are still under development. Existing works on offline RL theory primarily emphasize a few trivial settings, such as linear MDP or general function approximation with strong assumptions and independent data, which lack guidance for practical use. The coupling of deep learning and Bellman residuals makes this problem challenging, in addition to the difficulty of data dependence. In this paper, we establish a non-asymptotic estimation error of pessimistic offline RL using general neural network approximation with $\mathcal{C}$-mixing data regarding the structure of networks, the dimension of datasets, and the concentrability of data coverage, under mild assumptions. Our result shows that the estimation error consists of two parts: the first converges to zero at a desired rate on the sample size with partially controllable concentrability, and the second becomes negligible if the residual constraint is tight. This result demonstrates the explicit efficiency of deep adversarial offline RL frameworks. We utilize the empirical process tool for $\mathcal{C}$-mixing sequences and the neural network approximation theory for the H\"{o}lder class to achieve this. We also develop methods to bound the Bellman estimation error caused by function approximation with empirical Bellman constraint perturbations. Additionally, we present a result that lessens the curse of dimensionality using data with low intrinsic dimensionality and function classes with low complexity. Our estimation provides valuable insights into the development of deep offline RL and guidance for algorithm model design.  ( 3 min )
    KGLens: A Parameterized Knowledge Graph Solution to Assess What an LLM Does and Doesn't Know. (arXiv:2312.11539v1 [cs.AI])
    Current approaches to evaluating large language models (LLMs) with pre-existing Knowledge Graphs (KG) mostly ignore the structure of the KG and make arbitrary choices of which part of the graph to evaluate. In this paper, we introduce KGLens, a method to evaluate LLMs by generating natural language questions from a KG in a structure aware manner so that we can characterize its performance on a more aggregated level. KGLens uses a parameterized KG, where each edge is augmented with a beta distribution that guides how to sample edges from the KG for QA testing. As the evaluation proceeds, different edges of the parameterized KG are sampled and assessed appropriately, converging to a more global picture of the performance of the LLMs on the KG as a whole. In our experiments, we construct three domain-specific KGs for knowledge assessment, comprising over 19,000 edges, 700 relations, and 21,000 entities. The results demonstrate that KGLens can not only assess overall performance but also provide topic, temporal, and relation analyses of LLMs. This showcases the adaptability and customizability of KGLens, emphasizing its ability to focus the evaluation based on specific criteria.  ( 2 min )
    Bayesian Methods for Media Mix Modelling with shape and funnel effects. (arXiv:2311.05587v4 [cs.LG] UPDATED)
    In recent years, significant progress in generative AI has highlighted the important role of physics-inspired models that utilize advanced mathematical concepts based on fundamental physics principles to enhance artificial intelligence capabilities. Among these models, those based on diffusion equations have greatly improved image quality. This study aims to explore the potential uses of Maxwell-Boltzmann equation, which forms the basis of the kinetic theory of gases, and the Michaelis-Menten model in Marketing Mix Modelling (MMM) applications. We propose incorporating these equations into Hierarchical Bayesian models to analyse consumer behaviour in the context of advertising. These equation sets excel in accurately describing the random dynamics in complex systems like social interactions and consumer-advertising interactions.  ( 2 min )
    Identifying Label Errors in Object Detection Datasets by Loss Inspection. (arXiv:2303.06999v3 [cs.CV] UPDATED)
    Labeling datasets for supervised object detection is a dull and time-consuming task. Errors can be easily introduced during annotation and overlooked during review, yielding inaccurate benchmarks and performance degradation of deep neural networks trained on noisy labels. In this work, we for the first time introduce a benchmark for label error detection methods on object detection datasets as well as a label error detection method and a number of baselines. We simulate four different types of randomly introduced label errors on train and test sets of well-labeled object detection datasets. For our label error detection method we assume a two-stage object detector to be given and consider the sum of both stages' classification and regression losses. The losses are computed with respect to the predictions and the noisy labels including simulated label errors, aiming at detecting the latter. We compare our method to three baselines: a naive one without deep learning, the object detector's score and the entropy of the classification softmax distribution. We outperform all baselines and demonstrate that among the considered methods, ours is the only one that detects label errors of all four types efficiently. Furthermore, we detect real label errors a) on commonly used test datasets in object detection and b) on a proprietary dataset. In both cases we achieve low false positives rates, i.e., we detect label errors with a precision for a) of up to 71.5% and for b) with 97%.  ( 3 min )
    AI-Based Energy Transportation Safety: Pipeline Radial Threat Estimation Using Intelligent Sensing System. (arXiv:2312.11583v1 [cs.LG])
    The application of artificial intelligence technology has greatly enhanced and fortified the safety of energy pipelines, particularly in safeguarding against external threats. The predominant methods involve the integration of intelligent sensors to detect external vibration, enabling the identification of event types and locations, thereby replacing manual detection methods. However, practical implementation has exposed a limitation in current methods - their constrained ability to accurately discern the spatial dimensions of external signals, which complicates the authentication of threat events. Our research endeavors to overcome the above issues by harnessing deep learning techniques to achieve a more fine-grained recognition and localization process. This refinement is crucial in effectively identifying genuine threats to pipelines, thus enhancing the safety of energy transportation. This paper proposes a radial threat estimation method for energy pipelines based on distributed optical fiber sensing technology. Specifically, we introduce a continuous multi-view and multi-domain feature fusion methodology to extract comprehensive signal features and construct a threat estimation and recognition network. The utilization of collected acoustic signal data is optimized, and the underlying principle is elucidated. Moreover, we incorporate the concept of transfer learning through a pre-trained model, enhancing both recognition accuracy and training efficiency. Empirical evidence gathered from real-world scenarios underscores the efficacy of our method, notably in its substantial reduction of false alarms and remarkable gains in recognition accuracy. More generally, our method exhibits versatility and can be extrapolated to a broader spectrum of recognition tasks and scenarios.  ( 3 min )
    Fast Neural Network Inference on FPGAs for Triggering on Long-Lived Particles at Colliders. (arXiv:2307.05152v2 [hep-ex] UPDATED)
    Experimental particle physics demands a sophisticated trigger and acquisition system capable to efficiently retain the collisions of interest for further investigation. Heterogeneous computing with the employment of FPGA cards may emerge as a trending technology for the triggering strategy of the upcoming high-luminosity program of the Large Hadron Collider at CERN. In this context, we present two machine-learning algorithms for selecting events where neutral long-lived particles decay within the detector volume studying their accuracy and inference time when accelerated on commercially available Xilinx FPGA accelerator cards. The inference time is also confronted with a CPU- and GPU-based hardware setup. The proposed new algorithms are proven efficient for the considered benchmark physics scenario and their accuracy is found to not degrade when accelerated on the FPGA cards. The results indicate that all tested architectures fit within the latency requirements of a second-level trigger farm and that exploiting accelerator technologies for real-time processing of particle-physics collisions is a promising research field that deserves additional investigations, in particular with machine-learning models with a large number of trainable parameters.  ( 3 min )
    Automatic Parameter Selection for Non-Redundant Clustering. (arXiv:2312.11952v1 [cs.LG])
    High-dimensional datasets often contain multiple meaningful clusterings in different subspaces. For example, objects can be clustered either by color, weight, or size, revealing different interpretations of the given dataset. A variety of approaches are able to identify such non-redundant clusterings. However, most of these methods require the user to specify the expected number of subspaces and clusters for each subspace. Stating these values is a non-trivial problem and usually requires detailed knowledge of the input dataset. In this paper, we propose a framework that utilizes the Minimum Description Length Principle (MDL) to detect the number of subspaces and clusters per subspace automatically. We describe an efficient procedure that greedily searches the parameter space by splitting and merging subspaces and clusters within subspaces. Additionally, an encoding strategy is introduced that allows us to detect outliers in each subspace. Extensive experiments show that our approach is highly competitive to state-of-the-art methods.  ( 2 min )
    Hierarchical Autoregressive Modeling for Neural Video Compression. (arXiv:2010.10258v3 [eess.IV] UPDATED)
    Recent work by Marino et al. (2020) showed improved performance in sequential density estimation by combining masked autoregressive flows with hierarchical latent variable models. We draw a connection between such autoregressive generative models and the task of lossy video compression. Specifically, we view recent neural video compression methods (Lu et al., 2019; Yang et al., 2020b; Agustssonet al., 2020) as instances of a generalized stochastic temporal autoregressive transform, and propose avenues for enhancement based on this insight. Comprehensive evaluations on large-scale video data show improved rate-distortion performance over both state-of-the-art neural and conventional video compression methods.  ( 2 min )
    An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training. (arXiv:2312.11819v1 [cs.LG])
    Recently, ChatGPT or InstructGPT like large language models (LLM) has made a significant impact in the AI world. These models are incredibly versatile, capable of performing language tasks on par or even exceeding the capabilities of human experts. Many works have attempted to reproduce the complex InstructGPT's RLHF (Reinforcement Learning with Human Feedback) training pipeline. However, the mainstream distributed RLHF training methods typically adopt a fixed model placement strategy, referred to as the Flattening strategy. This strategy treats all four models involved in RLHF as a single entity and places them on all devices, regardless of their differences. Unfortunately, this strategy exacerbates the generation bottlenecks in the RLHF training and degrades the overall training efficiency. To address these issues, we propose an adaptive model placement framework that offers two flexible model placement strategies. These strategies allow for the agile allocation of models across devices in a fine-grained manner. The Interleaving strategy helps reduce memory redundancy and communication costs during RLHF training. On the other hand, the Separation strategy improves the throughput of model training by separating the training and generation stages of the RLHF pipeline. Notably, this framework seamlessly integrates with other mainstream techniques for acceleration and enables automatic hyperparameter search. Extensive experiments have demonstrated that our Interleaving and Separation strategies can achieve notable improvements up to 11x, compared to the current state-of-the-art (SOTA) approaches. These experiments encompassed a wide range of training scenarios, involving models of varying sizes and devices of different scales. The results highlight the effectiveness and superiority of our approaches in accelerating the training of distributed RLHF.  ( 3 min )
    Unified framework for diffusion generative models in SO(3): applications in computer vision and astrophysics. (arXiv:2312.11707v1 [cs.LG])
    Diffusion-based generative models represent the current state-of-the-art for image generation. However, standard diffusion models are based on Euclidean geometry and do not translate directly to manifold-valued data. In this work, we develop extensions of both score-based generative models (SGMs) and Denoising Diffusion Probabilistic Models (DDPMs) to the Lie group of 3D rotations, SO(3). SO(3) is of particular interest in many disciplines such as robotics, biochemistry and astronomy/cosmology science. Contrary to more general Riemannian manifolds, SO(3) admits a tractable solution to heat diffusion, and allows us to implement efficient training of diffusion models. We apply both SO(3) DDPMs and SGMs to synthetic densities on SO(3) and demonstrate state-of-the-art results. Additionally, we demonstrate the practicality of our model on pose estimation tasks and in predicting correlated galaxy orientations for astrophysics/cosmology.  ( 2 min )
    Machine-Made Media: Monitoring the Mobilization of Machine-Generated Articles on Misinformation and Mainstream News Websites. (arXiv:2305.09820v3 [cs.CY] UPDATED)
    As large language models (LLMs) like ChatGPT have gained traction, an increasing number of news websites have begun utilizing them to generate articles. However, not only can these language models produce factually inaccurate articles on reputable websites but disreputable news sites can utilize LLMs to mass produce misinformation. To begin to understand this phenomenon, we present one of the first large-scale studies of the prevalence of synthetic articles within online news media. To do this, we train a DeBERTa-based synthetic news detector and classify over 15.90 million articles from 3,074 misinformation and mainstream news websites. We find that between January 1, 2022, and May 1, 2023, the relative number of synthetic news articles increased by 55.4% on mainstream websites while increasing by 457% on misinformation sites. We find that this increase is largely driven by smaller less popular websites. Analyzing the impact of the release of ChatGPT using an interrupted-time-series, we show that while its release resulted in a marked increase in synthetic articles on small sites as well as misinformation news websites, there was not a corresponding increase on large mainstream news websites.  ( 3 min )
    UFDA: Universal Federated Domain Adaptation with Practical Assumptions. (arXiv:2311.15570v2 [cs.LG] UPDATED)
    Conventional Federated Domain Adaptation (FDA) approaches usually demand an abundance of assumptions, which makes them significantly less feasible for real-world situations and introduces security hazards. This paper relaxes the assumptions from previous FDAs and studies a more practical scenario named Universal Federated Domain Adaptation (UFDA). It only requires the black-box model and the label set information of each source domain, while the label sets of different source domains could be inconsistent, and the target-domain label set is totally blind. Towards a more effective solution for our newly proposed UFDA scenario, we propose a corresponding methodology called Hot-Learning with Contrastive Label Disambiguation (HCLD). It particularly tackles UFDA's domain shifts and category gaps problems by using one-hot outputs from the black-box models of various source domains. Moreover, to better distinguish the shared and unknown classes, we further present a cluster-level strategy named Mutual-Voting Decision (MVD) to extract robust consensus knowledge across peer classes from both source and target domains. Extensive experiments on three benchmark datasets demonstrate that our method achieves comparable performance for our UFDA scenario with much fewer assumptions, compared to previous methodologies with comprehensive additional assumptions.  ( 2 min )
    Fractional Deep Reinforcement Learning for Age-Minimal Mobile Edge Computing. (arXiv:2312.10418v2 [cs.LG] UPDATED)
    Mobile edge computing (MEC) is a promising paradigm for real-time applications with intensive computational needs (e.g., autonomous driving), as it can reduce the processing delay. In this work, we focus on the timeliness of computational-intensive updates, measured by Age-ofInformation (AoI), and study how to jointly optimize the task updating and offloading policies for AoI with fractional form. Specifically, we consider edge load dynamics and formulate a task scheduling problem to minimize the expected time-average AoI. The uncertain edge load dynamics, the nature of the fractional objective, and hybrid continuous-discrete action space (due to the joint optimization) make this problem challenging and existing approaches not directly applicable. To this end, we propose a fractional reinforcement learning(RL) framework and prove its convergence. We further design a model-free fractional deep RL (DRL) algorithm, where each device makes scheduling decisions with the hybrid action space without knowing the system dynamics and decisions of other devices. Experimental results show that our proposed algorithms reduce the average AoI by up to 57.6% compared with several non-fractional benchmarks.  ( 2 min )
    Finding Nash equilibria by minimizing approximate exploitability with learned best responses. (arXiv:2301.08830v2 [cs.GT] UPDATED)
    There has been substantial progress on finding game-theoretic equilibria. Most of that work has focused on games with finite, discrete action spaces. However, many games involving space, time, money, and other fine-grained quantities have continuous action spaces (or are best modeled as such). We study the problem of finding an approximate Nash equilibrium of games with continuous action sets. The standard measure of closeness to Nash equilibrium is exploitability, which measures how much players can benefit from unilaterally changing their strategy. We propose two new methods that minimize an approximation of the exploitability with respect to the strategy profile. The first method uses a learned best-response function, which takes the current strategy profile as input and returns candidate best responses for each player. The strategy profile and best-response functions are trained simultaneously, with the former trying to minimize exploitability while the latter tries to maximize it. The second method maintains an ensemble of candidate best responses for each player. In each iteration, the best-performing elements of each ensemble are used to update the current strategy profile. The strategy profile and best-response ensembles are simultaneously trained to minimize and maximize the approximate exploitability, respectively. We evaluate our methods on various continuous games, showing that they outperform prior methods.  ( 3 min )
    SEPT: Towards Efficient Scene Representation Learning for Motion Prediction. (arXiv:2309.15289v4 [cs.CV] UPDATED)
    Motion prediction is crucial for autonomous vehicles to operate safely in complex traffic environments. Extracting effective spatiotemporal relationships among traffic elements is key to accurate forecasting. Inspired by the successful practice of pretrained large language models, this paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful spatiotemporal understanding for complex traffic scenes. Specifically, our approach involves three masking-reconstruction modeling tasks on scene inputs including agents' trajectories and road network, pretraining the scene encoder to capture kinematics within trajectory, spatial structure of road network, and interactions among roads and agents. The pretrained encoder is then finetuned on the downstream forecasting task. Extensive experiments demonstrate that SEPT, without elaborate architectural design or manual feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming previous methods on all main metrics by a large margin.  ( 2 min )
    EncryIP: A Practical Encryption-Based Framework for Model Intellectual Property Protection. (arXiv:2312.12049v1 [cs.CR])
    In the rapidly growing digital economy, protecting intellectual property (IP) associated with digital products has become increasingly important. Within this context, machine learning (ML) models, being highly valuable digital assets, have gained significant attention for IP protection. This paper introduces a practical encryption-based framework called \textit{EncryIP}, which seamlessly integrates a public-key encryption scheme into the model learning process. This approach enables the protected model to generate randomized and confused labels, ensuring that only individuals with accurate secret keys, signifying authorized users, can decrypt and reveal authentic labels. Importantly, the proposed framework not only facilitates the protected model to multiple authorized users without requiring repetitive training of the original ML model with IP protection methods but also maintains the model's performance without compromising its accuracy. Compared to existing methods like watermark-based, trigger-based, and passport-based approaches, \textit{EncryIP} demonstrates superior effectiveness in both training protected models and efficiently detecting the unauthorized spread of ML models.  ( 2 min )
    Neural Fuzzy Extractors: A Secure Way to Use Artificial Neural Networks for Biometric User Authentication. (arXiv:2003.08433v2 [cs.CR] UPDATED)
    Powered by new advances in sensor development and artificial intelligence, the decreasing cost of computation, and the pervasiveness of handheld computation devices, biometric user authentication (and identification) is rapidly becoming ubiquitous. Modern approaches to biometric authentication, based on sophisticated machine learning techniques, cannot avoid storing either trained-classifier details or explicit user biometric data, thus exposing users' credentials to falsification. In this paper, we introduce a secure way to handle user-specific information involved with the use of vector-space classifiers or artificial neural networks for biometric authentication. Our proposed architecture, called a Neural Fuzzy Extractor (NFE), allows the coupling of pre-existing classifiers with fuzzy extractors, through a artificial-neural-network-based buffer called an expander, with minimal or no performance degradation. The NFE thus offers all the performance advantages of modern deep-learning-based classifiers, and all the security of standard fuzzy extractors. We demonstrate the NFE retrofit to a classic artificial neural network for a simple scenario of fingerprint-based user authentication.  ( 3 min )
    Robust Stochastic Graph Generator for Counterfactual Explanations. (arXiv:2312.11747v1 [cs.LG])
    Counterfactual Explanation (CE) techniques have garnered attention as a means to provide insights to the users engaging with AI systems. While extensively researched in domains such as medical imaging and autonomous vehicles, Graph Counterfactual Explanation (GCE) methods have been comparatively under-explored. GCEs generate a new graph similar to the original one, with a different outcome grounded on the underlying predictive model. Among these GCE techniques, those rooted in generative mechanisms have received relatively limited investigation despite demonstrating impressive accomplishments in other domains, such as artistic styles and natural language modelling. The preference for generative explainers stems from their capacity to generate counterfactual instances during inference, leveraging autonomously acquired perturbations of the input graph. Motivated by the rationales above, our study introduces RSGG-CE, a novel Robust Stochastic Graph Generator for Counterfactual Explanations able to produce counterfactual examples from the learned latent space considering a partially ordered generation sequence. Furthermore, we undertake quantitative and qualitative analyses to compare RSGG-CE's performance against SoA generative explainers, highlighting its increased ability to engendering plausible counterfactual candidates.  ( 2 min )
    Rapid Artefact Removal and H&E-Stained Tissue Segmentation. (arXiv:2308.13304v2 [eess.IV] UPDATED)
    We present an innovative method for rapidly segmenting hematoxylin and eosin (H&E)-stained tissue in whole-slide images (WSIs) that eliminates a wide range of undesirable artefacts such as pen marks and scanning artefacts. Our method involves taking a single-channel representation of a lowmagnification RGB overview of the WSI in which the pixel values are bimodally distributed such that H&E-stained tissue is easily distinguished from both background and a wide variety of artefacts. We demonstrate our method on 30 WSIs prepared from a wide range of institutions and WSI digital scanners, each containing substantial artefacts, and compare it to segmentations provided by Otsu thresholding and Histolab tissue segmentation and pen filtering tools. We found that our method segmented the tissue and fully removed all artefacts in 29 out of 30 WSIs, whereas Otsu thresholding failed to remove any artefacts, and the Histolab pen filtering tools only partially removed the pen marks. The beauty of our approach lies in its simplicity: manipulating RGB colour space and using Otsu thresholding allows for the segmentation of H&E-stained tissue and the rapid removal of artefacts without the need for machine learning or parameter tuning.  ( 3 min )
    Initializing Services in Interactive ML Systems for Diverse Users. (arXiv:2312.11846v1 [cs.LG])
    This paper studies ML systems that interactively learn from users across multiple subpopulations with heterogeneous data distributions. The primary objective is to provide specialized services for different user groups while also predicting user preferences. Once the users select a service based on how well the service anticipated their preference, the services subsequently adapt and refine themselves based on the user data they accumulate, resulting in an iterative, alternating minimization process between users and services (learning dynamics). Employing such tailored approaches has two main challenges: (i) Unknown user preferences: Typically, data on user preferences are unavailable without interaction, and uniform data collection across a large and diverse user base can be prohibitively expensive. (ii) Suboptimal Local Solutions: The total loss (sum of loss functions across all users and all services) landscape is not convex even if the individual losses on a single service are convex, making it likely for the learning dynamics to get stuck in local minima. The final outcome of the aforementioned learning dynamics is thus strongly influenced by the initial set of services offered to users, and is not guaranteed to be close to the globally optimal outcome. In this work, we propose a randomized algorithm to adaptively select very few users to collect preference data from, while simultaneously initializing a set of services. We prove that under mild assumptions on the loss functions, the expected total loss achieved by the algorithm right after initialization is within a factor of the globally optimal total loss with complete user preference data, and this factor scales only logarithmically in the number of services. Our theory is complemented by experiments on real as well as semi-synthetic datasets.  ( 3 min )
    Who Reviews The Reviewers? A Multi-Level Jury Problem. (arXiv:2211.08494v2 [cs.LG] UPDATED)
    We consider the problem of determining a binary ground truth using advice from a group of independent reviewers (experts) who express their guess about a ground truth correctly with some independent probability (competence). In this setting, when all reviewers are competent (competence greater than one-half), the Condorcet Jury Theorem tells us that adding more reviewers increases the overall accuracy, and if all competences are known, then there exists an optimal weighting of the reviewers. However, in practical settings, reviewers may be noisy or incompetent, i.e., competence below half, and the number of experts may be small, so the asymptotic Condorcet Jury Theorem is not practically relevant. In such cases we explore appointing one or more chairs (judges) who determine the weight of each reviewer for aggregation, creating multiple levels. However, these chairs may be unable to correctly identify the competence of the reviewers they oversee, and therefore unable to compute the optimal weighting. We give conditions when a set of chairs is able to weight the reviewers optimally, and depending on the competence distribution of the agents, give results about when it is better to have more chairs or more reviewers. Through numerical simulations we show that in some cases it is better to have more chairs, but in many cases it is better to have more reviewers.  ( 3 min )
    Generalizing Adam to Manifolds for Efficiently Training Transformers. (arXiv:2305.16901v2 [cs.LG] UPDATED)
    One of the primary reasons behind the success of neural networks has been the emergence of an array of new, highly-successful optimizers, perhaps most importantly the Adam optimizer. It is wiedely used for training neural networks, yet notoriously hard to interpret. Lacking a clear physical intuition, Adam is difficult to generalize to manifolds. Some attempts have been made to directly apply parts of the Adam algorithm to manifolds or to find an underlying structure, but a full generalization has remained elusive. In this work a new approach is presented that leverages the special structure of the manifolds which are relevant for optimization of neural networks, such as the Stiefel manifold, the symplectic Stiefel manifold, the Grassmann manifold and the symplectic Grassmann manifold: all of these are homogeneous spaces and as such admit a global tangent space representation. This global tangent space representation is used to perform all of the steps in the Adam optimizer. The resulting algorithm is then applied to train a transformer for which orthogonality constraints are enforced up to machine precision and we observe significant speed-ups in the training process. Optimization of neural networks where they weights do not lie on a manifold is identified as a special case of the presented framkework. This allows for a flexible implementation in which the learning rate is adapted simultaneously for all parameters, irrespective of whether they are an element of a general manifold or a vector space.  ( 3 min )
    ICML 2023 Topological Deep Learning Challenge : Design and Results. (arXiv:2309.15188v3 [cs.LG] UPDATED)
    This paper presents the computational challenge on topological deep learning that was hosted within the ICML 2023 Workshop on Topology and Geometry in Machine Learning. The competition asked participants to provide open-source implementations of topological neural networks from the literature by contributing to the python packages TopoNetX (data processing) and TopoModelX (deep learning). The challenge attracted twenty-eight qualifying submissions in its two-month duration. This paper describes the design of the challenge and summarizes its main findings.  ( 2 min )
    Hire When You Need to: Gradual Participant Recruitment for Auction-based Federated Learning. (arXiv:2310.02651v2 [cs.LG] UPDATED)
    The success of Federated Learning (FL) depends on the quantity and quality of the data owners (DOs) as well as their motivation to join FL model training. Reputation-based FL participant selection methods have been proposed. However, they still face the challenges of the cold start problem and potential selection bias towards highly reputable DOs. Such a bias can result in lower reputation DOs being prematurely excluded from future FL training rounds, thereby reducing the diversity of training data and the generalizability of the resulting models. To address these challenges, we propose the Gradual Participant Selection scheme for Auction-based Federated Learning (GPS-AFL). Unlike existing AFL incentive mechanisms which generally assume that all DOs required for an FL task must be selected in one go, GPS-AFL gradually selects the required DOs over multiple rounds of training as more information is revealed through repeated interactions. It is designed to strike a balance between cost saving and performance enhancement, while mitigating the drawbacks of selection bias in reputation-based FL. Extensive experiments based on real-world datasets demonstrate the significant advantages of GPS-AFL, which reduces costs by 33.65% and improved total utility by 2.91%, on average compared to the best-performing state-of-the-art approach.  ( 3 min )
    Mind the Gap: Federated Learning Broadens Domain Generalization in Diagnostic AI Models. (arXiv:2310.00757v2 [cs.CV] UPDATED)
    Developing robust artificial intelligence (AI) models that generalize well to unseen datasets is challenging and usually requires large and variable datasets, preferably from multiple institutions. In federated learning (FL), a model is trained collaboratively at numerous sites that hold local datasets without exchanging them. So far, the impact of training strategy, i.e., local versus collaborative, on the diagnostic on-domain and off-domain performance of AI models interpreting chest radiographs has not been assessed. Consequently, using 610,000 chest radiographs from five institutions across the globe, we assessed diagnostic performance as a function of training strategy (i.e., local vs. collaborative), network architecture (i.e., convolutional vs. transformer-based), generalization performance (i.e., on-domain vs. off-domain), imaging finding (i.e., cardiomegaly, pleural effusion, pneumonia, atelectasis, consolidation, pneumothorax, and no abnormality), dataset size (i.e., from n=18,000 to 213,921 radiographs), and dataset diversity. Large datasets not only showed minimal performance gains with FL but, in some instances, even exhibited decreases. In contrast, smaller datasets revealed marked improvements. Thus, on-domain performance was mainly driven by training data size. However, off-domain performance leaned more on training diversity. When trained collaboratively across diverse external institutions, AI models consistently surpassed models trained locally for off-domain tasks, emphasizing FL's potential in leveraging data diversity. In conclusion, FL can bolster diagnostic privacy, reproducibility, and off-domain reliability of AI models and, potentially, optimize healthcare outcomes.  ( 3 min )
    The Validity of a Machine Learning-Based Video Game in the Objective Screening of Attention Deficit Hyperactivity Disorder in Children Aged 5 to 12 Years. (arXiv:2312.11832v1 [cs.LG])
    Objective: Early identification of ADHD is necessary to provide the opportunity for timely treatment. However, screening the symptoms of ADHD on a large scale is not easy. This study aimed to validate a video game (FishFinder) for the screening of ADHD using objective measurement of the core symptoms of this disorder. Method: The FishFinder measures attention and impulsivity through in-game performance and evaluates the child's hyperactivity using smartphone motion sensors. This game was tested on 26 children with ADHD and 26 healthy children aged 5 to 12 years. A Support Vector Machine was employed to detect children with ADHD. results: This system showed 92.3% accuracy, 90% sensitivity, and 93.7% specificity using a combination of in-game and movement features. Conclusions: The FishFinder demonstrated a strong ability to identify ADHD in children. So, this game can be used as an affordable, accessible, and enjoyable method for the objective screening of ADHD.  ( 2 min )
    Counter-Empirical Attacking based on Adversarial Reinforcement Learning for Time-Relevant Scoring System. (arXiv:2311.05144v2 [cs.LG] UPDATED)
    Scoring systems are commonly seen for platforms in the era of big data. From credit scoring systems in financial services to membership scores in E-commerce shopping platforms, platform managers use such systems to guide users towards the encouraged activity pattern, and manage resources more effectively and more efficiently thereby. To establish such scoring systems, several "empirical criteria" are firstly determined, followed by dedicated top-down design for each factor of the score, which usually requires enormous effort to adjust and tune the scoring function in the new application scenario. What's worse, many fresh projects usually have no ground-truth or any experience to evaluate a reasonable scoring system, making the designing even harder. To reduce the effort of manual adjustment of the scoring function in every new scoring system, we innovatively study the scoring system from the preset empirical criteria without any ground truth, and propose a novel framework to improve the system from scratch. In this paper, we propose a "counter-empirical attacking" mechanism that can generate "attacking" behavior traces and try to break the empirical rules of the scoring system. Then an adversarial "enhancer" is applied to evaluate the scoring system and find the improvement strategy. By training the adversarial learning problem, a proper scoring function can be learned to be robust to the attacking activity traces that are trying to violate the empirical criteria. Extensive experiments have been conducted on two scoring systems including a shared computing resource platform and a financial credit system. The experimental results have validated the effectiveness of our proposed framework.  ( 3 min )
    Pseudo Contrastive Learning for Graph-based Semi-supervised Learning. (arXiv:2302.09532v3 [cs.LG] UPDATED)
    Pseudo Labeling is a technique used to improve the performance of semi-supervised Graph Neural Networks (GNNs) by generating additional pseudo-labels based on confident predictions. However, the quality of generated pseudo-labels has been a longstanding concern due to the sensitivity of the classification objective with respect to the given labels. To avoid the untrustworthy classification supervision indicating ``a node belongs to a specific class,'' we favor the fault-tolerant contrasting supervision demonstrating ``two nodes do not belong to the same class.'' Thus, the problem of generating high-quality pseudo-labels is then transformed into a relaxed version, i.e., identifying reliable negative pairs. To achieve this, we propose a general framework for GNNs, termed Pseudo Contrastive Learning (PCL). It separates two nodes whose positive and negative pseudo-labels target the same class. To incorporate topological knowledge into learning, we devise a topologically weighted contrastive loss that spends more effort separating negative pairs with smaller topological distances. Experimentally, we apply PCL to various GNNs, which consistently outperform their counterparts using other popular general techniques on five real-world graphs.  ( 2 min )
    Short-Term Multi-Horizon Line Loss Rate Forecasting of a Distribution Network Using Attention-GCN-LSTM. (arXiv:2312.11898v1 [cs.LG])
    Accurately predicting line loss rates is vital for effective line loss management in distribution networks, especially over short-term multi-horizons ranging from one hour to one week. In this study, we propose Attention-GCN-LSTM, a novel method that combines Graph Convolutional Networks (GCN), Long Short-Term Memory (LSTM), and a three-level attention mechanism to address this challenge. By capturing spatial and temporal dependencies, our model enables accurate forecasting of line loss rates across multiple horizons. Through comprehensive evaluation using real-world data from 10KV feeders, our Attention-GCN-LSTM model consistently outperforms existing algorithms, exhibiting superior performance in terms of prediction accuracy and multi-horizon forecasting. This model holds significant promise for enhancing line loss management in distribution networks.  ( 2 min )
    CaRe-CNN: Cascading Refinement CNN for Myocardial Infarct Segmentation with Microvascular Obstructions. (arXiv:2312.11315v2 [cs.CV] UPDATED)
    Late gadolinium enhanced (LGE) magnetic resonance (MR) imaging is widely established to assess the viability of myocardial tissue of patients after acute myocardial infarction (MI). We propose the Cascading Refinement CNN (CaRe-CNN), which is a fully 3D, end-to-end trained, 3-stage CNN cascade that exploits the hierarchical structure of such labeled cardiac data. Throughout the three stages of the cascade, the label definition changes and CaRe-CNN learns to gradually refine its intermediate predictions accordingly. Furthermore, to obtain more consistent qualitative predictions, we propose a series of post-processing steps that take anatomical constraints into account. Our CaRe-CNN was submitted to the FIMH 2023 MYOSAIQ challenge, where it ranked second out of 18 participating teams. CaRe-CNN showed great improvements most notably when segmenting the difficult but clinically most relevant myocardial infarct tissue (MIT) as well as microvascular obstructions (MVO). When computing the average scores over all labels, our method obtained the best score in eight out of ten metrics. Thus, accurate cardiac segmentation after acute MI via our CaRe-CNN allows generating patient-specific models of the heart serving as an important step towards personalized medicine.  ( 2 min )
    QuadAttack: A Quadratic Programming Approach to Ordered Top-K Attacks. (arXiv:2312.11510v1 [cs.CR])
    The adversarial vulnerability of Deep Neural Networks (DNNs) has been well-known and widely concerned, often under the context of learning top-$1$ attacks (e.g., fooling a DNN to classify a cat image as dog). This paper shows that the concern is much more serious by learning significantly more aggressive ordered top-$K$ clear-box~\footnote{ This is often referred to as white/black-box attacks in the literature. We choose to adopt neutral terminology, clear/opaque-box attacks in this paper, and omit the prefix clear-box for simplicity.} targeted attacks proposed in Adversarial Distillation. We propose a novel and rigorous quadratic programming (QP) method of learning ordered top-$K$ attacks with low computing cost, dubbed as \textbf{QuadAttac$K$}. Our QuadAttac$K$ directly solves the QP to satisfy the attack constraint in the feature embedding space (i.e., the input space to the final linear classifier), which thus exploits the semantics of the feature embedding space (i.e., the principle of class coherence). With the optimized feature embedding vector perturbation, it then computes the adversarial perturbation in the data space via the vanilla one-step back-propagation. In experiments, the proposed QuadAttac$K$ is tested in the ImageNet-1k classification using ResNet-50, DenseNet-121, and Vision Transformers (ViT-B and DEiT-S). It successfully pushes the boundary of successful ordered top-$K$ attacks from $K=10$ up to $K=20$ at a cheap budget ($1\times 60$) and further improves attack success rates for $K=5$ for all tested models, while retaining the performance for $K=1$.  ( 2 min )
    LightGCNet: A Lightweight Geometric Constructive Neural Network for Data-Driven Soft sensors. (arXiv:2312.12022v1 [stat.ML])
    Data-driven soft sensors provide a potentially cost-effective and more accurate modeling approach to measure difficult-to-measure indices in industrial processes compared to mechanistic approaches. Artificial intelligence (AI) techniques, such as deep learning, have become a popular soft sensors modeling approach in the area of machine learning and big data. However, soft sensors models based deep learning potentially lead to complex model structures and excessive training time. In addition, industrial processes often rely on distributed control systems (DCS) characterized by resource constraints. Herein, guided by spatial geometric, a lightweight geometric constructive neural network, namely LightGCNet, is proposed, which utilizes compact angle constraint to assign the hidden parameters from dynamic intervals. At the same time, a node pool strategy and spatial geometric relationships are used to visualize and optimize the process of assigning hidden parameters, enhancing interpretability. In addition, the universal approximation property of LightGCNet is proved by spatial geometric analysis. Two versions algorithmic implementations of LightGCNet are presented in this article. Simulation results concerning both benchmark datasets and the ore grinding process indicate remarkable merits of LightGCNet in terms of small network size, fast learning speed, and sound generalization.  ( 2 min )
    EyePreserve: Identity-Preserving Iris Synthesis. (arXiv:2312.12028v1 [cs.CV])
    Synthesis of same-identity biometric iris images, both for existing and non-existing identities while preserving the identity across a wide range of pupil sizes, is complex due to intricate iris muscle constriction mechanism, requiring a precise model of iris non-linear texture deformations to be embedded into the synthesis pipeline. This paper presents the first method of fully data-driven, identity-preserving, pupil size-varying s ynthesis of iris images. This approach is capable of synthesizing images of irises with different pupil sizes representing non-existing identities as well as non-linearly deforming the texture of iris images of existing subjects given the segmentation mask of the target iris image. Iris recognition experiments suggest that the proposed deformation model not only preserves the identity when changing the pupil size but offers better similarity between same-identity iris samples with significant differences in pupil size, compared to state-of-the-art linear and non-linear (bio-mechanical-based) iris deformation models. Two immediate applications of the proposed approach are: (a) synthesis of, or enhancement of the existing biometric datasets for iris recognition, mimicking those acquired with iris sensors, and (b) helping forensic human experts in examining iris image pairs with significant differences in pupil dilation. Source codes and weights of the models are made available with the paper.  ( 2 min )
    Label Denoising through Cross-Model Agreement. (arXiv:2308.13976v3 [cs.LG] UPDATED)
    Learning from corrupted labels is very common in real-world machine-learning applications. Memorizing such noisy labels could affect the learning of the model, leading to sub-optimal performances. In this work, we propose a novel framework to learn robust machine-learning models from noisy labels. Through an empirical study, we find that different models make relatively similar predictions on clean examples, while the predictions on noisy examples vary much more across different models. Motivated by this observation, we propose \em denoising with cross-model agreement \em (DeCA) which aims to minimize the KL-divergence between the true label distributions parameterized by two machine learning models while maximizing the likelihood of data observation. We employ the proposed DeCA on both the binary label scenario and the multiple label scenario. For the binary label scenario, we select implicit feedback recommendation as the downstream task and conduct experiments with four state-of-the-art recommendation models on four datasets. For the multiple-label scenario, the downstream application is image classification on two benchmark datasets. Experimental results demonstrate that the proposed methods significantly improve the model performance compared with normal training and other denoising methods on both binary and multiple-label scenarios.  ( 2 min )
    GDP nowcasting with artificial neural networks: How much does long-term memory matter?. (arXiv:2304.05805v2 [econ.EM] UPDATED)
    In our study, we apply artificial neural networks (ANNs) to nowcast quarterly GDP growth for the U.S. economy. Using the monthly FRED-MD database, we compare the nowcasting performance of five different ANN architectures: the multilayer perceptron (MLP), the one-dimensional convolutional neural network (1D CNN), the Elman recurrent neural network (RNN), the long short-term memory network (LSTM), and the gated recurrent unit (GRU). The empirical analysis presents the results from two distinctively different evaluation periods. The first (2012:Q1 -- 2019:Q4) is characterized by balanced economic growth, while the second (2012:Q1 -- 2022:Q4) also includes periods of the COVID-19 recession. According to our results, longer input sequences result in more accurate nowcasts in periods of balanced economic growth. However, this effect ceases above a relatively low threshold value of around six quarters (eighteen months). During periods of economic turbulence (e.g., during the COVID-19 recession), longer input sequences do not help the models' predictive performance; instead, they seem to weaken their generalization capability. Combined results from the two evaluation periods indicate that architectural features enabling for long-term memory do not result in more accurate nowcasts. On the other hand, the 1D CNN has proved to be a highly suitable model for GDP nowcasting. The network has shown good nowcasting performance among the competitors during the first evaluation period and achieved the overall best accuracy during the second evaluation period. Consequently, first in the literature, we propose the application of the 1D CNN for economic nowcasting.  ( 3 min )
    Human-Machine Teaming for UAVs: An Experimentation Platform. (arXiv:2312.11718v1 [cs.AI])
    Full automation is often not achievable or desirable in critical systems with high-stakes decisions. Instead, human-AI teams can achieve better results. To research, develop, evaluate, and validate algorithms suited for such teaming, lightweight experimentation platforms that enable interactions between humans and multiple AI agents are necessary. However, there are limited examples of such platforms for defense environments. To address this gap, we present the Cogment human-machine teaming experimentation platform, which implements human-machine teaming (HMT) use cases that features heterogeneous multi-agent systems and can involve learning AI agents, static AI agents, and humans. It is built on the Cogment platform and has been used for academic research, including work presented at the ALA workshop at AAMAS this year [1]. With this platform, we hope to facilitate further research on human-machine teaming in critical systems and defense environments.  ( 2 min )
    Ghost Noise for Regularizing Deep Neural Networks. (arXiv:2305.17205v2 [cs.LG] UPDATED)
    Batch Normalization (BN) is widely used to stabilize the optimization process and improve the test performance of deep neural networks. The regularization effect of BN depends on the batch size and explicitly using smaller batch sizes with Batch Normalization, a method known as Ghost Batch Normalization (GBN), has been found to improve generalization in many settings. We investigate the effectiveness of GBN by disentangling the induced ``Ghost Noise'' from normalization and quantitatively analyzing the distribution of noise as well as its impact on model performance. Inspired by our analysis, we propose a new regularization technique called Ghost Noise Injection (GNI) that imitates the noise in GBN without incurring the detrimental train-test discrepancy effects of small batch training. We experimentally show that GNI can provide a greater generalization benefit than GBN. Ghost Noise Injection can also be beneficial in otherwise non-noisy settings such as layer-normalized networks, providing additional evidence of the usefulness of Ghost Noise in Batch Normalization as a regularizer.  ( 2 min )
    Lifting Architectural Constraints of Injective Flows. (arXiv:2306.01843v3 [cs.LG] UPDATED)
    Normalizing Flows explicitly maximize a full-dimensional likelihood on the training data. However, real data is typically only supported on a lower-dimensional manifold leading the model to expend significant compute on modeling noise. Injective Flows fix this by jointly learning a manifold and the distribution on it. So far, they have been limited by restrictive architectures and/or high computational cost. We lift both constraints by a new efficient estimator for the maximum likelihood loss, compatible with free-form bottleneck architectures. We further show that naively learning both the data manifold and the distribution on it can lead to divergent solutions, and use this insight to motivate a stable maximum likelihood training objective. We perform extensive experiments on toy, tabular and image data, demonstrating the competitive performance of the resulting model.  ( 2 min )
    Ad-load Balancing via Off-policy Learning in a Content Marketplace. (arXiv:2309.11518v2 [cs.IR] UPDATED)
    Ad-load balancing is a critical challenge in online advertising systems, particularly in the context of social media platforms, where the goal is to maximize user engagement and revenue while maintaining a satisfactory user experience. This requires the optimization of conflicting objectives, such as user satisfaction and ads revenue. Traditional approaches to ad-load balancing rely on static allocation policies, which fail to adapt to changing user preferences and contextual factors. In this paper, we present an approach that leverages off-policy learning and evaluation from logged bandit feedback. We start by presenting a motivating analysis of the ad-load balancing problem, highlighting the conflicting objectives between user satisfaction and ads revenue. We emphasize the nuances that arise due to user heterogeneity and the dependence on the user's position within a session. Based on this analysis, we define the problem as determining the optimal ad-load for a particular feed fetch. To tackle this problem, we propose an off-policy learning framework that leverages unbiased estimators such as Inverse Propensity Scoring (IPS) and Doubly Robust (DR) to learn and estimate the policy values using offline collected stochastic data. We present insights from online A/B experiments deployed at scale across over 80 million users generating over 200 million sessions, where we find statistically significant improvements in both user satisfaction metrics and ads revenue for the platform.  ( 3 min )
    HypLL: The Hyperbolic Learning Library. (arXiv:2306.06154v3 [cs.LG] UPDATED)
    Deep learning in hyperbolic space is quickly gaining traction in the fields of machine learning, multimedia, and computer vision. Deep networks commonly operate in Euclidean space, implicitly assuming that data lies on regular grids. Recent advances have shown that hyperbolic geometry provides a viable alternative foundation for deep learning, especially when data is hierarchical in nature and when working with few embedding dimensions. Currently however, no accessible open-source library exists to build hyperbolic network modules akin to well-known deep learning libraries. We present HypLL, the Hyperbolic Learning Library to bring the progress on hyperbolic deep learning together. HypLL is built on top of PyTorch, with an emphasis in its design for ease-of-use, in order to attract a broad audience towards this new and open-ended research direction. The code is available at: https://github.com/maxvanspengler/hyperbolic_learning_library.  ( 2 min )
    Augmentation-Aware Self-Supervision for Data-Efficient GAN Training. (arXiv:2205.15677v4 [cs.LG] UPDATED)
    Training generative adversarial networks (GANs) with limited data is challenging because the discriminator is prone to overfitting. Previously proposed differentiable augmentation demonstrates improved data efficiency of training GANs. However, the augmentation implicitly introduces undesired invariance to augmentation for the discriminator since it ignores the change of semantics in the label space caused by data transformation, which may limit the representation learning ability of the discriminator and ultimately affect the generative modeling performance of the generator. To mitigate the negative impact of invariance while inheriting the benefits of data augmentation, we propose a novel augmentation-aware self-supervised discriminator that predicts the augmentation parameter of the augmented data. Particularly, the prediction targets of real data and generated data are required to be distinguished since they are different during training. We further encourage the generator to adversarially learn from the self-supervised discriminator by generating augmentation-predictable real and not fake data. This formulation connects the learning objective of the generator and the arithmetic $-$ harmonic mean divergence under certain assumptions. We compare our method with state-of-the-art (SOTA) methods using the class-conditional BigGAN and unconditional StyleGAN2 architectures on data-limited CIFAR-10, CIFAR-100, FFHQ, LSUN-Cat, and five low-shot datasets. Experimental results demonstrate significant improvements of our method over SOTA methods in training data-efficient GANs.  ( 3 min )
    Multi-Agent Reinforcement Learning with Action Masking for UAV-enabled Mobile Communications. (arXiv:2303.16737v2 [cs.MA] UPDATED)
    Unmanned Aerial Vehicles (UAVs) are increasingly used as aerial base stations to provide ad hoc communications infrastructure. Building upon prior research efforts which consider either static nodes, 2D trajectories or single UAV systems, this paper focuses on the use of multiple UAVs for providing wireless communication to mobile users in the absence of terrestrial communications infrastructure. In particular, we jointly optimize UAV 3D trajectory and NOMA power allocation to maximize system throughput. Firstly, a weighted K-means-based clustering algorithm establishes UAV-user associations at regular intervals. The efficacy of training a novel Shared Deep Q-Network (SDQN) with action masking is then explored. Unlike training each UAV separately using DQN, the SDQN reduces training time by using the experiences of multiple UAVs instead of a single agent. We also show that SDQN can be used to train a multi-agent system with differing action spaces. Simulation results confirm that: 1) training a shared DQN outperforms a conventional DQN in terms of maximum system throughput (+20%) and training time (-10%); 2) it can converge for agents with different action spaces, yielding a 9% increase in throughput compared to mutual learning algorithms; and 3) combining NOMA with an SDQN architecture enables the network to achieve a better sum rate compared with existing baseline schemes.  ( 2 min )
    The performance of multiple language models in identifying offensive language on social media. (arXiv:2312.11504v1 [cs.CL])
    Text classification is an important topic in the field of natural language processing. It has been preliminarily applied in information retrieval, digital library, automatic abstracting, text filtering, word semantic discrimination and many other fields. The aim of this research is to use a variety of algorithms to test the ability to identify offensive posts and evaluate their performance against a variety of assessment methods. The motivation for this project is to reduce the harm of these languages to human censors by automating the screening of offending posts. The field is a new one, and despite much interest in the past two years, there has been no focus on the object of the offence. Through the experiment of this project, it should inspire future research on identification methods as well as identification content.  ( 2 min )
    Identification of Causal Structure with Latent Variables Based on Higher Order Cumulants. (arXiv:2312.11934v1 [cs.LG])
    Causal discovery with latent variables is a crucial but challenging task. Despite the emergence of numerous methods aimed at addressing this challenge, they are not fully identified to the structure that two observed variables are influenced by one latent variable and there might be a directed edge in between. Interestingly, we notice that this structure can be identified through the utilization of higher-order cumulants. By leveraging the higher-order cumulants of non-Gaussian data, we provide an analytical solution for estimating the causal coefficients or their ratios. With the estimated (ratios of) causal coefficients, we propose a novel approach to identify the existence of a causal edge between two observed variables subject to latent variable influence. In case when such a causal edge exits, we introduce an asymmetry criterion to determine the causal direction. The experimental results demonstrate the effectiveness of our proposed method.  ( 2 min )
    A Survey of Reasoning with Foundation Models. (arXiv:2312.11562v1 [cs.AI])
    Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing development of foundation models, there is a growing interest in exploring their abilities in reasoning tasks. In this paper, we introduce seminal foundation models proposed or adaptable for reasoning, highlighting the latest advancements in various reasoning tasks, methods, and benchmarks. We then delve into the potential future directions behind the emergence of reasoning abilities within foundation models. We also discuss the relevance of multimodal learning, autonomous agents, and super alignment in the context of reasoning. By discussing these future research directions, we hope to inspire researchers in their exploration of this field, stimulate further advancements in reasoning with foundation models, and contribute to the development of AGI.  ( 2 min )
    Mithridates: Auditing and Boosting Backdoor Resistance of Machine Learning Pipelines. (arXiv:2302.04977v3 [cs.CR] UPDATED)
    Machine learning (ML) models trained on data from potentially untrusted sources are vulnerable to poisoning. A small, maliciously crafted subset of the training inputs can cause the model to learn a "backdoor" task (e.g., misclassify inputs with a certain feature) in addition to its main task. Recent research proposed many hypothetical backdoor attacks whose efficacy heavily depends on the configuration and training hyperparameters of the target model. Given the variety of potential backdoor attacks, ML engineers who are not security experts have no way to measure how vulnerable their current training pipelines are, nor do they have a practical way to compare training configurations so as to pick the more resistant ones. Deploying a defense requires evaluating and choosing from among dozens of research papers and re-engineering the training pipeline. In this paper, we aim to provide ML engineers with pragmatic tools to audit the backdoor resistance of their training pipelines and to compare different training configurations, to help choose one that best balances accuracy and security. First, we propose a universal, attack-agnostic resistance metric based on the minimum number of training inputs that must be compromised before the model learns any backdoor. Second, we design, implement, and evaluate Mithridates a multi-stage approach that integrates backdoor resistance into the training-configuration search. ML developers already rely on hyperparameter search to find configurations that maximize the model's accuracy. Mithridates extends this standard tool to balance accuracy and resistance without disruptive changes to the training pipeline. We show that hyperparameters found by Mithridates increase resistance to multiple types of backdoor attacks by 3-5x with only a slight impact on accuracy. We also discuss extensions to AutoML and federated learning.  ( 3 min )
    Prediction and Control in Continual Reinforcement Learning. (arXiv:2312.11669v1 [cs.LG])
    Temporal difference (TD) learning is often used to update the estimate of the value function which is used by RL agents to extract useful policies. In this paper, we focus on value function estimation in continual reinforcement learning. We propose to decompose the value function into two components which update at different timescales: a permanent value function, which holds general knowledge that persists over time, and a transient value function, which allows quick adaptation to new situations. We establish theoretical results showing that our approach is well suited for continual learning and draw connections to the complementary learning systems (CLS) theory from neuroscience. Empirically, this approach improves performance significantly on both prediction and control problems.  ( 2 min )
    MG-Skip: Random Multi-Gossip Skipping Method for Nonsmooth Distributed Optimization. (arXiv:2312.11861v1 [math.OC])
    Distributed optimization methods with probabilistic local updates have recently gained attention for their provable ability to communication acceleration. Nevertheless, this capability is effective only when the loss function is smooth and the network is sufficiently well-connected. In this paper, we propose the first linear convergent method MG-Skip with probabilistic local updates for nonsmooth distributed optimization. Without any extra condition for the network connectivity, MG-Skip allows for the multiple-round gossip communication to be skipped in most iterations, while its iteration complexity is $\mathcal{O}\left(\kappa \log \frac{1}{\epsilon}\right)$ and communication complexity is only $\mathcal{O}\left(\sqrt{\frac{\kappa}{(1-\rho)}} \log \frac{1}{\epsilon}\right)$, where $\kappa$ is the condition number of the loss function and $\rho$ reflects the connectivity of the network topology. To the best of our knowledge, MG-Skip achieves the best communication complexity when the loss function has the smooth (strongly convex)+nonsmooth (convex) composite form.  ( 2 min )
    ContraNovo: A Contrastive Learning Approach to Enhance De Novo Peptide Sequencing. (arXiv:2312.11584v1 [q-bio.QM])
    De novo peptide sequencing from mass spectrometry (MS) data is a critical task in proteomics research. Traditional de novo algorithms have encountered a bottleneck in accuracy due to the inherent complexity of proteomics data. While deep learning-based methods have shown progress, they reduce the problem to a translation task, potentially overlooking critical nuances between spectra and peptides. In our research, we present ContraNovo, a pioneering algorithm that leverages contrastive learning to extract the relationship between spectra and peptides and incorporates the mass information into peptide decoding, aiming to address these intricacies more efficiently. Through rigorous evaluations on two benchmark datasets, ContraNovo consistently outshines contemporary state-of-the-art solutions, underscoring its promising potential in enhancing de novo peptide sequencing. The source code is available at https://github.com/BEAM-Labs/ContraNovo.  ( 2 min )
    Submodularity, pairwise independence and correlation gap. (arXiv:2209.08563v2 [math.OC] UPDATED)
    In this paper, we provide a characterization of the expected value of monotone submodular set functions with $n$ pairwise independent random inputs. Inspired by the notion of ``correlation gap'', we study the ratio of the maximum expected value of a function with arbitrary dependence among the random inputs with given marginal probabilities to the maximum expected value of the function with pairwise independent random inputs and the same marginal probabilities. Our results show that the ratio is upper bounded by: (a) $4/3$ for $n = 3$ with general marginal probabilities and any monotone submodular set function (b) $4/3$ for general $n$ with small and large marginal probabilities and any monotone submodular set function and (c) $4k/(4k-1)$ for general $n$, general identical probabilities and rank functions of $k$-uniform matroids. The bound is tight in all three cases. This contrasts with the $e/(e-1)$ bound on the correlation gap ratio for monotone submodular set functions with mutually independent random inputs (which is known to be tight in case (b)), and illustrates a fundamental difference in the behavior of submodular functions with weaker notions of independence. These results can be immediately extended beyond pairwise independence to correlated random inputs. We discuss applications in distributionally robust optimization and mechanism design and end the paper with a conjecture.  ( 2 min )
    On the Trade-off between the Number of Nodes and the Number of Trees in a Random Forest. (arXiv:2312.11540v1 [cs.LG])
    In this paper, we focus on the prediction phase of a random forest and study the problem of representing a bag of decision trees using a smaller bag of decision trees, where we only consider binary decision problems on the binary domain and simple decision trees in which an internal node is limited to querying the Boolean value of a single variable. As a main result, we show that the majority function of $n$ variables can be represented by a bag of $T$ ($< n$) decision trees each with polynomial size if $n-T$ is a constant, where $n$ and $T$ must be odd (in order to avoid the tie break). We also show that a bag of $n$ decision trees can be represented by a bag of $T$ decision trees each with polynomial size if $n-T$ is a constant and a small classification error is allowed. A related result on the $k$-out-of-$n$ functions is presented too.  ( 2 min )
    Learned ISTA with Error-based Thresholding for Adaptive Sparse Coding. (arXiv:2112.10985v2 [cs.LG] UPDATED)
    Drawing on theoretical insights, we advocate an error-based thresholding (EBT) mechanism for learned ISTA (LISTA), which utilizes a function of the layer-wise reconstruction error to suggest a specific threshold for each observation in the shrinkage function of each layer. We show that the proposed EBT mechanism well disentangles the learnable parameters in the shrinkage functions from the reconstruction errors, endowing the obtained models with improved adaptivity to possible data variations. With rigorous analyses, we further show that the proposed EBT also leads to a faster convergence on the basis of LISTA or its variants, in addition to its higher adaptivity. Extensive experimental results confirm our theoretical analyses and verify the effectiveness of our methods.  ( 2 min )
    Shapley-PC: Constraint-based Causal Structure Learning with Shapley Values. (arXiv:2312.11582v1 [cs.LG])
    Causal Structure Learning (CSL), amounting to extracting causal relations among the variables in a dataset, is widely perceived as an important step towards robust and transparent models. Constraint-based CSL leverages conditional independence tests to perform causal discovery. We propose Shapley-PC, a novel method to improve constraint-based CSL algorithms by using Shapley values over the possible conditioning sets to decide which variables are responsible for the observed conditional (in)dependences. We prove soundness and asymptotic consistency and demonstrate that it can outperform state-of-the-art constraint-based, search-based and functional causal model-based methods, according to standard metrics in CSL.  ( 2 min )
    Provably Convergent Federated Trilevel Learning. (arXiv:2312.11835v1 [cs.LG])
    Trilevel learning, also called trilevel optimization (TLO), has been recognized as a powerful modelling tool for hierarchical decision process and widely applied in many machine learning applications, such as robust neural architecture search, hyperparameter optimization, and domain adaptation. Tackling TLO problems has presented a great challenge due to their nested decision-making structure. In addition, existing works on TLO face the following key challenges: 1) they all focus on the non-distributed setting, which may lead to privacy breach; 2) they do not offer any non-asymptotic convergence analysis which characterizes how fast an algorithm converges. To address the aforementioned challenges, this paper proposes an asynchronous federated trilevel optimization method to solve TLO problems. The proposed method utilizes $\mu$-cuts to construct a hyper-polyhedral approximation for the TLO problem and solve it in an asynchronous manner. We demonstrate that the proposed $\mu$-cuts are applicable to not only convex functions but also a wide range of non-convex functions that meet the $\mu$-weakly convex assumption. Furthermore, we theoretically analyze the non-asymptotic convergence rate for the proposed method by showing its iteration complexity to obtain $\epsilon$-stationary point is upper bounded by $\mathcal{O}(\frac{1}{\epsilon^2})$. Extensive experiments on real-world datasets have been conducted to elucidate the superiority of the proposed method, e.g., it has a faster convergence rate with a maximum acceleration of approximately 80$\%$.  ( 2 min )
    Dynamic Frequency Domain Graph Convolutional Network for Traffic Forecasting. (arXiv:2312.11933v1 [cs.LG])
    Complex spatial dependencies in transportation networks make traffic prediction extremely challenging. Much existing work is devoted to learning dynamic graph structures among sensors, and the strategy of mining spatial dependencies from traffic data, known as data-driven, tends to be an intuitive and effective approach. However, Time-Shift of traffic patterns and noise induced by random factors hinder data-driven spatial dependence modeling. In this paper, we propose a novel dynamic frequency domain graph convolution network (DFDGCN) to capture spatial dependencies. Specifically, we mitigate the effects of time-shift by Fourier transform, and introduce the identity embedding of sensors and time embedding when capturing data for graph learning since traffic data with noise is not entirely reliable. The graph is combined with static predefined and self-adaptive graphs during graph convolution to predict future traffic data through classical causal convolutions. Extensive experiments on four real-world datasets demonstrate that our model is effective and outperforms the baselines.  ( 2 min )
    Continual Learning: Forget-free Winning Subnetworks for Video Representations. (arXiv:2312.11973v1 [cs.CV])
    Inspired by the Regularized Lottery Ticket Hypothesis (RLTH), which highlights the presence of competitive subnetworks within dense networks for continual learning tasks, we introduce Winning Subnetworks (WSN). This approach utilizes reused weights in dense networks to enhance learning in Task Incremental Learning (TIL) scenarios. To mitigate overfitting in Few-Shot Class Incremental Learning (FSCIL), we have developed WSN variants referred to as the Soft subnetwork (SoftNet). Furthermore, addressing WSN's limitation of sparse reused weights in Video Incremental Learning (VIL), we propose the Fourier Subneural Operator (FSO). The FSO, operating in Fourier space, adaptively and compactly encodes videos, discovering reusable subnetworks with diverse bandwidths. We have applied FSO's Fourier representations to various continual learning contexts, including VIL, TIL, and FSCIL. Our extensive experiments across these scenarios demonstrate FSO's remarkable efficacy in continual learning, significantly enhancing task performance at various convolutional representational levels: it boosts performance in the higher layers for TIL and FSCIL and the lower layers for VIL.  ( 2 min )
    Label-Free Multivariate Time Series Anomaly Detection. (arXiv:2312.11549v1 [cs.LG])
    Anomaly detection in multivariate time series (MTS) has been widely studied in one-class classification (OCC) setting. The training samples in OCC are assumed to be normal, which is difficult to guarantee in practical situations. Such a case may degrade the performance of OCC-based anomaly detection methods which fit the training distribution as the normal distribution. In this paper, we propose MTGFlow, an unsupervised anomaly detection approach for MTS anomaly detection via dynamic Graph and entity-aware normalizing Flow. MTGFlow first estimates the density of the entire training samples and then identifies anomalous instances based on the density of the test samples within the fitted distribution. This relies on a widely accepted assumption that anomalous instances exhibit more sparse densities than normal ones, with no reliance on the clean training dataset. However, it is intractable to directly estimate the density due to complex dependencies among entities and their diverse inherent characteristics. To mitigate this, we utilize the graph structure learning model to learn interdependent and evolving relations among entities, which effectively captures complex and accurate distribution patterns of MTS. In addition, our approach incorporates the unique characteristics of individual entities by employing an entity-aware normalizing flow. This enables us to represent each entity as a parameterized normal distribution. Furthermore, considering that some entities present similar characteristics, we propose a cluster strategy that capitalizes on the commonalities of entities with similar characteristics, resulting in more precise and detailed density estimation. We refer to this cluster-aware extension as MTGFlow_cluster. Extensive experiments are conducted on six widely used benchmark datasets, in which MTGFlow and MTGFlow cluster demonstrate their superior detection performance.  ( 3 min )
    Are you talking to ['xem'] or ['x', 'em']? On Tokenization and Addressing Misgendering in LLMs with Pronoun Tokenization Parity. (arXiv:2312.11779v1 [cs.CL])
    A large body of NLP research has documented the ways gender biases manifest and amplify within large language models (LLMs), though this research has predominantly operated within a gender binary-centric context. A growing body of work has identified the harmful limitations of this gender-exclusive framing; many LLMs cannot correctly and consistently refer to persons outside the gender binary, especially if they use neopronouns. While data scarcity has been identified as a possible culprit, the precise mechanisms through which it influences LLM misgendering remain underexplored. Our work addresses this gap by studying data scarcity's role in subword tokenization and, consequently, the formation of LLM word representations. We uncover how the Byte-Pair Encoding (BPE) tokenizer, a backbone for many popular LLMs, contributes to neopronoun misgendering through out-of-vocabulary behavior. We introduce pronoun tokenization parity (PTP), a novel approach to reduce LLM neopronoun misgendering by preserving a token's functional structure. We evaluate PTP's efficacy using pronoun consistency-based metrics and a novel syntax-based metric. Through several controlled experiments, finetuning LLMs with PTP improves neopronoun consistency from 14.5% to 58.4%, highlighting the significant role tokenization plays in LLM pronoun consistency.  ( 3 min )
    A Bayesian Spatial Model to Correct Under-Reporting in Urban Crowdsourcing. (arXiv:2312.11754v1 [cs.CY])
    Decision-makers often observe the occurrence of events through a reporting process. City governments, for example, rely on resident reports to find and then resolve urban infrastructural problems such as fallen street trees, flooded basements, or rat infestations. Without additional assumptions, there is no way to distinguish events that occur but are not reported from events that truly did not occur--a fundamental problem in settings with positive-unlabeled data. Because disparities in reporting rates correlate with resident demographics, addressing incidents only on the basis of reports leads to systematic neglect in neighborhoods that are less likely to report events. We show how to overcome this challenge by leveraging the fact that events are spatially correlated. Our framework uses a Bayesian spatial latent variable model to infer event occurrence probabilities and applies it to storm-induced flooding reports in New York City, further pooling results across multiple storms. We show that a model accounting for under-reporting and spatial correlation predicts future reports more accurately than other models, and further induces a more equitable set of inspections: its allocations better reflect the population and provide equitable service to non-white, less traditionally educated, and lower-income residents. This finding reflects heterogeneous reporting behavior learned by the model: reporting rates are higher in Census tracts with higher populations, proportions of white residents, and proportions of owner-occupied households. Our work lays the groundwork for more equitable proactive government services, even with disparate reporting behavior.  ( 3 min )
    Multiple Hypothesis Dropout: Estimating the Parameters of Multi-Modal Output Distributions. (arXiv:2312.11735v1 [cs.LG])
    In many real-world applications, from robotics to pedestrian trajectory prediction, there is a need to predict multiple real-valued outputs to represent several potential scenarios. Current deep learning techniques to address multiple-output problems are based on two main methodologies: (1) mixture density networks, which suffer from poor stability at high dimensions, or (2) multiple choice learning (MCL), an approach that uses $M$ single-output functions, each only producing a point estimate hypothesis. This paper presents a Mixture of Multiple-Output functions (MoM) approach using a novel variant of dropout, Multiple Hypothesis Dropout. Unlike traditional MCL-based approaches, each multiple-output function not only estimates the mean but also the variance for its hypothesis. This is achieved through a novel stochastic winner-take-all loss which allows each multiple-output function to estimate variance through the spread of its subnetwork predictions. Experiments on supervised learning problems illustrate that our approach outperforms existing solutions for reconstructing multimodal output distributions. Additional studies on unsupervised learning problems show that estimating the parameters of latent posterior distributions within a discrete autoencoder significantly improves codebook efficiency, sample quality, precision and recall.  ( 2 min )
    Locally-Minimal Probabilistic Explanations. (arXiv:2312.11831v1 [cs.LG])
    Formal abductive explanations offer crucial guarantees of rigor and so are of interest in high-stakes uses of machine learning (ML). One drawback of abductive explanations is explanation size, justified by the cognitive limits of human decision-makers. Probabilistic abductive explanations (PAXps) address this limitation, but their theoretical and practical complexity makes their exact computation most often unrealistic. This paper proposes novel efficient algorithms for the computation of locally-minimal PXAps, which offer high-quality approximations of PXAps in practice. The experimental results demonstrate the practical efficiency of the proposed algorithms.  ( 2 min )
    Convergence Visualizer of Decentralized Federated Distillation with Reduced Communication Costs. (arXiv:2312.11905v1 [cs.NI])
    Federated learning (FL) achieves collaborative learning without the need for data sharing, thus preventing privacy leakage. To extend FL into a fully decentralized algorithm, researchers have applied distributed optimization algorithms to FL by considering machine learning (ML) tasks as parameter optimization problems. Conversely, the consensus-based multi-hop federated distillation (CMFD) proposed in the authors' previous work makes neural network (NN) models get close with others in a function space rather than in a parameter space. Hence, this study solves two unresolved challenges of CMFD: (1) communication cost reduction and (2) visualization of model convergence. Based on a proposed dynamic communication cost reduction method (DCCR), the amount of data transferred in a network is reduced; however, with a slight degradation in the prediction accuracy. In addition, a technique for visualizing the distance between the NN models in a function space is also proposed. The technique applies a dimensionality reduction technique by approximating infinite-dimensional functions as numerical vectors to visualize the trajectory of how the models change by the distributed learning algorithm.  ( 3 min )
    COPD-FlowNet: Elevating Non-invasive COPD Diagnosis with CFD Simulations. (arXiv:2312.11561v1 [cs.LG])
    Chronic Obstructive Pulmonary Disorder (COPD) is a prevalent respiratory disease that significantly impacts the quality of life of affected individuals. This paper presents COPDFlowNet, a novel deep-learning framework that leverages a custom Generative Adversarial Network (GAN) to generate synthetic Computational Fluid Dynamics (CFD) velocity flow field images specific to the trachea of COPD patients. These synthetic images serve as a valuable resource for data augmentation and model training. Additionally, COPDFlowNet incorporates a custom Convolutional Neural Network (CNN) architecture to predict the location of the obstruction site.  ( 2 min )
    Multi-agent reinforcement learning using echo-state network and its application to pedestrian dynamics. (arXiv:2312.11834v1 [cs.MA])
    In recent years, simulations of pedestrians using the multi-agent reinforcement learning (MARL) have been studied. This study considered the roads on a grid-world environment, and implemented pedestrians as MARL agents using an echo-state network and the least squares policy iteration method. Under this environment, the ability of these agents to learn to move forward by avoiding other agents was investigated. Specifically, we considered two types of tasks: the choice between a narrow direct route and a broad detour, and the bidirectional pedestrian flow in a corridor. The simulations results indicated that the learning was successful when the density of the agents was not that high.  ( 2 min )
    Curriculum Learning for Cooperation in Multi-Agent Reinforcement Learning. (arXiv:2312.11768v1 [cs.AI])
    While there has been significant progress in curriculum learning and continuous learning for training agents to generalize across a wide variety of environments in the context of single-agent reinforcement learning, it is unclear if these algorithms would still be valid in a multi-agent setting. In a competitive setting, a learning agent can be trained by making it compete with a curriculum of increasingly skilled opponents. However, a general intelligent agent should also be able to learn to act around other agents and cooperate with them to achieve common goals. When cooperating with other agents, the learning agent must (a) learn how to perform the task (or subtask), and (b) increase the overall team reward. In this paper, we aim to answer the question of what kind of cooperative teammate, and a curriculum of teammates should a learning agent be trained with to achieve these two objectives. Our results on the game Overcooked show that a pre-trained teammate who is less skilled is the best teammate for overall team reward but the worst for the learning of the agent. Moreover, somewhat surprisingly, a curriculum of teammates with decreasing skill levels performs better than other types of curricula.  ( 2 min )
    Empowering Dual-Level Graph Self-Supervised Pretraining with Motif Discovery. (arXiv:2312.11927v1 [cs.LG])
    While self-supervised graph pretraining techniques have shown promising results in various domains, their application still experiences challenges of limited topology learning, human knowledge dependency, and incompetent multi-level interactions. To address these issues, we propose a novel solution, Dual-level Graph self-supervised Pretraining with Motif discovery (DGPM), which introduces a unique dual-level pretraining structure that orchestrates node-level and subgraph-level pretext tasks. Unlike prior approaches, DGPM autonomously uncovers significant graph motifs through an edge pooling module, aligning learned motif similarities with graph kernel-based similarities. A cross-matching task enables sophisticated node-motif interactions and novel representation learning. Extensive experiments on 15 datasets validate DGPM's effectiveness and generalizability, outperforming state-of-the-art methods in unsupervised representation learning and transfer learning settings. The autonomously discovered motifs demonstrate the potential of DGPM to enhance robustness and interpretability.  ( 2 min )
    Big Learning Expectation Maximization. (arXiv:2312.11926v1 [cs.LG])
    Mixture models serve as one fundamental tool with versatile applications. However, their training techniques, like the popular Expectation Maximization (EM) algorithm, are notoriously sensitive to parameter initialization and often suffer from bad local optima that could be arbitrarily worse than the optimal. To address the long-lasting bad-local-optima challenge, we draw inspiration from the recent ground-breaking foundation models and propose to leverage their underlying big learning principle to upgrade the EM. Specifically, we present the Big Learning EM (BigLearn-EM), an EM upgrade that simultaneously performs joint, marginal, and orthogonally transformed marginal matchings between data and model distributions. Through simulated experiments, we empirically show that the BigLearn-EM is capable of delivering the optimal with high probability; comparisons on benchmark clustering datasets further demonstrate its effectiveness and advantages over existing techniques. The code is available at https://github.com/YulaiCong/Big-Learning-Expectation-Maximization.  ( 2 min )
    Regularized Conditional Alignment for Multi-Domain Text Classification. (arXiv:2312.11572v1 [cs.CL])
    The most successful multi-domain text classification (MDTC) approaches employ the shared-private paradigm to facilitate the enhancement of domain-invariant features through domain-specific attributes. Additionally, they employ adversarial training to align marginal feature distributions. Nevertheless, these methodologies encounter two primary challenges: (1) Neglecting class-aware information during adversarial alignment poses a risk of misalignment; (2) The limited availability of labeled data across multiple domains fails to ensure adequate discriminative capacity for the model. To tackle these issues, we propose a method called Regularized Conditional Alignment (RCA) to align the joint distributions of domains and classes, thus matching features within the same category and amplifying the discriminative qualities of acquired features. Moreover, we employ entropy minimization and virtual adversarial training to constrain the uncertainty of predictions pertaining to unlabeled data and enhance the model's robustness. Empirical results on two benchmark datasets demonstrate that our RCA approach outperforms state-of-the-art MDTC techniques.  ( 2 min )
    Point Cloud Segmentation Using Transfer Learning with RandLA-Net: A Case Study on Urban Areas. (arXiv:2312.11880v1 [cs.CV])
    Urban environments are characterized by complex structures and diverse features, making accurate segmentation of point cloud data a challenging task. This paper presents a comprehensive study on the application of RandLA-Net, a state-of-the-art neural network architecture, for the 3D segmentation of large-scale point cloud data in urban areas. The study focuses on three major Chinese cities, namely Chengdu, Jiaoda, and Shenzhen, leveraging their unique characteristics to enhance segmentation performance. To address the limited availability of labeled data for these specific urban areas, we employed transfer learning techniques. We transferred the learned weights from the Sensat Urban and Toronto 3D datasets to initialize our RandLA-Net model. Additionally, we performed class remapping to adapt the model to the target urban areas, ensuring accurate segmentation results. The experimental results demonstrate the effectiveness of the proposed approach achieving over 80\% F1 score for each areas in 3D point cloud segmentation. The transfer learning strategy proves to be crucial in overcoming data scarcity issues, providing a robust solution for urban point cloud analysis. The findings contribute to the advancement of point cloud segmentation methods, especially in the context of rapidly evolving Chinese urban areas.  ( 2 min )
    Shaping Political Discourse using multi-source News Summarization. (arXiv:2312.11703v1 [cs.CL])
    Multi-document summarization is the process of automatically generating a concise summary of multiple documents related to the same topic. This summary can help users quickly understand the key information from a large collection of documents. Multi-document summarization systems are more complex than single-document summarization systems due to the need to identify and combine information from multiple sources. In this paper, we have developed a machine learning model that generates a concise summary of a topic from multiple news documents. The model is designed to be unbiased by sampling its input equally from all the different aspects of the topic, even if the majority of the news sources lean one way.  ( 2 min )
    ComplexityNet: Increasing LLM Inference Efficiency by Learning Task Complexity. (arXiv:2312.11511v1 [cs.CL])
    We present ComplexityNet, a streamlined language model designed for assessing task complexity. This model predicts the likelihood of accurate output by various language models, each with different capabilities. Our initial application of ComplexityNet involves the Mostly Basic Python Problems (MBPP) dataset. We pioneered the creation of the first set of labels to define task complexity. ComplexityNet achieved a notable 79% accuracy in determining task complexity, a significant improvement over the 34% accuracy of the original, non fine-tuned model. Furthermore, ComplexityNet effectively reduces computational resource usage by 90% compared to using the highest complexity model, while maintaining a high code generation accuracy of 86.7%. This study demonstrates that fine-tuning smaller models to categorize tasks based on their complexity can lead to a more balanced trade-off between accuracy and efficiency in the use of Large Language Models. Our findings suggest a promising direction for optimizing LLM applications, especially in resource-constrained environments.  ( 2 min )
    Protect Your Score: Contact Tracing With Differential Privacy Guarantees. (arXiv:2312.11581v1 [cs.CR])
    The pandemic in 2020 and 2021 had enormous economic and societal consequences, and studies show that contact tracing algorithms can be key in the early containment of the virus. While large strides have been made towards more effective contact tracing algorithms, we argue that privacy concerns currently hold deployment back. The essence of a contact tracing algorithm constitutes the communication of a risk score. Yet, it is precisely the communication and release of this score to a user that an adversary can leverage to gauge the private health status of an individual. We pinpoint a realistic attack scenario and propose a contact tracing algorithm with differential privacy guarantees against this attack. The algorithm is tested on the two most widely used agent-based COVID19 simulators and demonstrates superior performance in a wide range of settings. Especially for realistic test scenarios and while releasing each risk score with epsilon=1 differential privacy, we achieve a two to ten-fold reduction in the infection rate of the virus. To the best of our knowledge, this presents the first contact tracing algorithm with differential privacy guarantees when revealing risk scores for COVID19.  ( 3 min )
    A Unified Pre-training and Adaptation Framework for Combinatorial Optimization on Graphs. (arXiv:2312.11547v1 [cs.AI])
    Combinatorial optimization (CO) on graphs is a classic topic that has been extensively studied across many scientific and industrial fields. Recently, solving CO problems on graphs through learning methods has attracted great attention. Advanced deep learning methods, e.g., graph neural networks (GNNs), have been used to effectively assist the process of solving COs. However, current frameworks based on GNNs are mainly designed for certain CO problems, thereby failing to consider their transferable and generalizable abilities among different COs on graphs. Moreover, simply using original graphs to model COs only captures the direct correlations among objects, which does not consider the mathematical logicality and properties of COs. In this paper, we propose a unified pre-training and adaptation framework for COs on graphs with the help of the maximum satisfiability (Max-SAT) problem. We first use Max-SAT to bridge different COs on graphs since they can be converted to Max-SAT problems represented by standard formulas and clauses with logical information. Then, we further design a pre-training and domain adaptation framework to extract the transferable and generalizable features so that different COs can benefit from them. In the pre-training stage, Max-SAT instances are generated to initialize the parameters of the model. In the fine-tuning stage, instances from CO and Max-SAT problems are used for adaptation so that the transferable ability can be further improved. Numerical experiments on several datasets show that features extracted by our framework exhibit superior transferability and Max-SAT can boost the ability to solve COs on graphs.  ( 3 min )
    Topo-MLP : A Simplicial Network Without Message Passing. (arXiv:2312.11862v1 [cs.LG])
    Due to their ability to model meaningful higher order relations among a set of entities, higher order network models have emerged recently as a powerful alternative for graph-based network models which are only capable of modeling binary relationships. Message passing paradigm is still dominantly used to learn representations even for higher order network models. While powerful, message passing can have disadvantages during inference, particularly when the higher order connectivity information is missing or corrupted. To overcome such limitations, we propose Topo-MLP, a purely MLP-based simplicial neural network algorithm to learn the representation of elements in a simplicial complex without explicitly relying on message passing. Our framework utilizes a novel Higher Order Neighborhood Contrastive (HONC) loss which implicitly incorporates the simplicial structure into representation learning. Our proposed model's simplicity makes it faster during inference. Moreover, we show that our model is robust when faced with missing or corrupted connectivity structure.  ( 2 min )
    Stronger Graph Transformer with Regularized Attention Scores. (arXiv:2312.11730v1 [cs.LG])
    Graph Neural Networks are notorious for its memory consumption. A recent Transformer based GNN called Graph Transformer are shown to obtain superior performances when long range dependencies exist. However, combining graph data and Transformer architecture led to a combinationally worse memory issue. We propose a novel version of "edge regularization technique" that alleviates the need for Positional Encoding and ultimately alleviate GT's out of memory issue. We observe that it is not clear whether having an edge regularization on top of positional encoding is helpful. However, it seems evident when no positional encoding is applied, edge regularization technique indeed stably improves GT's performance.  ( 2 min )
    Learning a Diffusion Model Policy from Rewards via Q-Score Matching. (arXiv:2312.11752v1 [cs.LG])
    Diffusion models have become a popular choice for representing actor policies in behavior cloning and offline reinforcement learning. This is due to their natural ability to optimize an expressive class of distributions over a continuous space. However, previous works fail to exploit the score-based structure of diffusion models, and instead utilize a simple behavior cloning term to train the actor, limiting their ability in the actor-critic setting. In this paper, we focus on off-policy reinforcement learning and propose a new method for learning a diffusion model policy that exploits the linked structure between the score of the policy and the action gradient of the Q-function. We denote this method Q-score matching and provide theoretical justification for this approach. We conduct experiments in simulated environments to demonstrate the effectiveness of our proposed method and compare to popular baselines.  ( 2 min )
    Faster Convergence with Multiway Preferences. (arXiv:2312.11788v1 [cs.LG])
    We address the problem of convex optimization with preference feedback, where the goal is to minimize a convex function given a weaker form of comparison queries. Each query consists of two points and the dueling feedback returns a (noisy) single-bit binary comparison of the function values of the two queried points. Here we consider the sign-function-based comparison feedback model and analyze the convergence rates with batched and multiway (argmin of a set queried points) comparisons. Our main goal is to understand the improved convergence rates owing to parallelization in sign-feedback-based optimization problems. Our work is the first to study the problem of convex optimization with multiway preferences and analyze the optimal convergence rates. Our first contribution lies in designing efficient algorithms with a convergence rate of $\smash{\widetilde O}(\frac{d}{\min\{m,d\} \epsilon})$ for $m$-batched preference feedback where the learner can query $m$-pairs in parallel. We next study a $m$-multiway comparison (`battling') feedback, where the learner can get to see the argmin feedback of $m$-subset of queried points and show a convergence rate of $\smash{\widetilde O}(\frac{d}{ \min\{\log m,d\}\epsilon })$. We show further improved convergence rates with an additional assumption of strong convexity. Finally, we also study the convergence lower bounds for batched preferences and multiway feedback optimization showing the optimality of our convergence rates w.r.t. $m$.  ( 2 min )
    Twitter Permeability to financial events: an experiment towards a model for sensing irregularities. (arXiv:2312.11530v1 [q-fin.ST])
    There is a general consensus of the good sensing and novelty characteristics of Twitter as an information media for the complex financial market. This paper investigates the permeability of Twittersphere, the total universe of Twitter users and their habits, towards relevant events in the financial market. Analysis shows that a general purpose social media is permeable to financial-specific events and establishes Twitter as a relevant feeder for taking decisions regarding the financial market and event fraudulent activities in that market. However, the provenance of contributions, their different levels of credibility and quality and even the purpose or intention behind them should to be considered and carefully contemplated if Twitter is used as a single source for decision taking. With the overall aim of this research, to deploy an architecture for real-time monitoring of irregularities in the financial market, this paper conducts a series of experiments on the level of permeability and the permeable features of Twitter in the event of one of these irregularities. To be precise, Twitter data is collected concerning an event comprising of a specific financial action on the 27th January 2017:{~ }the announcement about the merge of two companies Tesco PLC and Booker Group PLC, listed in the main market of the London Stock Exchange (LSE), to create the UK's Leading Food Business. The experiment attempts to answer five key research questions which aim to characterize the features of Twitter permeability to the financial market. The experimental results confirm that a far-impacting financial event, such as the merger considered, caused apparent disturbances in all the features considered, that is, information volume, content and sentiment as well as geographical provenance. Analysis shows that despite, Twitter not being a specific financial forum, it is permeable to financial events.  ( 3 min )
    Estimation of individual causal effects in network setup for multiple treatments. (arXiv:2312.11573v1 [cs.LG])
    We study the problem of estimation of Individual Treatment Effects (ITE) in the context of multiple treatments and networked observational data. Leveraging the network information, we aim to utilize hidden confounders that may not be directly accessible in the observed data, thereby enhancing the practical applicability of the strong ignorability assumption. To achieve this, we first employ Graph Convolutional Networks (GCN) to learn a shared representation of the confounders. Then, our approach utilizes separate neural networks to infer potential outcomes for each treatment. We design a loss function as a weighted combination of two components: representation loss and Mean Squared Error (MSE) loss on the factual outcomes. To measure the representation loss, we extend existing metrics such as Wasserstein and Maximum Mean Discrepancy (MMD) from the binary treatment setting to the multiple treatments scenario. To validate the effectiveness of our proposed methodology, we conduct a series of experiments on the benchmark datasets such as BlogCatalog and Flickr. The experimental results consistently demonstrate the superior performance of our models when compared to baseline methods.  ( 2 min )
    The irruption of cryptocurrencies into Twitter cashtags: a classifying solution. (arXiv:2312.11531v1 [q-fin.ST])
    There is a consensus about the good sensing characteristics of Twitter to mine and uncover knowledge in financial markets, being considered a relevant feeder for taking decisions about buying or holding stock shares and even for detecting stock manipulation. Although Twitter hashtags allow to aggregate topic-related content, a specific mechanism for financial information also exists: Cashtag. However, the irruption of cryptocurrencies has resulted in a significant degradation on the cashtag-based aggregation of posts. Unfortunately, Twitter' users may use homonym tickers to refer to cryptocurrencies and to companies in stock markets, which means that filtering by cashtag may result on both posts referring to stock companies and cryptocurrencies. This research proposes automated classifiers to distinguish conflicting cashtags and, so, their container tweets by analyzing the distinctive features of tweets referring to stock companies and cryptocurrencies. As experiment, this paper analyses the interference between cryptocurrencies and company tickers in the London Stock Exchange (LSE), specifically, companies in the main and alternative market indices FTSE-100 and AIM-100. Heuristic-based as well as supervised classifiers are proposed and their advantages and drawbacks, including their ability to self-adapt to Twitter usage changes, are discussed. The experiment confirms a significant distortion in collected data when colliding or homonym cashtags exist, i.e., the same \$ acronym to refer to company tickers and cryptocurrencies. According to our results, the distinctive features of posts including cryptocurrencies or company tickers support accurate classification of colliding tweets (homonym cashtags) and Independent Models, as the most detached classifiers from training data, have the potential to be trans-applicability (in different stock markets) while retaining performance.  ( 3 min )
    Eliciting Kemeny Rankings. (arXiv:2312.11663v1 [cs.LG])
    We formulate the problem of eliciting agents' preferences with the goal of finding a Kemeny ranking as a Dueling Bandits problem. Here the bandits' arms correspond to alternatives that need to be ranked and the feedback corresponds to a pairwise comparison between alternatives by a randomly sampled agent. We consider both sampling with and without replacement, i.e., the possibility to ask the same agent about some comparison multiple times or not. We find approximation bounds for Kemeny rankings dependant on confidence intervals over estimated winning probabilities of arms. Based on these we state algorithms to find Probably Approximately Correct (PAC) solutions and elaborate on their sample complexity for sampling with or without replacement. Furthermore, if all agents' preferences are strict rankings over the alternatives, we provide means to prune confidence intervals and thereby guide a more efficient elicitation. We formulate several adaptive sampling methods that use look-aheads to estimate how much confidence intervals (and thus approximation guarantees) might be tightened. All described methods are compared on synthetic data.  ( 2 min )
    Model Stealing Attack against Recommender System. (arXiv:2312.11571v1 [cs.CR])
    Recent studies have demonstrated the vulnerability of recommender systems to data privacy attacks. However, research on the threat to model privacy in recommender systems, such as model stealing attacks, is still in its infancy. Some adversarial attacks have achieved model stealing attacks against recommender systems, to some extent, by collecting abundant training data of the target model (target data) or making a mass of queries. In this paper, we constrain the volume of available target data and queries and utilize auxiliary data, which shares the item set with the target data, to promote model stealing attacks. Although the target model treats target and auxiliary data differently, their similar behavior patterns allow them to be fused using an attention mechanism to assist attacks. Besides, we design stealing functions to effectively extract the recommendation list obtained by querying the target model. Experimental results show that the proposed methods are applicable to most recommender systems and various scenarios and exhibit excellent attack performance on multiple datasets.  ( 2 min )
    Probabilistic Offline Policy Ranking with Approximate Bayesian Computation. (arXiv:2312.11551v1 [cs.LG])
    In practice, it is essential to compare and rank candidate policies offline before real-world deployment for safety and reliability. Prior work seeks to solve this offline policy ranking (OPR) problem through value-based methods, such as Off-policy evaluation (OPE). However, they fail to analyze special cases performance (e.g., worst or best cases), due to the lack of holistic characterization of policies performance. It is even more difficult to estimate precise policy values when the reward is not fully accessible under sparse settings. In this paper, we present Probabilistic Offline Policy Ranking (POPR), a framework to address OPR problems by leveraging expert data to characterize the probability of a candidate policy behaving like experts, and approximating its entire performance posterior distribution to help with ranking. POPR does not rely on value estimation, and the derived performance posterior can be used to distinguish candidates in worst, best, and average-cases. To estimate the posterior, we propose POPR-EABC, an Energy-based Approximate Bayesian Computation (ABC) method conducting likelihood-free inference. POPR-EABC reduces the heuristic nature of ABC by a smooth energy function, and improves the sampling efficiency by a pseudo-likelihood. We empirically demonstrate that POPR-EABC is adequate for evaluating policies in both discrete and continuous action spaces across various experiment environments, and facilitates probabilistic comparisons of candidate policies before deployment.  ( 2 min )
    Synthetic Shifts to Initial Seed Vector Exposes the Brittle Nature of Latent-Based Diffusion Models. (arXiv:2312.11473v1 [cs.CV])
    Recent advances in Conditional Diffusion Models have led to substantial capabilities in various domains. However, understanding the impact of variations in the initial seed vector remains an underexplored area of concern. Particularly, latent-based diffusion models display inconsistencies in image generation under standard conditions when initialized with suboptimal initial seed vectors. To understand the impact of the initial seed vector on generated samples, we propose a reliability evaluation framework that evaluates the generated samples of a diffusion model when the initial seed vector is subjected to various synthetic shifts. Our results indicate that slight manipulations to the initial seed vector of the state-of-the-art Stable Diffusion (Rombach et al., 2022) can lead to significant disturbances in the generated samples, consequently creating images without the effect of conditioning variables. In contrast, GLIDE (Nichol et al., 2022) stands out in generating reliable samples even when the initial seed vector is transformed. Thus, our study sheds light on the importance of the selection and the impact of the initial seed vector in the latent-based diffusion model.  ( 2 min )
    The geometry of flow: Advancing predictions of river geometry with multi-model machine learning. (arXiv:2312.11476v1 [physics.geo-ph])
    Hydraulic geometry parameters describing river hydrogeomorphic is important for flood forecasting. Although well-established, power-law hydraulic geometry curves have been widely used to understand riverine systems and mapping flooding inundation worldwide for the past 70 years, we have become increasingly aware of the limitations of these approaches. In the present study, we have moved beyond these traditional power-law relationships for river geometry, testing the ability of machine-learning models to provide improved predictions of river width and depth. For this work, we have used an unprecedentedly large river measurement dataset (HYDRoSWOT) as well as a suite of watershed predictor data to develop novel data-driven approaches to better estimate river geometries over the contiguous United States (CONUS). Our Random Forest, XGBoost, and neural network models out-performed the traditional, regionalized power law-based hydraulic geometry equations for both width and depth, providing R-squared values of as high as 0.75 for width and as high as 0.67 for depth, compared with R-squared values of 0.57 for width and 0.18 for depth from the regional hydraulic geometry equations. Our results also show diverse performance outcomes across stream orders and geographical regions for the different machine-learning models, demonstrating the value of using multi-model approaches to maximize the predictability of river geometry. The developed models have been used to create the newly publicly available STREAM-geo dataset, which provides river width, depth, width/depth ratio, and river and stream surface area (%RSSA) for nearly 2.7 million NHDPlus stream reaches across the rivers and streams across the contiguous US.  ( 3 min )
    LLM in a flash: Efficient Large Language Model Inference with Limited Memory. (arXiv:2312.11514v1 [cs.CL])
    Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their intensive computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters on flash memory but bringing them on demand to DRAM. Our method involves constructing an inference cost model that harmonizes with the flash memory behavior, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Within this flash memory-informed framework, we introduce two principal techniques. First, "windowing'" strategically reduces data transfer by reusing previously activated neurons, and second, "row-column bundling", tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory. These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory.  ( 2 min )
    Labrador: Exploring the Limits of Masked Language Modeling for Laboratory Data. (arXiv:2312.11502v1 [cs.CL])
    In this work we introduce Labrador, a pre-trained Transformer model for laboratory data. Labrador and BERT were pre-trained on a corpus of 100 million lab test results from electronic health records (EHRs) and evaluated on various downstream outcome prediction tasks. Both models demonstrate mastery of the pre-training task but neither consistently outperform XGBoost on downstream supervised tasks. Our ablation studies reveal that transfer learning shows limited effectiveness for BERT and achieves marginal success with Labrador. We explore the reasons for the failure of transfer learning and suggest that the data generating process underlying each patient cannot be characterized sufficiently using labs alone, among other factors. We encourage future work to focus on joint modeling of multiple EHR data categories and to include tree-based baselines in their evaluations.  ( 2 min )
    Preference and Concurrence Aware Bayesian Graph Neural Networks for Recommender Systems. (arXiv:2312.11486v1 [cs.IR])
    Graph-based collaborative filtering methods have prevailing performance for recommender systems since they can capture high-order information between users and items, in which the graphs are constructed from the observed user-item interactions that might miss links or contain spurious positive interactions in industrial scenarios. The Bayesian Graph Neural Network framework approaches this issue with generative models for the interaction graphs. The critical problem is to devise a proper family of graph generative models tailored to recommender systems. We propose an efficient generative model that jointly considers the preferences of users, the concurrence of items and some important graph structure information. Experiments on four popular benchmark datasets demonstrate the effectiveness of our proposed graph generative methods for recommender systems.  ( 2 min )
    Extracting Interpretable Local and Global Representations from Attention on Time Series. (arXiv:2312.11466v1 [cs.LG])
    This paper targets two transformer attention based interpretability methods working with local abstraction and global representation, in the context of time series data. We distinguish local and global contexts, and provide a comprehensive framework for both general interpretation options. We discuss their specific instantiation via different methods in detail, also outlining their respective computational implementation and abstraction variants. Furthermore, we provide extensive experimentation demonstrating the efficacy of the presented approaches. In particular, we perform our experiments using a selection of univariate datasets from the UCR UEA time series repository where we both assess the performance of the proposed approaches, as well as their impact on explainability and interpretability/complexity. Here, with an extensive analysis of hyperparameters, the presented approaches demonstrate an significant improvement in interpretability/complexity, while capturing many core decisions of and maintaining a similar performance to the baseline model. Finally, we draw general conclusions outlining and guiding the application of the presented methods.  ( 2 min )
    Investigating the Impact of Weight Sharing Decisions on Knowledge Transfer in Continual Learning. (arXiv:2311.09506v3 [cs.LG] UPDATED)
    Continual Learning (CL) has generated attention as a method of avoiding Catastrophic Forgetting (CF) in the sequential training of neural networks, improving network efficiency and adaptability to different tasks. Additionally, CL serves as an ideal setting for studying network behavior and Forward Knowledge Transfer (FKT) between tasks. Pruning methods for CL train subnetworks to handle the sequential tasks which allows us to take a structured approach to investigating FKT. Sharing prior subnetworks' weights leverages past knowledge for the current task through FKT. Understanding which weights to share is important as sharing all weights can yield sub-optimal accuracy. This paper investigates how different sharing decisions affect the FKT between tasks. Through this lens we demonstrate how task complexity and similarity influence the optimal weight sharing decisions, giving insights into the relationships between tasks and helping inform decision making in similar CL methods. We implement three sequential datasets designed to emphasize variation in task complexity and similarity, reporting results for both ResNet-18 and VGG-16. By sharing in accordance with the decisions supported by our findings, we show that we can improve task accuracy compared to other sharing decisions.  ( 3 min )
    Drift Control of High-Dimensional RBM: A Computational Method Based on Neural Networks. (arXiv:2309.11651v2 [eess.SY] UPDATED)
    Motivated by applications in queueing theory, we consider a stochastic control problem whose state space is the $d$-dimensional positive orthant. The controlled process $Z$ evolves as a reflected Brownian motion whose covariance matrix is exogenously specified, as are its directions of reflection from the orthant's boundary surfaces. A system manager chooses a drift vector $\theta(t)$ at each time $t$ based on the history of $Z$, and the cost rate at time $t$ depends on both $Z(t)$ and $\theta(t)$. In our initial problem formulation, the objective is to minimize expected discounted cost over an infinite planning horizon, after which we treat the corresponding ergodic control problem. Extending earlier work by Han et al. (Proceedings of the National Academy of Sciences, 2018, 8505-8510), we develop and illustrate a simulation-based computational method that relies heavily on deep neural network technology. For test problems studied thus far, our method is accurate to within a fraction of one percent, and is computationally feasible in dimensions up to at least $d=30$.  ( 2 min )
    Physics-Informed Neural Network Lyapunov Functions: PDE Characterization, Learning, and Verification. (arXiv:2312.09131v2 [math.OC] UPDATED)
    We provide a systematic investigation of using physics-informed neural networks to compute Lyapunov functions. We encode Lyapunov conditions as a partial differential equation (PDE) and use this for training neural network Lyapunov functions. We analyze the analytical properties of the solutions to the Lyapunov and Zubov PDEs. In particular, we show that employing the Zubov equation in training neural Lyapunov functions can lead to approximate regions of attraction close to the true domain of attraction. We also examine approximation errors and the convergence of neural approximations to the unique solution of Zubov's equation. We then provide sufficient conditions for the learned neural Lyapunov functions that can be readily verified by satisfiability modulo theories (SMT) solvers, enabling formal verification of both local stability analysis and region-of-attraction estimates in the large. Through a number of nonlinear examples, ranging from low to high dimensions, we demonstrate that the proposed framework can outperform traditional sums-of-squares (SOS) Lyapunov functions obtained using semidefinite programming (SDP).  ( 2 min )
    Physics-informed State-space Neural Networks for Transport Phenomena. (arXiv:2309.12211v2 [cs.LG] UPDATED)
    This work introduces Physics-informed State-space neural network Models (PSMs), a novel solution to achieving real-time optimization, flexibility, and fault tolerance in autonomous systems, particularly in transport-dominated systems such as chemical, biomedical, and power plants. Traditional data-driven methods fall short due to a lack of physical constraints like mass conservation; PSMs address this issue by training deep neural networks with sensor data and physics-informing using components' Partial Differential Equations (PDEs), resulting in a physics-constrained, end-to-end differentiable forward dynamics model. Through two in silico experiments -- a heated channel and a cooling system loop -- we demonstrate that PSMs offer a more accurate approach than a purely data-driven model. In the former experiment, PSMs demonstrated significantly lower average root-mean-square errors across test datasets compared to a purely data-driven neural network, with reductions of 44 %, 48 %, and 94 % in predicting pressure, velocity, and temperature, respectively. Beyond accuracy, PSMs demonstrate a compelling multitask capability, making them highly versatile. In this work, we showcase two: supervisory control of a nonlinear system through a sequentially updated state-space representation and the proposal of a diagnostic algorithm using residuals from each of the PDEs. The former demonstrates PSMs' ability to handle constant and time-dependent constraints, while the latter illustrates their value in system diagnostics and fault detection. We further posit that PSMs could serve as a foundation for Digital Twins, constantly updated digital representations of physical systems.  ( 3 min )
    One for All: Towards Training One Graph Model for All Classification Tasks. (arXiv:2310.00149v2 [cs.LG] UPDATED)
    Designing a single model to address multiple tasks has been a long-standing objective in artificial intelligence. Recently, large language models have demonstrated exceptional capability in solving different tasks within the language domain. However, a unified model for various graph tasks remains underexplored, primarily due to the challenges unique to the graph learning domain. First, graph data from different areas carry distinct attributes and follow different distributions. Such discrepancy makes it hard to represent graphs in a single representation space. Second, tasks on graphs diversify into node, link, and graph tasks, requiring distinct embedding strategies. Finally, an appropriate graph prompting paradigm for in-context learning is unclear. We propose \textbf{One for All (OFA)}, the first general framework that can use a single graph model to address the above challenges. Specifically, OFA proposes text-attributed graphs to unify different graph data by describing nodes and edges with natural language and uses language models to encode the diverse and possibly cross-domain text attributes to feature vectors in the same embedding space. Furthermore, OFA introduces the concept of nodes-of-interest to standardize different tasks with a single task representation. For in-context learning on graphs, OFA introduces a novel graph prompting paradigm that appends prompting substructures to the input graph, which enables it to address varied tasks without fine-tuning. We train the OFA model using graph data from multiple domains (including citation networks, molecular graphs, knowledge graphs, etc.) simultaneously and evaluate its ability in supervised, few-shot, and zero-shot learning scenarios. OFA performs well across different tasks, making it the first general-purpose across-domains classification model on graphs.  ( 3 min )
    Efficient Parallelization of a Ubiquitous Sequential Computation. (arXiv:2311.06281v3 [cs.DS] UPDATED)
    We find a succinct expression for computing the sequence $x_t = a_t x_{t-1} + b_t$ in parallel with two prefix sums, given $t = (1, 2, \dots, n)$, $a_t \in \mathbb{R}^n$, $b_t \in \mathbb{R}^n$, and initial value $x_0 \in \mathbb{R}$. On $n$ parallel processors, the computation of $n$ elements incurs $\mathcal{O}(\log n)$ time and $\mathcal{O}(n)$ space. Sequences of this form are ubiquitous in science and engineering, making efficient parallelization useful for a vast number of applications. We implement our expression in software, test it on parallel hardware, and verify that it executes faster than sequential computation by a factor of $\frac{n}{\log n}$.  ( 2 min )
    On the Efficacy of Differentially Private Few-shot Image Classification. (arXiv:2302.01190v3 [stat.ML] UPDATED)
    There has been significant recent progress in training differentially private (DP) models which achieve accuracy that approaches the best non-private models. These DP models are typically pretrained on large public datasets and then fine-tuned on private downstream datasets that are relatively large and similar in distribution to the pretraining data. However, in many applications including personalization and federated learning, it is crucial to perform well (i) in the few-shot setting, as obtaining large amounts of labeled data may be problematic; and (ii) on datasets from a wide variety of domains for use in various specialist settings. To understand under which conditions few-shot DP can be effective, we perform an exhaustive set of experiments that reveals how the accuracy and vulnerability to attack of few-shot DP image classification models are affected as the number of shots per class, privacy level, model architecture, downstream dataset, and subset of learnable parameters in the model vary. We show that to achieve DP accuracy on par with non-private models, the shots per class must be increased as the privacy level increases. We also show that learning parameter-efficient FiLM adapters under DP is competitive with learning just the final classifier layer or learning all of the network parameters. Finally, we evaluate DP federated learning systems and establish state-of-the-art performance on the challenging FLAIR benchmark.  ( 3 min )
    Modelling and characterization of fine Particulate Matter dynamics in Bujumbura using low cost sensors. (arXiv:2312.12003v1 [stat.ML])
    Air pollution is a result of multiple sources including both natural and anthropogenic activities. The rapid urbanization of the cities such as Bujumbura economic capital of Burundi, is one of these factors. The very first characterization of the spatio-temporal variability of PM2.5 in Bujumbura and the forecasting of PM2.5 concentration have been conducted in this paper using data collected during a year, from august 2022 to august 2023, by low cost sensors installed in Bujumbura city. For each commune, an hourly, daily and seasonal analysis were carried out and the results showed that the mass concentrations of PM2.5 in the three municipalities differ from one commune to another. The average hourly and annual PM2.5 concentrations exceed the World Health Organization standards. The range is between 28.3 and 35.0 microgram/m3 . In order to make prediction of PM2.5 concentration, an investigation of RNN with Long Short Term Memory (LSTM) has been undertaken.  ( 2 min )
    AI without networks. (arXiv:2106.03354v2 [cs.LG] UPDATED)
    Contemporary Artificial Intelligence (AI) stands on two legs: large training data corpora and many-parameter artificial neural networks (ANNs). The data corpora are needed to represent the complexity and heterogeneity of the world. The role of the networks is less transparent due to the obscure dependence of the network parameters and outputs on the training data and inputs. This raises problems, ranging from technical-scientific to legal-ethical. We hypothesize that a transparent approach to machine learning is possible without using networks at all. By generalizing a parameter-free, statistically consistent data interpolation method, which we analyze theoretically in detail, we develop a framework for generative modeling. Given the growing usage of machine learning techniques in science, we demonstrate this framework with an example from the field of animal behavior. We applied this generative Hilbert framework to the trajectories of small groups of swimming fish. The framework outperforms previously developed state-of-the-art traditional mathematical behavioral models and contemporary ANN-based models in reproducing naturalistic behaviors. We do not suggest that the proposed framework will outperform networks in all applications, as over-parameterized networks can interpolate. However, our framework is theoretically sound, transparent, deterministic and parameter free: it does not require any compute-expensive training, does not involve optimization, has no model selection, and is easily reproduced and ported. We also propose an easily computed method of credit assignment based on this framework that could help address ethical-legal challenges raised by generative AI.  ( 3 min )
    Extension of the Dip-test Repertoire -- Efficient and Differentiable p-value Calculation for Clustering. (arXiv:2312.12050v1 [cs.LG])
    Over the last decade, the Dip-test of unimodality has gained increasing interest in the data mining community as it is a parameter-free statistical test that reliably rates the modality in one-dimensional samples. It returns a so called Dip-value and a corresponding probability for the sample's unimodality (Dip-p-value). These two values share a sigmoidal relationship. However, the specific transformation is dependent on the sample size. Many Dip-based clustering algorithms use bootstrapped look-up tables translating Dip- to Dip-p-values for a certain limited amount of sample sizes. We propose a specifically designed sigmoid function as a substitute for these state-of-the-art look-up tables. This accelerates computation and provides an approximation of the Dip- to Dip-p-value transformation for every single sample size. Further, it is differentiable and can therefore easily be integrated in learning schemes using gradient descent. We showcase this by exploiting our function in a novel subspace clustering algorithm called Dip'n'Sub. We highlight in extensive experiments the various benefits of our proposal.  ( 3 min )
    Futures Quantitative Investment with Heterogeneous Continual Graph Neural Network. (arXiv:2303.16532v2 [cs.LG] UPDATED)
    This study aims to address the challenges of futures price prediction in high-frequency trading (HFT) by proposing a continuous learning factor predictor based on graph neural networks. The model integrates multi-factor pricing theories with real-time market dynamics, effectively bypassing the limitations of existing methods that lack financial theory guidance and ignore various trend signals and their interactions. We propose three heterogeneous tasks, including price moving average regression, price gap regression and change-point detection to trace the short-, intermediate-, and long-term trend factors present in the data. In addition, this study also considers the cross-sectional correlation characteristics of future contracts, where prices of different futures often show strong dynamic correlations. Each variable (future contract) depends not only on its historical values (temporal) but also on the observation of other variables (cross-sectional). To capture these dynamic relationships more accurately, we resort to the spatio-temporal graph neural network (STGNN) to enhance the predictive power of the model. The model employs a continuous learning strategy to simultaneously consider these tasks (factors). Additionally, due to the heterogeneity of the tasks, we propose to calculate parameter importance with mutual information between original observations and the extracted features to mitigate the catastrophic forgetting (CF) problem. Empirical tests on 49 commodity futures in China's futures market demonstrate that the proposed model outperforms other state-of-the-art models in terms of prediction accuracy. Not only does this research promote the integration of financial theory and deep learning, but it also provides a scientific basis for actual trading decisions.  ( 3 min )
    Probabilistic Exponential Integrators. (arXiv:2305.14978v2 [math.NA] UPDATED)
    Probabilistic solvers provide a flexible and efficient framework for simulation, uncertainty quantification, and inference in dynamical systems. However, like standard solvers, they suffer performance penalties for certain stiff systems, where small steps are required not for reasons of numerical accuracy but for the sake of stability. This issue is greatly alleviated in semi-linear problems by the probabilistic exponential integrators developed in this paper. By including the fast, linear dynamics in the prior, we arrive at a class of probabilistic integrators with favorable properties. Namely, they are proven to be L-stable, and in a certain case reduce to a classic exponential integrator -- with the added benefit of providing a probabilistic account of the numerical error. The method is also generalized to arbitrary non-linear systems by imposing piece-wise semi-linearity on the prior via Jacobians of the vector field at the previous estimates, resulting in probabilistic exponential Rosenbrock methods. We evaluate the proposed methods on multiple stiff differential equations and demonstrate their improved stability and efficiency over established probabilistic solvers. The present contribution thus expands the range of problems that can be effectively tackled within probabilistic numerics.  ( 2 min )
    Supervision Interpolation via LossMix: Generalizing Mixup for Object Detection and Beyond. (arXiv:2303.10343v2 [cs.CV] UPDATED)
    The success of data mixing augmentations in image classification tasks has been well-received. However, these techniques cannot be readily applied to object detection due to challenges such as spatial misalignment, foreground/background distinction, and plurality of instances. To tackle these issues, we first introduce a novel conceptual framework called Supervision Interpolation (SI), which offers a fresh perspective on interpolation-based augmentations by relaxing and generalizing Mixup. Based on SI, we propose LossMix, a simple yet versatile and effective regularization that enhances the performance and robustness of object detectors and more. Our key insight is that we can effectively regularize the training on mixed data by interpolating their loss errors instead of ground truth labels. Empirical results on the PASCAL VOC and MS COCO datasets demonstrate that LossMix can consistently outperform state-of-the-art methods widely adopted for detection. Furthermore, by jointly leveraging LossMix with unsupervised domain adaptation, we successfully improve existing approaches and set a new state of the art for cross-domain object detection.  ( 2 min )
    Emergence Learning: A Rising Direction from Emergent Abilities and a Monosemanticity-Based Study. (arXiv:2312.11560v1 [cs.LG])
    In the past 20 years, artificial neural networks have become dominant in various areas, continually growing in scale. However, the current analysis of large models has mainly focused on functionality, overlooking the influence of scale differences on their properties. To address this, we propose the concept of Emergence Learning, which emphasizes the significance of scale. By studying models of different scales, we have identified a key factor in achieving higher performance in large models: the decrease of monosemantic neurons. Building on this insight, we propose a proactive approach to inhibit monosemanticity for improved performance. Our solution involves a two-phase process that includes monosemantic neuron detection and inhibition, supported by theoretical analysis. Experimental results on various tasks and neural networks demonstrate the effectiveness of our proposed method. Following the idea of Emergence Learning, though drawing inspiration from scaling phenomena, the applicability of our method is not restricted to large scale alone. Therefore, the experiment is self-contained. However, extending this research to very large-scale datasets is appealing yet impossible for research departments due to limited resources. We are delighted to share the first co-authorship and eagerly await collaboration from any AI company before submission.  ( 2 min )
    Unlocking Musculoskeletal Disorder Risk Factors: NLP-Based Classification and Mode-Based Ranking. (arXiv:2312.11517v1 [cs.CL])
    This research delves into the intricate landscape of Musculoskeletal Disorder (MSD) risk factors, employing a novel fusion of Natural Language Processing (NLP) techniques and mode-based ranking methodologies. The primary objective is to advance the comprehension of MSD risk factors, their classification, and their relative severity, facilitating more targeted preventive and management interventions. The study utilizes eight diverse models, integrating pre-trained transformers, cosine similarity, and various distance metrics to classify risk factors into personal, biomechanical, workplace, psychological, and organizational classes. Key findings reveal that the BERT model with cosine similarity attains an overall accuracy of 28\%, while the sentence transformer, coupled with Euclidean, Bray-Curtis, and Minkowski distances, achieves a flawless accuracy score of 100\%. In tandem with the classification efforts, the research employs a mode-based ranking approach on survey data to discern the severity hierarchy of MSD risk factors. Intriguingly, the rankings align precisely with the previous literature, reaffirming the consistency and reliability of the approach. ``Working posture" emerges as the most severe risk factor, emphasizing the critical role of proper posture in preventing MSDs. The collective perceptions of survey participants underscore the significance of factors like ``Job insecurity," ``Effort reward imbalance," and ``Poor employee facility" in contributing to MSD risks. The convergence of rankings provides actionable insights for organizations aiming to reduce the prevalence of MSDs. The study concludes with implications for targeted interventions, recommendations for improving workplace conditions, and avenues for future research.  ( 2 min )
    Efficient Conditionally Invariant Representation Learning. (arXiv:2212.08645v2 [cs.LG] UPDATED)
    We introduce the Conditional Independence Regression CovariancE (CIRCE), a measure of conditional independence for multivariate continuous-valued variables. CIRCE applies as a regularizer in settings where we wish to learn neural features $\varphi(X)$ of data $X$ to estimate a target $Y$, while being conditionally independent of a distractor $Z$ given $Y$. Both $Z$ and $Y$ are assumed to be continuous-valued but relatively low dimensional, whereas $X$ and its features may be complex and high dimensional. Relevant settings include domain-invariant learning, fairness, and causal learning. The procedure requires just a single ridge regression from $Y$ to kernelized features of $Z$, which can be done in advance. It is then only necessary to enforce independence of $\varphi(X)$ from residuals of this regression, which is possible with attractive estimation properties and consistency guarantees. By contrast, earlier measures of conditional feature dependence require multiple regressions for each step of feature learning, resulting in more severe bias and variance, and greater computational cost. When sufficiently rich features are used, we establish that CIRCE is zero if and only if $\varphi(X) \perp \!\!\! \perp Z \mid Y$. In experiments, we show superior performance to previous methods on challenging benchmarks, including learning conditionally invariant image features.  ( 2 min )
    CausalVAE: Structured Causal Disentanglement in Variational Autoencoder. (arXiv:2004.08697v7 [cs.LG] UPDATED)
    Learning disentanglement aims at finding a low dimensional representation which consists of multiple explanatory and generative factors of the observational data. The framework of variational autoencoder (VAE) is commonly used to disentangle independent factors from observations. However, in real scenarios, factors with semantics are not necessarily independent. Instead, there might be an underlying causal structure which renders these factors dependent. We thus propose a new VAE based framework named CausalVAE, which includes a Causal Layer to transform independent exogenous factors into causal endogenous ones that correspond to causally related concepts in data. We further analyze the model identifiabitily, showing that the proposed model learned from observations recovers the true one up to a certain degree by providing supervision signals (e.g. feature labels). Experiments are conducted on various datasets, including synthetic and real word benchmark CelebA. Results show that the causal representations learned by CausalVAE are semantically interpretable, and their causal relationship as a Directed Acyclic Graph (DAG) is identified with good accuracy. Furthermore, we demonstrate that the proposed CausalVAE model is able to generate counterfactual data through "do-operation" to the causal factors.  ( 3 min )
    Maximum Reward Formulation In Reinforcement Learning. (arXiv:2010.03744v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) algorithms typically deal with maximizing the expected cumulative return (discounted or undiscounted, finite or infinite horizon). However, several crucial applications in the real world, such as drug discovery, do not fit within this framework because an RL agent only needs to identify states (molecules) that achieve the highest reward within a trajectory and does not need to optimize for the expected cumulative return. In this work, we formulate an objective function to maximize the expected maximum reward along a trajectory, derive a novel functional form of the Bellman equation, introduce the corresponding Bellman operators, and provide a proof of convergence. Using this formulation, we achieve state-of-the-art results on the task of molecule generation that mimics a real-world drug discovery pipeline.  ( 2 min )
    When Model Meets New Normals: Test-time Adaptation for Unsupervised Time-series Anomaly Detection. (arXiv:2312.11976v1 [cs.LG])
    Time-series anomaly detection deals with the problem of detecting anomalous timesteps by learning normality from the sequence of observations. However, the concept of normality evolves over time, leading to a "new normal problem", where the distribution of normality can be changed due to the distribution shifts between training and test data. This paper highlights the prevalence of the new normal problem in unsupervised time-series anomaly detection studies. To tackle this issue, we propose a simple yet effective test-time adaptation strategy based on trend estimation and a self-supervised approach to learning new normalities during inference. Extensive experiments on real-world benchmarks demonstrate that incorporating the proposed strategy into the anomaly detector consistently improves the model's performance compared to the baselines, leading to robustness to the distribution shifts.  ( 2 min )
    A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library. (arXiv:2312.11918v1 [cs.LG])
    We provide an optimized implementation of the forward pass of FlashAttention-2, a popular memory-aware scaled dot-product attention algorithm, as a custom fused CUDA kernel targeting NVIDIA Hopper architecture and written using the open-source CUTLASS library. In doing so, we explain the challenges and techniques involved in fusing online-softmax with back-to-back GEMM kernels, utilizing the Hopper-specific Tensor Memory Accelerator (TMA) and Warpgroup Matrix-Multiply-Accumulate (WGMMA) instructions, defining and transforming CUTLASS Layouts and Tensors, overlapping copy and GEMM operations, and choosing optimal tile sizes for the Q, K and V attention matrices while balancing the register pressure and shared memory utilization. In head-to-head benchmarks on a single H100 PCIe GPU for some common choices of hyperparameters, we observe 20-50% higher FLOPs/s over a version of FlashAttention-2 optimized for last-generation NVIDIA Ampere architecture.  ( 2 min )
    SimCalib: Graph Neural Network Calibration based on Similarity between Nodes. (arXiv:2312.11858v1 [cs.LG])
    Graph neural networks (GNNs) have exhibited impressive performance in modeling graph data as exemplified in various applications. Recently, the GNN calibration problem has attracted increasing attention, especially in cost-sensitive scenarios. Previous work has gained empirical insights on the issue, and devised effective approaches for it, but theoretical supports still fall short. In this work, we shed light on the relationship between GNN calibration and nodewise similarity via theoretical analysis. A novel calibration framework, named SimCalib, is accordingly proposed to consider similarity between nodes at global and local levels. At the global level, the Mahalanobis distance between the current node and class prototypes is integrated to implicitly consider similarity between the current node and all nodes in the same class. At the local level, the similarity of node representation movement dynamics, quantified by nodewise homophily and relative degree, is considered. Informed about the application of nodewise movement patterns in analyzing nodewise behavior on the over-smoothing problem, we empirically present a possible relationship between over-smoothing and GNN calibration problem. Experimentally, we discover a correlation between nodewise similarity and model calibration improvement, in alignment with our theoretical results. Additionally, we conduct extensive experiments investigating different design factors and demonstrate the effectiveness of our proposed SimCalib framework for GNN calibration by achieving state-of-the-art performance on 14 out of 16 benchmarks.  ( 2 min )
    Classification of complex local environments in systems of particle shapes through shape-symmetry encoded data augmentation. (arXiv:2312.11822v1 [cond-mat.soft])
    Detecting and analyzing the local environment is crucial for investigating the dynamical processes of crystal nucleation and shape colloidal particle self-assembly. Recent developments in machine learning provide a promising avenue for better order parameters in complex systems that are challenging to study using traditional approaches. However, the application of machine learning to self-assembly on systems of particle shapes is still underexplored. To address this gap, we propose a simple, physics-agnostic, yet powerful approach that involves training a multilayer perceptron (MLP) as a local environment classifier for systems of particle shapes, using input features such as particle distances and orientations. Our MLP classifier is trained in a supervised manner with a shape symmetry-encoded data augmentation technique without the need for any conventional roto-translations invariant symmetry functions. We evaluate the performance of our classifiers on four different scenarios involving self-assembly of cubic structures, 2-dimensional and 3-dimensional patchy particle shape systems, hexagonal bipyramids with varying aspect ratios, and truncated shapes with different degrees of truncation. The proposed training process and data augmentation technique are both straightforward and flexible, enabling easy application of the classifier to other processes involving particle orientations. Our work thus presents a valuable tool for investigating self-assembly processes on systems of particle shapes, with potential applications in structure identification of any particle-based or molecular system where orientations can be defined.  ( 3 min )
    Fast, Scalable, Warm-Start Semidefinite Programming with Spectral Bundling and Sketching. (arXiv:2312.11801v1 [math.OC])
    While semidefinite programming (SDP) has traditionally been limited to moderate-sized problems, recent algorithms augmented with matrix sketching techniques have enabled solving larger SDPs. However, these methods achieve scalability at the cost of an increase in the number of necessary iterations, resulting in slower convergence as the problem size grows. Furthermore, they require iteration-dependent parameter schedules that prohibit effective utilization of warm-start initializations important in practical applications with incrementally-arriving data or mixed-integer programming. We present SpecBM, a provably correct, fast and scalable algorithm for solving massive SDPs that can leverage a warm-start initialization to further accelerate convergence. Our proposed algorithm is a spectral bundle method for solving general SDPs containing both equality and inequality constraints. Moveover, when augmented with an optional matrix sketching technique, our algorithm achieves the dramatically improved scalability of previous work while sustaining convergence speed. We empirically demonstrate the effectiveness of our method, both with and without warm-starting, across multiple applications with large instances. For example, on a problem with 600 million decision variables, SpecBM achieved a solution of standard accuracy in less than 7 minutes, where the previous state-of-the-art scalable SDP solver requires more than 16 hours. Our method solves an SDP with more than 10^13 decision variables on a single machine with 16 cores and no more than 128GB RAM; the previous state-of-the-art method had not achieved an accurate solution after 72 hours on the same instance. We make our implementation in pure JAX publicly available.  ( 2 min )
    Accelerating the prediction of inorganic surfaces with machine learning interatomic potentials. (arXiv:2312.11708v1 [cond-mat.mtrl-sci])
    The surface properties of solid-state materials often dictate their functionality, especially for applications where nanoscale effects become important. The relevant surface(s) and their properties are determined, in large part, by the materials synthesis or operating conditions. These conditions dictate thermodynamic driving forces and kinetic rates responsible for yielding the observed surface structure and morphology. Computational surface science methods have long been applied to connect thermochemical conditions to surface phase stability, particularly in the heterogeneous catalysis and thin film growth communities. This review provides a brief introduction to first-principles approaches to compute surface phase diagrams before introducing emerging data-driven approaches. The remainder of the review focuses on the application of machine learning, predominantly in the form of learned interatomic potentials, to study complex surfaces. As machine learning algorithms and large datasets on which to train them become more commonplace in materials science, computational methods are poised to become even more predictive and powerful for modeling the complexities of inorganic surfaces at the nanoscale.  ( 2 min )
    Evaluating Language-Model Agents on Realistic Autonomous Tasks. (arXiv:2312.11671v1 [cs.CL])
    In this report, we explore the ability of language model agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild. We refer to this cluster of capabilities as "autonomous replication and adaptation" or ARA. We believe that systems capable of ARA could have wide-reaching and hard-to-anticipate consequences, and that measuring and forecasting ARA may be useful for informing measures around security, monitoring, and alignment. Additionally, once a system is capable of ARA, placing bounds on a system's capabilities may become significantly more difficult. We construct four simple example agents that combine language models with tools that allow them to take actions in the world. We then evaluate these agents on 12 tasks relevant to ARA. We find that these language model agents can only complete the easiest tasks from this list, although they make some progress on the more challenging tasks. Unfortunately, these evaluations are not adequate to rule out the possibility that near-future agents will be capable of ARA. In particular, we do not think that these evaluations provide good assurance that the ``next generation'' of language models (e.g. 100x effective compute scaleup on existing models) will not yield agents capable of ARA, unless intermediate evaluations are performed during pretraining. Relatedly, we expect that fine-tuning of the existing models could produce substantially more competent agents, even if the fine-tuning is not directly targeted at ARA.  ( 3 min )
    A review-based study on different Text-to-Speech technologies. (arXiv:2312.11563v1 [cs.SD])
    This research paper presents a comprehensive review-based study on various Text-to-Speech (TTS) technologies. TTS technology is an important aspect of human-computer interaction, enabling machines to convert written text into audible speech. The paper examines the different TTS technologies available, including concatenative TTS, formant synthesis TTS, and statistical parametric TTS. The study focuses on comparing the advantages and limitations of these technologies in terms of their naturalness of voice, the level of complexity of the system, and their suitability for different applications. In addition, the paper explores the latest advancements in TTS technology, including neural TTS and hybrid TTS. The findings of this research will provide valuable insights for researchers, developers, and users who want to understand the different TTS technologies and their suitability for specific applications.  ( 2 min )
    Topic-VQ-VAE: Leveraging Latent Codebooks for Flexible Topic-Guided Document Generation. (arXiv:2312.11532v1 [cs.CL])
    This paper introduces a novel approach for topic modeling utilizing latent codebooks from Vector-Quantized Variational Auto-Encoder~(VQ-VAE), discretely encapsulating the rich information of the pre-trained embeddings such as the pre-trained language model. From the novel interpretation of the latent codebooks and embeddings as conceptual bag-of-words, we propose a new generative topic model called Topic-VQ-VAE~(TVQ-VAE) which inversely generates the original documents related to the respective latent codebook. The TVQ-VAE can visualize the topics with various generative distributions including the traditional BoW distribution and the autoregressive image generation. Our experimental results on document analysis and image generation demonstrate that TVQ-VAE effectively captures the topic context which reveals the underlying structures of the dataset and supports flexible forms of document generation. Official implementation of the proposed TVQ-VAE is available at https://github.com/clovaai/TVQ-VAE.  ( 2 min )
    Improved Differentially Private and Lazy Online Convex Optimization. (arXiv:2312.11534v1 [cs.CR])
    We study the task of $(\epsilon, \delta)$-differentially private online convex optimization (OCO). In the online setting, the release of each distinct decision or iterate carries with it the potential for privacy loss. This problem has a long history of research starting with Jain et al. [2012] and the best known results for the regime of {\epsilon} being very small are presented in Agarwal et al. [2023]. In this paper we improve upon the results of Agarwal et al. [2023] in terms of the dimension factors as well as removing the requirement of smoothness. Our results are now the best known rates for DP-OCO in this regime. Our algorithms builds upon the work of [Asi et al., 2023] which introduced the idea of explicitly limiting the number of switches via rejection sampling. The main innovation in our algorithm is the use of sampling from a strongly log-concave density which allows us to trade-off the dimension factors better leading to improved results.  ( 2 min )
    Explain To Decide: A Human-Centric Review on the Role of Explainable Artificial Intelligence in AI-assisted Decision Making. (arXiv:2312.11507v1 [cs.HC])
    The unprecedented performance of machine learning models in recent years, particularly Deep Learning and transformer models, has resulted in their application in various domains such as finance, healthcare, and education. However, the models are error-prone and cannot be used autonomously, especially in decision-making scenarios where, technically or ethically, the cost of error is high. Moreover, because of the black-box nature of these models, it is frequently difficult for the end user to comprehend the models' outcomes and underlying processes to trust and use the model outcome to make a decision. Explainable Artificial Intelligence (XAI) aids end-user understanding of the model by utilizing approaches, including visualization techniques, to explain and interpret the inner workings of the model and how it arrives at a result. Although numerous research studies have been conducted recently focusing on the performance of models and the XAI approaches, less work has been done on the impact of explanations on human-AI team performance. This paper surveyed the recent empirical studies on XAI's impact on human-AI decision-making, identified the challenges, and proposed future research directions.  ( 2 min )
    Glioblastoma Tumor Segmentation using an Ensemble of Vision Transformers. (arXiv:2312.11467v1 [eess.IV])
    Glioblastoma is one of the most aggressive and deadliest types of brain cancer, with low survival rates compared to other types of cancer. Analysis of Magnetic Resonance Imaging (MRI) scans is one of the most effective methods for the diagnosis and treatment of brain cancers such as glioblastoma. Accurate tumor segmentation in MRI images is often required for treatment planning and risk assessment of treatment methods. Here, we propose a novel pipeline, Brain Radiology Aided by Intelligent Neural NETworks (BRAINNET), which leverages MaskFormer, a vision transformer model, and generates robust tumor segmentation maks. We use an ensemble of nine predictions from three models separately trained on each of the three orthogonal 2D slice directions (axial, sagittal, and coronal) of a 3D brain MRI volume. We train and test our models on the publicly available UPenn-GBM dataset, consisting of 3D multi-parametric MRI (mpMRI) scans from 611 subjects. Using Dice coefficient (DC) and 95% Hausdorff distance (HD) for evaluation, our models achieved state-of-the-art results in segmenting all three different tumor regions -- tumor core (DC = 0.894, HD = 2.308), whole tumor (DC = 0.891, HD = 3.552), and enhancing tumor (DC = 0.812, HD = 1.608).  ( 2 min )
  • Open

    On the Efficacy of Differentially Private Few-shot Image Classification. (arXiv:2302.01190v3 [stat.ML] UPDATED)
    There has been significant recent progress in training differentially private (DP) models which achieve accuracy that approaches the best non-private models. These DP models are typically pretrained on large public datasets and then fine-tuned on private downstream datasets that are relatively large and similar in distribution to the pretraining data. However, in many applications including personalization and federated learning, it is crucial to perform well (i) in the few-shot setting, as obtaining large amounts of labeled data may be problematic; and (ii) on datasets from a wide variety of domains for use in various specialist settings. To understand under which conditions few-shot DP can be effective, we perform an exhaustive set of experiments that reveals how the accuracy and vulnerability to attack of few-shot DP image classification models are affected as the number of shots per class, privacy level, model architecture, downstream dataset, and subset of learnable parameters in the model vary. We show that to achieve DP accuracy on par with non-private models, the shots per class must be increased as the privacy level increases. We also show that learning parameter-efficient FiLM adapters under DP is competitive with learning just the final classifier layer or learning all of the network parameters. Finally, we evaluate DP federated learning systems and establish state-of-the-art performance on the challenging FLAIR benchmark.  ( 3 min )
    Probabilistic Exponential Integrators. (arXiv:2305.14978v2 [math.NA] UPDATED)
    Probabilistic solvers provide a flexible and efficient framework for simulation, uncertainty quantification, and inference in dynamical systems. However, like standard solvers, they suffer performance penalties for certain stiff systems, where small steps are required not for reasons of numerical accuracy but for the sake of stability. This issue is greatly alleviated in semi-linear problems by the probabilistic exponential integrators developed in this paper. By including the fast, linear dynamics in the prior, we arrive at a class of probabilistic integrators with favorable properties. Namely, they are proven to be L-stable, and in a certain case reduce to a classic exponential integrator -- with the added benefit of providing a probabilistic account of the numerical error. The method is also generalized to arbitrary non-linear systems by imposing piece-wise semi-linearity on the prior via Jacobians of the vector field at the previous estimates, resulting in probabilistic exponential Rosenbrock methods. We evaluate the proposed methods on multiple stiff differential equations and demonstrate their improved stability and efficiency over established probabilistic solvers. The present contribution thus expands the range of problems that can be effectively tackled within probabilistic numerics.  ( 2 min )
    Leveraging the Urysohn Lemma of Topology for an Enhanced Binary Classifier. (arXiv:2312.11948v1 [physics.data-an])
    In this article we offer a comprehensive analysis of the Urysohn's classifier in a binary classification context. It utilizes Urysohn's Lemma of Topology to construct separating functions, providing rigorous and adaptable solutions. Numerical experiments demonstrated exceptional performance, with scores ranging from 95% to 100%. Notably, the Urysohn's classifier outperformed CatBoost and KNN in various scenarios. Despite sensitivity to the p-metric parameter, it proved robust and adaptable. The Urysohn's classifier's mathematical rigor and adaptability make it promising for binary classification, with applications in medical diagnosis, fraud detection and cyber security. Future research includes parameter optimization and combining the Urysohn's classifier with other techniques. It offers an elegant and principled approach to classification, ensuring integrity and valuable data insights.  ( 2 min )
    Topological complexity of spiked random polynomials and finite-rank spherical integrals. (arXiv:2312.12323v1 [math.PR])
    We study the annealed complexity of a random Gaussian homogeneous polynomial on the $N$-dimensional unit sphere in the presence of deterministic polynomials that depend on fixed unit vectors and external parameters. In particular, we establish variational formulas for the exponential asymptotics of the average number of total critical points and of local maxima. This is obtained through the Kac-Rice formula and the determinant asymptotics of a finite-rank perturbation of a Gaussian Wigner matrix. More precisely, the determinant analysis is based on recent advances on finite-rank spherical integrals by [Guionnet, Husson 2022] to study the large deviations of multi-rank spiked Gaussian Wigner matrices. The analysis of the variational problem identifies a topological phase transition. There is an exact threshold for the external parameters such that, once exceeded, the complexity function vanishes into new regions in which the critical points are close to the given vectors. Interestingly, these regions also include those where critical points are close to multiple vectors.  ( 2 min )
    Extension of the Dip-test Repertoire -- Efficient and Differentiable p-value Calculation for Clustering. (arXiv:2312.12050v1 [cs.LG])
    Over the last decade, the Dip-test of unimodality has gained increasing interest in the data mining community as it is a parameter-free statistical test that reliably rates the modality in one-dimensional samples. It returns a so called Dip-value and a corresponding probability for the sample's unimodality (Dip-p-value). These two values share a sigmoidal relationship. However, the specific transformation is dependent on the sample size. Many Dip-based clustering algorithms use bootstrapped look-up tables translating Dip- to Dip-p-values for a certain limited amount of sample sizes. We propose a specifically designed sigmoid function as a substitute for these state-of-the-art look-up tables. This accelerates computation and provides an approximation of the Dip- to Dip-p-value transformation for every single sample size. Further, it is differentiable and can therefore easily be integrated in learning schemes using gradient descent. We showcase this by exploiting our function in a novel subspace clustering algorithm called Dip'n'Sub. We highlight in extensive experiments the various benefits of our proposal.  ( 3 min )
    Modelling and characterization of fine Particulate Matter dynamics in Bujumbura using low cost sensors. (arXiv:2312.12003v1 [stat.ML])
    Air pollution is a result of multiple sources including both natural and anthropogenic activities. The rapid urbanization of the cities such as Bujumbura economic capital of Burundi, is one of these factors. The very first characterization of the spatio-temporal variability of PM2.5 in Bujumbura and the forecasting of PM2.5 concentration have been conducted in this paper using data collected during a year, from august 2022 to august 2023, by low cost sensors installed in Bujumbura city. For each commune, an hourly, daily and seasonal analysis were carried out and the results showed that the mass concentrations of PM2.5 in the three municipalities differ from one commune to another. The average hourly and annual PM2.5 concentrations exceed the World Health Organization standards. The range is between 28.3 and 35.0 microgram/m3 . In order to make prediction of PM2.5 concentration, an investigation of RNN with Long Short Term Memory (LSTM) has been undertaken.  ( 2 min )
    Generalized Causal Tree for Uplift Modeling. (arXiv:2202.02416v2 [stat.ME] UPDATED)
    Uplift modeling is crucial in various applications ranging from marketing and policy-making to personalized recommendations. The main objective is to learn optimal treatment allocations for a heterogeneous population. A primary line of existing work modifies the loss function of the decision tree algorithm to identify cohorts with heterogeneous treatment effects. Another line of work estimates the individual treatment effects separately for the treatment group and the control group using off-the-shelf supervised learning algorithms. The former approach that directly models the heterogeneous treatment effect is known to outperform the latter in practice. However, the existing tree-based methods are mostly limited to a single treatment and a single control use case, except for a handful of extensions to multiple discrete treatments. In this paper, we propose a generalization of tree-based approaches to tackle multiple discrete and continuous-valued treatments. We focus on a generalization of the well-known causal tree algorithm due to its desirable statistical properties, but our generalization technique can be applied to other tree-based approaches as well. The efficacy of our proposed method is demonstrated using experiments and real data examples.  ( 2 min )
    New classes of the greedy-applicable arm feature distributions in the sparse linear bandit problem. (arXiv:2312.12400v1 [cs.LG])
    We consider the sparse contextual bandit problem where arm feature affects reward through the inner product of sparse parameters. Recent studies have developed sparsity-agnostic algorithms based on the greedy arm selection policy. However, the analysis of these algorithms requires strong assumptions on the arm feature distribution to ensure that the greedily selected samples are sufficiently diverse; One of the most common assumptions, relaxed symmetry, imposes approximate origin-symmetry on the distribution, which cannot allow distributions that has origin-asymmetric support. In this paper, we show that the greedy algorithm is applicable to a wider range of the arm feature distributions from two aspects. Firstly, we show that a mixture distribution that has a greedy-applicable component is also greedy-applicable. Second, we propose new distribution classes, related to Gaussian mixture, discrete, and radial distribution, for which the sample diversity is guaranteed. The proposed classes can describe distributions with origin-asymmetric support and, in conjunction with the first claim, provide theoretical guarantees of the greedy policy for a very wide range of the arm feature distributions.  ( 2 min )
    CausalVAE: Structured Causal Disentanglement in Variational Autoencoder. (arXiv:2004.08697v7 [cs.LG] UPDATED)
    Learning disentanglement aims at finding a low dimensional representation which consists of multiple explanatory and generative factors of the observational data. The framework of variational autoencoder (VAE) is commonly used to disentangle independent factors from observations. However, in real scenarios, factors with semantics are not necessarily independent. Instead, there might be an underlying causal structure which renders these factors dependent. We thus propose a new VAE based framework named CausalVAE, which includes a Causal Layer to transform independent exogenous factors into causal endogenous ones that correspond to causally related concepts in data. We further analyze the model identifiabitily, showing that the proposed model learned from observations recovers the true one up to a certain degree by providing supervision signals (e.g. feature labels). Experiments are conducted on various datasets, including synthetic and real word benchmark CelebA. Results show that the causal representations learned by CausalVAE are semantically interpretable, and their causal relationship as a Directed Acyclic Graph (DAG) is identified with good accuracy. Furthermore, we demonstrate that the proposed CausalVAE model is able to generate counterfactual data through "do-operation" to the causal factors.  ( 3 min )
    Efficient Conditionally Invariant Representation Learning. (arXiv:2212.08645v2 [cs.LG] UPDATED)
    We introduce the Conditional Independence Regression CovariancE (CIRCE), a measure of conditional independence for multivariate continuous-valued variables. CIRCE applies as a regularizer in settings where we wish to learn neural features $\varphi(X)$ of data $X$ to estimate a target $Y$, while being conditionally independent of a distractor $Z$ given $Y$. Both $Z$ and $Y$ are assumed to be continuous-valued but relatively low dimensional, whereas $X$ and its features may be complex and high dimensional. Relevant settings include domain-invariant learning, fairness, and causal learning. The procedure requires just a single ridge regression from $Y$ to kernelized features of $Z$, which can be done in advance. It is then only necessary to enforce independence of $\varphi(X)$ from residuals of this regression, which is possible with attractive estimation properties and consistency guarantees. By contrast, earlier measures of conditional feature dependence require multiple regressions for each step of feature learning, resulting in more severe bias and variance, and greater computational cost. When sufficiently rich features are used, we establish that CIRCE is zero if and only if $\varphi(X) \perp \!\!\! \perp Z \mid Y$. In experiments, we show superior performance to previous methods on challenging benchmarks, including learning conditionally invariant image features.  ( 2 min )
    Learning from Mistakes: Self-Regularizing Hierarchical Representations in Point Cloud Semantic Segmentation. (arXiv:2301.11145v2 [cs.CV] UPDATED)
    Recent advances in autonomous robotic technologies have highlighted the growing need for precise environmental analysis. LiDAR semantic segmentation has gained attention to accomplish fine-grained scene understanding by acting directly on raw content provided by sensors. Recent solutions showed how different learning techniques can be used to improve the performance of the model, without any architectural or dataset change. Following this trend, we present a coarse-to-fine setup that LEArns from classification mistaKes (LEAK) derived from a standard model. First, classes are clustered into macro groups according to mutual prediction errors; then, the learning process is regularized by: (1) aligning class-conditional prototypical feature representation for both fine and coarse classes, (2) weighting instances with a per-class fairness index. Our LEAK approach is very general and can be seamlessly applied on top of any segmentation architecture; indeed, experimental results showed that it enables state-of-the-art performances on different architectures, datasets and tasks, while ensuring more balanced class-wise results and faster convergence.  ( 2 min )
    LightGCNet: A Lightweight Geometric Constructive Neural Network for Data-Driven Soft sensors. (arXiv:2312.12022v1 [stat.ML])
    Data-driven soft sensors provide a potentially cost-effective and more accurate modeling approach to measure difficult-to-measure indices in industrial processes compared to mechanistic approaches. Artificial intelligence (AI) techniques, such as deep learning, have become a popular soft sensors modeling approach in the area of machine learning and big data. However, soft sensors models based deep learning potentially lead to complex model structures and excessive training time. In addition, industrial processes often rely on distributed control systems (DCS) characterized by resource constraints. Herein, guided by spatial geometric, a lightweight geometric constructive neural network, namely LightGCNet, is proposed, which utilizes compact angle constraint to assign the hidden parameters from dynamic intervals. At the same time, a node pool strategy and spatial geometric relationships are used to visualize and optimize the process of assigning hidden parameters, enhancing interpretability. In addition, the universal approximation property of LightGCNet is proved by spatial geometric analysis. Two versions algorithmic implementations of LightGCNet are presented in this article. Simulation results concerning both benchmark datasets and the ore grinding process indicate remarkable merits of LightGCNet in terms of small network size, fast learning speed, and sound generalization.  ( 2 min )
    Modeling non-linear Effects with Neural Networks in Relational Event Models. (arXiv:2312.12357v1 [stat.ML])
    Dynamic networks offer an insight of how relational systems evolve. However, modeling these networks efficiently remains a challenge, primarily due to computational constraints, especially as the number of observed events grows. This paper addresses this issue by introducing the Deep Relational Event Additive Model (DREAM) as a solution to the computational challenges presented by modeling non-linear effects in Relational Event Models (REMs). DREAM relies on Neural Additive Models to model non-linear effects, allowing each effect to be captured by an independent neural network. By strategically trading computational complexity for improved memory management and leveraging the computational capabilities of Graphic Processor Units (GPUs), DREAM efficiently captures complex non-linear relationships within data. This approach demonstrates the capability of DREAM in modeling dynamic networks and scaling to larger networks. Comparisons with traditional REM approaches showcase DREAM superior computational efficiency. The model potential is further demonstrated by an examination of the patent citation network, which contains nearly 8 million nodes and 100 million events.  ( 2 min )
    Neural Network Approximation for Pessimistic Offline Reinforcement Learning. (arXiv:2312.11863v1 [cs.LG])
    Deep reinforcement learning (RL) has shown remarkable success in specific offline decision-making scenarios, yet its theoretical guarantees are still under development. Existing works on offline RL theory primarily emphasize a few trivial settings, such as linear MDP or general function approximation with strong assumptions and independent data, which lack guidance for practical use. The coupling of deep learning and Bellman residuals makes this problem challenging, in addition to the difficulty of data dependence. In this paper, we establish a non-asymptotic estimation error of pessimistic offline RL using general neural network approximation with $\mathcal{C}$-mixing data regarding the structure of networks, the dimension of datasets, and the concentrability of data coverage, under mild assumptions. Our result shows that the estimation error consists of two parts: the first converges to zero at a desired rate on the sample size with partially controllable concentrability, and the second becomes negligible if the residual constraint is tight. This result demonstrates the explicit efficiency of deep adversarial offline RL frameworks. We utilize the empirical process tool for $\mathcal{C}$-mixing sequences and the neural network approximation theory for the H\"{o}lder class to achieve this. We also develop methods to bound the Bellman estimation error caused by function approximation with empirical Bellman constraint perturbations. Additionally, we present a result that lessens the curse of dimensionality using data with low intrinsic dimensionality and function classes with low complexity. Our estimation provides valuable insights into the development of deep offline RL and guidance for algorithm model design.  ( 3 min )
    Topo-MLP : A Simplicial Network Without Message Passing. (arXiv:2312.11862v1 [cs.LG])
    Due to their ability to model meaningful higher order relations among a set of entities, higher order network models have emerged recently as a powerful alternative for graph-based network models which are only capable of modeling binary relationships. Message passing paradigm is still dominantly used to learn representations even for higher order network models. While powerful, message passing can have disadvantages during inference, particularly when the higher order connectivity information is missing or corrupted. To overcome such limitations, we propose Topo-MLP, a purely MLP-based simplicial neural network algorithm to learn the representation of elements in a simplicial complex without explicitly relying on message passing. Our framework utilizes a novel Higher Order Neighborhood Contrastive (HONC) loss which implicitly incorporates the simplicial structure into representation learning. Our proposed model's simplicity makes it faster during inference. Moreover, we show that our model is robust when faced with missing or corrupted connectivity structure.  ( 2 min )
    Big Learning Expectation Maximization. (arXiv:2312.11926v1 [cs.LG])
    Mixture models serve as one fundamental tool with versatile applications. However, their training techniques, like the popular Expectation Maximization (EM) algorithm, are notoriously sensitive to parameter initialization and often suffer from bad local optima that could be arbitrarily worse than the optimal. To address the long-lasting bad-local-optima challenge, we draw inspiration from the recent ground-breaking foundation models and propose to leverage their underlying big learning principle to upgrade the EM. Specifically, we present the Big Learning EM (BigLearn-EM), an EM upgrade that simultaneously performs joint, marginal, and orthogonally transformed marginal matchings between data and model distributions. Through simulated experiments, we empirically show that the BigLearn-EM is capable of delivering the optimal with high probability; comparisons on benchmark clustering datasets further demonstrate its effectiveness and advantages over existing techniques. The code is available at https://github.com/YulaiCong/Big-Learning-Expectation-Maximization.  ( 2 min )
    Best Arm Identification with Fixed Budget: A Large Deviation Perspective. (arXiv:2312.12137v1 [cs.LG])
    We consider the problem of identifying the best arm in stochastic Multi-Armed Bandits (MABs) using a fixed sampling budget. Characterizing the minimal instance-specific error probability for this problem constitutes one of the important remaining open problems in MABs. When arms are selected using a static sampling strategy, the error probability decays exponentially with the number of samples at a rate that can be explicitly derived via Large Deviation techniques. Analyzing the performance of algorithms with adaptive sampling strategies is however much more challenging. In this paper, we establish a connection between the Large Deviation Principle (LDP) satisfied by the empirical proportions of arm draws and that satisfied by the empirical arm rewards. This connection holds for any adaptive algorithm, and is leveraged (i) to improve error probability upper bounds of some existing algorithms, such as the celebrated \sr (Successive Rejects) algorithm \citep{audibert2010best}, and (ii) to devise and analyze new algorithms. In particular, we present \sred (Continuous Rejects), a truly adaptive algorithm that can reject arms in {\it any} round based on the observed empirical gaps between the rewards of various arms. Applying our Large Deviation results, we prove that \sred enjoys better performance guarantees than existing algorithms, including \sr. Extensive numerical experiments confirm this observation.  ( 2 min )
    AI without networks. (arXiv:2106.03354v2 [cs.LG] UPDATED)
    Contemporary Artificial Intelligence (AI) stands on two legs: large training data corpora and many-parameter artificial neural networks (ANNs). The data corpora are needed to represent the complexity and heterogeneity of the world. The role of the networks is less transparent due to the obscure dependence of the network parameters and outputs on the training data and inputs. This raises problems, ranging from technical-scientific to legal-ethical. We hypothesize that a transparent approach to machine learning is possible without using networks at all. By generalizing a parameter-free, statistically consistent data interpolation method, which we analyze theoretically in detail, we develop a framework for generative modeling. Given the growing usage of machine learning techniques in science, we demonstrate this framework with an example from the field of animal behavior. We applied this generative Hilbert framework to the trajectories of small groups of swimming fish. The framework outperforms previously developed state-of-the-art traditional mathematical behavioral models and contemporary ANN-based models in reproducing naturalistic behaviors. We do not suggest that the proposed framework will outperform networks in all applications, as over-parameterized networks can interpolate. However, our framework is theoretically sound, transparent, deterministic and parameter free: it does not require any compute-expensive training, does not involve optimization, has no model selection, and is easily reproduced and ported. We also propose an easily computed method of credit assignment based on this framework that could help address ethical-legal challenges raised by generative AI.  ( 3 min )
    CUDC: A Curiosity-Driven Unsupervised Data Collection Method with Adaptive Temporal Distances for Offline Reinforcement Learning. (arXiv:2312.12191v1 [cs.LG])
    Offline reinforcement learning (RL) aims to learn an effective policy from a pre-collected dataset. Most existing works are to develop sophisticated learning algorithms, with less emphasis on improving the data collection process. Moreover, it is even challenging to extend the single-task setting and collect a task-agnostic dataset that allows an agent to perform multiple downstream tasks. In this paper, we propose a Curiosity-driven Unsupervised Data Collection (CUDC) method to expand feature space using adaptive temporal distances for task-agnostic data collection and ultimately improve learning efficiency and capabilities for multi-task offline RL. To achieve this, CUDC estimates the probability of the k-step future states being reachable from the current states, and adapts how many steps into the future that the dynamics model should predict. With this adaptive reachability mechanism in place, the feature representation can be diversified, and the agent can navigate itself to collect higher-quality data with curiosity. Empirically, CUDC surpasses existing unsupervised methods in efficiency and learning performance in various downstream offline RL tasks of the DeepMind control suite.  ( 2 min )
    Maximum Reward Formulation In Reinforcement Learning. (arXiv:2010.03744v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) algorithms typically deal with maximizing the expected cumulative return (discounted or undiscounted, finite or infinite horizon). However, several crucial applications in the real world, such as drug discovery, do not fit within this framework because an RL agent only needs to identify states (molecules) that achieve the highest reward within a trajectory and does not need to optimize for the expected cumulative return. In this work, we formulate an objective function to maximize the expected maximum reward along a trajectory, derive a novel functional form of the Bellman equation, introduce the corresponding Bellman operators, and provide a proof of convergence. Using this formulation, we achieve state-of-the-art results on the task of molecule generation that mimics a real-world drug discovery pipeline.  ( 2 min )
    Confidence and Uncertainty Assessment for Distributional Random Forests. (arXiv:2302.05761v3 [math.ST] UPDATED)
    The Distributional Random Forest (DRF) is a recently introduced Random Forest algorithm to estimate multivariate conditional distributions. Due to its general estimation procedure, it can be employed to estimate a wide range of targets such as conditional average treatment effects, conditional quantiles, and conditional correlations. However, only results about the consistency and convergence rate of the DRF prediction are available so far. We characterize the asymptotic distribution of DRF and develop a bootstrap approximation of it. This allows us to derive inferential tools for quantifying standard errors and the construction of confidence regions that have asymptotic coverage guarantees. In simulation studies, we empirically validate the developed theory for inference of low-dimensional targets and for testing distributional differences between two populations.  ( 2 min )
    Root Cause Explanation of Outliers under Noisy Mechanisms. (arXiv:2312.11818v1 [cs.AI])
    Identifying root causes of anomalies in causal processes is vital across disciplines. Once identified, one can isolate the root causes and implement necessary measures to restore the normal operation. Causal processes are often modelled as graphs with entities being nodes and their paths/interconnections as edge. Existing work only consider the contribution of nodes in the generative process, thus can not attribute the outlier score to the edges of the mechanism if the anomaly occurs in the connections. In this paper, we consider both individual edge and node of each mechanism when identifying the root causes. We introduce a noisy functional causal model to account for this purpose. Then, we employ Bayesian learning and inference methods to infer the noises of the nodes and edges. We then represent the functional form of a target outlier leaf as a function of the node and edge noises. Finally, we propose an efficient gradient-based attribution method to compute the anomaly attribution scores which scales linearly with the number of nodes and edges. Experiments on simulated datasets and two real-world scenario datasets show better anomaly attribution performance of the proposed method compared to the baselines. Our method scales to larger graphs with more nodes and edges.  ( 2 min )
    Clustering Mixtures of Bounded Covariance Distributions Under Optimal Separation. (arXiv:2312.11769v1 [cs.LG])
    We study the clustering problem for mixtures of bounded covariance distributions, under a fine-grained separation assumption. Specifically, given samples from a $k$-component mixture distribution $D = \sum_{i =1}^k w_i P_i$, where each $w_i \ge \alpha$ for some known parameter $\alpha$, and each $P_i$ has unknown covariance $\Sigma_i \preceq \sigma^2_i \cdot I_d$ for some unknown $\sigma_i$, the goal is to cluster the samples assuming a pairwise mean separation in the order of $(\sigma_i+\sigma_j)/\sqrt{\alpha}$ between every pair of components $P_i$ and $P_j$. Our contributions are as follows: For the special case of nearly uniform mixtures, we give the first poly-time algorithm for this clustering task. Prior work either required separation scaling with the maximum cluster standard deviation (i.e. $\max_i \sigma_i$) [DKK+22b] or required both additional structural assumptions and mean separation scaling as a large degree polynomial in $1/\alpha$ [BKK22]. For general-weight mixtures, we point out that accurate clustering is information-theoretically impossible under our fine-grained mean separation assumptions. We introduce the notion of a clustering refinement -- a list of not-too-small subsets satisfying a similar separation, and which can be merged into a clustering approximating the ground truth -- and show that it is possible to efficiently compute an accurate clustering refinement of the samples. Furthermore, under a variant of the "no large sub-cluster'' condition from in prior work [BKK22], we show that our algorithm outputs an accurate clustering, not just a refinement, even for general-weight mixtures. As a corollary, we obtain efficient clustering algorithms for mixtures of well-conditioned high-dimensional log-concave distributions. Moreover, our algorithm is robust to $\Omega(\alpha)$-fraction of adversarial outliers.  ( 3 min )
    ADMM-MM Algorithm for General Tensor Decomposition. (arXiv:2312.11763v1 [cs.CV])
    In this paper, we propose a new unified optimization algorithm for general tensor decomposition which is formulated as an inverse problem for low-rank tensors in the general linear observation models. The proposed algorithm supports three basic loss functions ($\ell_2$-loss, $\ell_1$-loss and KL divergence) and various low-rank tensor decomposition models (CP, Tucker, TT, and TR decompositions). We derive the optimization algorithm based on hierarchical combination of the alternating direction method of multiplier (ADMM) and majorization-minimization (MM). We show that wide-range applications can be solved by the proposed algorithm, and can be easily extended to any established tensor decomposition models in a {plug-and-play} manner.  ( 2 min )
    Eliciting Kemeny Rankings. (arXiv:2312.11663v1 [cs.LG])
    We formulate the problem of eliciting agents' preferences with the goal of finding a Kemeny ranking as a Dueling Bandits problem. Here the bandits' arms correspond to alternatives that need to be ranked and the feedback corresponds to a pairwise comparison between alternatives by a randomly sampled agent. We consider both sampling with and without replacement, i.e., the possibility to ask the same agent about some comparison multiple times or not. We find approximation bounds for Kemeny rankings dependant on confidence intervals over estimated winning probabilities of arms. Based on these we state algorithms to find Probably Approximately Correct (PAC) solutions and elaborate on their sample complexity for sampling with or without replacement. Furthermore, if all agents' preferences are strict rankings over the alternatives, we provide means to prune confidence intervals and thereby guide a more efficient elicitation. We formulate several adaptive sampling methods that use look-aheads to estimate how much confidence intervals (and thus approximation guarantees) might be tightened. All described methods are compared on synthetic data.  ( 2 min )
    Wide Deep Neural Networks with Gaussian Weights are Very Close to Gaussian Processes. (arXiv:2312.11737v1 [math.ST])
    We establish novel rates for the Gaussian approximation of random deep neural networks with Gaussian parameters (weights and biases) and Lipschitz activation functions, in the wide limit. Our bounds apply for the joint output of a network evaluated any finite input set, provided a certain non-degeneracy condition of the infinite-width covariances holds. We demonstrate that the distance between the network output and the corresponding Gaussian approximation scales inversely with the width of the network, exhibiting faster convergence than the naive heuristic suggested by the central limit theorem. We also apply our bounds to obtain theoretical approximations for the exact Bayesian posterior distribution of the network, when the likelihood is a bounded Lipschitz function of the network output evaluated on a (finite) training set. This includes popular cases such as the Gaussian likelihood, i.e. exponential of minus the mean squared error.  ( 2 min )
    Improved Differentially Private and Lazy Online Convex Optimization. (arXiv:2312.11534v1 [cs.CR])
    We study the task of $(\epsilon, \delta)$-differentially private online convex optimization (OCO). In the online setting, the release of each distinct decision or iterate carries with it the potential for privacy loss. This problem has a long history of research starting with Jain et al. [2012] and the best known results for the regime of {\epsilon} being very small are presented in Agarwal et al. [2023]. In this paper we improve upon the results of Agarwal et al. [2023] in terms of the dimension factors as well as removing the requirement of smoothness. Our results are now the best known rates for DP-OCO in this regime. Our algorithms builds upon the work of [Asi et al., 2023] which introduced the idea of explicitly limiting the number of switches via rejection sampling. The main innovation in our algorithm is the use of sampling from a strongly log-concave density which allows us to trade-off the dimension factors better leading to improved results.  ( 2 min )

  • Open

    The challenges cloud migration and modernization solve for enterprises
    With legacy systems, traditional data storage, and integration methods phasing out for many organizations, businesses are rapidly transitioning to cloud-based data to improve operational workflow and ensure a trustworthy data foundation. Cloud migration and modernization resolve numerous challenges commonly faced by modern companies. As more organizations adopt cloud initiatives for their business infrastructures, it is… Read More »The challenges cloud migration and modernization solve for enterprises The post The challenges cloud migration and modernization solve for enterprises appeared first on Data Science Central.  ( 25 min )
  • Open

    Llama Guard is now available in Amazon SageMaker JumpStart
    Today we are excited to announce that the Llama Guard model is now available for customers using Amazon SageMaker JumpStart. Llama Guard provides input and output safeguards in large language model (LLM) deployment. It’s one of the components under Purple Llama, Meta’s initiative featuring open trust and safety tools and evaluations to help developers build […]  ( 15 min )
    Identify cybersecurity anomalies in your Amazon Security Lake data using Amazon SageMaker
    In this post, you learn how to prepare data sourced from Amazon Security Lake, and then train and deploy an ML model using an IP Insights algorithm in SageMaker. This model identifies anomalous network traffic or behavior which can then be composed as part of a larger end-to-end security solution.  ( 13 min )
  • Open

    Research Focus: Week of December 18, 2023
    In this issue of Research Focus: Optimized exit-augmented models for scalable efficient inference; NeurIPS LLM Efficiency Challenge; LLM-empowered automated data exploration; Boosting cloud efficiency with data-driven decision-making and optimization. The post Research Focus: Week of December 18, 2023 appeared first on Microsoft Research.  ( 9 min )
  • Open

    Cool Robots of 2023: Meet the Autonomous Movers and Shakers
    Outside the glare of the klieg lights that ChatGPT commanded this year, a troupe of autonomous machines nudged the frontiers of robotics forward. Here are six that showed special prowess — swimming, diving, gripping, seeing, strolling and flying through 2023. A Media Darling at CES Ella — a smart stroller from startup Glüxkind Technologies, of Read article >  ( 7 min )
    Thomson Reuters Taps Generative AI to Power Legal Offerings
    Thomson Reuters, the global content and technology company, is transforming the legal industry with generative AI. In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Thomson Reuters Chief Product Officer David Wong about its potential — and implications. Many of Thomson Reuters offerings for the legal industry either address an information Read article >  ( 6 min )
    Into the Omniverse: Foundry Nuke’s OpenUSD Enhancements Ring in a 3D Renaissance
    The latest OpenUSD updates enable users to tackle larger, more complex scenes with enhanced geometry control and streamlined asset management.  ( 7 min )
  • Open

    Using AI, MIT researchers identify a new class of antibiotic candidates
    These compounds can kill methicillin-resistant Staphylococcus aureus (MRSA), a bacterium that causes deadly infections.  ( 10 min )
    A flexible solution to help artists improve animation
    This new method draws on 200-year-old geometric foundations to give artists control over the appearance of animated characters.  ( 10 min )
  • Open

    Addition theorems for Dixon functions
    The last couple blog posts have been about Dixon elliptic functions, functions which are analogous in some ways to sine and cosine functions. Whereas sine and cosine satisfy a Pythagorean identity the Dixon functions sm and cm satisfy what you might call a Fermat identity alluding to Fermat’s last theorem. The functions sm and cm […] Addition theorems for Dixon functions first appeared on John D. Cook.  ( 5 min )
  • Open

    Is Learning in Games Good for the Learners?. (arXiv:2305.19496v2 [cs.GT] UPDATED)
    We consider a number of questions related to tradeoffs between reward and regret in repeated gameplay between two agents. To facilitate this, we introduce a notion of $\textit{generalized equilibrium}$ which allows for asymmetric regret constraints, and yields polytopes of feasible values for each agent and pair of regret constraints, where we show that any such equilibrium is reachable by a pair of algorithms which maintain their regret guarantees against arbitrary opponents. As a central example, we highlight the case one agent is no-swap and the other's regret is unconstrained. We show that this captures an extension of $\textit{Stackelberg}$ equilibria with a matching optimal value, and that there exists a wide class of games where a player can significantly increase their utility by deviating from a no-swap-regret algorithm against a no-swap learner (in fact, almost any game without pure Nash equilibria is of this form). Additionally, we make use of generalized equilibria to consider tradeoffs in terms of the opponent's algorithm choice. We give a tight characterization for the maximal reward obtainable against $\textit{some}$ no-regret learner, yet we also show a class of games in which this is bounded away from the value obtainable against the class of common "mean-based" no-regret algorithms. Finally, we consider the question of learning reward-optimal strategies via repeated play with a no-regret agent when the game is initially unknown. Again we show tradeoffs depending on the opponent's learning algorithm: the Stackelberg strategy is learnable in exponential time with any no-regret agent (and in polynomial time with any no-$\textit{adaptive}$-regret agent) for any game where it is learnable via queries, and there are games where it is learnable in polynomial time against any no-swap-regret agent but requires exponential time against a mean-based no-regret agent.  ( 3 min )
    Meta-Referential Games to Learn Compositional Learning Behaviours. (arXiv:2207.08012v4 [cs.CL] UPDATED)
    Human beings use compositionality to generalise from past experiences to novel experiences. We assume a separation of our experiences into fundamental atomic components that can be recombined in novel ways to support our ability to engage with novel experiences. We frame this as the ability to learn to generalise compositionally, and we will refer to behaviours making use of this ability as compositional learning behaviours (CLBs). A central problem to learning CLBs is the resolution of a binding problem (BP). While it is another feat of intelligence that human beings perform with ease, it is not the case for state-of-the-art artificial agents. Thus, in order to build artificial agents able to collaborate with human beings, we propose to develop a novel benchmark to investigate agents' abilities to exhibit CLBs by solving a domain-agnostic version of the BP. We take inspiration from the language emergence and grounding framework of referential games and propose a meta-learning extension of referential games, entitled Meta-Referential Games, and use this framework to build our benchmark, the Symbolic Behaviour Benchmark (S2B). We provide baseline results and error analysis showing that our benchmark is a compelling challenge that we hope will spur the research community towards developing more capable artificial agents.  ( 3 min )
    Price-Discrimination Game for Distributed Resource Management in Federated Learning. (arXiv:2308.13838v2 [cs.LG] UPDATED)
    In vanilla federated learning (FL) such as FedAvg, the parameter server (PS) and multiple distributed clients can form a typical buyer's market, where the number of PS/buyers of FL services is far less than the number of clients/sellers. In order to improve the performance of FL and reduce the cost of motivating clients to participate in FL, this paper proposes to differentiate the pricing for services provided by different clients rather than simply providing the same service pricing for different clients. The price is differentiated based on the performance improvements brought to FL and their heterogeneity in computing and communication capabilities. To this end, a price-discrimination game (PDG) is formulated to comprehensively address the distributed resource management problems in FL, including multi-objective trade-off, client selection, and incentive mechanism. As the PDG is a mixed-integer nonlinear programming (MINLP) problem, a distributed semi-heuristic algorithm with low computational complexity and low communication overhead is designed to solve it. The simulation result verifies the effectiveness of the proposed approach.  ( 2 min )
    Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. (arXiv:2205.15466v7 [cs.LG] UPDATED)
    Data valuation has wide use cases in machine learning, including improving data quality and creating economic incentives for data sharing. This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we introduce the concept of safety margin, which measures the robustness of a data value notion. We show that the Banzhaf value, a famous value notion that originated from cooperative game theory literature, achieves the largest safety margin among all semivalues (a class of value notions that satisfy crucial properties entailed by ML applications and include the famous Shapley value and Leave-one-out error). We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the other semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.  ( 3 min )
    SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision. (arXiv:2209.09130v2 [cs.LG] UPDATED)
    The latest industrial inference engines, such as FasterTransformer and TurboTransformers, have verified that half-precision floating point (FP16) and 8-bit integer (INT8) quantization can greatly improve model inference speed. However, the existing INT8 quantization methods are too complicated, and improper usage will lead to model performance damage greatly. In this paper, we develop a toolkit for users to easily quantize their models for inference, in which Self-Adaptive Mixed-Precision (SAMP) is proposed to automatically control quantization rate by a mixed-precision architecture to balance model accuracy and efficiency. Experimental results show that our SAMP toolkit has a higher speedup than PyTorch and FasterTransformer while ensuring the required accuracy. In addition, SAMP is based on a modular design, decoupling the tokenizer, embedding, encoder and target layers, which allows users to handle various downstream tasks and can be seamlessly integrated into PyTorch.  ( 2 min )
    Partial Matrix Completion. (arXiv:2208.12063v2 [cs.LG] UPDATED)
    The matrix completion problem aims to reconstruct a low-rank matrix based on a revealed set of possibly noisy entries. Prior works consider completing the entire matrix with generalization error guarantees. However, the completion accuracy can be drastically different over different entries. This work establishes a new framework of partial matrix completion, where the goal is to identify a large subset of the entries that can be completed with high confidence. We propose an efficient algorithm with the following provable guarantees. Given access to samples from an unknown and arbitrary distribution, it guarantees: (a) high accuracy over completed entries, and (b) high coverage of the underlying distribution. We also consider an online learning variant of this problem, where we propose a low-regret algorithm based on iterative gradient updates. Preliminary empirical evaluations are included.  ( 2 min )
    Understanding or Manipulation: Rethinking Online Performance Gains of Modern Recommender Systems. (arXiv:2210.05662v2 [cs.IR] UPDATED)
    Recommender systems are expected to be assistants that help human users find relevant information automatically without explicit queries. As recommender systems evolve, increasingly sophisticated learning techniques are applied and have achieved better performance in terms of user engagement metrics such as clicks and browsing time. The increase in the measured performance, however, can have two possible attributions: a better understanding of user preferences, and a more proactive ability to utilize human bounded rationality to seduce user over-consumption. A natural following question is whether current recommendation algorithms are manipulating user preferences. If so, can we measure the manipulation level? In this paper, we present a general framework for benchmarking the degree of manipulations of recommendation algorithms, in both slate recommendation and sequential recommendation scenarios. The framework consists of four stages, initial preference calculation, training data collection, algorithm training and interaction, and metrics calculation that involves two proposed metrics. We benchmark some representative recommendation algorithms in both synthetic and real-world datasets under the proposed framework. We have observed that a high online click-through rate does not necessarily mean a better understanding of user initial preference, but ends in prompting users to choose more documents they initially did not favor. Moreover, we find that the training data have notable impacts on the manipulation degrees, and algorithms with more powerful modeling abilities are more sensitive to such impacts. The experiments also verified the usefulness of the proposed metrics for measuring the degree of manipulations. We advocate that future recommendation algorithm studies should be treated as an optimization problem with constrained user preference manipulations.  ( 3 min )
    Adversarial Graph Contrastive Learning with Information Regularization. (arXiv:2202.06491v5 [cs.LG] UPDATED)
    Contrastive learning is an effective unsupervised method in graph representation learning. Recently, the data augmentation based contrastive learning method has been extended from images to graphs. However, most prior works are directly adapted from the models designed for images. Unlike the data augmentation on images, the data augmentation on graphs is far less intuitive and much harder to provide high-quality contrastive samples, which are the key to the performance of contrastive learning models. This leaves much space for improvement over the existing graph contrastive learning frameworks. In this work, by introducing an adversarial graph view and an information regularizer, we propose a simple but effective method, Adversarial Graph Contrastive Learning (ARIEL), to extract informative contrastive samples within a reasonable constraint. It consistently outperforms the current graph contrastive learning methods in the node classification task over various real-world datasets and further improves the robustness of graph contrastive learning. The code is at https://github.com/Shengyu-Feng/ARIEL.  ( 2 min )
    Uniform Sequence Better: Time Interval Aware Data Augmentation for Sequential Recommendation. (arXiv:2212.08262v2 [cs.IR] UPDATED)
    Sequential recommendation is an important task to predict the next-item to access based on a sequence of interacted items. Most existing works learn user preference as the transition pattern from the previous item to the next one, ignoring the time interval between these two items. However, we observe that the time interval in a sequence may vary significantly different, and thus result in the ineffectiveness of user modeling due to the issue of \emph{preference drift}. In fact, we conducted an empirical study to validate this observation, and found that a sequence with uniformly distributed time interval (denoted as uniform sequence) is more beneficial for performance improvement than that with greatly varying time interval. Therefore, we propose to augment sequence data from the perspective of time interval, which is not studied in the literature. Specifically, we design five operators (Ti-Crop, Ti-Reorder, Ti-Mask, Ti-Substitute, Ti-Insert) to transform the original non-uniform sequence to uniform sequence with the consideration of variance of time intervals. Then, we devise a control strategy to execute data augmentation on item sequences in different lengths. Finally, we implement these improvements on a state-of-the-art model CoSeRec and validate our approach on four real datasets. The experimental results show that our approach reaches significantly better performance than the other 11 competing methods. Our implementation is available: https://github.com/KingGugu/TiCoSeRec.  ( 3 min )
    Rotting Infinitely Many-armed Bandits. (arXiv:2201.12975v3 [cs.LG] UPDATED)
    We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $\Omega(\max\{\varrho^{1/3}T,\sqrt{T}\})$ worst-case regret lower bound where $T$ is the horizon time. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T,\sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T,T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.  ( 2 min )
    Proximal Mean Field Learning in Shallow Neural Networks. (arXiv:2210.13879v3 [cs.LG] UPDATED)
    We propose a custom learning algorithm for shallow over-parameterized neural networks, i.e., networks with single hidden layer having infinite width. The infinite width of the hidden layer serves as an abstraction for the over-parameterization. Building on the recent mean field interpretations of learning dynamics in shallow neural networks, we realize mean field learning as a computational algorithm, rather than as an analytical tool. Specifically, we design a Sinkhorn regularized proximal algorithm to approximate the distributional flow for the learning dynamics over weighted point clouds. In this setting, a contractive fixed point recursion computes the time-varying weights, numerically realizing the interacting Wasserstein gradient flow of the parameter distribution supported over the neuronal ensemble. An appealing aspect of the proposed algorithm is that the measure-valued recursions allow meshless computation. We demonstrate the proposed computational framework of interacting weighted particle evolution on binary and multi-class classification. Our algorithm performs gradient descent of the free energy associated with the risk functional.  ( 2 min )
    A novel multi-layer modular approach for real-time fuzzy-identification of gravitational-wave signals. (arXiv:2206.06004v4 [gr-qc] UPDATED)
    Advanced LIGO and Advanced Virgo ground-based interferometers are instruments capable to detect gravitational wave signals exploiting advanced laser interferometry techniques. The underlying data analysis task consists in identifying specific patterns in noisy timeseries, but it is made extremely complex by the incredibly small amplitude of the target signals. In this scenario, the development of effective gravitational wave detection algorithms is crucial. We propose a novel layered framework for real-time detection of gravitational waves inspired by speech processing techniques and, in the present implementation, based on a state-of-the-art machine learning approach involving a hybridization of genetic programming and neural networks. The key aspects of the newly proposed framework are: the well structured, layered approach, and the low computational complexity. The paper describes the basic concepts of the framework and the derivation of the first three layers. Even if the layers are based on models derived using a machine learning approach, the proposed layered structure has a universal nature. Compared to more complex approaches, such as convolutional neural networks, which comprise a parameter set of several tens of MB and were tested exclusively for fixed length data samples, our framework has lower accuracy (e.g., it identifies 45% of low signal-to-noise-ration gravitational wave signals, against 65% of the state-of-the-art, at a false alarm probability of $10^{-2}$), but has a much lower computational complexity and a higher degree of modularity. Furthermore, the exploitation of short-term features makes the results of the new framework virtually independent against time-position of gravitational wave signals, simplifying its future exploitation in real-time multi-layer pipelines for gravitational-wave detection with new generation interferometers.  ( 3 min )
    FO-PINNs: A First-Order formulation for Physics Informed Neural Networks. (arXiv:2210.14320v2 [cs.LG] UPDATED)
    Physics-Informed Neural Networks (PINNs) are a class of deep learning neural networks that learn the response of a physical system without any simulation data, and only by incorporating the governing partial differential equations (PDEs) in their loss function. While PINNs are successfully used for solving forward and inverse problems, their accuracy decreases significantly for parameterized systems. PINNs also have a soft implementation of boundary conditions resulting in boundary conditions not being exactly imposed everywhere on the boundary. With these challenges at hand, we present first-order physics-informed neural networks (FO-PINNs). These are PINNs that are trained using a first-order formulation of the PDE loss function. We show that, compared to standard PINNs, FO-PINNs offer significantly higher accuracy in solving parameterized systems, and reduce time-per-iteration by removing the extra backpropagations needed to compute the second or higher-order derivatives. Additionally, FO-PINNs can enable exact imposition of boundary conditions using approximate distance functions, which pose challenges when applied on high-order PDEs. Through three examples, we demonstrate the advantages of FO-PINNs over standard PINNs in terms of accuracy and training speedup.  ( 2 min )
    POLTER: Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning. (arXiv:2205.11357v3 [cs.LG] UPDATED)
    The goal of Unsupervised Reinforcement Learning (URL) is to find a reward-agnostic prior policy on a task domain, such that the sample-efficiency on supervised downstream tasks is improved. Although agents initialized with such a prior policy can achieve a significantly higher reward with fewer samples when finetuned on the downstream task, it is still an open question how an optimal pretrained prior policy can be achieved in practice. In this work, we present POLTER (Policy Trajectory Ensemble Regularization) - a general method to regularize the pretraining that can be applied to any URL algorithm and is especially useful on data- and knowledge-based URL algorithms. It utilizes an ensemble of policies that are discovered during pretraining and moves the policy of the URL algorithm closer to its optimal prior. Our method is based on a theoretical framework, and we analyze its practical effects on a white-box benchmark, allowing us to study POLTER with full control. In our main experiments, we evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark (URLB), which consists of 12 tasks in 3 domains. We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case. Under a fair comparison with tuned baselines and tuned POLTER, we establish a new state-of-the-art for model-free methods on the URLB.  ( 3 min )
    Impartial Games: A Challenge for Reinforcement Learning. (arXiv:2205.12787v3 [cs.LG] UPDATED)
    While AlphaZero-style reinforcement learning (RL) algorithms excel in various board games, in this paper we show that they face challenges on impartial games where players share pieces. We present a concrete example of a game - namely the children's game of Nim - and other impartial games that seem to be a stumbling block for AlphaZero-style and similar self-play reinforcement learning algorithms. Our work is built on the challenges posed by the intricacies of data distribution on the ability of neural networks to learn parity functions, exacerbated by the noisy labels issue. Our findings are consistent with recent studies showing that AlphaZero-style algorithms are vulnerable to adversarial attacks and adversarial perturbations, showing the difficulty of learning to master the games in all legal states. We show that Nim can be learned on small boards, but the learning progress of AlphaZero-style algorithms dramatically slows down when the board size increases. Intuitively, the difference between impartial games like Nim and partisan games like Chess and Go can be explained by the fact that if a small part of the board is covered for impartial games it is typically not possible to predict whether the position is won or lost as there is often zero correlation between the visible part of a partly blanked-out position and its correct evaluation. This situation starkly contrasts partisan games where a partly blanked-out board position typically provides abundant or at least non-trifle information about the value of the fully uncovered position.  ( 3 min )
    FedGCN: Convergence-Communication Tradeoffs in Federated Training of Graph Convolutional Networks. (arXiv:2201.12433v7 [cs.LG] UPDATED)
    Methods for training models on graphs distributed across multiple clients have recently grown in popularity, due to the size of these graphs as well as regulations on keeping data where it is generated. However, the cross-client edges naturally exist among clients. Thus, distributed methods for training a model on a single graph incur either significant communication overhead between clients or a loss of available information to the training. We introduce the Federated Graph Convolutional Network (FedGCN) algorithm, which uses federated learning to train GCN models for semi-supervised node classification with fast convergence and little communication. Compared to prior methods that require extra communication among clients at each training round, FedGCN clients only communicate with the central server in one pre-training step, greatly reducing communication costs and allowing the use of homomorphic encryption to further enhance privacy. We theoretically analyze the tradeoff between FedGCN's convergence rate and communication cost under different data distributions. Experimental results show that our FedGCN algorithm achieves better model accuracy with 51.7% faster convergence on average and at least 100X less communication compared to prior work.  ( 3 min )
    Detecting fake accounts through Generative Adversarial Network in online social media. (arXiv:2210.15657v2 [cs.SI] UPDATED)
    Online social media is integral to human life, facilitating messaging, information sharing, and confidential communication while preserving privacy. Platforms like Twitter, Instagram, and Facebook exemplify this phenomenon. However, users face challenges due to network anomalies, often stemming from malicious activities such as identity theft for financial gain or harm. This paper proposes a novel method using user similarity measures and the Generative Adversarial Network (GAN) algorithm to identify fake user accounts in the Twitter dataset. Despite the problem's complexity, the method achieves an AUC rate of 80\% in classifying and detecting fake accounts. Notably, the study builds on previous research, highlighting advancements and insights into the evolving landscape of anomaly detection in online social networks.  ( 2 min )
    Fake detection in imbalance dataset by Semi-supervised learning with GAN. (arXiv:2212.01071v3 [cs.LG] UPDATED)
    As social media continues to grow rapidly, the prevalence of harassment on these platforms has also increased. This has piqued the interest of researchers in the field of fake detection. Social media data, often forms complex graphs with numerous nodes, posing several challenges. These challenges and limitations include dealing with a significant amount of irrelevant features in matrices and addressing issues such as high data dispersion and an imbalanced class distribution within the dataset. To overcome these challenges and limitations, researchers have employed auto-encoders and a combination of semi-supervised learning with a GAN algorithm, referred to as SGAN. Our proposed method utilizes auto-encoders for feature extraction and incorporates SGAN. By leveraging an unlabeled dataset, the unsupervised layer of SGAN compensates for the limited availability of labeled data, making efficient use of the limited number of labeled instances. Multiple evaluation metrics were employed, including the Confusion Matrix and the ROC curve. The dataset was divided into training and testing sets, with 100 labeled samples for training and 1,000 samples for testing. The novelty of our research lies in applying SGAN to address the issue of imbalanced datasets in fake account detection. By optimizing the use of a smaller number of labeled instances and reducing the need for extensive computational power, our method offers a more efficient solution. Additionally, our study contributes to the field by achieving an 81% accuracy in detecting fake accounts using only 100 labeled samples. This demonstrates the potential of SGAN as a powerful tool for handling minority classes and addressing big data challenges in fake account detection.  ( 3 min )
    Disentangled Representation with Causal Constraints for Counterfactual Fairness. (arXiv:2208.09147v2 [cs.LG] UPDATED)
    Much research has been devoted to the problem of learning fair representations; however, they do not explicitly the relationship between latent representations. In many real-world applications, there may be causal relationships between latent representations. Furthermore, most fair representation learning methods focus on group-level fairness and are based on correlations, ignoring the causal relationships underlying the data. In this work, we theoretically demonstrate that using the structured representations enable downstream predictive models to achieve counterfactual fairness, and then we propose the Counterfactual Fairness Variational AutoEncoder (CF-VAE) to obtain structured representations with respect to domain knowledge. The experimental results show that the proposed method achieves better fairness and accuracy performance than the benchmark fairness methods.  ( 2 min )
    Training Adaptive Reconstruction Networks for Blind Inverse Problems. (arXiv:2202.11342v3 [cs.LG] UPDATED)
    Neural networks allow solving many ill-posed inverse problems with unprecedented performance. Physics informed approaches already progressively replace carefully hand-crafted reconstruction algorithms in real applications. However, these networks suffer from a major defect: when trained on a given forward operator, they do not generalize well to a different one. The aim of this paper is twofold. First, we show through various applications that training the network with a family of forward operators allows solving the adaptivity problem without compromising the reconstruction quality significantly.Second, we illustrate that this training procedure allows tackling challenging blind inverse problems.Our experiments include partial Fourier sampling problems arising in magnetic resonance imaging (MRI) with sensitivity estimation and off-resonance effects, computerized tomography (CT) with a tilted geometry and image deblurring with Fresnel diffraction kernels.  ( 2 min )
    Commutativity and Disentanglement from the Manifold Perspective. (arXiv:2210.07857v4 [stat.ML] UPDATED)
    In this paper, we interpret disentanglement as the discovery of local charts of the data manifold and trace how this definition naturally leads to an equivalent condition for disentanglement: commutativity between factors of variation. We study the impact of this manifold framework to two classes of problems: learning matrix exponential operators and compressing data-generating models. In each problem, the manifold perspective yields interesting results about the feasibility and fruitful approaches their solutions. We also link our manifold framework to two other common disentanglement paradigms: group theoretic and probabilistic approaches to disentanglement. In each case, we show how these frameworks can be merged with our manifold perspective. Importantly, we recover commutativity as a central property in both alternative frameworks, further highlighting its importance in disentanglement.  ( 2 min )
    Marginal Post Processing of Bayesian Inference Products with Normalizing Flows and Kernel Density Estimators. (arXiv:2205.12841v5 [astro-ph.IM] UPDATED)
    Bayesian analysis has become an indispensable tool across many different cosmological fields including the study of gravitational waves, the Cosmic Microwave Background and the 21-cm signal from the Cosmic Dawn among other phenomena. The method provides a way to fit complex models to data describing key cosmological and astrophysical signals and a whole host of contaminating signals and instrumental effects modelled with `nuisance parameters'. In this paper, we summarise a method that uses Masked Autoregressive Flows and Kernel Density Estimators to learn marginal posterior densities corresponding to core science parameters. We find that the marginal or 'nuisance-free' posteriors and the associated likelihoods have an abundance of applications including; the calculation of previously intractable marginal Kullback-Leibler divergences and marginal Bayesian Model Dimensionalities, likelihood emulation and prior emulation. We demonstrate each application using toy examples, examples from the field of 21-cm cosmology and samples from the Dark Energy Survey. We discuss how marginal summary statistics like the Kullback-Leibler divergences and Bayesian Model Dimensionalities can be used to examine the constraining power of different experiments and how we can perform efficient joint analysis by taking advantage of marginal prior and likelihood emulators. We package our multipurpose code up in the pip-installable code margarine for use in the wider scientific community.  ( 3 min )
    Using Model-Based Trees with Boosting to Fit Low-Order Functional ANOVA Models. (arXiv:2207.06950v5 [stat.ML] UPDATED)
    Low-order functional ANOVA (fANOVA) models have been rediscovered in the machine learning (ML) community under the guise of inherently interpretable machine learning. Explainable Boosting Machines or EBM (Lou et al. 2013) and GAMI-Net (Yang et al. 2021) are two recently proposed ML algorithms for fitting functional main effects and second-order interactions. We propose a new algorithm, called GAMI-Tree, that is similar to EBM, but has a number of features that lead to better performance. It uses model-based trees as base learners and incorporates a new interaction filtering method that is better at capturing the underlying interactions. In addition, our iterative training method converges to a model with better predictive performance, and the embedded purification ensures that interactions are hierarchically orthogonal to main effects. The algorithm does not need extensive tuning, and our implementation is fast and efficient. We use simulated and real datasets to compare the performance and interpretability of GAMI-Tree with EBM and GAMI-Net.  ( 2 min )
    IoTGAN: GAN Powered Camouflage Against Machine Learning Based IoT Device Identification. (arXiv:2201.03281v2 [cs.CR] UPDATED)
    With the proliferation of IoT devices, researchers have developed a variety of IoT device identification methods with the assistance of machine learning. Nevertheless, the security of these identification methods mostly depends on collected training data. In this research, we propose a novel attack strategy named IoTGAN to manipulate an IoT device's traffic such that it can evade machine learning based IoT device identification. In the development of IoTGAN, we have two major technical challenges: (i) How to obtain the discriminative model in a black-box setting, and (ii) How to add perturbations to IoT traffic through the manipulative model, so as to evade the identification while not influencing the functionality of IoT devices. To address these challenges, a neural network based substitute model is used to fit the target model in black-box settings, it works as a discriminative model in IoTGAN. A manipulative model is trained to add adversarial perturbations into the IoT device's traffic to evade the substitute model. Experimental results show that IoTGAN can successfully achieve the attack goals. We also develop efficient countermeasures to protect machine learning based IoT device identification from been undermined by IoTGAN.  ( 3 min )
    Policy Learning with Competing Agents. (arXiv:2204.01884v3 [stat.ML] UPDATED)
    Decision makers often aim to learn a treatment assignment policy under a capacity constraint on the number of agents that they can treat. When agents can respond strategically to such policies, competition arises, complicating estimation of the optimal policy. In this paper, we study capacity-constrained treatment assignment in the presence of such interference. We consider a dynamic model where the decision maker allocates treatments at each time step and heterogeneous agents myopically best respond to the previous treatment assignment policy. When the number of agents is large but finite, we show that the threshold for receiving treatment under a given policy converges to the policy's mean-field equilibrium threshold. Based on this result, we develop a consistent estimator for the policy gradient. In simulations and a semi-synthetic experiment with data from the National Education Longitudinal Study of 1988, we demonstrate that this estimator can be used for learning capacity-constrained policies in the presence of strategic behavior.  ( 2 min )
    Active Learning Guided by Efficient Surrogate Learners. (arXiv:2301.02761v2 [cs.LG] UPDATED)
    Re-training a deep learning model each time a single data point receives a new label is impractical due to the inherent complexity of the training process. Consequently, existing active learning (AL) algorithms tend to adopt a batch-based approach where, during each AL iteration, a set of data points is collectively chosen for annotation. However, this strategy frequently leads to redundant sampling, ultimately eroding the efficacy of the labeling procedure. In this paper, we introduce a new AL algorithm that harnesses the power of a Gaussian process surrogate in conjunction with the neural network principal learner. Our proposed model adeptly updates the surrogate learner for every new data instance, enabling it to emulate and capitalize on the continuous learning dynamics of the neural network without necessitating a complete retraining of the principal model for each individual label. Experiments on four benchmark datasets demonstrate that this approach yields significant enhancements, either rivaling or aligning with the performance of state-of-the-art techniques.  ( 2 min )
    Testing Relative Fairness in Human Decisions With Machine Learning. (arXiv:2112.11279v2 [cs.LG] UPDATED)
    Fairness in decision-making has been a long-standing issue in our society. Compared to algorithmic fairness, fairness in human decisions is even more important since there are processes where humans make the final decisions and that machine learning models inherit bias from the human decisions they were trained on. However, the standard for fairness in human decisions are highly subjective and contextual. This leads to the difficulty for testing "absolute" fairness in human decisions. To bypass this issue, this work aims to test relative fairness in human decisions. That is, instead of defining what are "absolute" fair decisions, we check the relative fairness of one decision set against another. An example outcome can be: Decision Set A favors female over male more than Decision Set B. Such relative fairness has the following benefits: (1) it avoids the ambiguous and contradictory definition of "absolute" fair decisions; (2) it reveals the relative preference and bias between different human decisions; (3) if a reference set of decisions is provided, relative fairness of other decision sets against this reference set can reflect whether those decision sets are fair by the standard of that reference set. We define the relative fairness with statistical tests (null hypothesis and effect size tests) of the decision differences across each sensitive group. Furthermore, we show that a machine learning model trained on the human decisions can inherit the bias/preference and therefore can be utilized to estimate the relative fairness between two decision sets made on different data.  ( 3 min )
    Mesogeos: A multi-purpose dataset for data-driven wildfire modeling in the Mediterranean. (arXiv:2306.05144v2 [cs.CV] UPDATED)
    We introduce Mesogeos, a large-scale multi-purpose dataset for wildfire modeling in the Mediterranean. Mesogeos integrates variables representing wildfire drivers (meteorology, vegetation, human activity) and historical records of wildfire ignitions and burned areas for 17 years (2006-2022). It is designed as a cloud-friendly spatio-temporal dataset, namely a datacube, harmonizing all variables in a grid of 1km x 1km x 1-day resolution. The datacube structure offers opportunities to assess machine learning (ML) usage in various wildfire modeling tasks. We extract two ML-ready datasets that establish distinct tracks to demonstrate this potential: (1) short-term wildfire danger forecasting and (2) final burned area estimation given the point of ignition. We define appropriate metrics and baselines to evaluate the performance of models in each track. By publishing the datacube, along with the code to create the ML datasets and models, we encourage the community to foster the implementation of additional tracks for mitigating the increasing threat of wildfires in the Mediterranean.  ( 2 min )
    Mitigating Backdoors in Federated Learning with FLD. (arXiv:2303.00302v2 [cs.LG] UPDATED)
    Federated learning allows clients to collaboratively train a global model without uploading raw data for privacy preservation. This feature, i.e., the inability to review participants' datasets, has recently been found responsible for federated learning's vulnerability in the face of backdoor attacks. Existing defense methods fall short from two perspectives: 1) they consider only very specific and limited attacker models and unable to cope with advanced backdoor attacks, such as distributed backdoor attacks, which break down the global trigger into multiple distributed triggers. 2) they conduct detection based on model granularity thus the performance gets impacted by the model dimension. To address these challenges, we propose Federated Layer Detection (FLD), a novel model filtering approach for effectively defending against backdoor attacks. FLD examines the models based on layer granularity to capture the complete model details and effectively detect potential backdoor models regardless of model dimension. We provide theoretical analysis and proof for the convergence of FLD. Extensive experiments demonstrate that FLD effectively mitigates state-of-the-art backdoor attacks with negligible impact on the accuracy of the primary task.  ( 2 min )
    Stochastic Latent Transformer: Efficient Modelling of Stochastically Forced Zonal Jets. (arXiv:2310.16741v2 [cs.LG] UPDATED)
    We present a novel probabilistic deep learning approach, the 'Stochastic Latent Transformer' (SLT), designed for the efficient reduced-order modelling of stochastic partial differential equations. Stochastically driven flow models are pertinent to a diverse range of natural phenomena, including jets on giant planets, ocean circulation, and the variability of midlatitude weather. However, much of the recent progress in deep learning has predominantly focused on deterministic systems. The SLT comprises a stochastically-forced transformer paired with a translation-equivariant autoencoder, trained towards the Continuous Ranked Probability Score. We showcase its effectiveness by applying it to a well-researched zonal jet system, where the interaction between stochastically forced eddies and the zonal mean flow results in a rich low-frequency variability. The SLT accurately reproduces system dynamics across various integration periods, validated through quantitative diagnostics that include spectral properties and the rate of transitions between distinct states. The SLT achieves a five-order-of-magnitude speedup in emulating the zonally-averaged flow compared to direct numerical simulations. This acceleration facilitates the cost-effective generation of large ensembles, enabling the exploration of statistical questions concerning the probabilities of spontaneous transition events.  ( 2 min )
    All in One: Multi-task Prompting for Graph Neural Networks. (arXiv:2307.01504v2 [cs.SI] UPDATED)
    Recently, ''pre-training and fine-tuning'' has been adopted as a standard workflow for many graph tasks since it can take general graph knowledge to relieve the lack of graph annotations from each application. However, graph tasks with node level, edge level, and graph level are far diversified, making the pre-training pretext often incompatible with these multiple tasks. This gap may even cause a ''negative transfer'' to the specific application, leading to poor results. Inspired by the prompt learning in natural language processing (NLP), which has presented significant effectiveness in leveraging prior knowledge for various NLP tasks, we study the prompting topic for graphs with the motivation of filling the gap between pre-trained models and various graph tasks. In this paper, we propose a novel multi-task prompting method for graph models. Specifically, we first unify the format of graph prompts and language prompts with the prompt token, token structure, and inserting pattern. In this way, the prompting idea from NLP can be seamlessly introduced to the graph area. Then, to further narrow the gap between various graph tasks and state-of-the-art pre-training strategies, we further study the task space of various graph applications and reformulate downstream problems to the graph-level task. Afterward, we introduce meta-learning to efficiently learn a better initialization for the multi-task prompt of graphs so that our prompting framework can be more reliable and general for different tasks. We conduct extensive experiments, results from which demonstrate the superiority of our method.  ( 3 min )
    Communication-constrained hypothesis testing: Optimality, robustness, and reverse data processing inequalities. (arXiv:2206.02765v2 [math.ST] UPDATED)
    We study hypothesis testing under communication constraints, where each sample is quantized before being revealed to a statistician. Without communication constraints, it is well known that the sample complexity of simple binary hypothesis testing is characterized by the Hellinger distance between the distributions. We show that the sample complexity of simple binary hypothesis testing under communication constraints is at most a logarithmic factor larger than in the unconstrained setting and this bound is tight. We develop a polynomial-time algorithm that achieves the aforementioned sample complexity. Our framework extends to robust hypothesis testing, where the distributions are corrupted in the total variation distance. Our proofs rely on a new reverse data processing inequality and a reverse Markov inequality, which may be of independent interest. For simple $M$-ary hypothesis testing, the sample complexity in the absence of communication constraints has a logarithmic dependence on $M$. We show that communication constraints can cause an exponential blow-up leading to $\Omega(M)$ sample complexity even for adaptive algorithms.  ( 2 min )
    Deep Feature Screening: Feature Selection for Ultra High-Dimensional Data via Deep Neural Networks. (arXiv:2204.01682v3 [stat.ML] UPDATED)
    The applications of traditional statistical feature selection methods to high-dimension, low sample-size data often struggle and encounter challenging problems, such as overfitting, curse of dimensionality, computational infeasibility, and strong model assumption. In this paper, we propose a novel two-step nonparametric approach called Deep Feature Screening (DeepFS) that can overcome these problems and identify significant features with high precision for ultra high-dimensional, low-sample-size data. This approach first extracts a low-dimensional representation of input data and then applies feature screening based on multivariate rank distance correlation recently developed by Deb and Sen (2021). This approach combines the strengths of both deep neural networks and feature screening, and thereby has the following appealing features in addition to its ability of handling ultra high-dimensional data with small number of samples: (1) it is model free and distribution free; (2) it can be used for both supervised and unsupervised feature selection; and (3) it is capable of recovering the original input data. The superiority of DeepFS is demonstrated via extensive simulation studies and real data analyses.  ( 2 min )
    Geometric structure of Deep Learning networks and construction of global ${\mathcal L}^2$ minimizers. (arXiv:2309.10639v3 [cs.LG] UPDATED)
    In this paper, we provide a geometric interpretation of the structure of Deep Learning (DL) networks, characterized by $L$ hidden layers, a ReLU ramp activation function, an $\mathcal{L}^2$ Schatten class (or Hilbert-Schmidt) cost function, and input and output spaces $\mathbb{R}^Q$ with equal dimension $Q\geq1$. The hidden layers are also defined on $\mathbb{R}^{Q}$; the training input size $N$ can be arbitrarily large - thus, we are considering the underparametrized regime. We apply our recent results on shallow neural networks to construct an explicit family of minimizers for the global minimum of the cost function in the case $L\geq Q$, which we show to be degenerate. In the context presented here, the hidden layers of the DL network "curate" the training inputs by recursive application of a truncation map that minimizes the noise to signal ratio of the training inputs. Moreover, we determine a set of $2^Q-1$ distinct degenerate local minima of the cost function. Our constructions make no use of gradient descent algorithms at all.  ( 3 min )
    RayDF: Neural Ray-surface Distance Fields with Multi-view Consistency. (arXiv:2310.19629v2 [cs.CV] UPDATED)
    In this paper, we study the problem of continuous 3D shape representations. The majority of existing successful methods are coordinate-based implicit neural representations. However, they are inefficient to render novel views or recover explicit surface points. A few works start to formulate 3D shapes as ray-based neural functions, but the learned structures are inferior due to the lack of multi-view geometry consistency. To tackle these challenges, we propose a new framework called RayDF. It consists of three major components: 1) the simple ray-surface distance field, 2) the novel dual-ray visibility classifier, and 3) a multi-view consistency optimization module to drive the learned ray-surface distances to be multi-view geometry consistent. We extensively evaluate our method on three public datasets, demonstrating remarkable performance in 3D surface point reconstruction on both synthetic and challenging real-world 3D scenes, clearly surpassing existing coordinate-based and ray-based baselines. Most notably, our method achieves a 1000x faster speed than coordinate-based methods to render an 800x800 depth image, showing the superiority of our method for 3D shape representation. Our code and data are available at https://github.com/vLAR-group/RayDF  ( 2 min )
    NetGPT: A Native-AI Network Architecture Beyond Provisioning Personalized Generative Services. (arXiv:2307.06148v3 [cs.LG] UPDATED)
    Large language models (LLMs) have triggered tremendous success to empower our daily life by generative information. The personalization of LLMs could further contribute to their applications due to better alignment with human intents. Towards personalized generative services, a collaborative cloud-edge methodology is promising, as it facilitates the effective orchestration of heterogeneous distributed communication and computing resources. In this article, we put forward NetGPT to capably synergize appropriate LLMs at the edge and the cloud based on their computing capacity. In addition, edge LLMs could efficiently leverage location-based information for personalized prompt completion, thus benefiting the interaction with the cloud LLM. In particular, we present the feasibility of NetGPT by leveraging low-rank adaptation-based fine-tuning of open-source LLMs (i.e., GPT-2-base model and LLaMA model), and conduct comprehensive numerical comparisons with alternative cloud-edge collaboration or cloud-only techniques, so as to demonstrate the superiority of NetGPT. Subsequently, we highlight the essential changes required for an artificial intelligence (AI)-native network architecture towards NetGPT, with emphasis on deeper integration of communications and computing resources and careful calibration of logical AI workflow. Furthermore, we demonstrate several benefits of NetGPT, which come as by-products, as the edge LLMs' capability to predict trends and infer intents promises a unified solution for intelligent network management & orchestration. We argue that NetGPT is a promising AI-native network architecture for provisioning beyond personalized generative services.  ( 3 min )
    Provably Personalized and Robust Federated Learning. (arXiv:2306.08393v2 [cs.LG] UPDATED)
    Identifying clients with similar objectives and learning a model-per-cluster is an intuitive and interpretable approach to personalization in federated learning. However, doing so with provable and optimal guarantees has remained an open challenge. We formalize this problem as a stochastic optimization problem, achieving optimal convergence rates for a large class of loss functions. We propose simple iterative algorithms which identify clusters of similar clients and train a personalized model-per-cluster, using local client gradients and flexible constraints on the clusters. The convergence rates of our algorithms asymptotically match those obtained if we knew the true underlying clustering of the clients and are provably robust in the Byzantine setting where some fraction of the clients are malicious.  ( 2 min )
    DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning. (arXiv:2309.05173v3 [cs.CL] UPDATED)
    Prompt tuning (PT), where a small amount of trainable soft (continuous) prompt vectors is affixed to the input of language models (LM), has shown promising results across various tasks and models for parameter-efficient fine-tuning (PEFT). PT stands out from other PEFT approaches because it maintains competitive performance with fewer trainable parameters and does not drastically scale up its parameters as the model size expands. However, PT introduces additional soft prompt tokens, leading to longer input sequences, which significantly impacts training and inference time and memory usage due to the Transformer's quadratic complexity. Particularly concerning for Large Language Models (LLMs) that face heavy daily querying. To address this issue, we propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt into a shorter soft prompt and a pair of low-rank matrices that are then optimised with two different learning rates. This allows DePT to achieve better performance while saving over 20% memory and time costs compared to vanilla PT and its variants, without changing trainable parameter sizes. Through extensive experiments on 23 natural language processing (NLP) and vision-language (VL) tasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches, including the full fine-tuning baseline in some scenarios. Additionally, we empirically show that DEPT grows more efficient as the model size increases. Our further study reveals that DePT integrates seamlessly with parameter-efficient transfer learning in the few-shot learning setting and highlights its adaptability to various model architectures and sizes.  ( 3 min )
    Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains. (arXiv:2311.07723v3 [cs.AI] UPDATED)
    As AI systems become more intelligent and their behavior becomes more challenging to assess, they may learn to game the flaws of human feedback instead of genuinely striving to follow instructions; however, this risk can be mitigated by controlling how LLMs generalize human feedback to situations where it is unreliable. To better understand how reward models generalize, we craft 69 distribution shifts spanning 8 categories. We find that reward models do not learn to evaluate `instruction-following' by default and instead favor personas that resemble internet text. Techniques for interpreting reward models' internal representations achieve better generalization than standard fine-tuning, but still frequently fail to distinguish instruction-following from conflated behaviors. We consolidate the 15 most challenging distribution shifts into the GENeralization analogIES (GENIES) benchmark, which we hope will enable progress toward controlling reward model generalization.  ( 2 min )
    Audio Generation with Multiple Conditional Diffusion Model. (arXiv:2308.11940v3 [cs.SD] UPDATED)
    Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at https://conditionaudiogen.github.io/conditionaudiogen/  ( 2 min )
    PharmacoNet: Accelerating Large-Scale Virtual Screening by Deep Pharmacophore Modeling. (arXiv:2310.00681v3 [q-bio.BM] UPDATED)
    As the size of accessible compound libraries expands to over 10 billion, the need for more efficient structure-based virtual screening methods is emerging. Different pre-screening methods have been developed for rapid screening, but there is still a lack of structure-based methods applicable to various proteins that perform protein-ligand binding conformation prediction and scoring in an extremely short time. Here, we describe for the first time a deep-learning framework for structure-based pharmacophore modeling to address this challenge. We frame pharmacophore modeling as an instance segmentation problem to determine each protein hotspot and the location of corresponding pharmacophores, and protein-ligand binding pose prediction as a graph-matching problem. PharmacoNet is significantly faster than state-of-the-art structure-based approaches, yet reasonably accurate with a simple scoring function. Furthermore, we show the promising result that PharmacoNet effectively retains hit candidates even under the high pre-screening filtration rates. Overall, our study uncovers the hitherto untapped potential of a pharmacophore modeling approach in deep learning-based drug discovery.  ( 2 min )
    Stochastic Gradient Descent outperforms Gradient Descent in recovering a high-dimensional signal in a glassy energy landscape. (arXiv:2309.04788v2 [cs.LG] UPDATED)
    Stochastic Gradient Descent (SGD) is an out-of-equilibrium algorithm used extensively to train artificial neural networks. However very little is known on to what extent SGD is crucial for to the success of this technology and, in particular, how much it is effective in optimizing high-dimensional non-convex cost functions as compared to other optimization algorithms such as Gradient Descent (GD). In this work we leverage dynamical mean field theory to benchmark its performances in the high-dimensional limit. To do that, we consider the problem of recovering a hidden high-dimensional non-linearly encrypted signal, a prototype high-dimensional non-convex hard optimization problem. We compare the performances of SGD to GD and we show that SGD largely outperforms GD for sufficiently small batch sizes. In particular, a power law fit of the relaxation time of these algorithms shows that the recovery threshold for SGD with small batch size is smaller than the corresponding one of GD.  ( 2 min )
    Maximum diffusion reinforcement learning. (arXiv:2309.15293v3 [cs.LG] UPDATED)
    The assumption that data are independent and identically distributed underpins all machine learning. When data are collected sequentially from agent experiences this assumption does not generally hold, as in reinforcement learning. Here, we derive a method that overcomes these limitations by exploiting the statistical mechanics of ergodic processes, which we term maximum diffusion reinforcement learning. By decorrelating agent experiences, our approach provably enables single-shot learning in continuous deployments over the course of individual task attempts. Moreover, we prove our approach generalizes well-known maximum entropy techniques, and robustly exceeds state-of-the-art performance across popular benchmarks. Our results at the nexus of physics, learning, and control pave the way towards more transparent and reliable decision-making in reinforcement learning agents, such as locomoting robots and self-driving cars.  ( 2 min )
    Neural oscillators for generalization of physics-informed machine learning. (arXiv:2308.08989v2 [cs.LG] UPDATED)
    A primary challenge of physics-informed machine learning (PIML) is its generalization beyond the training domain, especially when dealing with complex physical problems represented by partial differential equations (PDEs). This paper aims to enhance the generalization capabilities of PIML, facilitating practical, real-world applications where accurate predictions in unexplored regions are crucial. We leverage the inherent causality and temporal sequential characteristics of PDE solutions to fuse PIML models with recurrent neural architectures based on systems of ordinary differential equations, referred to as neural oscillators. Through effectively capturing long-time dependencies and mitigating the exploding and vanishing gradient problem, neural oscillators foster improved generalization in PIML tasks. Extensive experimentation involving time-dependent nonlinear PDEs and biharmonic beam equations demonstrates the efficacy of the proposed approach. Incorporating neural oscillators outperforms existing state-of-the-art methods on benchmark problems across various metrics. Consequently, the proposed method improves the generalization capabilities of PIML, providing accurate solutions for extrapolation and prediction beyond the training data.  ( 2 min )
    Exploiting Label Skews in Federated Learning with Model Concatenation. (arXiv:2312.06290v2 [cs.LG] UPDATED)
    Federated Learning (FL) has emerged as a promising solution to perform deep learning on different data owners without exchanging raw data. However, non-IID data has been a key challenge in FL, which could significantly degrade the accuracy of the final model. Among different non-IID types, label skews have been challenging and common in image classification and other tasks. Instead of averaging the local models in most previous studies, we propose FedConcat, a simple and effective approach that concatenates these local models as the base of the global model to effectively aggregate the local knowledge. To reduce the size of the global model, we adopt the clustering technique to group the clients by their label distributions and collaboratively train a model inside each cluster. We theoretically analyze the advantage of concatenation over averaging by analyzing the information bottleneck of deep neural networks. Experimental results demonstrate that FedConcat achieves significantly higher accuracy than previous state-of-the-art FL methods in various heterogeneous label skew distribution settings and meanwhile has lower communication costs. Our code is publicly available at https://github.com/sjtudyq/FedConcat.  ( 2 min )
    Information-Theoretic Generalization Analysis for Topology-aware Heterogeneous Federated Edge Learning over Noisy Channels. (arXiv:2310.16407v2 [cs.IT] UPDATED)
    With the rapid growth of edge intelligence, the deployment of federated learning (FL) over wireless networks has garnered increasing attention, which is called Federated Edge Learning (FEEL). In FEEL, both mobile devices transmitting model parameters over noisy channels and collecting data in diverse environments pose challenges to the generalization of trained models. Moreover, devices can engage in decentralized FL via Device-to-Device communication while the communication topology of connected devices also impacts the generalization of models. Most recent theoretical studies overlook the incorporation of all these effects into FEEL when developing generalization analyses. In contrast, our work presents an information-theoretic generalization analysis for topology-aware FEEL in the presence of data heterogeneity and noisy channels. Additionally, we propose a novel regularization method called Federated Global Mutual Information Reduction (FedGMIR) to enhance the performance of models based on our analysis. Numerical results validate our theoretical findings and provide evidence for the effectiveness of the proposed method.  ( 2 min )
    Is Channel Independent strategy optimal for Time Series Forecasting?. (arXiv:2310.17658v3 [cs.LG] UPDATED)
    There has been an emergence of various models for long-term time series forecasting. Recent studies have demonstrated that a single linear layer, using Channel Dependent (CD) or Channel Independent (CI) modeling, can even outperform a large number of sophisticated models. However, current research primarily considers CD and CI as two complementary yet mutually exclusive approaches, unable to harness these two extremes simultaneously. And it is also a challenging issue that both CD and CI are static strategies that cannot be determined to be optimal for a specific dataset without extensive experiments. In this paper, we reconsider whether the current CI strategy is the best solution for time series forecasting. First, we propose a simple yet effective strategy called CSC, which stands for $\mathbf{C}$hannel $\mathbf{S}$elf-$\mathbf{C}$lustering strategy, for linear models. Our Channel Self-Clustering (CSC) enhances CI strategy's performance improvements while reducing parameter size, for exmpale by over 10 times on electricity dataset, and significantly cutting training time. Second, we further propose Channel Rearrangement (CR), a method for deep models inspired by the self-clustering. CR attains competitive performance against baselines. Finally, we also discuss whether it is best to forecast the future values using the historical values of the same channel as inputs. We hope our findings and methods could inspire new solutions beyond CD/CI.  ( 3 min )
    MimiC: Combating Client Dropouts in Federated Learning by Mimicking Central Updates. (arXiv:2306.12212v3 [cs.LG] UPDATED)
    Federated learning (FL) is a promising framework for privacy-preserving collaborative learning, where model training tasks are distributed to clients and only the model updates need to be collected at a server. However, when being deployed at mobile edge networks, clients may have unpredictable availability and drop out of the training process, which hinders the convergence of FL. This paper tackles such a critical challenge. Specifically, we first investigate the convergence of the classical FedAvg algorithm with arbitrary client dropouts. We find that with the common choice of a decaying learning rate, FedAvg oscillates around a stationary point of the global loss function, which is caused by the divergence between the aggregated and desired central update. Motivated by this new observation, we then design a novel training algorithm named MimiC, where the server modifies each received model update based on the previous ones. The proposed modification of the received model updates mimics the imaginary central update irrespective of dropout clients. The theoretical analysis of MimiC shows that divergence between the aggregated and central update diminishes with proper learning rates, leading to its convergence. Simulation results further demonstrate that MimiC maintains stable convergence performance and learns better models than the baseline methods.  ( 3 min )
    On the Unexpected Abilities of Large Language Models. (arXiv:2308.09720v2 [cs.AI] UPDATED)
    Large Language Models (LLMs) are capable of displaying a wide range of abilities that are not directly connected with the task for which they are trained: predicting the next words of human-written texts. In this article, I review recent research investigating the cognitive abilities developed by LLMs and their relation to human cognition. I discuss the nature of the indirect process that leads to the acquisition of these cognitive abilities, their relation to other indirect processes, and the implications for the acquisition of integrated abilities. Moreover, I propose the factors that enable the development of abilities that are related only very indirectly to the proximal objective of the training task. Finally, I discuss whether the full set of capabilities that LLMs could possibly develop is predictable.  ( 2 min )
    A Competitive Algorithm for Agnostic Active Learning. (arXiv:2310.18786v2 [cs.LG] UPDATED)
    For some hypothesis classes and input distributions, active agnostic learning needs exponentially fewer samples than passive learning; for other classes and distributions, it offers little to no improvement. The most popular algorithms for agnostic active learning express their performance in terms of a parameter called the disagreement coefficient, but it is known that these algorithms are inefficient on some inputs. We take a different approach to agnostic active learning, getting an algorithm that is competitive with the optimal algorithm for any binary hypothesis class $H$ and distribution $D_X$ over $X$. In particular, if any algorithm can use $m^*$ queries to get $O(\eta)$ error, then our algorithm uses $O(m^* \log |H|)$ queries to get $O(\eta)$ error. Our algorithm lies in the vein of the splitting-based approach of Dasgupta [2004], which gets a similar result for the realizable ($\eta = 0$) setting. We also show that it is NP-hard to do better than our algorithm's $O(\log |H|)$ overhead in general.  ( 2 min )
    Union Subgraph Neural Networks. (arXiv:2305.15747v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) are widely used for graph representation learning in many application domains. The expressiveness of vanilla GNNs is upper-bounded by 1-dimensional Weisfeiler-Leman (1-WL) test as they operate on rooted subtrees through iterative message passing. In this paper, we empower GNNs by injecting neighbor-connectivity information extracted from a new type of substructure. We first investigate different kinds of connectivities existing in a local neighborhood and identify a substructure called union subgraph, which is able to capture the complete picture of the 1-hop neighborhood of an edge. We then design a shortest-path-based substructure descriptor that possesses three nice properties and can effectively encode the high-order connectivities in union subgraphs. By infusing the encoded neighbor connectivities, we propose a novel model, namely Union Subgraph Neural Network (UnionSNN), which is proven to be strictly more powerful than 1-WL in distinguishing non-isomorphic graphs. Additionally, the local encoding from union subgraphs can also be injected into arbitrary message-passing neural networks (MPNNs) and Transformer-based models as a plugin. Extensive experiments on 18 benchmarks of both graph-level and node-level tasks demonstrate that UnionSNN outperforms state-of-the-art baseline models, with competitive computational efficiency. The injection of our local encoding to existing models is able to boost the performance by up to 11.09\%. Our code is available at https://github.com/AngusMonroe/UnionSNN.  ( 2 min )
    GALAXY: Graph-based Active Learning at the Extreme. (arXiv:2202.01402v2 [cs.LG] CROSS LISTED)
    Active learning is a label-efficient approach to train highly effective models while interactively selecting only small subsets of unlabelled data for labelling and training. In "open world" settings, the classes of interest can make up a small fraction of the overall dataset -- most of the data may be viewed as an out-of-distribution or irrelevant class. This leads to extreme class-imbalance, and our theory and methods focus on this core issue. We propose a new strategy for active learning called GALAXY (Graph-based Active Learning At the eXtrEme), which blends ideas from graph-based active learning and deep learning. GALAXY automatically and adaptively selects more class-balanced examples for labeling than most other methods for active learning. Our theory shows that GALAXY performs a refined form of uncertainty sampling that gathers a much more class-balanced dataset than vanilla uncertainty sampling. Experimentally, we demonstrate GALAXY's superiority over existing state-of-art deep active learning algorithms in unbalanced vision classification settings generated from popular datasets.  ( 2 min )
    Using Property Elicitation to Understand the Impacts of Fairness Regularizers. (arXiv:2309.11343v2 [cs.LG] UPDATED)
    Predictive algorithms are often trained by optimizing some loss function, to which regularization functions are added to impose a penalty for violating constraints. As expected, the addition of such regularization functions can change the minimizer of the objective. It is not well-understood which regularizers change the minimizer of the loss, and, when the minimizer does change, how it changes. We use property elicitation to take first steps towards understanding the joint relationship between the loss and regularization functions and the optimal decision for a given problem instance. In particular, we give a necessary and sufficient condition on loss and regularizer pairs for when a property changes with the addition of the regularizer, and examine some regularizers satisfying this condition standard in the fair machine learning literature. We empirically demonstrate how algorithmic decision-making changes as a function of both data distribution changes and hardness of the constraints.  ( 2 min )
    ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models. (arXiv:2310.10505v3 [cs.LG] UPDATED)
    Alignment is crucial for training large language models. The predominant strategy is Reinforcement Learning from Human Feedback (RLHF), with Proximal Policy Optimization (PPO) as the de-facto algorithm. Yet, PPO is known to struggle with computational inefficiency, a challenge that this paper aims to address. We identify three important properties of RLHF tasks: fast simulation, deterministic transitions, and trajectory-level rewards, which are not leveraged in PPO. Based on these properties, we develop ReMax, a new algorithm tailored for RLHF. The design of ReMax builds on the celebrated algorithm REINFORCE but is enhanced with a new variance-reduction technique. ReMax offers threefold advantages over PPO: first, it is simple to implement with just 6 lines of code. It further eliminates more than 4 hyper-parameters in PPO, which are laborious to tune. Second, ReMax reduces memory usage by about 50%. To illustrate, PPO runs out of memory when fine-tuning a Llama2-7B model on A100-80GB GPUs, whereas ReMax can support the training. Even though memory-efficient techniques (e.g., ZeRO and offload) are employed for PPO to afford training, ReMax can utilize a larger batch size to increase throughput. Third, in terms of wall-clock time, PPO is about twice as slow as ReMax per iteration. Importantly, these improvements do not sacrifice task performance. We hypothesize that these advantages can be maintained in larger-scale models.  ( 3 min )
    Mitigating Data Injection Attacks on Federated Learning. (arXiv:2312.02102v2 [cs.LG] UPDATED)
    Federated learning is a technique that allows multiple entities to collaboratively train models using their data without compromising data privacy. However, despite its advantages, federated learning can be susceptible to false data injection attacks. In these scenarios, a malicious entity with control over specific agents in the network can manipulate the learning process, leading to a suboptimal model. Consequently, addressing these data injection attacks presents a significant research challenge in federated learning systems. In this paper, we propose a novel technique to detect and mitigate data injection attacks on federated learning systems. Our mitigation method is a local scheme, performed during a single instance of training by the coordinating node, allowing the mitigation during the convergence of the algorithm. Whenever an agent is suspected to be an attacker, its data will be ignored for a certain period, this decision will often be re-evaluated. We prove that with probability 1, after a finite time, all attackers will be ignored while the probability of ignoring a trustful agent becomes 0, provided that there is a majority of truthful agents. Simulations show that when the coordinating node detects and isolates all the attackers, the model recovers and converges to the truthful model.  ( 2 min )
    Using Machine Learning to generate an open-access cropland map from satellite images time series in the Indian Himalayan Region. (arXiv:2203.14673v1 [cs.CV] CROSS LISTED)
    Crop maps are crucial for agricultural monitoring and food management and can additionally support domain-specific applications, such as setting cold supply chain infrastructure in developing countries. Machine learning (ML) models, combined with freely-available satellite imagery, can be used to produce cost-effective and high spatial-resolution crop maps. However, accessing ground truth data for supervised learning is especially challenging in developing countries due to factors such as smallholding and fragmented geography, which often results in a lack of crop type maps or even reliable cropland maps. Our area of interest for this study lies in Himachal Pradesh, India, where we aim at producing an open-access binary cropland map at 10-meter resolution for the Kullu, Shimla, and Mandi districts. To this end, we developed an ML pipeline that relies on Sentinel-2 satellite images time series. We investigated two pixel-based supervised classifiers, support vector machines (SVM) and random forest (RF), which are used to classify per-pixel time series for binary cropland mapping. The ground truth data used for training, validation and testing was manually annotated from a combination of field survey reference points and visual interpretation of very high resolution (VHR) imagery. We trained and validated the models via spatial cross-validation to account for local spatial autocorrelation and selected the RF model due to overall robustness and lower computational cost. We tested the generalization capability of the chosen model at the pixel level by computing the accuracy, recall, precision, and F1-score on hold-out test sets of each district, achieving an average accuracy for the RF (our best model) of 87%. We used this model to generate a cropland map for three districts of Himachal Pradesh, spanning 14,600 km2, which improves the resolution and quality of existing public maps.  ( 3 min )
    STS-CCL: Spatial-Temporal Synchronous Contextual Contrastive Learning for Urban Traffic Forecasting. (arXiv:2307.02507v2 [cs.LG] UPDATED)
    Efficiently capturing the complex spatiotemporal representations from large-scale unlabeled traffic data remains to be a challenging task. In considering of the dilemma, this work employs the advanced contrastive learning and proposes a novel Spatial-Temporal Synchronous Contextual Contrastive Learning (STS-CCL) model. First, we elaborate the basic and strong augmentation methods for spatiotemporal graph data, which not only perturb the data in terms of graph structure and temporal characteristics, but also employ a learning-based dynamic graph view generator for adaptive augmentation. Second, we introduce a Spatial-Temporal Synchronous Contrastive Module (STS-CM) to simultaneously capture the decent spatial-temporal dependencies and realize graph-level contrasting. To further discriminate node individuals in negative filtering, a Semantic Contextual Contrastive method is designed based on semantic features and spatial heterogeneity, achieving node-level contrastive learning along with negative filtering. Finally, we present a hard mutual-view contrastive training scheme and extend the classic contrastive loss to an integrated objective function, yielding better performance. Extensive experiments and evaluations demonstrate that building a predictor upon STS-CCL contrastive learning model gains superior performance than existing traffic forecasting benchmarks. The proposed STS-CCL is highly suitable for large datasets with only a few labeled data and other spatiotemporal tasks with data scarcity issue.  ( 3 min )
    ReConTab: Regularized Contrastive Representation Learning for Tabular Data. (arXiv:2310.18541v2 [cs.LG] UPDATED)
    Representation learning stands as one of the critical machine learning techniques across various domains. Through the acquisition of high-quality features, pre-trained embeddings significantly reduce input space redundancy, benefiting downstream pattern recognition tasks such as classification, regression, or detection. Nonetheless, in the domain of tabular data, feature engineering and selection still heavily rely on manual intervention, leading to time-consuming processes and necessitating domain expertise. In response to this challenge, we introduce ReConTab, a deep automatic representation learning framework with regularized contrastive learning. Agnostic to any type of modeling task, ReConTab constructs an asymmetric autoencoder based on the same raw features from model inputs, producing low-dimensional representative embeddings. Specifically, regularization techniques are applied for raw feature selection. Meanwhile, ReConTab leverages contrastive learning to distill the most pertinent information for downstream tasks. Experiments conducted on extensive real-world datasets substantiate the framework's capacity to yield substantial and robust performance improvements. Furthermore, we empirically demonstrate that pre-trained embeddings can seamlessly integrate as easily adaptable features, enhancing the performance of various traditional methods such as XGBoost and Random Forest.  ( 2 min )
    Efficient Enumeration of Markov Equivalent DAGs. (arXiv:2301.12212v2 [cs.AI] UPDATED)
    Enumerating the directed acyclic graphs (DAGs) of a Markov equivalence class (MEC) is an important primitive in causal analysis. The central resource from the perspective of computational complexity is the delay, that is, the time an algorithm that lists all members of the class requires between two consecutive outputs. Commonly used algorithms for this task utilize the rules proposed by Meek (1995) or the transformational characterization by Chickering (1995), both resulting in superlinear delay. In this paper, we present the first linear-time delay algorithm. On the theoretical side, we show that our algorithm can be generalized to enumerate DAGs represented by models that incorporate background knowledge, such as MPDAGs; on the practical side, we provide an efficient implementation and evaluate it in a series of experiments. Complementary to the linear-time delay algorithm, we also provide intriguing insights into Markov equivalence itself: All members of an MEC can be enumerated such that two successive DAGs have structural Hamming distance at most three.  ( 2 min )
    Southern Ocean Dynamics Under Climate Change: New Knowledge Through Physics-Guided Machine Learning. (arXiv:2310.13916v2 [physics.ao-ph] UPDATED)
    Complex ocean systems such as the Antarctic Circumpolar Current play key roles in the climate, and current models predict shifts in their strength and area under climate change. However, the physical processes underlying these changes are not well understood, in part due to the difficulty of characterizing and tracking changes in ocean physics in complex models. Using the Antarctic Circumpolar Current as a case study, we extend the method Tracking global Heating with Ocean Regimes (THOR) to a mesoscale eddy permitting climate model and identify regions of the ocean characterized by similar physics, called dynamical regimes, using readily accessible fields from climate models. To this end, we cluster grid cells into dynamical regimes and train an ensemble of neural networks, allowing uncertainty quantification, to predict these regimes and track them under climate change. Finally, we leverage this new knowledge to elucidate the dynamical drivers of the identified regime shifts as noted by the neural network using the 'explainability' methods SHAP and Layer-wise Relevance Propagation. A region undergoing a profound shift is where the Antarctic Circumpolar Current intersects the Pacific-Antarctic Ridge, an area important for carbon draw-down and fisheries. In this region, THOR specifically reveals a shift in dynamical regime under climate change driven by changes in wind stress and interactions with bathymetry. Using this knowledge to guide further exploration, we find that as the Antarctic Circumpolar Current shifts north under intensifying wind stress, the dominant dynamical role of bathymetry weakens and the flow intensifies.  ( 3 min )
    Flow Dynamics Correction for Action Recognition. (arXiv:2310.10059v2 [cs.CV] UPDATED)
    Various research studies indicate that action recognition performance highly depends on the types of motions being extracted and how accurate the human actions are represented. In this paper, we investigate different optical flow, and features extracted from these optical flow that capturing both short-term and long-term motion dynamics. We perform power normalization on the magnitude component of optical flow for flow dynamics correction to boost subtle or dampen sudden motions. We show that existing action recognition models which rely on optical flow are able to get performance boosted with our corrected optical flow. To further improve performance, we integrate our corrected flow dynamics into popular models through a simple hallucination step by selecting only the best performing optical flow features, and we show that by 'translating' the CNN feature maps into these optical flow features with different scales of motions leads to the new state-of-the-art performance on several benchmarks including HMDB-51, YUP++, fine-grained action recognition on MPII Cooking Activities, and large-scale Charades.  ( 2 min )
    Pitfalls in Link Prediction with Graph Neural Networks: Understanding the Impact of Target-link Inclusion & Better Practices. (arXiv:2306.00899v2 [cs.LG] UPDATED)
    While Graph Neural Networks (GNNs) are remarkably successful in a variety of high-impact applications, we demonstrate that, in link prediction, the common practices of including the edges being predicted in the graph at training and/or test have outsized impact on the performance of low-degree nodes. We theoretically and empirically investigate how these practices impact node-level performance across different degrees. Specifically, we explore three issues that arise: (I1) overfitting; (I2) distribution shift; and (I3) implicit test leakage. The former two issues lead to poor generalizability to the test data, while the latter leads to overestimation of the model's performance and directly impacts the deployment of GNNs. To address these issues in a systematic way, we introduce an effective and efficient GNN training framework, SpotTarget, which leverages our insight on low-degree nodes: (1) at training time, it excludes a (training) edge to be predicted if it is incident to at least one low-degree node; and (2) at test time, it excludes all test edges to be predicted (thus, mimicking real scenarios of using GNNs, where the test data is not included in the graph). SpotTarget helps researchers and practitioners adhere to best practices for learning from graph data, which are frequently overlooked even by the most widely-used frameworks. Our experiments on various real-world datasets show that SpotTarget makes GNNs up to 15x more accurate in sparse graphs, and significantly improves their performance for low-degree nodes in dense graphs.  ( 3 min )
    Human Voice Pitch Estimation: A Convolutional Network with Auto-Labeled and Synthetic Data. (arXiv:2308.07170v2 [cs.SD] UPDATED)
    In the domain of music and sound processing, pitch extraction plays a pivotal role. Our research presents a specialized convolutional neural network designed for pitch extraction, particularly from the human singing voice in acapella performances. Notably, our approach combines synthetic data with auto-labeled acapella sung audio, creating a robust training environment. Evaluation across datasets comprising synthetic sounds, opera recordings, and time-stretched vowels demonstrates its efficacy. This work paves the way for enhanced pitch extraction in both music and voice settings.  ( 2 min )
    SUREL+: Moving from Walks to Sets for Scalable Subgraph-based Graph Representation Learning. (arXiv:2303.03379v2 [cs.LG] UPDATED)
    Subgraph-based graph representation learning (SGRL) has recently emerged as a powerful tool in many prediction tasks on graphs due to its advantages in model expressiveness and generalization ability. Most previous SGRL models face computational issues associated with the high cost of subgraph extraction for each training or test query. Recently, SUREL was proposed to accelerate SGRL, which samples random walks offline and joins these walks online as a proxy of subgraphs for representation learning. Thanks to the reusability of sampled walks across different queries, SUREL achieves state-of-the-art performance in terms of scalability and prediction accuracy. However, SUREL still suffers from high computational overhead caused by node redundancy in sampled walks. In this work, we propose a novel framework SUREL+ that upgrades SUREL by using node sets instead of walks to represent subgraphs. This set-based representation avoids repeated nodes by definition, but node sets can be irregular in size. To address this issue, we design a customized sparse data structure to efficiently store and index node sets, and provide a specialized operator to join them in parallel batches. SUREL+ is modularized to support multiple types of set samplers, structural features, and neural encoders to complement the structure information loss after the reduction from walks to sets. Extensive experiments have been performed to validate SUREL+ in the prediction tasks of links, relation types, and higher-order patterns. SUREL+ achieves 3-11$\times$ speedups of SUREL while maintaining comparable or even better prediction performance; compared to other SGRL baselines, SUREL+ achieves $\sim$20$\times$ speedups and significantly improves the prediction accuracy.  ( 3 min )
    Neural Bradley-Terry Rating: Quantifying Properties from Comparisons. (arXiv:2307.13709v4 [cs.LG] UPDATED)
    Many properties in the real world doesn't have metrics and can't be numerically observed, making them difficult to learn. To deal with this challenging problem, prior works have primarily focused on estimating those properties by using graded human scores as the target label in the training. Meanwhile, rating algorithms based on the Bradley-Terry model are extensively studied to evaluate the competitiveness of players based on their match history. In this paper, we introduce the Neural Bradley-Terry Rating (NBTR), a novel machine learning framework designed to quantify and evaluate properties of unknown items. Our method seamlessly integrates the Bradley-Terry model into the neural network structure. Moreover, we generalize this architecture further to asymmetric environments with unfairness, a condition more commonly encountered in real-world settings. Through experimental analysis, we demonstrate that NBTR successfully learns to quantify and estimate desired properties.  ( 2 min )
    Knowledge Graph Prompting for Multi-Document Question Answering. (arXiv:2308.11730v2 [cs.CL] UPDATED)
    The `pre-train, prompt, predict' paradigm of large language models (LLMs) has achieved remarkable success in open-domain question answering (OD-QA). However, few works explore this paradigm in the scenario of multi-document question answering (MD-QA), a task demanding a thorough understanding of the logical associations among the contents and structures of different documents. To fill this crucial gap, we propose a Knowledge Graph Prompting (KGP) method to formulate the right context in prompting LLMs for MD-QA, which consists of a graph construction module and a graph traversal module. For graph construction, we create a knowledge graph (KG) over multiple documents with nodes symbolizing passages or document structures (e.g., pages/tables), and edges denoting the semantic/lexical similarity between passages or intra-document structural relations. For graph traversal, we design an LLM-based graph traversal agent that navigates across nodes and gathers supporting passages assisting LLMs in MD-QA. The constructed graph serves as the global ruler that regulates the transitional space among passages and reduces retrieval latency. Concurrently, the graph traversal agent acts as a local navigator that gathers pertinent context to progressively approach the question and guarantee retrieval quality. Extensive experiments underscore the efficacy of KGP for MD-QA, signifying the potential of leveraging graphs in enhancing the prompt design for LLMs. Our code: https://github.com/YuWVandy/KG-LLM-MDQA.  ( 3 min )
    Algorithm Selection for Deep Active Learning with Imbalanced Datasets. (arXiv:2302.07317v3 [cs.LG] CROSS LISTED)
    Label efficiency has become an increasingly important objective in deep learning applications. Active learning aims to reduce the number of labeled examples needed to train deep networks, but the empirical performance of active learning algorithms can vary dramatically across datasets and applications. It is difficult to know in advance which active learning strategy will perform well or best in a given application. To address this, we propose the first adaptive algorithm selection strategy for deep active learning. For any unlabeled dataset, our (meta) algorithm TAILOR (Thompson ActIve Learning algORithm selection) iteratively and adaptively chooses among a set of candidate active learning algorithms. TAILOR uses novel reward functions aimed at gathering class-balanced examples. Extensive experiments in multi-class and multi-label applications demonstrate TAILOR's effectiveness in achieving accuracy comparable or better than that of the best of the candidate algorithms. Our implementation of TAILOR is open-sourced at https://github.com/jifanz/TAILOR.  ( 2 min )
    Efficiently Representing Finite-state Automata With Recurrent Neural Networks. (arXiv:2310.05161v3 [cs.CL] UPDATED)
    Understanding neural network architectures with formal models of computation promises to spark a better understanding of the network's capabilities and limitations. A long line of work has described recurrent neural networks (RNN) in terms of their connection to the well-understood finite-state automata (FSAs), whose sequential nature provides a useful analogy to how RNNs function. Minsky's [1954] construction first showed how RNNs can simulate FSAs and provided a way of understanding RNNs as FSAs. This paper presents a comprehensive review of this construction along with two additional classical results showcasing the relationship between RNNs and FSAs: The constructions due to Dewdney [1977] and Indyk [1995]. We are not only interested in \emph{whether} an RNN can simulate an FSA, but also in the space requirements to do so: Whereas Minsky [1954] shows that an RNN can simulate an FSA with $N$ states using $\mathcal{O}\left(N\right)$ neurons, Dewdney [1977] improved this to $\mathcal{O}\left(N^\frac{3}{4}\right)$ and Indyk [1995] further to $\mathcal{O}\left(\sqrt{N}\right)$, which he also showed to be optimal. We discuss the constructions, emphasizing their commonalities, and put them into the context of more modern research, focusing on the representational capacity of neural language models.  ( 2 min )
    Optimality of Message-Passing Architectures for Sparse Graphs. (arXiv:2305.10391v2 [cs.LG] UPDATED)
    We study the node classification problem on feature-decorated graphs in the sparse setting, i.e., when the expected degree of a node is $O(1)$ in the number of nodes, in the fixed-dimensional asymptotic regime, i.e., the dimension of the feature data is fixed while the number of nodes is large. Such graphs are typically known to be locally tree-like. We introduce a notion of Bayes optimality for node classification tasks, called asymptotic local Bayes optimality, and compute the optimal classifier according to this criterion for a fairly general statistical data model with arbitrary distributions of the node features and edge connectivity. The optimal classifier is implementable using a message-passing graph neural network architecture. We then compute the generalization error of this classifier and compare its performance against existing learning methods theoretically on a well-studied statistical model with naturally identifiable signal-to-noise ratios (SNRs) in the data. We find that the optimal message-passing architecture interpolates between a standard MLP in the regime of low graph signal and a typical convolution in the regime of high graph signal. Furthermore, we prove a corresponding non-asymptotic result.  ( 2 min )
    A Comprehensive Python Library for Deep Learning-Based Event Detection in Multivariate Time Series Data and Information Retrieval in NLP. (arXiv:2310.16485v2 [cs.LG] UPDATED)
    Event detection in time series data is crucial in various domains, including finance, healthcare, cybersecurity, and science. Accurately identifying events in time series data is vital for making informed decisions, detecting anomalies, and predicting future trends. Despite extensive research exploring diverse methods for event detection in time series, with deep learning approaches being among the most advanced, there is still room for improvement and innovation in this field. In this paper, we present a new deep learning supervised method for detecting events in multivariate time series data. Our method combines four distinct novelties compared to existing deep-learning supervised methods. Firstly, it is based on regression instead of binary classification. Secondly, it does not require labeled datasets where each point is labeled; instead, it only requires reference events defined as time points or intervals of time. Thirdly, it is designed to be robust by using a stacked ensemble learning meta-model that combines deep learning models, ranging from classic feed-forward neural networks (FFNs) to state-of-the-art architectures like transformers. This ensemble approach can mitigate individual model weaknesses and biases, resulting in more robust predictions. Finally, to facilitate practical implementation, we have developed a Python package to accompany our proposed method. The package, called eventdetector-ts, can be installed through the Python Package Index (PyPI). In this paper, we present our method and provide a comprehensive guide on the usage of the package. We showcase its versatility and effectiveness through different real-world use cases from natural language processing (NLP) to financial security domains.  ( 3 min )
    ExpeL: LLM Agents Are Experiential Learners. (arXiv:2308.10144v2 [cs.LG] UPDATED)
    The recent surge in research interest in applying large language models (LLMs) to decision-making tasks has flourished by leveraging the extensive world knowledge embedded in LLMs. While there is a growing demand to tailor LLMs for custom decision-making tasks, finetuning them for specific tasks is resource-intensive and may diminish the model's generalization capabilities. Moreover, state-of-the-art language models like GPT-4 and Claude are primarily accessible through API calls, with their parametric weights remaining proprietary and unavailable to the public. This scenario emphasizes the growing need for new methodologies that allow learning from agent experiences without requiring parametric updates. To address these problems, we introduce the Experiential Learning (ExpeL) agent. Our agent autonomously gathers experiences and extracts knowledge using natural language from a collection of training tasks. At inference, the agent recalls its extracted insights and past experiences to make informed decisions. Our empirical results highlight the robust learning efficacy of the ExpeL agent, indicating a consistent enhancement in its performance as it accumulates experiences. We further explore the emerging capabilities and transfer learning potential of the ExpeL agent through qualitative observations and additional experiments.  ( 2 min )
    On the Computational Benefit of Multimodal Learning. (arXiv:2309.13782v2 [cs.LG] UPDATED)
    Human perception inherently operates in a multimodal manner. Similarly, as machines interpret the empirical world, their learning processes ought to be multimodal. The recent, remarkable successes in empirical multimodal learning underscore the significance of understanding this paradigm. Yet, a solid theoretical foundation for multimodal learning has eluded the field for some time. While a recent study by Lu (2023) has shown the superior sample complexity of multimodal learning compared to its unimodal counterpart, another basic question remains: does multimodal learning also offer computational advantages over unimodal learning? This work initiates a study on the computational benefit of multimodal learning. We demonstrate that, under certain conditions, multimodal learning can outpace unimodal learning exponentially in terms of computation. Specifically, we present a learning task that is NP-hard for unimodal learning but is solvable in polynomial time by a multimodal algorithm. Our construction is based on a novel modification to the intersection of two half-spaces problem.  ( 2 min )
    Mixing predictions for online metric algorithms. (arXiv:2304.01781v2 [cs.LG] UPDATED)
    A major technique in learning-augmented online algorithms is combining multiple algorithms or predictors. Since the performance of each predictor may vary over time, it is desirable to use not the single best predictor as a benchmark, but rather a dynamic combination which follows different predictors at different times. We design algorithms that combine predictions and are competitive against such dynamic combinations for a wide class of online problems, namely, metrical task systems. Against the best (in hindsight) unconstrained combination of $\ell$ predictors, we obtain a competitive ratio of $O(\ell^2)$, and show that this is best possible. However, for a benchmark with slightly constrained number of switches between different predictors, we can get a $(1+\epsilon)$-competitive algorithm. Moreover, our algorithms can be adapted to access predictors in a bandit-like fashion, querying only one predictor at a time. An unexpected implication of one of our lower bounds is a new structural insight about covering formulations for the $k$-server problem.  ( 2 min )
    Domain-Aware Fine-Tuning: Enhancing Neural Network Adaptability. (arXiv:2308.07728v2 [cs.LG] UPDATED)
    Fine-tuning pre-trained neural network models has become a widely adopted approach across various domains. However, it can lead to the distortion of pre-trained feature extractors that already possess strong generalization capabilities. Mitigating feature distortion during adaptation to new target domains is crucial. Recent studies have shown promising results in handling feature distortion by aligning the head layer on in-distribution datasets before performing fine-tuning. Nonetheless, a significant limitation arises from the treatment of batch normalization layers during fine-tuning, leading to suboptimal performance. In this paper, we propose Domain-Aware Fine-Tuning (DAFT), a novel approach that incorporates batch normalization conversion and the integration of linear probing and fine-tuning. Our batch normalization conversion method effectively mitigates feature distortion by reducing modifications to the neural network during fine-tuning. Additionally, we introduce the integration of linear probing and fine-tuning to optimize the head layer with gradual adaptation of the feature extractor. By leveraging batch normalization layers and integrating linear probing and fine-tuning, our DAFT significantly mitigates feature distortion and achieves improved model performance on both in-distribution and out-of-distribution datasets. Extensive experiments demonstrate that our method outperforms other baseline methods, demonstrating its effectiveness in not only improving performance but also mitigating feature distortion.  ( 2 min )
    DIRECT: Deep Active Learning under Imbalance and Label Noise. (arXiv:2312.09196v1 [cs.LG] CROSS LISTED)
    Class imbalance is a prevalent issue in real world machine learning applications, often leading to poor performance in rare and minority classes. With an abundance of wild unlabeled data, active learning is perhaps the most effective technique in solving the problem at its root -- collecting a more balanced and informative set of labeled examples during annotation. In this work, we propose a novel algorithm that first identifies the class separation threshold and then annotate the most uncertain examples from the minority classes, close to the separation threshold. Through a novel reduction to one-dimensional active learning, our algorithm DIRECT is able to leverage the classic active learning literature to address issues such as batch labeling and tolerance towards label noise. Compared to existing algorithms, our algorithm saves more than 15\% of the annotation budget compared to state-of-art active learning algorithm and more than 90\% of annotation budget compared to random sampling.  ( 2 min )
    Structured Inverse-Free Natural Gradient: Memory-Efficient & Numerically-Stable KFAC for Large Neural Nets. (arXiv:2312.05705v2 [cs.LG] UPDATED)
    Second-order methods for deep learning -- such as KFAC -- can be useful for neural net training. However, they are often memory-inefficient and numerically unstable for low-precision training since their preconditioning Kronecker factors are dense, and require high-precision matrix inversion or decomposition. Consequently, such methods are not widely used for training large neural networks such as transformer-based models. We address these two issues by (i) formulating an inverse-free update of KFAC and (ii) imposing structures in each of the Kronecker factors, resulting in a method we term structured inverse-free natural gradient descent (SINGD). On large modern neural networks, we show that, in contrast to KFAC, SINGD is memory efficient and numerically robust, and often outperforms AdamW even in half precision. Hence, our work closes a gap between first-order and second-order methods in modern low precision training for large neural nets.  ( 2 min )
    Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models. (arXiv:2310.03059v5 [cs.CV] UPDATED)
    The popularity of pre-trained large models has revolutionized downstream tasks across diverse fields, such as language, vision, and multi-modality. To minimize the adaption cost for downstream tasks, many Parameter-Efficient Fine-Tuning (PEFT) techniques are proposed for language and 2D image pre-trained models. However, the specialized PEFT method for 3D pre-trained models is still under-explored. To this end, we introduce Point-PEFT, a novel framework for adapting point cloud pre-trained models with minimal learnable parameters. Specifically, for a pre-trained 3D model, we freeze most of its parameters, and only tune the newly added PEFT modules on downstream tasks, which consist of a Point-prior Prompt and a Geometry-aware Adapter. The Point-prior Prompt adopts a set of learnable prompt tokens, for which we propose to construct a memory bank with domain-specific knowledge, and utilize a parameter-free attention to enhance the prompt tokens. The Geometry-aware Adapter aims to aggregate point cloud features within spatial neighborhoods to capture fine-grained geometric information through local interactions. Extensive experiments indicate that our Point-PEFT can achieve better performance than the full fine-tuning on various downstream tasks, while using only 5% of the trainable parameters, demonstrating the efficiency and effectiveness of our approach. Code is released at https://github.com/Ivan-Tang-3D/PEFT-3D.  ( 3 min )
    Learning the Causal Structure of Networked Dynamical Systems under Latent Nodes and Structured Noise. (arXiv:2312.05974v2 [cs.LG] UPDATED)
    This paper considers learning the hidden causal network of a linear networked dynamical system (NDS) from the time series data at some of its nodes -- partial observability. The dynamics of the NDS are driven by colored noise that generates spurious associations across pairs of nodes, rendering the problem much harder. To address the challenge of noise correlation and partial observability, we assign to each pair of nodes a feature vector computed from the time series data of observed nodes. The feature embedding is engineered to yield structural consistency: there exists an affine hyperplane that consistently partitions the set of features, separating the feature vectors corresponding to connected pairs of nodes from those corresponding to disconnected pairs. The causal inference problem is thus addressed via clustering the designed features. We demonstrate with simple baseline supervised methods the competitive performance of the proposed causal inference mechanism under broad connectivity regimes and noise correlation levels, including a real world network. Further, we devise novel technical guarantees of structural consistency for linear NDS under the considered regime.  ( 3 min )
    A Theory of Multimodal Learning. (arXiv:2309.12458v2 [cs.LG] UPDATED)
    Human perception of the empirical world involves recognizing the diverse appearances, or 'modalities', of underlying objects. Despite the longstanding consideration of this perspective in philosophy and cognitive science, the study of multimodality remains relatively under-explored within the field of machine learning. Nevertheless, current studies of multimodal machine learning are limited to empirical practices, lacking theoretical foundations beyond heuristic arguments. An intriguing finding from the practice of multimodal learning is that a model trained on multiple modalities can outperform a finely-tuned unimodal model, even on unimodal tasks. This paper provides a theoretical framework that explains this phenomenon, by studying generalization properties of multimodal learning algorithms. We demonstrate that multimodal learning allows for a superior generalization bound compared to unimodal learning, up to a factor of $O(\sqrt{n})$, where $n$ represents the sample size. Such advantage occurs when both connection and heterogeneity exist between the modalities.  ( 2 min )
    Integrated Decision Gradients: Compute Your Attributions Where the Model Makes Its Decision. (arXiv:2305.20052v2 [cs.LG] UPDATED)
    Attribution algorithms are frequently employed to explain the decisions of neural network models. Integrated Gradients (IG) is an influential attribution method due to its strong axiomatic foundation. The algorithm is based on integrating the gradients along a path from a reference image to the input image. Unfortunately, it can be observed that gradients computed from regions where the output logit changes minimally along the path provide poor explanations for the model decision, which is called the saturation effect problem. In this paper, we propose an attribution algorithm called integrated decision gradients (IDG). The algorithm focuses on integrating gradients from the region of the path where the model makes its decision, i.e., the portion of the path where the output logit rapidly transitions from zero to its final value. This is practically realized by scaling each gradient by the derivative of the output logit with respect to the path. The algorithm thereby provides a principled solution to the saturation problem. Additionally, we minimize the errors within the Riemann sum approximation of the path integral by utilizing non-uniform subdivisions determined by adaptive sampling. In the evaluation on ImageNet, it is demonstrated that IDG outperforms IG, Left-IG, Guided IG, and adversarial gradient integration both qualitatively and quantitatively using standard insertion and deletion metrics across three common models.  ( 3 min )
    ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding. (arXiv:2305.14196v3 [cs.CL] UPDATED)
    We introduce ZeroSCROLLS, a zero-shot benchmark for natural language understanding over long texts, which contains only test and small validation sets, without training data. We adapt six tasks from the SCROLLS benchmark, and add four new datasets, including two novel information fusing tasks, such as aggregating the percentage of positive reviews. Using ZeroSCROLLS, we conduct a comprehensive evaluation of both open-source and closed large language models, finding that Claude outperforms ChatGPT, and that GPT-4 achieves the highest average score. However, there is still room for improvement on multiple open challenges in ZeroSCROLLS, such as aggregation tasks, where models struggle to pass the naive baseline. As the state of the art is a moving target, we invite researchers to evaluate their ideas on the live ZeroSCROLLS leaderboard.  ( 2 min )
    Latent Graph Inference with Limited Supervision. (arXiv:2310.04314v2 [cs.LG] UPDATED)
    Latent graph inference (LGI) aims to jointly learn the underlying graph structure and node representations from data features. However, existing LGI methods commonly suffer from the issue of supervision starvation, where massive edge weights are learned without semantic supervision and do not contribute to the training loss. Consequently, these supervision-starved weights, which may determine the predictions of testing samples, cannot be semantically optimal, resulting in poor generalization. In this paper, we observe that this issue is actually caused by the graph sparsification operation, which severely destroys the important connections established between pivotal nodes and labeled ones. To address this, we propose to restore the corrupted affinities and replenish the missed supervision for better LGI. The key challenge then lies in identifying the critical nodes and recovering the corrupted affinities. We begin by defining the pivotal nodes as $k$-hop starved nodes, which can be identified based on a given adjacency matrix. Considering the high computational burden, we further present a more efficient alternative inspired by CUR matrix decomposition. Subsequently, we eliminate the starved nodes by reconstructing the destroyed connections. Extensive experiments on representative benchmarks demonstrate that reducing the starved nodes consistently improves the performance of state-of-the-art LGI methods, especially under extremely limited supervision (6.12% improvement on Pubmed with a labeling rate of only 0.3%).  ( 2 min )
    When Do Program-of-Thoughts Work for Reasoning?. (arXiv:2308.15452v6 [cs.CL] UPDATED)
    In the realm of embodied artificial intelligence, the reasoning capabilities of Large Language Models (LLMs) play a pivotal role. Although there are effective methods like program-of-thought prompting for LLMs which uses programming language to tackle complex reasoning tasks, the specific impact of code data on the improvement of reasoning capabilities remains under-explored. To address this gap, we propose complexity-impacted reasoning score (CIRS), which combines structural and logical attributes, to measure the correlation between code and reasoning abilities. Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity by considering the difficulty and the cyclomatic complexity. Through an empirical analysis, we find not all code data of complexity can be learned or understood by LLMs. Optimal level of complexity is critical to the improvement of reasoning abilities by program-aided prompting. Then we design an auto-synthesizing and stratifying algorithm, and apply it to instruction generation for mathematical reasoning and code data filtering for code generation tasks. Extensive results demonstrates the effectiveness of our proposed approach. Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.  ( 3 min )
    Elephants and Algorithms: A Review of the Current and Future Role of AI in Elephant Monitoring. (arXiv:2306.13803v2 [cs.AI] UPDATED)
    Artificial intelligence (AI) and machine learning (ML) present revolutionary opportunities to enhance our understanding of animal behavior and conservation strategies. Using elephants, a crucial species in Africa's protected areas, as our focal point, we delve into the role of AI and ML in their conservation. Given the increasing amounts of data gathered from a variety of sensors like cameras, microphones, geophones, drones, and satellites, the challenge lies in managing and interpreting this vast data. New AI and ML techniques offer solutions to streamline this process, helping us extract vital information that might otherwise be overlooked. This paper focuses on the different AI-driven monitoring methods and their potential for improving elephant conservation. Collaborative efforts between AI experts and ecological researchers are essential in leveraging these innovative technologies for enhanced wildlife conservation, setting a precedent for numerous other species.  ( 2 min )
    ULTRA-DP: Unifying Graph Pre-training with Multi-task Graph Dual Prompt. (arXiv:2310.14845v2 [cs.LG] UPDATED)
    Recent research has demonstrated the efficacy of pre-training graph neural networks (GNNs) to capture the transferable graph semantics and enhance the performance of various downstream tasks. However, the semantic knowledge learned from pretext tasks might be unrelated to the downstream task, leading to a semantic gap that limits the application of graph pre-training. To reduce this gap, traditional approaches propose hybrid pre-training to combine various pretext tasks together in a multi-task learning fashion and learn multi-grained knowledge, which, however, cannot distinguish tasks and results in some transferable task-specific knowledge distortion by each other. Moreover, most GNNs cannot distinguish nodes located in different parts of the graph, making them fail to learn position-specific knowledge and lead to suboptimal performance. In this work, inspired by the prompt-based tuning in natural language processing, we propose a unified framework for graph hybrid pre-training which injects the task identification and position identification into GNNs through a prompt mechanism, namely multi-task graph dual prompt (ULTRA-DP). Based on this framework, we propose a prompt-based transferability test to find the most relevant pretext task in order to reduce the semantic gap. To implement the hybrid pre-training tasks, beyond the classical edge prediction task (node-node level), we further propose a novel pre-training paradigm based on a group of $k$-nearest neighbors (node-group level). The combination of them across different scales is able to comprehensively express more structural semantics and derive richer multi-grained knowledge. Extensive experiments show that our proposed ULTRA-DP can significantly enhance the performance of hybrid pre-training methods and show the generalizability to other pre-training tasks and backbone architectures.  ( 3 min )
    A General Search-based Framework for Generating Textual Counterfactual Explanations. (arXiv:2211.00369v2 [cs.LG] UPDATED)
    One of the prominent methods for explaining the decision of a machine-learning classifier is by a counterfactual example. Most current algorithms for generating such examples in the textual domain are based on generative language models. Generative models, however, are trained to minimize a specific loss function in order to fulfill certain requirements for the generated texts. Any change in the requirements may necessitate costly retraining, thus potentially limiting their applicability. In this paper, we present a general search-based framework for generating counterfactual explanations in the textual domain. Our framework is model-agnostic, domain-agnostic, anytime, and does not require retraining in order to adapt to changes in the user requirements. We model the task as a search problem in a space where the initial state is the classified text, and the goal state is a text in a given target class. Our framework includes domain-independent modification operators, but can also exploit domain-specific knowledge through specialized operators. The search algorithm attempts to find a text from the target class with minimal user-specified distance from the original classified object.  ( 2 min )
    Persistent Homological State-Space Estimation of Functional Human Brain Networks at Rest. (arXiv:2201.00087v4 [math.AT] UPDATED)
    We present a new data driven topological data analysis (TDA) approach for estimating state spaces in dynamically changing human functional brain networks of human. Our approach penalizes the topological distance between networks and clusters dynamically changing brain networks into topologically distinct states. Our method takes into account the temporal dimension of the data through the Wasserstein distance between networks. Our method is shown to outperform the widely used k-means clustering often used in estimating the state space in brain networks. The method is applied to more accurately determine the state spaces of dynamically changing functional brain networks. Subsequently, we address the question of whether the overall topology of brain networks is a heritable feature using the twin study design. MATLAB code for the method is available at https://github.com/laplcebeltrami/PH-STAT.  ( 2 min )
    A conditional gradient homotopy method with applications to Semidefinite Programming. (arXiv:2207.03101v2 [math.OC] UPDATED)
    We propose a new homotopy-based conditional gradient method for solving convex optimization problems with a large number of simple conic constraints. Instances of this template naturally appear in semidefinite programming problems arising as convex relaxations of combinatorial optimization problems. Our method is a double-loop algorithm in which the conic constraint is treated via a self-concordant barrier, and the inner loop employs a conditional gradient algorithm to approximate the analytic central path, while the outer loop updates the accuracy imposed on the temporal solution and the homotopy parameter. Our theoretical iteration complexity is competitive when confronted to state-of-the-art SDP solvers, with the decisive advantage of cheap projection-free subroutines. Preliminary numerical experiments are provided for illustrating the practical performance of the method.  ( 2 min )
    HetGPT: Harnessing the Power of Prompt Tuning in Pre-Trained Heterogeneous Graph Neural Networks. (arXiv:2310.15318v2 [cs.LG] UPDATED)
    Graphs have emerged as a natural choice to represent and analyze the intricate patterns and rich information of the Web, enabling applications such as online page classification and social recommendation. The prevailing "pre-train, fine-tune" paradigm has been widely adopted in graph machine learning tasks, particularly in scenarios with limited labeled nodes. However, this approach often exhibits a misalignment between the training objectives of pretext tasks and those of downstream tasks. This gap can result in the "negative transfer" problem, wherein the knowledge gained from pre-training adversely affects performance in the downstream tasks. The surge in prompt-based learning within Natural Language Processing (NLP) suggests the potential of adapting a "pre-train, prompt" paradigm to graphs as an alternative. However, existing graph prompting techniques are tailored to homogeneous graphs, neglecting the inherent heterogeneity of Web graphs. To bridge this gap, we propose HetGPT, a general post-training prompting framework to improve the predictive performance of pre-trained heterogeneous graph neural networks (HGNNs). The key is the design of a novel prompting function that integrates a virtual class prompt and a heterogeneous feature prompt, with the aim to reformulate downstream tasks to mirror pretext tasks. Moreover, HetGPT introduces a multi-view neighborhood aggregation mechanism, capturing the complex neighborhood structure in heterogeneous graphs. Extensive experiments on three benchmark datasets demonstrate HetGPT's capability to enhance the performance of state-of-the-art HGNNs on semi-supervised node classification.  ( 3 min )
    A predict-and-optimize approach to profit-driven churn prevention. (arXiv:2310.07047v2 [cs.LG] UPDATED)
    In this paper, we introduce a novel predict-and-optimize method for profit-driven churn prevention. We frame the task of targeting customers for a retention campaign as a regret minimization problem. The main objective is to leverage individual customer lifetime values (CLVs) to ensure that only the most valuable customers are targeted. In contrast, many profit-driven strategies focus on churn probabilities while considering average CLVs. This often results in significant information loss due to data aggregation. Our proposed model aligns with the guidelines of Predict-and-Optimize (PnO) frameworks and can be efficiently solved using stochastic gradient descent methods. Results from 12 churn prediction datasets underscore the effectiveness of our approach, which achieves the best average performance compared to other well-established strategies in terms of average profit.  ( 2 min )
    Taming Binarized Neural Networks and Mixed-Integer Programs. (arXiv:2310.04469v2 [cs.LG] UPDATED)
    There has been a great deal of recent interest in binarized neural networks, especially because of their explainability. At the same time, automatic differentiation algorithms such as backpropagation fail for binarized neural networks, which limits their applicability. By reformulating the problem of training binarized neural networks as a subadditive dual of a mixed-integer program, we show that binarized neural networks admit a tame representation. This, in turn, makes it possible to use the framework of Bolte et al. for implicit differentiation, which offers the possibility for practical implementation of backpropagation in the context of binarized neural networks. This approach could also be used for a broader class of mixed-integer programs, beyond the training of binarized neural networks, as encountered in symbolic approaches to AI and beyond.  ( 2 min )
    Channel Estimation in RIS-Enabled mmWave Wireless Systems: A Variational Inference Approach. (arXiv:2308.13616v2 [eess.SP] UPDATED)
    Channel estimation in reconfigurable intelligent surfaces (RIS)-aided systems is crucial for optimal configuration of the RIS and various downstream tasks such as user localization. In RIS-aided systems, channel estimation involves estimating two channels for the user-RIS (UE-RIS) and RIS-base station (RIS-BS) links. In the literature, two approaches are proposed: (i) cascaded channel estimation where the two channels are collapsed into a single one and estimated using training signals at the BS, and (ii) separate channel estimation that estimates each channel separately either in a passive or semi-passive RIS setting. In this work, we study the separate channel estimation problem in a fully passive RIS-aided millimeter-wave (mmWave) single-user single-input multiple-output (SIMO) communication system. First, we adopt a variational-inference (VI) approach to jointly estimate the UE-RIS and RIS-BS instantaneous channel state information (I-CSI). In particular, auxiliary posterior distributions of the I-CSI are learned through the maximization of the evidence lower bound. However, estimating the I-CSI for both links in every coherence block results in a high signaling overhead to control the RIS in scenarios with highly mobile users. Thus, we extend our first approach to estimate the slow-varying statistical CSI of the UE-RIS link overcoming the highly variant I-CSI. Precisely, our second method estimates the I-CSI of RIS-BS channel and the UE-RIS channel covariance matrix (CCM) directly from the uplink training signals in a fully passive RIS-aided system. The simulation results demonstrate that using maximum a posteriori channel estimation using the auxiliary posteriors can provide a capacity that approaches the capacity with perfect CSI.  ( 3 min )
    From Hope to Safety: Unlearning Biases of Deep Models via Gradient Penalization in Latent Space. (arXiv:2308.09437v3 [cs.LG] UPDATED)
    Deep Neural Networks are prone to learning spurious correlations embedded in the training data, leading to potentially biased predictions. This poses risks when deploying these models for high-stake decision-making, such as in medical applications. Current methods for post-hoc model correction either require input-level annotations which are only possible for spatially localized biases, or augment the latent feature space, thereby hoping to enforce the right reasons. We present a novel method for model correction on the concept level that explicitly reduces model sensitivity towards biases via gradient penalization. When modeling biases via Concept Activation Vectors, we highlight the importance of choosing robust directions, as traditional regression-based approaches such as Support Vector Machines tend to result in diverging directions. We effectively mitigate biases in controlled and real-world settings on the ISIC, Bone Age, ImageNet and CelebA datasets using VGG, ResNet and EfficientNet architectures. Code is available on https://github.com/frederikpahde/rrclarc.  ( 2 min )
    Towards Understanding the Generalizability of Delayed Stochastic Gradient Descent. (arXiv:2308.09430v2 [cs.LG] UPDATED)
    Stochastic gradient descent (SGD) performed in an asynchronous manner plays a crucial role in training large-scale machine learning models. However, the generalization performance of asynchronous delayed SGD, which is an essential metric for assessing machine learning algorithms, has rarely been explored. Existing generalization error bounds are rather pessimistic and cannot reveal the correlation between asynchronous delays and generalization. In this paper, we investigate sharper generalization error bound for SGD with asynchronous delay $\tau$. Leveraging the generating function analysis tool, we first establish the average stability of the delayed gradient algorithm. Based on this algorithmic stability, we provide upper bounds on the generalization error of $\tilde{\mathcal{O}}(\frac{T-\tau}{n\tau})$ and $\tilde{\mathcal{O}}(\frac{1}{n})$ for quadratic convex and strongly convex problems, respectively, where $T$ refers to the iteration number and $n$ is the amount of training data. Our theoretical results indicate that asynchronous delays reduce the generalization error of the delayed SGD algorithm. Analogous analysis can be generalized to the random delay setting, and the experimental results validate our theoretical findings.  ( 2 min )
    Can Transformers Learn Optimal Filtering for Unknown Systems?. (arXiv:2308.08536v2 [eess.SY] UPDATED)
    Transformer models have shown great success in natural language processing; however, their potential remains mostly unexplored for dynamical systems. In this work, we investigate the optimal output estimation problem using transformers, which generate output predictions using all the past ones. Particularly, we train the transformer using various distinct systems and then evaluate the performance on unseen systems with unknown dynamics. Empirically, the trained transformer adapts exceedingly well to different unseen systems and even matches the optimal performance given by the Kalman filter for linear systems. In more complex settings with non-i.i.d. noise, time-varying dynamics, and nonlinear dynamics like a quadrotor system with unknown parameters, transformers also demonstrate promising results. To support our experimental findings, we provide statistical guarantees that quantify the amount of training data required for the transformer to achieve a desired excess risk. Finally, we point out some limitations by identifying two classes of problems that lead to degraded performance, highlighting the need for caution when using transformers for control and estimation.  ( 2 min )
    Few-shot Class-incremental Learning: A Survey. (arXiv:2308.06764v2 [cs.LG] UPDATED)
    Few-shot Class-Incremental Learning (FSCIL) presents a unique challenge in Machine Learning (ML), as it necessitates the Incremental Learning (IL) of new classes from sparsely labeled training samples without forgetting previous knowledge. While this field has seen recent progress, it remains an active exploration area. This paper aims to provide a comprehensive and systematic review of FSCIL. In our in-depth examination, we delve into various facets of FSCIL, encompassing the problem definition, the discussion of the primary challenges of unreliable empirical risk minimization and the stability-plasticity dilemma, general schemes, and relevant problems of IL and Few-shot Learning (FSL). Besides, we offer an overview of benchmark datasets and evaluation metrics. Furthermore, we introduce the Few-shot Class-incremental Classification (FSCIC) methods from data-based, structure-based, and optimization-based approaches and the Few-shot Class-incremental Object Detection (FSCIOD) methods from anchor-free and anchor-based approaches. Beyond these, we present several promising research directions within FSCIL that merit further investigation.  ( 2 min )
    Does Visual Pretraining Help End-to-End Reasoning?. (arXiv:2307.08506v2 [cs.CV] UPDATED)
    We aim to investigate whether end-to-end learning of visual reasoning can be achieved with general-purpose neural networks, with the help of visual pretraining. A positive result would refute the common belief that explicit visual abstraction (e.g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks. We propose a simple and general self-supervised framework which "compresses" each video frame into a small set of tokens with a transformer network, and reconstructs the remaining frames based on the compressed temporal context. To minimize the reconstruction loss, the network must learn a compact representation for each image, as well as capture temporal dynamics and object permanence from temporal context. We perform evaluation on two visual reasoning benchmarks, CATER and ACRE. We observe that pretraining is essential to achieve compositional generalization for end-to-end visual reasoning. Our proposed framework outperforms traditional supervised pretraining, including image classification and explicit object detection, by large margins.  ( 2 min )
    A Survey on Blood Pressure Measurement Technologies: Addressing Potential Sources of Bias. (arXiv:2306.08451v3 [physics.med-ph] UPDATED)
    Regular blood pressure (BP) monitoring in clinical and ambulatory settings plays a crucial role in the prevention, diagnosis, treatment, and management of cardiovascular diseases. Recently, the widespread adoption of ambulatory BP measurement devices has been driven predominantly by the increased prevalence of hypertension and its associated risks and clinical conditions. Recent guidelines advocate for regular BP monitoring as part of regular clinical visits or even at home. This increased utilization of BP measurement technologies has brought up significant concerns, regarding the accuracy of reported BP values across settings. In this survey, focusing mainly on cuff-based BP monitoring technologies, we highlight how BP measurements can demonstrate substantial biases and variances due to factors such as measurement and device errors, demographics, and body habitus. With these inherent biases, the development of a new generation of cuff-based BP devices which use artificial-intelligence (AI) has significant potential. We present future avenues where AI-assisted technologies can leverage the extensive clinical literature on BP-related studies together with the large collections of BP records available in electronic health records. These resources can be combined with machine learning approaches, including deep learning and Bayesian inference, to remove BP measurement biases and to provide individualized BP-related cardiovascular risk indexes.  ( 3 min )
    On the Expected Size of Conformal Prediction Sets. (arXiv:2306.07254v2 [stat.ML] UPDATED)
    While conformal predictors reap the benefits of rigorous statistical guarantees on their error frequency, the size of their corresponding prediction sets is critical to their practical utility. Unfortunately, there is currently a lack of finite-sample analysis and guarantees for their prediction set sizes. To address this shortfall, we theoretically quantify the expected size of the prediction sets under the split conformal prediction framework. As this precise formulation cannot usually be calculated directly, we further derive point estimates and high-probability interval bounds that can be empirically computed, providing a practical method for characterizing the expected set size. We corroborate the efficacy of our results with experiments on real-world datasets for both regression and classification problems.  ( 2 min )
    Multi-modal Latent Diffusion. (arXiv:2306.04445v2 [cs.LG] UPDATED)
    Multi-modal data-sets are ubiquitous in modern applications, and multi-modal Variational Autoencoders are a popular family of models that aim to learn a joint representation of the different modalities. However, existing approaches suffer from a coherence-quality tradeoff, where models with good generation quality lack generative coherence across modalities, and vice versa. We discuss the limitations underlying the unsatisfactory performance of existing methods, to motivate the need for a different approach. We propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders. Individual latent variables are concatenated into a common latent space, which is fed to a masked diffusion model to enable generative modeling. We also introduce a new multi-time training method to learn the conditional score network for multi-modal diffusion. Our methodology substantially outperforms competitors in both generation quality and coherence, as shown through an extensive experimental campaign.  ( 2 min )
    Sequential Principal-Agent Problems with Communication: Efficient Computation and Learning. (arXiv:2306.03832v2 [cs.GT] UPDATED)
    We study a sequential decision making problem between a principal and an agent with incomplete information on both sides. In this model, the principal and the agent interact in a stochastic environment, and each is privy to observations about the state not available to the other. The principal has the power of commitment, both to elicit information from the agent and to provide signals about her own information. The principal and the agent communicate their signals to each other, and select their actions independently based on this communication. Each player receives a payoff based on the state and their joint actions, and the environment moves to a new state. The interaction continues over a finite time horizon, and both players act to optimize their own total payoffs over the horizon. Our model encompasses as special cases stochastic games of incomplete information and POMDPs, as well as sequential Bayesian persuasion and mechanism design problems. We study both computation of optimal policies and learning in our setting. While the general problems are computationally intractable, we study algorithmic solutions under a conditional independence assumption on the underlying state-observation distributions. We present a polynomial-time algorithm to compute the principal's optimal policy up to an additive approximation. Additionally, we show an efficient learning algorithm in the case where the transition probabilities are not known beforehand. The algorithm guarantees sublinear regret for both players.  ( 3 min )
    Learning Linear Causal Representations from Interventions under General Nonlinear Mixing. (arXiv:2306.02235v2 [cs.LG] UPDATED)
    We study the problem of learning causal representations from unknown, latent interventions in a general setting, where the latent distribution is Gaussian but the mixing function is completely general. We prove strong identifiability results given unknown single-node interventions, i.e., without having access to the intervention targets. This generalizes prior works which have focused on weaker classes, such as linear maps or paired counterfactual data. This is also the first instance of causal identifiability from non-paired interventions for deep neural network embeddings. Our proof relies on carefully uncovering the high-dimensional geometric structure present in the data distribution after a non-linear density transformation, which we capture by analyzing quadratic forms of precision matrices of the latent distributions. Finally, we propose a contrastive algorithm to identify the latent variables in practice and evaluate its performance on various tasks.  ( 2 min )
    Not All Neuro-Symbolic Concepts Are Created Equal: Analysis and Mitigation of Reasoning Shortcuts. (arXiv:2305.19951v2 [cs.LG] UPDATED)
    Neuro-Symbolic (NeSy) predictive models hold the promise of improved compliance with given constraints, systematic generalization, and interpretability, as they allow to infer labels that are consistent with some prior knowledge by reasoning over high-level concepts extracted from sub-symbolic inputs. It was recently shown that NeSy predictors are affected by reasoning shortcuts: they can attain high accuracy but by leveraging concepts with unintended semantics, thus coming short of their promised advantages. Yet, a systematic characterization of reasoning shortcuts and of potential mitigation strategies is missing. This work fills this gap by characterizing them as unintended optima of the learning objective and identifying four key conditions behind their occurrence. Based on this, we derive several natural mitigation strategies, and analyze their efficacy both theoretically and empirically. Our analysis shows reasoning shortcuts are difficult to deal with, casting doubts on the trustworthiness and interpretability of existing NeSy solutions.  ( 2 min )
    How Two-Layer Neural Networks Learn, One (Giant) Step at a Time. (arXiv:2305.18270v3 [stat.ML] UPDATED)
    We investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function through a few large batch gradient descent steps, leading to improvement in the approximation capacity with respect to the initialization. We compare the influence of batch size and that of multiple (but finitely many) steps. For a single gradient step, a batch of size $n = \mathcal{O}(d)$ is both necessary and sufficient to align with the target function, although only a single direction can be learned. In contrast, $n = \mathcal{O}(d^2)$ is essential for neurons to specialize to multiple relevant directions of the target with a single gradient step. Even in this case, we show there might exist ``hard'' directions requiring $n = \mathcal{O}(d^\ell)$ samples to be learned, where $\ell$ is known as the leap index of the target. The picture drastically improves over multiple gradient steps: we show that a batch-size of $n = \mathcal{O}(d)$ is indeed enough to learn multiple target directions satisfying a staircase property, where more and more directions can be learned over time. Finally, we discuss how these directions allows to drastically improve the approximation capacity and generalization error over the initialization, illustrating a separation of scale between the random features/lazy regime, and the feature learning regime. Our technical analysis leverages a combination of techniques related to concentration, projection-based conditioning, and Gaussian equivalence which we believe are of independent interest. By pinning down the conditions necessary for specialization and learning, our results highlight the interaction between batch size and number of iterations, and lead to a hierarchical depiction where learning performance exhibits a stairway to accuracy over time and batch size, shedding new light on how neural networks adapt to features of the data.  ( 3 min )
    GLOBE-CE: A Translation-Based Approach for Global Counterfactual Explanations. (arXiv:2305.17021v2 [cs.LG] UPDATED)
    Counterfactual explanations have been widely studied in explainability, with a range of application dependent methods prominent in fairness, recourse and model understanding. The major shortcoming associated with these methods, however, is their inability to provide explanations beyond the local or instance-level. While many works touch upon the notion of a global explanation, typically suggesting to aggregate masses of local explanations in the hope of ascertaining global properties, few provide frameworks that are both reliable and computationally tractable. Meanwhile, practitioners are requesting more efficient and interactive explainability tools. We take this opportunity to propose Global & Efficient Counterfactual Explanations (GLOBE-CE), a flexible framework that tackles the reliability and scalability issues associated with current state-of-the-art, particularly on higher dimensional datasets and in the presence of continuous features. Furthermore, we provide a unique mathematical analysis of categorical feature translations, utilising it in our method. Experimental evaluation with publicly available datasets and user studies demonstrate that GLOBE-CE performs significantly better than the current state-of-the-art across multiple metrics (e.g., speed, reliability).  ( 2 min )
    FairGen: Towards Fair Graph Generation. (arXiv:2303.17743v3 [cs.LG] UPDATED)
    There have been tremendous efforts over the past decades dedicated to the generation of realistic graphs in a variety of domains, ranging from social networks to computer networks, from gene regulatory networks to online transaction networks. Despite the remarkable success, the vast majority of these works are unsupervised in nature and are typically trained to minimize the expected graph reconstruction loss, which would result in the representation disparity issue in the generated graphs, i.e., the protected groups (often minorities) contribute less to the objective and thus suffer from systematically higher errors. In this paper, we aim to tailor graph generation to downstream mining tasks by leveraging label information and user-preferred parity constraints. In particular, we start from the investigation of representation disparity in the context of graph generative models. To mitigate the disparity, we propose a fairness-aware graph generative model named FairGen. Our model jointly trains a label-informed graph generation module and a fair representation learning module by progressively learning the behaviors of the protected and unprotected groups, from the `easy' concepts to the `hard' ones. In addition, we propose a generic context sampling strategy for graph generative models, which is proven to be capable of fairly capturing the contextual information of each group with a high probability. Experimental results on seven real-world data sets, including web-based graphs, demonstrate that FairGen (1) obtains performance on par with state-of-the-art graph generative models across nine network properties, (2) mitigates the representation disparity issues in the generated graphs, and (3) substantially boosts the model performance by up to 17% in downstream tasks via data augmentation.  ( 3 min )
    FFT-based Dynamic Token Mixer for Vision. (arXiv:2303.03932v2 [cs.CV] UPDATED)
    Multi-head-self-attention (MHSA)-equipped models have achieved notable performance in computer vision. Their computational complexity is proportional to quadratic numbers of pixels in input feature maps, resulting in slow processing, especially when dealing with high-resolution images. New types of token-mixer are proposed as an alternative to MHSA to circumvent this problem: an FFT-based token-mixer involves global operations similar to MHSA but with lower computational complexity. However, despite its attractive properties, the FFT-based token-mixer has not been carefully examined in terms of its compatibility with the rapidly evolving MetaFormer architecture. Here, we propose a novel token-mixer called Dynamic Filter and novel image recognition models, DFFormer and CDFFormer, to close the gaps above. The results of image classification and downstream tasks, analysis, and visualization show that our models are helpful. Notably, their throughput and memory efficiency when dealing with high-resolution image recognition is remarkable. Our results indicate that Dynamic Filter is one of the token-mixer options that should be seriously considered. The code is available at https://github.com/okojoalg/dfformer  ( 2 min )
    DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?. (arXiv:2303.01213v2 [cs.LG] UPDATED)
    Neoteric works have shown that modern deep learning models can exhibit a sparse double descent phenomenon. Indeed, as the sparsity of the model increases, the test performance first worsens since the model is overfitting the training data; then, the overfitting reduces, leading to an improvement in performance, and finally, the model begins to forget critical information, resulting in underfitting. Such a behavior prevents using traditional early stop criteria. In this work, we have three key contributions. First, we propose a learning framework that avoids such a phenomenon and improves generalization. Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon and enabling the use of traditional stop criteria. Third, we provide a comprehensive quantitative analysis of contingent factors such as re-initialization methods, model width and depth, and dataset noise. The contributions are supported by empirical evidence in typical setups. Our code is available at https://github.com/VGCQ/DSD2.  ( 2 min )
    Continuous-Time Functional Diffusion Processes. (arXiv:2303.00800v3 [cs.LG] UPDATED)
    We introduce Functional Diffusion Processes (FDPs), which generalize score-based diffusion models to infinite-dimensional function spaces. FDPs require a new mathematical framework to describe the forward and backward dynamics, and several extensions to derive practical training objectives. These include infinite-dimensional versions of Girsanov theorem, in order to be able to compute an ELBO, and of the sampling theorem, in order to guarantee that functional evaluations in a countable set of points are equivalent to infinite-dimensional functions. We use FDPs to build a new breed of generative models in function spaces, which do not require specialized network architectures, and that can work with any kind of continuous data. Our results on real data show that FDPs achieve high-quality image generation, using a simple MLP architecture with orders of magnitude fewer parameters than existing diffusion models.  ( 2 min )
    Simplifying Momentum-based Positive-definite Submanifold Optimization with Applications to Deep Learning. (arXiv:2302.09738v9 [stat.ML] UPDATED)
    Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations. Here, we simplify such difficulties for a class of sparse or structured symmetric positive-definite matrices with the affine-invariant metric. We do so by proposing a generalized version of the Riemannian normal coordinates that dynamically orthonormalizes the metric and locally converts the problem into an unconstrained problem in the Euclidean space. We use our approach to simplify existing approaches for structured covariances and develop matrix-inverse-free $2^\text{nd}$-order optimizers for deep learning with low precision by using only matrix multiplications. Code: https://github.com/yorkerlin/StructuredNGD-DL  ( 2 min )
    CrystalBox: Future-Based Explanations for Input-Driven Deep RL Systems. (arXiv:2302.13483v3 [cs.LG] UPDATED)
    We present CrystalBox, a novel, model-agnostic, posthoc explainability framework for Deep Reinforcement Learning (DRL) controllers in the large family of input-driven environments which includes computer systems. We combine the natural decomposability of reward functions in input-driven environments with the explanatory power of decomposed returns. We propose an efficient algorithm to generate future-based explanations across both discrete and continuous control environments. Using applications such as adaptive bitrate streaming and congestion control, we demonstrate CrystalBox's capability to generate high-fidelity explanations. We further illustrate its higher utility across three practical use cases: contrastive explanations, network observability, and guided reward design, as opposed to prior explainability techniques that identify salient features.  ( 2 min )
    Disentangled Representation for Causal Mediation Analysis. (arXiv:2302.09694v2 [cs.LG] UPDATED)
    Estimating direct and indirect causal effects from observational data is crucial to understanding the causal mechanisms and predicting the behaviour under different interventions. Causal mediation analysis is a method that is often used to reveal direct and indirect effects. Deep learning shows promise in mediation analysis, but the current methods only assume latent confounders that affect treatment, mediator and outcome simultaneously, and fail to identify different types of latent confounders (e.g., confounders that only affect the mediator or outcome). Furthermore, current methods are based on the sequential ignorability assumption, which is not feasible for dealing with multiple types of latent confounders. This work aims to circumvent the sequential ignorability assumption and applies the piecemeal deconfounding assumption as an alternative. We propose the Disentangled Mediation Analysis Variational AutoEncoder (DMAVAE), which disentangles the representations of latent confounders into three types to accurately estimate the natural direct effect, natural indirect effect and total effect. Experimental results show that the proposed method outperforms existing methods and has strong generalisation ability. We further apply the method to a real-world dataset to show its potential application.  ( 2 min )
    Variational Inference on the Final-Layer Output of Neural Networks. (arXiv:2302.02420v4 [cs.LG] UPDATED)
    Traditional neural networks are simple to train but they typically produce overconfident predictions. In contrast, Bayesian neural networks provide good uncertainty quantification but optimizing them is time consuming due to the large parameter space. This paper proposes to combine the advantages of both approaches by performing Variational Inference in the Final layer Output space (VIFO), because the output space is much smaller than the parameter space. We use neural networks to learn the mean and the variance of the probabilistic output. Like standard, non-Beyesian models, VIFO enjoys simple training and one can use Rademacher complexity to provide risk bounds for the model. On the other hand, using the Bayesian formulation we incorporate collapsed variational inference with VIFO which significantly improves the performance in practice. Experiments show that VIFO and ensembles of VIFO provide a good tradeoff in terms of run time and uncertainty quantification, especially for out of distribution data.  ( 2 min )
    Editing Language Model-based Knowledge Graph Embeddings. (arXiv:2301.10405v7 [cs.CL] UPDATED)
    Recently decades have witnessed the empirical success of framing Knowledge Graph (KG) embeddings via language models. However, language model-based KG embeddings are usually deployed as static artifacts, making them difficult to modify post-deployment without re-training after deployment. To address this issue, we propose a new task of editing language model-based KG embeddings in this paper. This task is designed to facilitate rapid, data-efficient updates to KG embeddings without compromising the performance of other aspects. We build four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and evaluate several knowledge editing baselines demonstrating the limited ability of previous models to handle the proposed challenging task. We further propose a simple yet strong baseline dubbed KGEditor, which utilizes additional parametric layers of the hypernetwork to edit/add facts. Our comprehensive experimental results reveal that KGEditor excels in updating specific facts without impacting the overall performance, even when faced with limited training resources. Code and datasets are available in https://github.com/zjunlp/PromptKG/tree/main/deltaKG.  ( 3 min )
    Semiparametric Regression for Spatial Data via Deep Learning. (arXiv:2301.03747v2 [stat.ML] UPDATED)
    In this work, we propose a deep learning-based method to perform semiparametric regression analysis for spatially dependent data. To be specific, we use a sparsely connected deep neural network with rectified linear unit (ReLU) activation function to estimate the unknown regression function that describes the relationship between response and covariates in the presence of spatial dependence. Under some mild conditions, the estimator is proven to be consistent, and the rate of convergence is determined by three factors: (1) the architecture of neural network class, (2) the smoothness and (intrinsic) dimension of true mean function, and (3) the magnitude of spatial dependence. Our method can handle well large data set owing to the stochastic gradient descent optimization algorithm. Simulation studies on synthetic data are conducted to assess the finite sample performance, the results of which indicate that the proposed method is capable of picking up the intricate relationship between response and covariates. Finally, a real data analysis is provided to demonstrate the validity and effectiveness of the proposed method.  ( 2 min )
    Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints. (arXiv:2301.03566v2 [math.ST] UPDATED)
    We study simple binary hypothesis testing under both local differential privacy (LDP) and communication constraints. We qualify our results as either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hypothesis testing under pure LDP constraints, we establish instance-optimal bounds for distributions with binary support; minimax-optimal bounds for general distributions; and (approximately) instance-optimal, computationally efficient algorithms for general distributions. When both privacy and communication constraints are present, we develop instance-optimal, computationally efficient algorithms that achieve the minimum possible sample complexity (up to universal constants). Our results on instance-optimal algorithms hinge on identifying the extreme points of the joint range set $\mathcal A$ of two distributions $p$ and $q$, defined as $\mathcal A := \{(\mathbf T p, \mathbf T q) | \mathbf T \in \mathcal C\}$, where $\mathcal C$ is the set of channels characterizing the constraints.  ( 2 min )
    Efficient, Direct, and Restricted Black-Box Graph Evasion Attacks to Any-Layer Graph Neural Networks via Influence Function. (arXiv:2009.00203v3 [cs.CR] UPDATED)
    Graph neural network (GNN), the mainstream method to learn on graph data, is vulnerable to graph evasion attacks, where an attacker slightly perturbing the graph structure can fool trained GNN models. Existing work has at least one of the following drawbacks: 1) limited to directly attack two-layer GNNs; 2) inefficient; and 3) impractical, as they need to know full or part of GNN model parameters. We address the above drawbacks and propose an influence-based \emph{efficient, direct, and restricted black-box} evasion attack to \emph{any-layer} GNNs. Specifically, we first introduce two influence functions, i.e., feature-label influence and label influence, that are defined on GNNs and label propagation (LP), respectively. Then we observe that GNNs and LP are strongly connected in terms of our defined influences. Based on this, we can then reformulate the evasion attack to GNNs as calculating label influence on LP, which is \emph{inherently} applicable to any-layer GNNs, while no need to know information about the internal GNN model. Finally, we propose an efficient algorithm to calculate label influence. Experimental results on various graph datasets show that, compared to state-of-the-art white-box attacks, our attack can achieve comparable attack performance, but has a 5-50x speedup when attacking two-layer GNNs. Moreover, our attack is effective to attack multi-layer GNNs\footnote{Source code and full version is in the link: \url{https://github.com/ventr1c/InfAttack}}.  ( 3 min )
    On the Compression of Neural Networks Using $\ell_0$-Norm Regularization and Weight Pruning. (arXiv:2109.05075v3 [cs.LG] UPDATED)
    Despite the growing availability of high-capacity computational platforms, implementation complexity still has been a great concern for the real-world deployment of neural networks. This concern is not exclusively due to the huge costs of state-of-the-art network architectures, but also due to the recent push towards edge intelligence and the use of neural networks in embedded applications. In this context, network compression techniques have been gaining interest due to their ability for reducing deployment costs while keeping inference accuracy at satisfactory levels. The present paper is dedicated to the development of a novel compression scheme for neural networks. To this end, a new form of $\ell_0$-norm-based regularization is firstly developed, which is capable of inducing strong sparseness in the network during training. Then, targeting the smaller weights of the trained network with pruning techniques, smaller yet highly effective networks can be obtained. The proposed compression scheme also involves the use of $\ell_2$-norm regularization to avoid overfitting as well as fine tuning to improve the performance of the pruned network. Experimental results are presented aiming to show the effectiveness of the proposed scheme as well as to make comparisons with competing approaches.  ( 3 min )
    High-order Tensor Pooling with Attention for Action Recognition. (arXiv:2110.05216v4 [cs.CV] UPDATED)
    We aim at capturing high-order statistics of feature vectors formed by a neural network, and propose end-to-end second- and higher-order pooling to form a tensor descriptor. Tensor descriptors require a robust similarity measure due to low numbers of aggregated vectors and the burstiness phenomenon, when a given feature appears more/less frequently than statistically expected. The Heat Diffusion Process (HDP) on a graph Laplacian is closely related to the Eigenvalue Power Normalization (EPN) of the covariance/autocorrelation matrix, whose inverse forms a loopy graph Laplacian. We show that the HDP and the EPN play the same role, i.e., to boost or dampen the magnitude of the eigenspectrum thus preventing the burstiness. We equip higher-order tensors with EPN which acts as a spectral detector of higher-order occurrences to prevent burstiness. We also prove that for a tensor of order r built from d dimensional feature descriptors, such a detector gives the likelihood if at least one higher-order occurrence is 'projected' into one of binom(d,r) subspaces represented by the tensor; thus forming a tensor power normalization metric endowed with binom(d,r) such 'detectors'. For experimental contributions, we apply several second- and higher-order pooling variants to action recognition, provide previously not presented comparisons of such pooling variants, and show state-of-the-art results on HMDB-51, YUP++ and MPII Cooking Activities.  ( 3 min )
    Rethinking Transfer Learning for Medical Image Classification. (arXiv:2106.05152v7 [eess.IV] UPDATED)
    Transfer learning (TL) from pretrained deep models is a standard practice in modern medical image classification (MIC). However, what levels of features to be reused are problem-dependent, and uniformly finetuning all layers of pretrained models may be suboptimal. This insight has partly motivated the recent differential TL strategies, such as TransFusion (TF) and layer-wise finetuning (LWFT), which treat the layers in the pretrained models differentially. In this paper, we add one more strategy into this family, called TruncatedTL, which reuses and finetunes appropriate bottom layers and directly discards the remaining layers. This yields not only superior MIC performance but also compact models for efficient inference, compared to other differential TL methods. Our code is available at: https://github.com/sun-umn/TTL  ( 2 min )
    Student as an Inherent Denoiser of Noisy Teacher. (arXiv:2312.10185v1 [cs.LG])
    Knowledge distillation (KD) has been widely employed to transfer knowledge from a large language model (LLM) to a specialized model in low-data regimes through pseudo label learning. However, pseudo labels generated by teacher models are usually noisy and may influence KD performance. This study delves into KD with noisy teachers and uncovers that the student model can already generate more accurate predictions than the teacher labels used to train it during KD, indicating its inherent ability to denoise noisy teacher labels. Motivated by this finding, we propose Peer-Advised KD to improve vanilla KD from noisy teachers. Experiments show that Peer-Advised KD can outperform LLM by approximately 5% with 50 human-labeled data, and even competitive to standard supervised finetuning with 750 human-labeled data.  ( 2 min )
    TSRNet: Simple Framework for Real-time ECG Anomaly Detection with Multimodal Time and Spectrogram Restoration Network. (arXiv:2312.10187v1 [eess.SP])
    The electrocardiogram (ECG) is a valuable signal used to assess various aspects of heart health, such as heart rate and rhythm. It plays a crucial role in identifying cardiac conditions and detecting anomalies in ECG data. However, distinguishing between normal and abnormal ECG signals can be a challenging task. In this paper, we propose an approach that leverages anomaly detection to identify unhealthy conditions using solely normal ECG data for training. Furthermore, to enhance the information available and build a robust system, we suggest considering both the time series and time-frequency domain aspects of the ECG signal. As a result, we introduce a specialized network called the Multimodal Time and Spectrogram Restoration Network (TSRNet) designed specifically for detecting anomalies in ECG signals. TSRNet falls into the category of restoration-based anomaly detection and draws inspiration from both the time series and spectrogram domains. By extracting representations from both domains, TSRNet effectively captures the comprehensive characteristics of the ECG signal. This approach enables the network to learn robust representations with superior discrimination abilities, allowing it to distinguish between normal and abnormal ECG patterns more effectively. Furthermore, we introduce a novel inference method, termed Peak-based Error, that specifically focuses on ECG peaks, a critical component in detecting abnormalities. The experimental result on the large-scale dataset PTB-XL has demonstrated the effectiveness of our approach in ECG anomaly detection, while also prioritizing efficiency by minimizing the number of trainable parameters. Our code is available at https://github.com/UARK-AICV/TSRNet.  ( 3 min )
    Automatic nonlinear MPC approximation with closed-loop guarantees. (arXiv:2312.10199v1 [eess.SY])
    In this paper, we address the problem of automatically approximating nonlinear model predictive control (MPC) schemes with closed-loop guarantees. First, we discuss how this problem can be reduced to a function approximation problem, which we then tackle by proposing ALKIA-X, the Adaptive and Localized Kernel Interpolation Algorithm with eXtrapolated reproducing kernel Hilbert space norm. ALKIA-X is a non-iterative algorithm that ensures numerically well-conditioned computations, a fast-to-evaluate approximating function, and the guaranteed satisfaction of any desired bound on the approximation error. Hence, ALKIA-X automatically computes an explicit function that approximates the MPC, yielding a controller suitable for safety-critical systems and high sampling rates. In a numerical experiment, we apply ALKIA-X to a nonlinear MPC scheme, demonstrating reduced offline computation and online evaluation time compared to a state-of-the-art method.  ( 2 min )
    Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey. (arXiv:2312.10163v1 [cs.CV])
    The advent of foundation models, which are pre-trained on vast datasets, has ushered in a new era of computer vision, characterized by their robustness and remarkable zero-shot generalization capabilities. Mirroring the transformative impact of foundation models like large language models (LLMs) in natural language processing, visual foundation models (VFMs) have become a catalyst for groundbreaking developments in computer vision. This review paper delineates the pivotal trajectories of VFMs, emphasizing their scalability and proficiency in generative tasks such as text-to-image synthesis, as well as their adeptness in discriminative tasks including image segmentation. While generative and discriminative models have historically charted distinct paths, we undertake a comprehensive examination of the recent strides made by VFMs in both domains, elucidating their origins, seminal breakthroughs, and pivotal methodologies. Additionally, we collate and discuss the extensive resources that facilitate the development of VFMs and address the challenges that pave the way for future research endeavors. A crucial direction for forthcoming innovation is the amalgamation of generative and discriminative paradigms. The nascent application of generative models within discriminative contexts signifies the early stages of this confluence. This survey aspires to be a contemporary compendium for scholars and practitioners alike, charting the course of VFMs and illuminating their multifaceted landscape.  ( 2 min )
    ICD-LM: Configuring Vision-Language In-Context Demonstrations by Language Modeling. (arXiv:2312.10104v1 [cs.CV])
    This paper studies how to configure powerful In-Context Demonstration (ICD) sequences for a Large Vision-Language Model (LVLM) to solve Vision-Language tasks through In-Context Learning (ICL). After observing that configuring an ICD sequence is a mirror process of composing a sentence, i.e., just as a sentence can be composed word by word via a Language Model, an ICD sequence can also be configured one by one. Consequently, we introduce an ICD Language Model (ICD-LM) specifically designed to generate effective ICD sequences. This involves creating a dataset of hand-crafted ICD sequences for various query samples and using it to train the ICD-LM. Our approach, diverging from traditional methods in NLP that select and order ICDs separately, enables to simultaneously learn how to select and order ICDs, enhancing the effect of the sequences. Moreover, during data construction, we use the LVLM intended for ICL implementation to validate the strength of each ICD sequence, resulting in a model-specific dataset and the ICD-LM trained by this dataset is also model-specific. We validate our methodology through experiments in Visual Question Answering and Image Captioning, confirming the viability of using a Language Model for ICD configuration. Our comprehensive ablation studies further explore the impact of various dataset construction and ICD-LM development settings on the outcomes. The code is given in https://github.com/ForJadeForest/ICD-LM.  ( 3 min )
    An Information-Flow Perspective on Algorithmic Fairness. (arXiv:2312.10128v1 [cs.CR])
    This work presents insights gained by investigating the relationship between algorithmic fairness and the concept of secure information flow. The problem of enforcing secure information flow is well-studied in the context of information security: If secret information may "flow" through an algorithm or program in such a way that it can influence the program's output, then that is considered insecure information flow as attackers could potentially observe (parts of) the secret. There is a strong correspondence between secure information flow and algorithmic fairness: if protected attributes such as race, gender, or age are treated as secret program inputs, then secure information flow means that these ``secret'' attributes cannot influence the result of a program. While most research in algorithmic fairness evaluation concentrates on studying the impact of algorithms (often treating the algorithm as a black-box), the concepts derived from information flow can be used both for the analysis of disparate treatment as well as disparate impact w.r.t. a structural causal model. In this paper, we examine the relationship between quantitative as well as qualitative information-flow properties and fairness. Moreover, based on this duality, we derive a new quantitative notion of fairness called fairness spread, which can be easily analyzed using quantitative information flow and which strongly relates to counterfactual fairness. We demonstrate that off-the-shelf tools for information-flow properties can be used in order to formally analyze a program's algorithmic fairness properties, including the new notion of fairness spread as well as established notions such as demographic parity.  ( 3 min )
    Revisiting the Entropy Semiring for Neural Speech Recognition. (arXiv:2312.10087v1 [eess.AS])
    In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need to learn it in a completely self-supervised way. In practice, the exponential number of possible alignments makes this extremely challenging, with models often learning peaky or sub-optimal alignments. Prima facie, the exponential nature of the alignment space makes it difficult to even quantify the uncertainty of a model's alignment distribution. Fortunately, it has been known for decades that the entropy of a probabilistic finite state transducer can be computed in time linear to the size of the transducer via a dynamic programming reduction based on semirings. In this work, we revisit the entropy semiring for neural speech recognition models, and show how alignment entropy can be used to supervise models through regularization or distillation. We also contribute an open-source implementation of CTC and RNN-T in the semiring framework that includes numerically stable and highly parallel variants of the entropy semiring. Empirically, we observe that the addition of alignment distillation improves the accuracy and latency of an already well-optimized teacher-student distillation model, achieving state-of-the-art performance on the Librispeech dataset in the streaming scenario.  ( 2 min )
    Constrained Meta-Reinforcement Learning for Adaptable Safety Guarantee with Differentiable Convex Programming. (arXiv:2312.10230v1 [cs.AI])
    Despite remarkable achievements in artificial intelligence, the deployability of learning-enabled systems in high-stakes real-world environments still faces persistent challenges. For example, in safety-critical domains like autonomous driving, robotic manipulation, and healthcare, it is crucial not only to achieve high performance but also to comply with given constraints. Furthermore, adaptability becomes paramount in non-stationary domains, where environmental parameters are subject to change. While safety and adaptability are recognized as key qualities for the new generation of AI, current approaches have not demonstrated effective adaptable performance in constrained settings. Hence, this paper breaks new ground by studying the unique challenges of ensuring safety in non-stationary environments by solving constrained problems through the lens of the meta-learning approach (learning-to-learn). While unconstrained meta-learning al-ready encounters complexities in end-to-end differentiation of the loss due to the bi-level nature, its constrained counterpart introduces an additional layer of difficulty, since the constraints imposed on task-level updates complicate the differentiation process. To address the issue, we first employ successive convex-constrained policy updates across multiple tasks with differentiable convexprogramming, which allows meta-learning in constrained scenarios by enabling end-to-end differentiation. This approach empowers the agent to rapidly adapt to new tasks under non-stationarity while ensuring compliance with safety constraints.  ( 2 min )
    Privacy-Aware Document Visual Question Answering. (arXiv:2312.10108v1 [cs.CV])
    Document Visual Question Answering (DocVQA) is a fast growing branch of document understanding. Despite the fact that documents contain sensitive or copyrighted information, none of the current DocVQA methods offers strong privacy guarantees. In this work, we explore privacy in the domain of DocVQA for the first time. We highlight privacy issues in state of the art multi-modal LLM models used for DocVQA, and explore possible solutions. Specifically, we focus on the invoice processing use case as a realistic, widely used scenario for document understanding, and propose a large scale DocVQA dataset comprising invoice documents and associated questions and answers. We employ a federated learning scheme, that reflects the real-life distribution of documents in different businesses, and we explore the use case where the ID of the invoice issuer is the sensitive information to be protected. We demonstrate that non-private models tend to memorise, behaviour that can lead to exposing private information. We then evaluate baseline training schemes employing federated learning and differential privacy in this multi-modal scenario, where the sensitive information might be exposed through any of the two input modalities: vision (document image) or language (OCR tokens). Finally, we design an attack exploiting the memorisation effect of the model, and demonstrate its effectiveness in probing different DocVQA models.  ( 2 min )
    Building symmetries into data-driven manifold dynamics models for complex flows. (arXiv:2312.10235v1 [cs.LG])
    Symmetries in a dynamical system provide an opportunity to dramatically improve the performance of data-driven models. For fluid flows, such models are needed for tasks related to design, understanding, prediction, and control. In this work we exploit the symmetries of the Navier-Stokes equations (NSE) and use simulation data to find the manifold where the long-time dynamics live, which has many fewer degrees of freedom than the full state representation, and the evolution equation for the dynamics on that manifold. We call this method ''symmetry charting''. The first step is to map to a ''fundamental chart'', which is a region in the state space of the flow to which all other regions can be mapped by a symmetry operation. To map to the fundamental chart we identify a set of indicators from the Fourier transform that uniquely identify the symmetries of the system. We then find a low-dimensional coordinate representation of the data in the fundamental chart with the use of an autoencoder. We use a variation called an implicit rank minimizing autoencoder with weight decay, which in addition to compressing the dimension of the data, also gives estimates of how many dimensions are needed to represent the data: i.e. the dimension of the invariant manifold of the long-time dynamics. Finally, we learn dynamics on this manifold with the use of neural ordinary differential equations. We apply symmetry charting to two-dimensional Kolmogorov flow in a chaotic bursting regime. This system has a continuous translation symmetry, and discrete rotation and shift-reflect symmetries. With this framework we observe that less data is needed to learn accurate data-driven models, more robust estimates of the manifold dimension are obtained, equivariance of the NSE is satisfied, better short-time tracking with respect to the true data is observed, and long-time statistics are correctly captured.  ( 3 min )
    The Limits of Fair Medical Imaging AI In The Wild. (arXiv:2312.10083v1 [cs.CY])
    As artificial intelligence (AI) rapidly approaches human-level performance in medical imaging, it is crucial that it does not exacerbate or propagate healthcare disparities. Prior research has established AI's capacity to infer demographic data from chest X-rays, leading to a key concern: do models using demographic shortcuts have unfair predictions across subpopulations? In this study, we conduct a thorough investigation into the extent to which medical AI utilizes demographic encodings, focusing on potential fairness discrepancies within both in-distribution training sets and external test sets. Our analysis covers three key medical imaging disciplines: radiology, dermatology, and ophthalmology, and incorporates data from six global chest X-ray datasets. We confirm that medical imaging AI leverages demographic shortcuts in disease classification. While correcting shortcuts algorithmically effectively addresses fairness gaps to create "locally optimal" models within the original data distribution, this optimality is not true in new test settings. Surprisingly, we find that models with less encoding of demographic attributes are often most "globally optimal", exhibiting better fairness during model evaluation in new test environments. Our work establishes best practices for medical imaging models which maintain their performance and fairness in deployments beyond their initial training contexts, underscoring critical considerations for AI clinical deployments across populations and sites.  ( 2 min )
    Data-Efficient Multimodal Fusion on a Single GPU. (arXiv:2312.10144v1 [cs.LG])
    The goal of multimodal alignment is to learn a single latent space that is shared between multimodal inputs. The most powerful models in this space have been trained using massive datasets of paired inputs and large-scale computational resources, making them prohibitively expensive to train in many practical scenarios. We surmise that existing unimodal encoders pre-trained on large amounts of unimodal data should provide an effective bootstrap to create multimodal models from unimodal ones at much lower costs. We therefore propose FuseMix, a multimodal augmentation scheme that operates on the latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment, we achieve competitive performance -- and in certain cases outperform state-of-the art methods -- in both image-text and audio-text retrieval, with orders of magnitude less compute and data: for example, we outperform CLIP on the Flickr30K text-to-image retrieval task with $\sim \! 600\times$ fewer GPU days and $\sim \! 80\times$ fewer image-text pairs. Additionally, we show how our method can be applied to convert pre-trained text-to-image generative models into audio-to-image ones. Code is available at: https://github.com/layer6ai-labs/fusemix.  ( 2 min )
    Finding Paths for Explainable MOOC Recommendation: A Learner Perspective. (arXiv:2312.10082v1 [cs.IR])
    The increasing availability of Massive Open Online Courses (MOOCs) has created a necessity for personalized course recommendation systems. These systems often combine neural networks with Knowledge Graphs (KGs) to achieve richer representations of learners and courses. While these enriched representations allow more accurate and personalized recommendations, explainability remains a significant challenge which is especially problematic for certain domains with significant impact such as education and online learning. Recently, a novel class of recommender systems that uses reinforcement learning and graph reasoning over KGs has been proposed to generate explainable recommendations in the form of paths over a KG. Despite their accuracy and interpretability on e-commerce datasets, these approaches have scarcely been applied to the educational domain and their use in practice has not been studied. In this work, we propose an explainable recommendation system for MOOCs that uses graph reasoning. To validate the practical implications of our approach, we conducted a user study examining user perceptions of our new explainable recommendations. We demonstrate the generalizability of our approach by conducting experiments on two educational datasets: COCO and Xuetang.  ( 2 min )
    Coupling Fairness and Pruning in a Single Run: a Bi-level Optimization Perspective. (arXiv:2312.10181v1 [cs.LG])
    Deep neural networks have demonstrated remarkable performance in various tasks. With a growing need for sparse deep learning, model compression techniques, especially pruning, have gained significant attention. However, conventional pruning techniques can inadvertently exacerbate algorithmic bias, resulting in unequal predictions. To address this, we define a fair pruning task where a sparse model is derived subject to fairness requirements. In particular, we propose a framework to jointly optimize the pruning mask and weight update processes with fairness constraints. This framework is engineered to compress models that maintain performance while ensuring fairness in a single execution. To this end, we formulate the fair pruning problem as a novel constrained bi-level optimization task and derive efficient and effective solving strategies. We design experiments spanning various datasets and settings to validate our proposed method. Our empirical analysis contrasts our framework with several mainstream pruning strategies, emphasizing our method's superiority in maintaining model fairness, performance, and efficiency.  ( 2 min )
    A Remark on Concept Drift for Dependent Data. (arXiv:2312.10212v1 [cs.LG])
    Concept drift, i.e., the change of the data generating distribution, can render machine learning models inaccurate. Several works address the phenomenon of concept drift in the streaming context usually assuming that consecutive data points are independent of each other. To generalize to dependent data, many authors link the notion of concept drift to time series. In this work, we show that the temporal dependencies are strongly influencing the sampling process. Thus, the used definitions need major modifications. In particular, we show that the notion of stationarity is not suited for this setup and discuss alternatives. We demonstrate that these alternative formal notions describe the observable learning behavior in numerical experiments.  ( 2 min )
    NM-FlowGAN: Modeling sRGB Noise with a Hybrid Approach based on Normalizing Flows and Generative Adversarial Networks. (arXiv:2312.10112v1 [cs.CV])
    Modeling and synthesizing real sRGB noise is crucial for various low-level vision tasks. The distribution of real sRGB noise is highly complex and affected by a multitude of factors, making its accurate modeling extremely challenging. Therefore, recent studies have proposed methods that employ data-driven generative models, such as generative adversarial networks (GAN) and Normalizing Flows. These studies achieve more accurate modeling of sRGB noise compared to traditional noise modeling methods. However, there are performance limitations due to the inherent characteristics of each generative model. To address this issue, we propose NM-FlowGAN, a hybrid approach that exploits the strengths of both GAN and Normalizing Flows. We simultaneously employ a pixel-wise noise modeling network based on Normalizing Flows, and spatial correlation modeling networks based on GAN. In our experiments, our NM-FlowGAN outperforms other baselines on the sRGB noise synthesis task. Moreover, the denoising neural network, trained with synthesized image pairs from our model, also shows superior performance compared to other baselines. Our code is available at: https://github.com/YoungJooHan/NM-FlowGAN  ( 2 min )
    Closing the Gap: Achieving Better Accuracy-Robustness Tradeoffs Against Query-Based Attacks. (arXiv:2312.10132v1 [cs.CV])
    Although promising, existing defenses against query-based attacks share a common limitation: they offer increased robustness against attacks at the price of a considerable accuracy drop on clean samples. In this work, we show how to efficiently establish, at test-time, a solid tradeoff between robustness and accuracy when mitigating query-based attacks. Given that these attacks necessarily explore low-confidence regions, our insight is that activating dedicated defenses, such as RND (Qin et al., NeuRIPS 2021) and Random Image Transformations (Xie et al., ICLR 2018), only for low-confidence inputs is sufficient to prevent them. Our approach is independent of training and supported by theory. We verify the effectiveness of our approach for various existing defenses by conducting extensive experiments on CIFAR-10, CIFAR-100, and ImageNet. Our results confirm that our proposal can indeed enhance these defenses by providing better tradeoffs between robustness and accuracy when compared to state-of-the-art approaches while being completely training-free.  ( 2 min )
    Towards Context-Aware Domain Generalization: Representing Environments with Permutation-Invariant Networks. (arXiv:2312.10107v1 [cs.LG])
    In this work, we show that information about the context of an input $X$ can improve the predictions of deep learning models when applied in new domains or production environments. We formalize the notion of context as a permutation-invariant representation of a set of data points that originate from the same environment/domain as the input itself. These representations are jointly learned with a standard supervised learning objective, providing incremental information about the unknown outcome. Furthermore, we offer a theoretical analysis of the conditions under which our approach can, in principle, yield benefits, and formulate two necessary criteria that can be easily verified in practice. Additionally, we contribute insights into the kind of distribution shifts for which our approach promises robustness. Our empirical evaluation demonstrates the effectiveness of our approach for both low-dimensional and high-dimensional data sets. Finally, we demonstrate that we can reliably detect scenarios where a model is tasked with unwarranted extrapolation in out-of-distribution (OOD) domains, identifying potential failure cases. Consequently, we showcase a method to select between the most predictive and the most robust model, circumventing the well-known trade-off between predictive performance and robustness.  ( 2 min )
    3FM: Multi-modal Meta-learning for Federated Tasks. (arXiv:2312.10179v1 [cs.LG])
    We present a novel approach in the domain of federated learning (FL), particularly focusing on addressing the challenges posed by modality heterogeneity, variability in modality availability across clients, and the prevalent issue of missing data. We introduce a meta-learning framework specifically designed for multimodal federated tasks. Our approach is motivated by the need to enable federated models to robustly adapt when exposed to new modalities, a common scenario in FL where clients often differ in the number of available modalities. The effectiveness of our proposed framework is demonstrated through extensive experimentation on an augmented MNIST dataset, enriched with audio and sign language data. We demonstrate that the proposed algorithm achieves better performance than the baseline on a subset of missing modality scenarios with careful tuning of the meta-learning rates. This is a shortened report, and our work will be extended and updated soon.  ( 2 min )
    Improving new physics searches with diffusion models for event observables and jet constituents. (arXiv:2312.10130v1 [physics.data-an])
    We introduce a new technique called Drapes to enhance the sensitivity in searches for new physics at the LHC. By training diffusion models on side-band data, we show how background templates for the signal region can be generated either directly from noise, or by partially applying the diffusion process to existing data. In the partial diffusion case, data can be drawn from side-band regions, with the inverse diffusion performed for new target conditional values, or from the signal region, preserving the distribution over the conditional property that defines the signal region. We apply this technique to the hunt for resonances using the LHCO di-jet dataset, and achieve state-of-the-art performance for background template generation using high level input features. We also show how Drapes can be applied to low level inputs with jet constituents, reducing the model dependence on the choice of input observables. Using jet constituents we can further improve sensitivity to the signal process, but observe a loss in performance where the signal significance before applying any selection is below 4$\sigma$.  ( 2 min )
    Beyond Empirical Windowing: An Attention-Based Approach for Trust Prediction in Autonomous Vehicles. (arXiv:2312.10209v1 [cs.HC])
    Humans' internal states play a key role in human-machine interaction, leading to the rise of human state estimation as a prominent field. Compared to swift state changes such as surprise and irritation, modeling gradual states like trust and satisfaction are further challenged by label sparsity: long time-series signals are usually associated with a single label, making it difficult to identify the critical span of state shifts. Windowing has been one widely-used technique to enable localized analysis of long time-series data. However, the performance of downstream models can be sensitive to the window size, and determining the optimal window size demands domain expertise and extensive search. To address this challenge, we propose a Selective Windowing Attention Network (SWAN), which employs window prompts and masked attention transformation to enable the selection of attended intervals with flexible lengths. We evaluate SWAN on the task of trust prediction on a new multimodal driving simulation dataset. Experiments show that SWAN significantly outperforms an existing empirical window selection baseline and neural network baselines including CNN-LSTM and Transformer. Furthermore, it shows robustness across a wide span of windowing ranges, compared to the traditional windowing approach.  ( 2 min )
    How Does It Function? Characterizing Long-term Trends in Production Serverless Workloads. (arXiv:2312.10127v1 [cs.PF])
    This paper releases and analyzes two new Huawei cloud serverless traces. The traces span a period of over 7 months with over 1.4 trillion function invocations combined. The first trace is derived from Huawei's internal workloads and contains detailed per-second statistics for 200 functions running across multiple Huawei cloud data centers. The second trace is a representative workload from Huawei's public FaaS platform. This trace contains per-minute arrival rates for over 5000 functions running in a single Huawei data center. We present the internals of a production FaaS platform by characterizing resource consumption, cold-start times, programming languages used, periodicity, per-second versus per-minute burstiness, correlations, and popularity. Our findings show that there is considerable diversity in how serverless functions behave: requests vary by up to 9 orders of magnitude across functions, with some functions executed over 1 billion times per day; scheduling time, execution time and cold-start distributions vary across 2 to 4 orders of magnitude and have very long tails; and function invocation counts demonstrate strong periodicity for many individual functions and on an aggregate level. Our analysis also highlights the need for further research in estimating resource reservations and time-series prediction to account for the huge diversity in how serverless functions behave. Datasets and code available at https://github.com/sir-lab/data-release  ( 3 min )
    Advancements in Content-Based Image Retrieval: A Comprehensive Survey of Relevance Feedback Techniques. (arXiv:2312.10089v1 [cs.CV])
    Content-based image retrieval (CBIR) systems have emerged as crucial tools in the field of computer vision, allowing for image search based on visual content rather than relying solely on metadata. This survey paper presents a comprehensive overview of CBIR, emphasizing its role in object detection and its potential to identify and retrieve visually similar images based on content features. Challenges faced by CBIR systems, including the semantic gap and scalability, are discussed, along with potential solutions. It elaborates on the semantic gap, which arises from the disparity between low-level features and high-level semantic concepts, and explores approaches to bridge this gap. One notable solution is the integration of relevance feedback (RF), empowering users to provide feedback on retrieved images and refine search results iteratively. The survey encompasses long-term and short-term learning approaches that leverage RF for enhanced CBIR accuracy and relevance. These methods focus on weight optimization and the utilization of active learning algorithms to select samples for training classifiers. Furthermore, the paper investigates machine learning techniques and the utilization of deep learning and convolutional neural networks to enhance CBIR performance. This survey paper plays a significant role in advancing the understanding of CBIR and RF techniques. It guides researchers and practitioners in comprehending existing methodologies, challenges, and potential solutions while fostering knowledge dissemination and identifying research gaps. By addressing future research directions, it sets the stage for advancements in CBIR that will enhance retrieval accuracy, usability, and effectiveness in various application domains.  ( 3 min )
    Enhancing Cognitive Diagnosis using Un-interacted Exercises: A Collaboration-aware Mixed Sampling Approach. (arXiv:2312.10110v1 [cs.CY])
    Cognitive diagnosis is a crucial task in computational education, aimed at evaluating students' proficiency levels across various knowledge concepts through exercises. Current models, however, primarily rely on students' answered exercises, neglecting the complex and rich information contained in un-interacted exercises. While recent research has attempted to leverage the data within un-interacted exercises linked to interacted knowledge concepts, aiming to address the long-tail issue, these studies fail to fully explore the informative, un-interacted exercises related to broader knowledge concepts. This oversight results in diminished performance when these models are applied to comprehensive datasets. In response to this gap, we present the Collaborative-aware Mixed Exercise Sampling (CMES) framework, which can effectively exploit the information present in un-interacted exercises linked to un-interacted knowledge concepts. Specifically, we introduce a novel universal sampling module where the training samples comprise not merely raw data slices, but enhanced samples generated by combining weight-enhanced attention mixture techniques. Given the necessity of real response labels in cognitive diagnosis, we also propose a ranking-based pseudo feedback module to regulate students' responses on generated exercises. The versatility of the CMES framework bolsters existing models and improves their adaptability. Finally, we demonstrate the effectiveness and interpretability of our framework through comprehensive experiments on real-world datasets.  ( 2 min )
    Early ChatGPT User Portrait through the Lens of Data. (arXiv:2312.10078v1 [cs.HC])
    Since its launch, ChatGPT has achieved remarkable success as a versatile conversational AI platform, drawing millions of users worldwide and garnering widespread recognition across academic, industrial, and general communities. This paper aims to point a portrait of early GPT users and understand how they evolved. Specific questions include their topics of interest and their potential careers; and how this changes over time. We conduct a detailed analysis of real-world ChatGPT datasets with multi-turn conversations between users and ChatGPT. Through a multi-pronged approach, we quantify conversation dynamics by examining the number of turns, then gauge sentiment to understand user sentiment variations, and finally employ Latent Dirichlet Allocation (LDA) to discern overarching topics within the conversation. By understanding shifts in user demographics and interests, we aim to shed light on the changing nature of human-AI interaction and anticipate future trends in user engagement with language models.  ( 2 min )
    No prejudice! Fair Federated Graph Neural Networks for Personalized Recommendation. (arXiv:2312.10080v1 [cs.IR])
    Ensuring fairness in Recommendation Systems (RSs) across demographic groups is critical due to the increased integration of RSs in applications such as personalized healthcare, finance, and e-commerce. Graph-based RSs play a crucial role in capturing intricate higher-order interactions among entities. However, integrating these graph models into the Federated Learning (FL) paradigm with fairness constraints poses formidable challenges as this requires access to the entire interaction graph and sensitive user information (such as gender, age, etc.) at the central server. This paper addresses the pervasive issue of inherent bias within RSs for different demographic groups without compromising the privacy of sensitive user attributes in FL environment with the graph-based model. To address the group bias, we propose F2PGNN (Fair Federated Personalized Graph Neural Network), a novel framework that leverages the power of Personalized Graph Neural Network (GNN) coupled with fairness considerations. Additionally, we use differential privacy techniques to fortify privacy protection. Experimental evaluation on three publicly available datasets showcases the efficacy of F2PGNN in mitigating group unfairness by 47% - 99% compared to the state-of-the-art while preserving privacy and maintaining the utility. The results validate the significance of our framework in achieving equitable and personalized recommendations using GNN within the FL landscape.  ( 2 min )
    Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models. (arXiv:2312.10091v1 [cs.IR])
    When solving challenging problems, language models (LMs) are able to identify relevant information from long and complicated contexts. To study how LMs solve retrieval tasks in diverse situations, we introduce ORION, a collection of structured retrieval tasks spanning six domains, from text understanding to coding. Each task in ORION can be represented abstractly by a request (e.g. a question) that retrieves an attribute (e.g. the character name) from a context (e.g. a story). We apply causal analysis on 18 open-source language models with sizes ranging from 125 million to 70 billion parameters. We find that LMs internally decompose retrieval tasks in a modular way: middle layers at the last token position process the request, while late layers retrieve the correct entity from the context. After causally enforcing this decomposition, models are still able to solve the original task, preserving 70% of the original correct token probability in 98 of the 106 studied model-task pairs. We connect our macroscopic decomposition with a microscopic description by performing a fine-grained case study of a question-answering task on Pythia-2.8b. Building on our high-level understanding, we demonstrate a proof of concept application for scalable internal oversight of LMs to mitigate prompt-injection while requiring human supervision on only a single input. Our solution improves accuracy drastically (from 15.5% to 97.5% on Pythia-12b). This work presents evidence of a universal emergent modular processing of tasks across varied domains and models and is a pioneering effort in applying interpretability for scalable internal oversight of LMs.  ( 3 min )
    On Robustness to Missing Video for Audiovisual Speech Recognition. (arXiv:2312.10088v1 [eess.AS])
    It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model. While there have been many attempts at building robust models, there is little consensus on how robustness should be evaluated. To address this, we introduce a framework that allows claims about robustness to be evaluated in a precise and testable way. We also conduct a systematic empirical study of the robustness of common audiovisual speech recognition architectures on a range of acoustic noise conditions and test suites. Finally, we show that an architecture-agnostic solution based on cascades can consistently achieve robustness to missing video, even in settings where existing techniques for robustness like dropout fall short.  ( 2 min )
    WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data. (arXiv:2312.10188v1 [cs.LG])
    We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora comprising millions of pages with annotations for document layout detection. Relating visual and textual items on document pages has gained further significance with the advent of multimodal models. Various approaches proved effective for visual question answering or layout segmentation. However, the interplay of text, tables, and visuals remains challenging for a variety of document understanding tasks. In particular, many models fail to generalize well to diverse domains and new languages due to insufficient availability of training data. WordScape addresses these limitations. Our automatic annotation pipeline parses the Open XML structure of Word documents obtained from the web, jointly providing layout-annotated document images and their textual representations. In turn, WordScape offers unique properties as it (1) leverages the ubiquity of the Word file format on the internet, (2) is readily accessible through the Common Crawl web corpus, (3) is adaptive to domain-specific documents, and (4) offers culturally and linguistically diverse document pages with natural semantic structure and high-quality text. Together with the pipeline, we will additionally release 9.5M urls to word documents which can be processed using WordScape to create a dataset of over 40M pages. Finally, we investigate the quality of text and layout annotations extracted by WordScape, assess the impact on document understanding benchmarks, and demonstrate that manual labeling costs can be substantially reduced.  ( 3 min )
    Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference. (arXiv:2312.10193v1 [cs.LG])
    The computational cost of transformer models makes them inefficient in low-latency or low-power applications. While techniques such as quantization or linear attention can reduce the computational load, they may incur a reduction in accuracy. In addition, globally reducing the cost for all inputs may be sub-optimal. We observe that for each layer, the full width of the layer may be needed only for a small subset of tokens inside a batch and that the "effective" width needed to process a token can vary from layer to layer. Motivated by this observation, we introduce the Adaptive Computation Module (ACM), a generic module that dynamically adapts its computational load to match the estimated difficulty of the input on a per-token basis. An ACM consists of a sequence of learners that progressively refine the output of their preceding counterparts. An additional gating mechanism determines the optimal number of learners to execute for each token. We also describe a distillation technique to replace any pre-trained model with an "ACMized" variant. The distillation phase is designed to be highly parallelizable across layers while being simple to plug-and-play into existing networks. Our evaluation of transformer models in computer vision and speech recognition demonstrates that substituting layers with ACMs significantly reduces inference costs without degrading the downstream accuracy for a wide interval of user-defined budgets.  ( 2 min )
    Bayesian Estimate of Mean Proper Scores for Diversity-Enhanced Active Learning. (arXiv:2312.10116v1 [cs.LG])
    The effectiveness of active learning largely depends on the sampling efficiency of the acquisition function. Expected Loss Reduction (ELR) focuses on a Bayesian estimate of the reduction in classification error, and more general costs fit in the same framework. We propose Bayesian Estimate of Mean Proper Scores (BEMPS) to estimate the increase in strictly proper scores such as log probability or negative mean square error within this framework. We also prove convergence results for this general class of costs. To facilitate better experimentation with the new acquisition functions, we develop a complementary batch AL algorithm that encourages diversity in the vector of expected changes in scores for unlabeled data. To allow high-performance classifiers, we combine deep ensembles, and dynamic validation set construction on pretrained models, and further speed up the ensemble process with the idea of Monte Carlo Dropout. Extensive experiments on both texts and images show that the use of mean square error and log probability with BEMPS yields robust acquisition functions and well-calibrated classifiers, and consistently outperforms the others tested. The advantages of BEMPS over the others are further supported by a set of qualitative analyses, where we visualise their sampling behaviour using data maps and t-SNE plots.  ( 3 min )
    Bayesian Metaplasticity from Synaptic Uncertainty. (arXiv:2312.10153v1 [cs.LG])
    Catastrophic forgetting remains a challenge for neural networks, especially in lifelong learning scenarios. In this study, we introduce MEtaplasticity from Synaptic Uncertainty (MESU), inspired by metaplasticity and Bayesian inference principles. MESU harnesses synaptic uncertainty to retain information over time, with its update rule closely approximating the diagonal Newton's method for synaptic updates. Through continual learning experiments on permuted MNIST tasks, we demonstrate MESU's remarkable capability to maintain learning performance across 100 tasks without the need of explicit task boundaries.  ( 2 min )
    Data-Adaptive Dimensional Analysis for Accurate Interpolation and Extrapolation in Computer Experiments. (arXiv:2312.10100v1 [cs.LG])
    Dimensional analysis (DA) pays attention to fundamental physical dimensions such as length and mass when modelling scientific and engineering systems. It goes back at least a century to Buckingham's Pi theorem, which characterizes a scientifically meaningful model in terms of a limited number of dimensionless variables. The methodology has only been exploited relatively recently by statisticians for design and analysis of experiments, however, and computer experiments in particular. The basic idea is to build models in terms of new dimensionless quantities derived from the original input and output variables. A scientifically valid formulation has the potential for improved prediction accuracy in principle, but the implementation of DA is far from straightforward. There can be a combinatorial number of possible models satisfying the conditions of the theory. Empirical approaches for finding effective derived variables will be described, and improvements in prediction accuracy will be demonstrated. As DA's dimensionless quantities for a statistical model typically compare the original variables rather than use their absolute magnitudes, DA is less dependent on the choice of experimental ranges in the training data. Hence, we are also able to illustrate sustained accuracy gains even when extrapolating substantially outside the training data.  ( 2 min )
    Robust Estimation of Causal Heteroscedastic Noise Models. (arXiv:2312.10102v1 [stat.ML])
    Distinguishing the cause and effect from bivariate observational data is the foundational problem that finds applications in many scientific disciplines. One solution to this problem is assuming that cause and effect are generated from a structural causal model, enabling identification of the causal direction after estimating the model in each direction. The heteroscedastic noise model is a type of structural causal model where the cause can contribute to both the mean and variance of the noise. Current methods for estimating heteroscedastic noise models choose the Gaussian likelihood as the optimization objective which can be suboptimal and unstable when the data has a non-Gaussian distribution. To address this limitation, we propose a novel approach to estimating this model with Student's $t$-distribution, which is known for its robustness in accounting for sampling variability with smaller sample sizes and extreme values without significantly altering the overall distribution shape. This adaptability is beneficial for capturing the parameters of the noise distribution in heteroscedastic noise models. Our empirical evaluations demonstrate that our estimators are more robust and achieve better overall performance across synthetic and real benchmarks.  ( 2 min )
    Assessing the Usability of GutGPT: A Simulation Study of an AI Clinical Decision Support System for Gastrointestinal Bleeding Risk. (arXiv:2312.10072v1 [cs.HC])
    Applications of large language models (LLMs) like ChatGPT have potential to enhance clinical decision support through conversational interfaces. However, challenges of human-algorithmic interaction and clinician trust are poorly understood. GutGPT, a LLM for gastrointestinal (GI) bleeding risk prediction and management guidance, was deployed in clinical simulation scenarios alongside the electronic health record (EHR) with emergency medicine physicians, internal medicine physicians, and medical students to evaluate its effect on physician acceptance and trust in AI clinical decision support systems (AI-CDSS). GutGPT provides risk predictions from a validated machine learning model and evidence-based answers by querying extracted clinical guidelines. Participants were randomized to GutGPT and an interactive dashboard, or the interactive dashboard and a search engine. Surveys and educational assessments taken before and after measured technology acceptance and content mastery. Preliminary results showed mixed effects on acceptance after using GutGPT compared to the dashboard or search engine but appeared to improve content mastery based on simulation performance. Overall, this study demonstrates LLMs like GutGPT could enhance effective AI-CDSS if implemented optimally and paired with interactive interfaces.  ( 3 min )
    Pareto Envelope Augmented with Reinforcement Learning: Multi-objective reinforcement learning-based approach for Large-Scale Constrained Pressurized Water Reactor optimization. (arXiv:2312.10194v1 [cs.LG])
    A novel method, the Pareto Envelope Augmented with Reinforcement Learning (PEARL), has been developed to address the challenges posed by multi-objective problems, particularly in the field of engineering where the evaluation of candidate solutions can be time-consuming. PEARL distinguishes itself from traditional policy-based multi-objective Reinforcement Learning methods by learning a single policy, eliminating the need for multiple neural networks to independently solve simpler sub-problems. Several versions inspired from deep learning and evolutionary techniques have been crafted, catering to both unconstrained and constrained problem domains. Curriculum Learning is harnessed to effectively manage constraints in these versions. PEARL's performance is first evaluated on classical multi-objective benchmarks. Additionally, it is tested on two practical PWR core Loading Pattern optimization problems to showcase its real-world applicability. The first problem involves optimizing the Cycle length and the rod-integrated peaking factor as the primary objectives, while the second problem incorporates the mean average enrichment as an additional objective. Furthermore, PEARL addresses three types of constraints related to boron concentration, peak pin burnup, and peak pin power. The results are systematically compared against a conventional approach, the Non-dominated Sorting Genetic Algorithm. Notably, PEARL, specifically the PEARL-NdS variant, efficiently uncovers a Pareto front without necessitating additional efforts from the algorithm designer, as opposed to a single optimization with scaled objectives. It also outperforms the classical approach across multiple performance metrics, including the Hyper-volume.  ( 3 min )
    A Generic Stochastic Hybrid Car-following Model Based on Approximate Bayesian Computation. (arXiv:2312.10042v1 [cs.LG])
    Car following (CF) models are fundamental to describing traffic dynamics. However, the CF behavior of human drivers is highly stochastic and nonlinear. As a result, identifying the best CF model has been challenging and controversial despite decades of research. Introduction of automated vehicles has further complicated this matter as their CF controllers remain proprietary, though their behavior appears different than human drivers. This paper develops a stochastic learning approach to integrate multiple CF models, rather than relying on a single model. The framework is based on approximate Bayesian computation that probabilistically concatenates a pool of CF models based on their relative likelihood of describing observed behavior. The approach, while data-driven, retains physical tractability and interpretability. Evaluation results using two datasets show that the proposed approach can better reproduce vehicle trajectories for both human driven and automated vehicles than any single CF model considered.  ( 2 min )
    ESTformer: Transformer Utilizing Spatiotemporal Dependencies for EEG Super-resolution. (arXiv:2312.10052v1 [eess.SP])
    Towards practical applications of Electroencephalography (EEG) data, lightweight acquisition devices, equipped with a few electrodes, result in a predicament where analysis methods can only leverage EEG data with extremely low spatial resolution. Recent methods mainly focus on using mathematical interpolation methods and Convolutional Neural Networks for EEG super-resolution (SR), but they suffer from high computation costs, extra bias, and few insights in spatiotemporal dependency modeling. To this end, we propose the ESTformer, an EEG SR framework utilizing spatiotemporal dependencies based on the Transformer. The ESTformer applies positional encoding methods and the Multi-head Self-attention mechanism to the space and time dimensions, which can learn spatial structural information and temporal functional variation. The ESTformer, with the fixed masking strategy, adopts a mask token to up-sample the low-resolution (LR) EEG data in case of disturbance from mathematical interpolation methods. On this basis, we design various Transformer blocks to construct the Spatial Interpolation Module (SIM) and the Temporal Reconstruction Module (TRM). Finally, the ESTformer cascades the SIM and the TRM to capture and model spatiotemporal dependencies for EEG SR with fidelity. Extensive experimental results on two EEG datasets show the effectiveness of the ESTformer against previous state-of-the-art methods and verify the superiority of the SR data to the LR data in EEG-based downstream tasks of person identification and emotion recognition. The proposed ESTformer demonstrates the versatility of the Transformer for EEG SR tasks.  ( 2 min )
    Deep Metric Learning for Computer Vision: A Brief Overview. (arXiv:2312.10046v1 [cs.CV])
    Objective functions that optimize deep neural networks play a vital role in creating an enhanced feature representation of the input data. Although cross-entropy-based loss formulations have been extensively used in a variety of supervised deep-learning applications, these methods tend to be less adequate when there is large intra-class variance and low inter-class variance in input data distribution. Deep Metric Learning seeks to develop methods that aim to measure the similarity between data samples by learning a representation function that maps these data samples into a representative embedding space. It leverages carefully designed sampling strategies and loss functions that aid in optimizing the generation of a discriminative embedding space even for distributions having low inter-class and high intra-class variances. In this chapter, we will provide an overview of recent progress in this area and discuss state-of-the-art Deep Metric Learning approaches.  ( 2 min )
    Estimation of Physical Parameters of Waveforms With Neural Networks. (arXiv:2312.10068v1 [eess.SP])
    Light Detection and Ranging (LiDAR) are fast emerging sensors in the field of Earth Observation. It is a remote sensing technology that utilizes laser beams to measure distances and create detailed three-dimensional representations of objects and environments. The potential of Full Waveform LiDAR is much greater than just height estimation and 3D reconstruction only. Overall shape of signal provides important information about properties of water body. However, the shape of FWL is unexplored as most LiDAR software work on point cloud by utilizing the maximum value within the waveform. Existing techniques in the field of LiDAR data analysis include depth estimation through inverse modeling and regression of logarithmic intensity and depth for approximating the attenuation coefficient. However, these methods suffer from limitations in accuracy. Depth estimation through inverse modeling provides only approximate values and does not account for variations in surface properties, while the regression approach for the attenuation coefficient is only able to generalize a value through several data points which lacks precision and may lead to significant errors in estimation. Additionally, there is currently no established modeling method available for predicting bottom reflectance. This research proposed a novel solution based on neural networks for parameter estimation in LIDAR data analysis. By leveraging the power of neural networks, the proposed solution successfully learned the inversion model, was able to do prediction of parameters such as depth, attenuation coefficient, and bottom reflectance. Performance of model was validated by testing it on real LiDAR data. In future, more data availability would enable more accuracy and reliability of such models.  ( 3 min )
    Interpretable Knowledge Tracing via Response Influence-based Counterfactual Reasoning. (arXiv:2312.10045v1 [cs.CY])
    Knowledge tracing (KT) plays a crucial role in computer-aided education and intelligent tutoring systems, aiming to assess students' knowledge proficiency by predicting their future performance on new questions based on their past response records. While existing deep learning knowledge tracing (DLKT) methods have significantly improved prediction accuracy and achieved state-of-the-art results, they often suffer from a lack of interpretability. To address this limitation, current approaches have explored incorporating psychological influences to achieve more explainable predictions, but they tend to overlook the potential influences of historical responses. In fact, understanding how models make predictions based on response influences can enhance the transparency and trustworthiness of the knowledge tracing process, presenting an opportunity for a new paradigm of interpretable KT. However, measuring unobservable response influences is challenging. In this paper, we resort to counterfactual reasoning that intervenes in each response to answer \textit{what if a student had answered a question incorrectly that he/she actually answered correctly, and vice versa}. Based on this, we propose RCKT, a novel response influence-based counterfactual knowledge tracing framework. RCKT generates response influences by comparing prediction outcomes from factual sequences and constructed counterfactual sequences after interventions. Additionally, we introduce maximization and inference techniques to leverage accumulated influences from different past responses, further improving the model's performance and credibility. Extensive experimental results demonstrate that our RCKT method outperforms state-of-the-art knowledge tracing methods on four datasets against six baselines, and provides credible interpretations of response influences.  ( 3 min )
    Robust Errant Beam Prognostics with Conditional Modeling for Particle Accelerators. (arXiv:2312.10040v1 [physics.acc-ph])
    Particle accelerators are complex and comprise thousands of components, with many pieces of equipment running at their peak power. Consequently, particle accelerators can fault and abort operations for numerous reasons. These faults impact the availability of particle accelerators during scheduled run-time and hamper the efficiency and the overall science output. To avoid these faults, we apply anomaly detection techniques to predict any unusual behavior and perform preemptive actions to improve the total availability of particle accelerators. Semi-supervised Machine Learning (ML) based anomaly detection approaches such as autoencoders and variational autoencoders are often used for such tasks. However, supervised ML techniques such as Siamese Neural Network (SNN) models can outperform unsupervised or semi-supervised approaches for anomaly detection by leveraging the label information. One of the challenges specific to anomaly detection for particle accelerators is the data's variability due to system configuration changes. To address this challenge, we employ Conditional Siamese Neural Network (CSNN) models and Conditional Variational Auto Encoder (CVAE) models to predict errant beam pulses at the Spallation Neutron Source (SNS) under different system configuration conditions and compare their performance. We demonstrate that CSNN outperforms CVAE in our application.  ( 2 min )
    ProtoEEGNet: An Interpretable Approach for Detecting Interictal Epileptiform Discharges. (arXiv:2312.10056v1 [eess.SP])
    In electroencephalogram (EEG) recordings, the presence of interictal epileptiform discharges (IEDs) serves as a critical biomarker for seizures or seizure-like events.Detecting IEDs can be difficult; even highly trained experts disagree on the same sample. As a result, specialists have turned to machine-learning models for assistance. However, many existing models are black boxes and do not provide any human-interpretable reasoning for their decisions. In high-stakes medical applications, it is critical to have interpretable models so that experts can validate the reasoning of the model before making important diagnoses. We introduce ProtoEEGNet, a model that achieves state-of-the-art accuracy for IED detection while additionally providing an interpretable justification for its classifications. Specifically, it can reason that one EEG looks similar to another ''prototypical'' EEG that is known to contain an IED. ProtoEEGNet can therefore help medical professionals effectively detect IEDs while maintaining a transparent decision-making process.  ( 2 min )
    Understanding Representations Pretrained with Auxiliary Losses for Embodied Agent Planning. (arXiv:2312.10069v1 [cs.RO])
    Pretrained representations from large-scale vision models have boosted the performance of downstream embodied policy learning. We look to understand whether additional self-supervised pretraining on exploration trajectories can build on these general-purpose visual representations to better support embodied planning in realistic environments. We evaluated four common auxiliary losses in embodied AI, two hindsight-based losses, and a standard imitation learning loss, by pretraining the agent's visual compression module and state belief representations with each objective and using CLIP as a representative visual backbone. The learned representations are then frozen for downstream multi-step evaluation on two goal-directed tasks. Surprisingly, we find that imitation learning on these exploration trajectories out-performs all other auxiliary losses even despite the exploration trajectories being dissimilar from the downstream tasks. This suggests that imitation of exploration may be ''all you need'' for building powerful planning representations. Additionally, we find that popular auxiliary losses can benefit from simple modifications to improve their support for downstream planning ability.  ( 2 min )
  • Open

    Continuous-Time Functional Diffusion Processes. (arXiv:2303.00800v3 [cs.LG] UPDATED)
    We introduce Functional Diffusion Processes (FDPs), which generalize score-based diffusion models to infinite-dimensional function spaces. FDPs require a new mathematical framework to describe the forward and backward dynamics, and several extensions to derive practical training objectives. These include infinite-dimensional versions of Girsanov theorem, in order to be able to compute an ELBO, and of the sampling theorem, in order to guarantee that functional evaluations in a countable set of points are equivalent to infinite-dimensional functions. We use FDPs to build a new breed of generative models in function spaces, which do not require specialized network architectures, and that can work with any kind of continuous data. Our results on real data show that FDPs achieve high-quality image generation, using a simple MLP architecture with orders of magnitude fewer parameters than existing diffusion models.  ( 2 min )
    Variational Inference on the Final-Layer Output of Neural Networks. (arXiv:2302.02420v4 [cs.LG] UPDATED)
    Traditional neural networks are simple to train but they typically produce overconfident predictions. In contrast, Bayesian neural networks provide good uncertainty quantification but optimizing them is time consuming due to the large parameter space. This paper proposes to combine the advantages of both approaches by performing Variational Inference in the Final layer Output space (VIFO), because the output space is much smaller than the parameter space. We use neural networks to learn the mean and the variance of the probabilistic output. Like standard, non-Beyesian models, VIFO enjoys simple training and one can use Rademacher complexity to provide risk bounds for the model. On the other hand, using the Bayesian formulation we incorporate collapsed variational inference with VIFO which significantly improves the performance in practice. Experiments show that VIFO and ensembles of VIFO provide a good tradeoff in terms of run time and uncertainty quantification, especially for out of distribution data.  ( 2 min )
    Extrapolated cross-validation for randomized ensembles. (arXiv:2302.13511v3 [stat.ME] UPDATED)
    Ensemble methods such as bagging and random forests are ubiquitous in various fields, from finance to genomics. Despite their prevalence, the question of the efficient tuning of ensemble parameters has received relatively little attention. This paper introduces a cross-validation method, ECV (Extrapolated Cross-Validation), for tuning the ensemble and subsample sizes in randomized ensembles. Our method builds on two primary ingredients: initial estimators for small ensemble sizes using out-of-bag errors and a novel risk extrapolation technique that leverages the structure of prediction risk decomposition. By establishing uniform consistency of our risk extrapolation technique over ensemble and subsample sizes, we show that ECV yields $\delta$-optimal (with respect to the oracle-tuned risk) ensembles for squared prediction risk. Our theory accommodates general ensemble predictors, only requires mild moment assumptions, and allows for high-dimensional regimes where the feature dimension grows with the sample size. As a practical case study, we employ ECV to predict surface protein abundances from gene expressions in single-cell multiomics using random forests. In comparison to sample-split cross-validation and $K$-fold cross-validation, ECV achieves higher accuracy avoiding sample splitting. At the same time, its computational cost is considerably lower owing to the use of the risk extrapolation technique. Additional numerical results validate the finite-sample accuracy of ECV for several common ensemble predictors under a computational constraint on the maximum ensemble size.  ( 3 min )
    Simplifying Momentum-based Positive-definite Submanifold Optimization with Applications to Deep Learning. (arXiv:2302.09738v9 [stat.ML] UPDATED)
    Riemannian submanifold optimization with momentum is computationally challenging because, to ensure that the iterates remain on the submanifold, we often need to solve difficult differential equations. Here, we simplify such difficulties for a class of sparse or structured symmetric positive-definite matrices with the affine-invariant metric. We do so by proposing a generalized version of the Riemannian normal coordinates that dynamically orthonormalizes the metric and locally converts the problem into an unconstrained problem in the Euclidean space. We use our approach to simplify existing approaches for structured covariances and develop matrix-inverse-free $2^\text{nd}$-order optimizers for deep learning with low precision by using only matrix multiplications. Code: https://github.com/yorkerlin/StructuredNGD-DL  ( 2 min )
    Optimality of Message-Passing Architectures for Sparse Graphs. (arXiv:2305.10391v2 [cs.LG] UPDATED)
    We study the node classification problem on feature-decorated graphs in the sparse setting, i.e., when the expected degree of a node is $O(1)$ in the number of nodes, in the fixed-dimensional asymptotic regime, i.e., the dimension of the feature data is fixed while the number of nodes is large. Such graphs are typically known to be locally tree-like. We introduce a notion of Bayes optimality for node classification tasks, called asymptotic local Bayes optimality, and compute the optimal classifier according to this criterion for a fairly general statistical data model with arbitrary distributions of the node features and edge connectivity. The optimal classifier is implementable using a message-passing graph neural network architecture. We then compute the generalization error of this classifier and compare its performance against existing learning methods theoretically on a well-studied statistical model with naturally identifiable signal-to-noise ratios (SNRs) in the data. We find that the optimal message-passing architecture interpolates between a standard MLP in the regime of low graph signal and a typical convolution in the regime of high graph signal. Furthermore, we prove a corresponding non-asymptotic result.  ( 2 min )
    GLOBE-CE: A Translation-Based Approach for Global Counterfactual Explanations. (arXiv:2305.17021v2 [cs.LG] UPDATED)
    Counterfactual explanations have been widely studied in explainability, with a range of application dependent methods prominent in fairness, recourse and model understanding. The major shortcoming associated with these methods, however, is their inability to provide explanations beyond the local or instance-level. While many works touch upon the notion of a global explanation, typically suggesting to aggregate masses of local explanations in the hope of ascertaining global properties, few provide frameworks that are both reliable and computationally tractable. Meanwhile, practitioners are requesting more efficient and interactive explainability tools. We take this opportunity to propose Global & Efficient Counterfactual Explanations (GLOBE-CE), a flexible framework that tackles the reliability and scalability issues associated with current state-of-the-art, particularly on higher dimensional datasets and in the presence of continuous features. Furthermore, we provide a unique mathematical analysis of categorical feature translations, utilising it in our method. Experimental evaluation with publicly available datasets and user studies demonstrate that GLOBE-CE performs significantly better than the current state-of-the-art across multiple metrics (e.g., speed, reliability).  ( 2 min )
    On the Expected Size of Conformal Prediction Sets. (arXiv:2306.07254v2 [stat.ML] UPDATED)
    While conformal predictors reap the benefits of rigorous statistical guarantees on their error frequency, the size of their corresponding prediction sets is critical to their practical utility. Unfortunately, there is currently a lack of finite-sample analysis and guarantees for their prediction set sizes. To address this shortfall, we theoretically quantify the expected size of the prediction sets under the split conformal prediction framework. As this precise formulation cannot usually be calculated directly, we further derive point estimates and high-probability interval bounds that can be empirically computed, providing a practical method for characterizing the expected set size. We corroborate the efficacy of our results with experiments on real-world datasets for both regression and classification problems.  ( 2 min )
    Using Property Elicitation to Understand the Impacts of Fairness Regularizers. (arXiv:2309.11343v2 [cs.LG] UPDATED)
    Predictive algorithms are often trained by optimizing some loss function, to which regularization functions are added to impose a penalty for violating constraints. As expected, the addition of such regularization functions can change the minimizer of the objective. It is not well-understood which regularizers change the minimizer of the loss, and, when the minimizer does change, how it changes. We use property elicitation to take first steps towards understanding the joint relationship between the loss and regularization functions and the optimal decision for a given problem instance. In particular, we give a necessary and sufficient condition on loss and regularizer pairs for when a property changes with the addition of the regularizer, and examine some regularizers satisfying this condition standard in the fair machine learning literature. We empirically demonstrate how algorithmic decision-making changes as a function of both data distribution changes and hardness of the constraints.  ( 2 min )
    Not All Neuro-Symbolic Concepts Are Created Equal: Analysis and Mitigation of Reasoning Shortcuts. (arXiv:2305.19951v2 [cs.LG] UPDATED)
    Neuro-Symbolic (NeSy) predictive models hold the promise of improved compliance with given constraints, systematic generalization, and interpretability, as they allow to infer labels that are consistent with some prior knowledge by reasoning over high-level concepts extracted from sub-symbolic inputs. It was recently shown that NeSy predictors are affected by reasoning shortcuts: they can attain high accuracy but by leveraging concepts with unintended semantics, thus coming short of their promised advantages. Yet, a systematic characterization of reasoning shortcuts and of potential mitigation strategies is missing. This work fills this gap by characterizing them as unintended optima of the learning objective and identifying four key conditions behind their occurrence. Based on this, we derive several natural mitigation strategies, and analyze their efficacy both theoretically and empirically. Our analysis shows reasoning shortcuts are difficult to deal with, casting doubts on the trustworthiness and interpretability of existing NeSy solutions.  ( 2 min )
    How Two-Layer Neural Networks Learn, One (Giant) Step at a Time. (arXiv:2305.18270v3 [stat.ML] UPDATED)
    We investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function through a few large batch gradient descent steps, leading to improvement in the approximation capacity with respect to the initialization. We compare the influence of batch size and that of multiple (but finitely many) steps. For a single gradient step, a batch of size $n = \mathcal{O}(d)$ is both necessary and sufficient to align with the target function, although only a single direction can be learned. In contrast, $n = \mathcal{O}(d^2)$ is essential for neurons to specialize to multiple relevant directions of the target with a single gradient step. Even in this case, we show there might exist ``hard'' directions requiring $n = \mathcal{O}(d^\ell)$ samples to be learned, where $\ell$ is known as the leap index of the target. The picture drastically improves over multiple gradient steps: we show that a batch-size of $n = \mathcal{O}(d)$ is indeed enough to learn multiple target directions satisfying a staircase property, where more and more directions can be learned over time. Finally, we discuss how these directions allows to drastically improve the approximation capacity and generalization error over the initialization, illustrating a separation of scale between the random features/lazy regime, and the feature learning regime. Our technical analysis leverages a combination of techniques related to concentration, projection-based conditioning, and Gaussian equivalence which we believe are of independent interest. By pinning down the conditions necessary for specialization and learning, our results highlight the interaction between batch size and number of iterations, and lead to a hierarchical depiction where learning performance exhibits a stairway to accuracy over time and batch size, shedding new light on how neural networks adapt to features of the data.  ( 3 min )
    ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding. (arXiv:2305.14196v3 [cs.CL] UPDATED)
    We introduce ZeroSCROLLS, a zero-shot benchmark for natural language understanding over long texts, which contains only test and small validation sets, without training data. We adapt six tasks from the SCROLLS benchmark, and add four new datasets, including two novel information fusing tasks, such as aggregating the percentage of positive reviews. Using ZeroSCROLLS, we conduct a comprehensive evaluation of both open-source and closed large language models, finding that Claude outperforms ChatGPT, and that GPT-4 achieves the highest average score. However, there is still room for improvement on multiple open challenges in ZeroSCROLLS, such as aggregation tasks, where models struggle to pass the naive baseline. As the state of the art is a moving target, we invite researchers to evaluate their ideas on the live ZeroSCROLLS leaderboard.  ( 2 min )
    On the connections between optimization algorithms, Lyapunov functions, and differential equations: theory and insights. (arXiv:2305.08658v2 [math.OC] UPDATED)
    We revisit the general framework introduced by Fazylab et al. (SIAM J. Optim. 28, 2018) to construct Lyapunov functions for optimization algorithms in discrete and continuous time. For smooth, strongly convex objective functions, we relax the requirements necessary for such a construction. As a result we are able to prove for Polyak's ordinary differential equations and for a two-parameter family of Nesterov algorithms rates of convergence that improve on those available in the literature. We analyse the interpretation of Nesterov algorithms as discretizations of the Polyak equation. We show that the algorithms are instances of Additive Runge-Kutta integrators and discuss the reasons why most discretizations of the differential equation do not result in optimization algorithms with acceleration. We also introduce a modification of Polyak's equation and study its convergence properties. Finally we extend the general framework to the stochastic scenario and consider an application to random algorithms with acceleration for overparameterized models; again we are able to prove convergence rates that improve on those in the literature.  ( 2 min )
    Geometric structure of Deep Learning networks and construction of global ${\mathcal L}^2$ minimizers. (arXiv:2309.10639v3 [cs.LG] UPDATED)
    In this paper, we provide a geometric interpretation of the structure of Deep Learning (DL) networks, characterized by $L$ hidden layers, a ReLU ramp activation function, an $\mathcal{L}^2$ Schatten class (or Hilbert-Schmidt) cost function, and input and output spaces $\mathbb{R}^Q$ with equal dimension $Q\geq1$. The hidden layers are also defined on $\mathbb{R}^{Q}$; the training input size $N$ can be arbitrarily large - thus, we are considering the underparametrized regime. We apply our recent results on shallow neural networks to construct an explicit family of minimizers for the global minimum of the cost function in the case $L\geq Q$, which we show to be degenerate. In the context presented here, the hidden layers of the DL network "curate" the training inputs by recursive application of a truncation map that minimizes the noise to signal ratio of the training inputs. Moreover, we determine a set of $2^Q-1$ distinct degenerate local minima of the cost function. Our constructions make no use of gradient descent algorithms at all.  ( 3 min )
    Learning Linear Causal Representations from Interventions under General Nonlinear Mixing. (arXiv:2306.02235v2 [cs.LG] UPDATED)
    We study the problem of learning causal representations from unknown, latent interventions in a general setting, where the latent distribution is Gaussian but the mixing function is completely general. We prove strong identifiability results given unknown single-node interventions, i.e., without having access to the intervention targets. This generalizes prior works which have focused on weaker classes, such as linear maps or paired counterfactual data. This is also the first instance of causal identifiability from non-paired interventions for deep neural network embeddings. Our proof relies on carefully uncovering the high-dimensional geometric structure present in the data distribution after a non-linear density transformation, which we capture by analyzing quadratic forms of precision matrices of the latent distributions. Finally, we propose a contrastive algorithm to identify the latent variables in practice and evaluate its performance on various tasks.  ( 2 min )
    Semiparametric Regression for Spatial Data via Deep Learning. (arXiv:2301.03747v2 [stat.ML] UPDATED)
    In this work, we propose a deep learning-based method to perform semiparametric regression analysis for spatially dependent data. To be specific, we use a sparsely connected deep neural network with rectified linear unit (ReLU) activation function to estimate the unknown regression function that describes the relationship between response and covariates in the presence of spatial dependence. Under some mild conditions, the estimator is proven to be consistent, and the rate of convergence is determined by three factors: (1) the architecture of neural network class, (2) the smoothness and (intrinsic) dimension of true mean function, and (3) the magnitude of spatial dependence. Our method can handle well large data set owing to the stochastic gradient descent optimization algorithm. Simulation studies on synthetic data are conducted to assess the finite sample performance, the results of which indicate that the proposed method is capable of picking up the intricate relationship between response and covariates. Finally, a real data analysis is provided to demonstrate the validity and effectiveness of the proposed method.  ( 2 min )
    Proximal Mean Field Learning in Shallow Neural Networks. (arXiv:2210.13879v3 [cs.LG] UPDATED)
    We propose a custom learning algorithm for shallow over-parameterized neural networks, i.e., networks with single hidden layer having infinite width. The infinite width of the hidden layer serves as an abstraction for the over-parameterization. Building on the recent mean field interpretations of learning dynamics in shallow neural networks, we realize mean field learning as a computational algorithm, rather than as an analytical tool. Specifically, we design a Sinkhorn regularized proximal algorithm to approximate the distributional flow for the learning dynamics over weighted point clouds. In this setting, a contractive fixed point recursion computes the time-varying weights, numerically realizing the interacting Wasserstein gradient flow of the parameter distribution supported over the neuronal ensemble. An appealing aspect of the proposed algorithm is that the measure-valued recursions allow meshless computation. We demonstrate the proposed computational framework of interacting weighted particle evolution on binary and multi-class classification. Our algorithm performs gradient descent of the free energy associated with the risk functional.  ( 2 min )
    Commutativity and Disentanglement from the Manifold Perspective. (arXiv:2210.07857v4 [stat.ML] UPDATED)
    In this paper, we interpret disentanglement as the discovery of local charts of the data manifold and trace how this definition naturally leads to an equivalent condition for disentanglement: commutativity between factors of variation. We study the impact of this manifold framework to two classes of problems: learning matrix exponential operators and compressing data-generating models. In each problem, the manifold perspective yields interesting results about the feasibility and fruitful approaches their solutions. We also link our manifold framework to two other common disentanglement paradigms: group theoretic and probabilistic approaches to disentanglement. In each case, we show how these frameworks can be merged with our manifold perspective. Importantly, we recover commutativity as a central property in both alternative frameworks, further highlighting its importance in disentanglement.  ( 2 min )
    Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints. (arXiv:2301.03566v2 [math.ST] UPDATED)
    We study simple binary hypothesis testing under both local differential privacy (LDP) and communication constraints. We qualify our results as either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hypothesis testing under pure LDP constraints, we establish instance-optimal bounds for distributions with binary support; minimax-optimal bounds for general distributions; and (approximately) instance-optimal, computationally efficient algorithms for general distributions. When both privacy and communication constraints are present, we develop instance-optimal, computationally efficient algorithms that achieve the minimum possible sample complexity (up to universal constants). Our results on instance-optimal algorithms hinge on identifying the extreme points of the joint range set $\mathcal A$ of two distributions $p$ and $q$, defined as $\mathcal A := \{(\mathbf T p, \mathbf T q) | \mathbf T \in \mathcal C\}$, where $\mathcal C$ is the set of channels characterizing the constraints.  ( 2 min )
    Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. (arXiv:2205.15466v7 [cs.LG] UPDATED)
    Data valuation has wide use cases in machine learning, including improving data quality and creating economic incentives for data sharing. This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we introduce the concept of safety margin, which measures the robustness of a data value notion. We show that the Banzhaf value, a famous value notion that originated from cooperative game theory literature, achieves the largest safety margin among all semivalues (a class of value notions that satisfy crucial properties entailed by ML applications and include the famous Shapley value and Leave-one-out error). We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the other semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.  ( 3 min )
    Using Model-Based Trees with Boosting to Fit Low-Order Functional ANOVA Models. (arXiv:2207.06950v5 [stat.ML] UPDATED)
    Low-order functional ANOVA (fANOVA) models have been rediscovered in the machine learning (ML) community under the guise of inherently interpretable machine learning. Explainable Boosting Machines or EBM (Lou et al. 2013) and GAMI-Net (Yang et al. 2021) are two recently proposed ML algorithms for fitting functional main effects and second-order interactions. We propose a new algorithm, called GAMI-Tree, that is similar to EBM, but has a number of features that lead to better performance. It uses model-based trees as base learners and incorporates a new interaction filtering method that is better at capturing the underlying interactions. In addition, our iterative training method converges to a model with better predictive performance, and the embedded purification ensures that interactions are hierarchically orthogonal to main effects. The algorithm does not need extensive tuning, and our implementation is fast and efficient. We use simulated and real datasets to compare the performance and interpretability of GAMI-Tree with EBM and GAMI-Net.  ( 2 min )
    Dirichlet-based Uncertainty Quantification for Personalized Federated Learning with Improved Posterior Networks. (arXiv:2312.11230v1 [stat.ML])
    In modern federated learning, one of the main challenges is to account for inherent heterogeneity and the diverse nature of data distributions for different clients. This problem is often addressed by introducing personalization of the models towards the data distribution of the particular client. However, a personalized model might be unreliable when applied to the data that is not typical for this client. Eventually, it may perform worse for these data than the non-personalized global model trained in a federated way on the data from all the clients. This paper presents a new approach to federated learning that allows selecting a model from global and personalized ones that would perform better for a particular input point. It is achieved through a careful modeling of predictive uncertainties that helps to detect local and global in- and out-of-distribution data and use this information to select the model that is confident in a prediction. The comprehensive experimental evaluation on the popular real-world image datasets shows the superior performance of the model in the presence of out-of-distribution data while performing on par with state-of-the-art personalized federated learning algorithms in the standard scenarios.  ( 2 min )
    Detection of Model-based Planted Pseudo-cliques in Random Dot Product Graphs by the Adjacency Spectral Embedding and the Graph Encoder Embedding. (arXiv:2312.11054v1 [stat.ME])
    In this paper, we explore the capability of both the Adjacency Spectral Embedding (ASE) and the Graph Encoder Embedding (GEE) for capturing an embedded pseudo-clique structure in the random dot product graph setting. In both theory and experiments, we demonstrate that this pairing of model and methods can yield worse results than the best existing spectral clique detection methods, demonstrating at once the methods' potential inability to capture even modestly sized pseudo-cliques and the methods' robustness to the model contamination giving rise to the pseudo-clique structure. To further enrich our analysis, we also consider the Variational Graph Auto-Encoder (VGAE) model in our simulation and real data experiments.  ( 2 min )
    Targeted Machine Learning for Average Causal Effect Estimation Using the Front-Door Functional. (arXiv:2312.10234v1 [stat.ME])
    Evaluating the average causal effect (ACE) of a treatment on an outcome often involves overcoming the challenges posed by confounding factors in observational studies. A traditional approach uses the back-door criterion, seeking adjustment sets to block confounding paths between treatment and outcome. However, this method struggles with unmeasured confounders. As an alternative, the front-door criterion offers a solution, even in the presence of unmeasured confounders between treatment and outcome. This method relies on identifying mediators that are not directly affected by these confounders and that completely mediate the treatment's effect. Here, we introduce novel estimation strategies for the front-door criterion based on the targeted minimum loss-based estimation theory. Our estimators work across diverse scenarios, handling binary, continuous, and multivariate mediators. They leverage data-adaptive machine learning algorithms, minimizing assumptions and ensuring key statistical properties like asymptotic linearity, double-robustness, efficiency, and valid estimates within the target parameter space. We establish conditions under which the nuisance functional estimations ensure the root n-consistency of ACE estimators. Our numerical experiments show the favorable finite sample performance of the proposed estimators. We demonstrate the applicability of these estimators to analyze the effect of early stage academic performance on future yearly income using data from the Finnish Social Science Data Archive.  ( 2 min )
    Policy Learning with Competing Agents. (arXiv:2204.01884v3 [stat.ML] UPDATED)
    Decision makers often aim to learn a treatment assignment policy under a capacity constraint on the number of agents that they can treat. When agents can respond strategically to such policies, competition arises, complicating estimation of the optimal policy. In this paper, we study capacity-constrained treatment assignment in the presence of such interference. We consider a dynamic model where the decision maker allocates treatments at each time step and heterogeneous agents myopically best respond to the previous treatment assignment policy. When the number of agents is large but finite, we show that the threshold for receiving treatment under a given policy converges to the policy's mean-field equilibrium threshold. Based on this result, we develop a consistent estimator for the policy gradient. In simulations and a semi-synthetic experiment with data from the National Education Longitudinal Study of 1988, we demonstrate that this estimator can be used for learning capacity-constrained policies in the presence of strategic behavior.  ( 2 min )
    Hypothesis Testing for Class-Conditional Noise Using Local Maximum Likelihood. (arXiv:2312.10238v1 [cs.LG])
    In supervised learning, automatically assessing the quality of the labels before any learning takes place remains an open research question. In certain particular cases, hypothesis testing procedures have been proposed to assess whether a given instance-label dataset is contaminated with class-conditional label noise, as opposed to uniform label noise. The existing theory builds on the asymptotic properties of the Maximum Likelihood Estimate for parametric logistic regression. However, the parametric assumptions on top of which these approaches are constructed are often too strong and unrealistic in practice. To alleviate this problem, in this paper we propose an alternative path by showing how similar procedures can be followed when the underlying model is a product of Local Maximum Likelihood Estimation that leads to more flexible nonparametric logistic regression models, which in turn are less susceptible to model misspecification. This different view allows for wider applicability of the tests by offering users access to a richer model class. Similarly to existing works, we assume we have access to anchor points which are provided by the users. We introduce the necessary ingredients for the adaptation of the hypothesis tests to the case of nonparametric logistic regression and empirically compare against the parametric approach presenting both synthetic and real-world case studies and discussing the advantages and limitations of the proposed approach.  ( 2 min )
    Convergence and complexity of block majorization-minimization for constrained block-Riemannian optimization. (arXiv:2312.10330v1 [math.OC])
    Block majorization-minimization (BMM) is a simple iterative algorithm for nonconvex optimization that sequentially minimizes a majorizing surrogate of the objective function in each block coordinate while the other block coordinates are held fixed. We consider a family of BMM algorithms for minimizing smooth nonconvex objectives, where each parameter block is constrained within a subset of a Riemannian manifold. We establish that this algorithm converges asymptotically to the set of stationary points, and attains an $\epsilon$-stationary point within $\widetilde{O}(\epsilon^{-2})$ iterations. In particular, the assumptions for our complexity results are completely Euclidean when the underlying manifold is a product of Euclidean or Stiefel manifolds, although our analysis makes explicit use of the Riemannian geometry. Our general analysis applies to a wide range of algorithms with Riemannian constraints: Riemannian MM, block projected gradient descent, optimistic likelihood estimation, geodesically constrained subspace tracking, robust PCA, and Riemannian CP-dictionary-learning. We experimentally validate that our algorithm converges faster than standard Euclidean algorithms applied to the Riemannian setting.  ( 2 min )
    One step closer to unbiased aleatoric uncertainty estimation. (arXiv:2312.10469v1 [cs.LG])
    Neural networks are powerful tools in various applications, and quantifying their uncertainty is crucial for reliable decision-making. In the deep learning field, the uncertainties are usually categorized into aleatoric (data) and epistemic (model) uncertainty. In this paper, we point out that the existing popular variance attenuation method highly overestimates aleatoric uncertainty. To address this issue, we propose a new estimation method by actively de-noising the observed data \footnote{Source code available at \url{https://github.com/wz16/DVA}.}. By conducting a broad range of experiments, we demonstrate that our proposed approach provides a much closer approximation to the actual data uncertainty than the standard method.  ( 2 min )
    Cardiac and extracardiac discharge diagnosis prediction from emergency department ECGs using deep learning. (arXiv:2312.11050v1 [eess.SP])
    Current deep learning algorithms designed for automatic ECG analysis have exhibited notable accuracy. However, akin to traditional electrocardiography, they tend to be narrowly focused and typically address a singular diagnostic condition. In this study, we specifically demonstrate the capability of a single model to predict a diverse range of both cardiac and non-cardiac discharge diagnoses based on a sole ECG collected in the emergency department. Among the 1,076 hierarchically structured ICD codes considered, our model achieves an AUROC exceeding 0.8 in 439 of them. This underscores the models proficiency in handling a wide array of diagnostic scenarios. We emphasize the potential of utilizing this model as a screening tool, potentially integrated into a holistic clinical decision support system for efficiently triaging patients in the emergency department. This research underscores the remarkable capabilities of comprehensive ECG analysis algorithms and the extensive range of possibilities facilitated by the open MIMIC-IV-ECG dataset. Finally, our data may play a pivotal role in revolutionizing the way ECG analysis is performed, marking a significant advancement in the field.  ( 2 min )
    Deep Feature Screening: Feature Selection for Ultra High-Dimensional Data via Deep Neural Networks. (arXiv:2204.01682v3 [stat.ML] UPDATED)
    The applications of traditional statistical feature selection methods to high-dimension, low sample-size data often struggle and encounter challenging problems, such as overfitting, curse of dimensionality, computational infeasibility, and strong model assumption. In this paper, we propose a novel two-step nonparametric approach called Deep Feature Screening (DeepFS) that can overcome these problems and identify significant features with high precision for ultra high-dimensional, low-sample-size data. This approach first extracts a low-dimensional representation of input data and then applies feature screening based on multivariate rank distance correlation recently developed by Deb and Sen (2021). This approach combines the strengths of both deep neural networks and feature screening, and thereby has the following appealing features in addition to its ability of handling ultra high-dimensional data with small number of samples: (1) it is model free and distribution free; (2) it can be used for both supervised and unsupervised feature selection; and (3) it is capable of recovering the original input data. The superiority of DeepFS is demonstrated via extensive simulation studies and real data analyses.  ( 2 min )
    Random Models for Fuzzy Clustering Similarity Measures. (arXiv:2312.10270v1 [stat.ML])
    The Adjusted Rand Index (ARI) is a widely used method for comparing hard clusterings, but requires a choice of random model that is often left implicit. Several recent works have extended the Rand Index to fuzzy clusterings, but the assumptions of the most common random model is difficult to justify in fuzzy settings. We propose a single framework for computing the ARI with three random models that are intuitive and explainable for both hard and fuzzy clusterings, along with the benefit of lower computational complexity. The theory and assumptions of the proposed models are contrasted with the existing permutation model. Computations on synthetic and benchmark data show that each model has distinct behaviour, meaning that accurate model selection is important for the reliability of results.  ( 2 min )
    Random Forest Variable Importance-based Selection Algorithm in Class Imbalance Problem. (arXiv:2312.10573v1 [stat.ML])
    Random Forest is a machine learning method that offers many advantages, including the ability to easily measure variable importance. Class balancing technique is a well-known solution to deal with class imbalance problem. However, it has not been actively studied on RF variable importance. In this paper, we study the effect of class balancing on RF variable importance. Our simulation results show that over-sampling is effective in correctly measuring variable importance in class imbalanced situations with small sample size, while under-sampling fails to differentiate important and non-informative variables. We then propose a variable selection algorithm that utilizes RF variable importance and its confidence interval. Through an experimental study using many real and artificial datasets, we demonstrate that our proposed algorithm efficiently selects an optimal feature set, leading to improved prediction performance in class imbalance problem.  ( 2 min )
    Bayesian Model Selection via Mean-Field Variational Approximation. (arXiv:2312.10607v1 [stat.ME])
    This article considers Bayesian model selection via mean-field (MF) variational approximation. Towards this goal, we study the non-asymptotic properties of MF inference under the Bayesian framework that allows latent variables and model mis-specification. Concretely, we show a Bernstein von-Mises (BvM) theorem for the variational distribution from MF under possible model mis-specification, which implies the distributional convergence of MF variational approximation to a normal distribution centering at the maximal likelihood estimator (within the specified model). Motivated by the BvM theorem, we propose a model selection criterion using the evidence lower bound (ELBO), and demonstrate that the model selected by ELBO tends to asymptotically agree with the one selected by the commonly used Bayesian information criterion (BIC) as sample size tends to infinity. Comparing to BIC, ELBO tends to incur smaller approximation error to the log-marginal likelihood (a.k.a. model evidence) due to a better dimension dependence and full incorporation of the prior information. Moreover, we show the geometric convergence of the coordinate ascent variational inference (CAVI) algorithm under the parametric model framework, which provides a practical guidance on how many iterations one typically needs to run when approximating the ELBO. These findings demonstrate that variational inference is capable of providing a computationally efficient alternative to conventional approaches in tasks beyond obtaining point estimates, which is also empirically demonstrated by our extensive numerical experiments.  ( 2 min )
    Rotting Infinitely Many-armed Bandits. (arXiv:2201.12975v3 [cs.LG] UPDATED)
    We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $\Omega(\max\{\varrho^{1/3}T,\sqrt{T}\})$ worst-case regret lower bound where $T$ is the horizon time. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T,\sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T,T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.  ( 2 min )
    Communication-constrained hypothesis testing: Optimality, robustness, and reverse data processing inequalities. (arXiv:2206.02765v2 [math.ST] UPDATED)
    We study hypothesis testing under communication constraints, where each sample is quantized before being revealed to a statistician. Without communication constraints, it is well known that the sample complexity of simple binary hypothesis testing is characterized by the Hellinger distance between the distributions. We show that the sample complexity of simple binary hypothesis testing under communication constraints is at most a logarithmic factor larger than in the unconstrained setting and this bound is tight. We develop a polynomial-time algorithm that achieves the aforementioned sample complexity. Our framework extends to robust hypothesis testing, where the distributions are corrupted in the total variation distance. Our proofs rely on a new reverse data processing inequality and a reverse Markov inequality, which may be of independent interest. For simple $M$-ary hypothesis testing, the sample complexity in the absence of communication constraints has a logarithmic dependence on $M$. We show that communication constraints can cause an exponential blow-up leading to $\Omega(M)$ sample complexity even for adaptive algorithms.  ( 2 min )
    Vesicoureteral Reflux Detection with Reliable Probabilistic Outputs. (arXiv:2312.11355v1 [cs.LG])
    Vesicoureteral Reflux (VUR) is a pediatric disorder in which urine flows backwards from the bladder to the upper urinary tract. Its detection is of great importance as it increases the risk of a Urinary Tract Infection, which can then lead to a kidney infection since bacteria may have direct access to the kidneys. Unfortunately the detection of VUR requires a rather painful medical examination, called voiding cysteourethrogram (VCUG), that exposes the child to radiation. In an effort to avoid the exposure to radiation required by VCUG some recent studies examined the use of machine learning techniques for the detection of VUR based on data that can be obtained without exposing the child to radiation. This work takes one step further by proposing an approach that provides lower and upper bounds for the conditional probability of a given child having VUR. The important property of these bounds is that they are guaranteed (up to statistical fluctuations) to contain well-calibrated probabilities with the only requirement that observations are independent and identically distributed (i.i.d.). Therefore they are much more informative and reliable than the plain yes/no answers provided by other techniques.  ( 2 min )
    Distributed Collapsed Gibbs Sampler for Dirichlet Process Mixture Models in Federated Learning. (arXiv:2312.11169v1 [stat.ML])
    Dirichlet Process Mixture Models (DPMMs) are widely used to address clustering problems. Their main advantage lies in their ability to automatically estimate the number of clusters during the inference process through the Bayesian non-parametric framework. However, the inference becomes considerably slow as the dataset size increases. This paper proposes a new distributed Markov Chain Monte Carlo (MCMC) inference method for DPMMs (DisCGS) using sufficient statistics. Our approach uses the collapsed Gibbs sampler and is specifically designed to work on distributed data across independent and heterogeneous machines, which habilitates its use in horizontal federated learning. Our method achieves highly promising results and notable scalability. For instance, with a dataset of 100K data points, the centralized algorithm requires approximately 12 hours to complete 100 iterations while our approach achieves the same number of iterations in just 3 minutes, reducing the execution time by a factor of 200 without compromising clustering performance. The code source is publicly available at https://github.com/redakhoufache/DisCGS.  ( 2 min )
    Amortized Reparametrization: Efficient and Scalable Variational Inference for Latent SDEs. (arXiv:2312.10550v1 [cs.LG])
    We consider the problem of inferring latent stochastic differential equations (SDEs) with a time and memory cost that scales independently with the amount of data, the total length of the time series, and the stiffness of the approximate differential equations. This is in stark contrast to typical methods for inferring latent differential equations which, despite their constant memory cost, have a time complexity that is heavily dependent on the stiffness of the approximate differential equation. We achieve this computational advancement by removing the need to solve differential equations when approximating gradients using a novel amortization strategy coupled with a recently derived reparametrization of expectations under linear SDEs. We show that, in practice, this allows us to achieve similar performance to methods based on adjoint sensitivities with more than an order of magnitude fewer evaluations of the model in training.  ( 2 min )
    Gibbs Sampling from Human Feedback: A Provable KL- constrained Framework for RLHF. (arXiv:2312.11456v1 [cs.LG])
    This paper studies the theoretical framework of the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We consider a standard mathematical formulation, the reverse-KL regularized contextual bandit for RLHF. Despite its widespread practical application, a rigorous theoretical analysis of this formulation remains open. We investigate its theoretical properties both in offline and online settings and propose efficient algorithms with finite-sample theoretical guarantees. Our work bridges the gap between theory and practice by linking our theoretical insights with existing practical alignment algorithms such as Direct Preference Optimization (DPO) and Rejection Sampling Optimization (RSO). Furthermore, these findings and connections also offer both theoretical and practical communities new tools and insights for future algorithmic design of alignment algorithms.  ( 2 min )
    Sparse Learning and Class Probability Estimation with Weighted Support Vector Machines. (arXiv:2312.10618v1 [stat.ME])
    Classification and probability estimation have broad applications in modern machine learning and data science applications, including biology, medicine, engineering, and computer science. The recent development of a class of weighted Support Vector Machines (wSVMs) has shown great values in robustly predicting the class probability and classification for various problems with high accuracy. The current framework is based on the $\ell^2$-norm regularized binary wSVMs optimization problem, which only works with dense features and has poor performance at sparse features with redundant noise in most real applications. The sparse learning process requires a prescreen of the important variables for each binary wSVMs for accurately estimating pairwise conditional probability. In this paper, we proposed novel wSVMs frameworks that incorporate automatic variable selection with accurate probability estimation for sparse learning problems. We developed efficient algorithms for effective variable selection for solving either the $\ell^1$-norm or elastic net regularized binary wSVMs optimization problems. The binary class probability is then estimated either by the $\ell^2$-norm regularized wSVMs framework with selected variables or by elastic net regularized wSVMs directly. The two-step approach of $\ell^1$-norm followed by $\ell^2$-norm wSVMs show a great advantage in both automatic variable selection and reliable probability estimators with the most efficient time. The elastic net regularized wSVMs offer the best performance in terms of variable selection and probability estimation with the additional advantage of variable grouping in the compensation of more computation time for high dimensional problems. The proposed wSVMs-based sparse learning methods have wide applications and can be further extended to $K$-class problems through ensemble learning.  ( 3 min )
    Effectiveness of Constant Stepsize in Markovian LSA and Statistical Inference. (arXiv:2312.10894v1 [stat.ML])
    In this paper, we study the effectiveness of using a constant stepsize in statistical inference via linear stochastic approximation (LSA) algorithms with Markovian data. After establishing a Central Limit Theorem (CLT), we outline an inference procedure that uses averaged LSA iterates to construct confidence intervals (CIs). Our procedure leverages the fast mixing property of constant-stepsize LSA for better covariance estimation and employs Richardson-Romberg (RR) extrapolation to reduce the bias induced by constant stepsize and Markovian data. We develop theoretical results for guiding stepsize selection in RR extrapolation, and identify several important settings where the bias provably vanishes even without extrapolation. We conduct extensive numerical experiments and compare against classical inference approaches. Our results show that using a constant stepsize enjoys easy hyperparameter tuning, fast convergence, and consistently better CI coverage, especially when data is limited.  ( 2 min )
    Uncertainty-based Fairness Measures. (arXiv:2312.11299v1 [cs.LG])
    Unfair predictions of machine learning (ML) models impede their broad acceptance in real-world settings. Tackling this arduous challenge first necessitates defining what it means for an ML model to be fair. This has been addressed by the ML community with various measures of fairness that depend on the prediction outcomes of the ML models, either at the group level or the individual level. These fairness measures are limited in that they utilize point predictions, neglecting their variances, or uncertainties, making them susceptible to noise, missingness and shifts in data. In this paper, we first show that an ML model may appear to be fair with existing point-based fairness measures but biased against a demographic group in terms of prediction uncertainties. Then, we introduce new fairness measures based on different types of uncertainties, namely, aleatoric uncertainty and epistemic uncertainty. We demonstrate on many datasets that (i) our uncertainty-based measures are complementary to existing measures of fairness, and (ii) they provide more insights about the underlying issues leading to bias.  ( 2 min )
    Robust Estimation of Causal Heteroscedastic Noise Models. (arXiv:2312.10102v1 [stat.ML])
    Distinguishing the cause and effect from bivariate observational data is the foundational problem that finds applications in many scientific disciplines. One solution to this problem is assuming that cause and effect are generated from a structural causal model, enabling identification of the causal direction after estimating the model in each direction. The heteroscedastic noise model is a type of structural causal model where the cause can contribute to both the mean and variance of the noise. Current methods for estimating heteroscedastic noise models choose the Gaussian likelihood as the optimization objective which can be suboptimal and unstable when the data has a non-Gaussian distribution. To address this limitation, we propose a novel approach to estimating this model with Student's $t$-distribution, which is known for its robustness in accounting for sampling variability with smaller sample sizes and extreme values without significantly altering the overall distribution shape. This adaptability is beneficial for capturing the parameters of the noise distribution in heteroscedastic noise models. Our empirical evaluations demonstrate that our estimators are more robust and achieve better overall performance across synthetic and real benchmarks.  ( 2 min )
    Continuous Diffusion for Mixed-Type Tabular Data. (arXiv:2312.10431v1 [cs.LG])
    Score-based generative models (or diffusion models for short) have proven successful across many domains in generating text and image data. However, the consideration of mixed-type tabular data with this model family has fallen short so far. Existing research mainly combines different diffusion processes without explicitly accounting for the feature heterogeneity inherent to tabular data. In this paper, we combine score matching and score interpolation to ensure a common type of continuous noise distribution that affects both continuous and categorical features alike. Further, we investigate the impact of distinct noise schedules per feature or per data type. We allow for adaptive, learnable noise schedules to ensure optimally allocated model capacity and balanced generative capability. Results show that our model consistently outperforms state-of-the-art benchmark models and that accounting for heterogeneity within the noise schedule design boosts the sample quality.  ( 2 min )
    Interventionally Consistent Surrogates for Agent-based Simulators. (arXiv:2312.11158v1 [cs.MA])
    Agent-based simulators provide granular representations of complex intelligent systems by directly modelling the interactions of the system's constituent agents. Their high-fidelity nature enables hyper-local policy evaluation and testing of what-if scenarios, but is associated with large computational costs that inhibits their widespread use. Surrogate models can address these computational limitations, but they must behave consistently with the agent-based model under policy interventions of interest. In this paper, we capitalise on recent developments on causal abstractions to develop a framework for learning interventionally consistent surrogate models for agent-based simulators. Our proposed approach facilitates rapid experimentation with policy interventions in complex systems, while inducing surrogates to behave consistently with high probability with respect to the agent-based simulator across interventions of interest. We demonstrate with empirical studies that observationally trained surrogates can misjudge the effect of interventions and misguide policymakers towards suboptimal policies, while surrogates trained for interventional consistency with our proposed method closely mimic the behaviour of an agent-based model under interventions of interest.  ( 2 min )
    Human mobility is well described by closed-form gravity-like models learned automatically from data. (arXiv:2312.11281v1 [physics.soc-ph])
    Modeling of human mobility is critical to address questions in urban planning and transportation, as well as global challenges in sustainability, public health, and economic development. However, our understanding and ability to model mobility flows within and between urban areas are still incomplete. At one end of the modeling spectrum we have simple so-called gravity models, which are easy to interpret and provide modestly accurate predictions of mobility flows. At the other end, we have complex machine learning and deep learning models, with tens of features and thousands of parameters, which predict mobility more accurately than gravity models at the cost of not being interpretable and not providing insight on human behavior. Here, we show that simple machine-learned, closed-form models of mobility are able to predict mobility flows more accurately, overall, than either gravity or complex machine and deep learning models. At the same time, these models are simple and gravity-like, and can be interpreted in terms similar to standard gravity models. Furthermore, these models work for different datasets and at different scales, suggesting that they may capture the fundamental universal features of human mobility.  ( 2 min )
    A Concentration Bound for TD(0) with Function Approximation. (arXiv:2312.10424v1 [cs.LG])
    We derive a concentration bound of the type `for all $n \geq n_0$ for some $n_0$' for TD(0) with linear function approximation. We work with online TD learning with samples from a single sample path of the underlying Markov chain. This makes our analysis significantly different from offline TD learning or TD learning with access to independent samples from the stationary distribution of the Markov chain. We treat TD(0) as a contractive stochastic approximation algorithm, with both martingale and Markov noises. Markov noise is handled using the Poisson equation and the lack of almost sure guarantees on boundedness of iterates is handled using the concept of relaxed concentration inequalities.  ( 2 min )

  • Open

    VideoPoet: A large language model for zero-shot video generation
    Posted by Dan Kondratyuk and David Ross, Software Engineers, Google Research A recent wave of video generation models has burst onto the scene, in many cases showcasing stunning picturesque quality. One of the current bottlenecks in video generation is in the ability to produce coherent large motions. In many cases, even the current leading models either generate small motion or, when producing larger motions, exhibit noticeable artifacts. To explore the application of language models in video generation, we introduce VideoPoet, a large language model (LLM) that is capable of a wide variety of video generation tasks, including text-to-video, image-to-video, video stylization, video inpainting and outpainting, and video-to-audio. One notable observation is that the leading video gene…  ( 93 min )
    Simulations illuminate the path to post-event traffic flow
    Posted by Yechen Li and Neha Arora, Software Engineers, Google Research Fifteen minutes. That’s how long it took to empty the Colosseum, an engineering marvel that’s still standing as the largest amphitheater in the world. Two thousand years later, this design continues to work well to move enormous crowds out of sporting and entertainment venues. But of course, exiting the arena is only the first step. Next, people must navigate the traffic that builds up in the surrounding streets. This is an age-old problem that remains unsolved to this day. In Rome, they addressed the issue by prohibiting private traffic on the street that passes directly by the Colosseum. This policy worked there, but what if you’re not in Rome? What if you’re at the Superbowl? Or at a Taylor Swift concert? …  ( 92 min )
  • Open

    DSC Weekly 19 December 2023
    Announcements Top Stories In-Depth The post DSC Weekly 19 December 2023 appeared first on Data Science Central.  ( 20 min )
    LLMs: Can intelligence be caged?
    It may depend on the level of intelligence and the perception of the strength of that cage by the captors. The cage may be strong enough that without any unknown or unexpected event, it would hold up. However, the heterogeneity of intelligence may result in forms [or messages] with which escapes can be made, without leaving… Read More »LLMs: Can intelligence be caged? The post LLMs: Can intelligence be caged? appeared first on Data Science Central.  ( 20 min )
    Why FAIR data assets are essential to AI data management
    One of the efforts our Dataworthy Collective will be ramping up in 2024 involves standardizing the building of logical knowledge graphs at the level of the document object. The goal is to make spreadsheets trustworthy, sharable and reusable on a standalone basis at web scale.  Lead Charles Hoffman and others on his team believe that… Read More »Why FAIR data assets are essential to AI data management The post Why FAIR data assets are essential to AI data management appeared first on Data Science Central.  ( 21 min )
  • Open

    NVIDIA to Reveal New AI Innovations at CES 2024
    In the lead-up to next month’s CES trade show in Las Vegas, NVIDIA will unveil its latest advancements in artificial intelligence — including generative AI — and a spectrum of other cutting-edge technologies. Scheduled for Monday, Jan. 8, at 8 a.m. PT, the company’s special address will be publicly streamed. Save the date and plan Read article >  ( 5 min )
    DLSS 3.5 Integration in D5 Render Marks New Era of Real-Time Rendering
    NVIDIA DLSS 3.5 for realistic ray-traced visuals is now available on D5 Render, a real-time 3D creation software.  ( 7 min )
  • Open

    Driving advanced analytics outcomes at scale using Amazon SageMaker powered PwC’s Machine Learning Ops Accelerator
    This post was written in collaboration with Ankur Goyal and Karthikeyan Chokappa from PwC Australia’s Cloud & Digital business. Artificial intelligence (AI) and machine learning (ML) are becoming an integral part of systems and processes, enabling decisions in real time, thereby driving top and bottom-line improvements across organizations. However, putting an ML model into production […]  ( 10 min )
  • Open

    Fermat curve
    The Fermat curve of order n is the set of points satisfying xn + yn = 1 for a positive integer n. Fermat’s last theorem is equivalent to saying there are no non-trivial rational points on the Fermat curve of order n > 2. (The trivial points have x or y equal to 0.) Parameterization The […] Fermat curve first appeared on John D. Cook.  ( 5 min )
    Conformal map between disk and equilateral triangle
    The Dixon elliptic functions sm and cm are in some ways analogous to sine and cosine. However, whereas sine and cosine satisfy the Dixon functions satisfy The exponent 3 foreshadows the fact that these functions have a sort of three-fold symmetry. In particular, the function sm maps an equilateral triangle in the complex plane to […] Conformal map between disk and equilateral triangle first appeared on John D. Cook.  ( 5 min )
  • Open

    OEBench: Investigating Open Environment Challenges in Real-World Relational Data Streams. (arXiv:2308.15059v3 [cs.LG] UPDATED)
    How to get insights from relational data streams in a timely manner is a hot research topic. Data streams can present unique challenges, such as distribution drifts, outliers, emerging classes, and changing features, which have recently been described as open environment challenges for machine learning. While existing studies have been done on incremental learning for data streams, their evaluations are mostly conducted with synthetic datasets. Thus, a natural question is how those open environment challenges look like and how existing incremental learning algorithms perform on real-world relational data streams. To fill this gap, we develop an Open Environment Benchmark named OEBench to evaluate open environment challenges in real-world relational data streams. Specifically, we investigate 55 real-world relational data streams and establish that open environment scenarios are indeed widespread, which presents significant challenges for stream learning algorithms. Through benchmarks with existing incremental learning algorithms, we find that increased data quantity may not consistently enhance the model accuracy when applied in open environment scenarios, where machine learning models can be significantly compromised by missing values, distribution drifts, or anomalies in real-world data streams. The current techniques are insufficient in effectively mitigating these challenges brought by open environments. More researches are needed to address real-world open environment challenges. All datasets and code are open-sourced in https://github.com/sjtudyq/OEBench.  ( 3 min )
    Is ChatGPT a game changer for geocoding -- a benchmark for geocoding address parsing techniques. (arXiv:2310.14360v4 [cs.CL] UPDATED)
    The remarkable success of GPT models across various tasks, including toponymy recognition motivates us to assess the performance of the GPT-3 model in the geocoding address parsing task. To ensure that the evaluation more accurately mirrors performance in real-world scenarios with diverse user input qualities and resolve the pressing need for a 'gold standard' evaluation dataset for geocoding systems, we introduce a benchmark dataset of low-quality address descriptions synthesized based on human input patterns mining from actual input logs of a geocoding system in production. This dataset has 21 different input errors and variations; contains over 239,000 address records that are uniquely selected from streets across all U.S. 50 states and D.C.; and consists of three subsets to be used as training, validation, and testing sets. Building on this, we train and gauge the performance of the GPT-3 model in extracting address components, contrasting its performance with transformer-based and LSTM-based models. The evaluation results indicate that Bidirectional LSTM-CRF model has achieved the best performance over these transformer-based models and GPT-3 model. Transformer-based models demonstrate very comparable results compared to the Bidirectional LSTM-CRF model. The GPT-3 model, though trailing in performance, showcases potential in the address parsing task with few-shot examples, exhibiting room for improvement with additional fine-tuning. We open source the code and data of this presented benchmark so that researchers can utilize it for future model development or extend it to evaluate similar tasks, such as document geocoding.  ( 3 min )
    Multi-class Support Vector Machine with Maximizing Minimum Margin. (arXiv:2312.06578v2 [cs.LG] UPDATED)
    Support Vector Machine (SVM) stands out as a prominent machine learning technique widely applied in practical pattern recognition tasks. It achieves binary classification by maximizing the "margin", which represents the minimum distance between instances and the decision boundary. Although many efforts have been dedicated to expanding SVM for multi-class case through strategies such as one versus one and one versus the rest, satisfactory solutions remain to be developed. In this paper, we propose a novel method for multi-class SVM that incorporates pairwise class loss considerations and maximizes the minimum margin. Adhering to this concept, we embrace a new formulation that imparts heightened flexibility to multi-class SVM. Furthermore, the correlations between the proposed method and multiple forms of multi-class SVM are analyzed. The proposed regularizer, akin to the concept of "margin", can serve as a seamless enhancement over the softmax in deep learning, providing guidance for network parameter learning. Empirical evaluations demonstrate the effectiveness and superiority of our proposed method over existing multi-classification methods.Code is available at https://github.com/zz-haooo/M3SVM.  ( 2 min )
    Convergent Data-driven Regularizations for CT Reconstruction. (arXiv:2212.07786v2 [math.NA] UPDATED)
    The reconstruction of images from their corresponding noisy Radon transform is a typical example of an ill-posed linear inverse problem as arising in the application of computerized tomography (CT). As the (naive) solution does not depend on the measured data continuously, regularization is needed to re-establish a continuous dependence. In this work, we investigate simple, but yet still provably convergent approaches to learning linear regularization methods from data. More specifically, we analyze two approaches: One generic linear regularization that learns how to manipulate the singular values of the linear operator in an extension of our previous work, and one tailored approach in the Fourier domain that is specific to CT-reconstruction. We prove that such approaches become convergent regularization methods as well as the fact that the reconstructions they provide are typically much smoother than the training data they were trained on. Finally, we compare the spectral as well as the Fourier-based approaches for CT-reconstruction numerically, discuss their advantages and disadvantages and investigate the effect of discretization errors at different resolutions.  ( 2 min )
    Distributed Matrix-Based Sampling for Graph Neural Network Training. (arXiv:2311.02909v2 [cs.LG] UPDATED)
    The primary contribution of this paper is new methods for reducing communication in the sampling step for distributed GNN training. Here, we propose a matrix-based bulk sampling approach that expresses sampling as a sparse matrix multiplication (SpGEMM) and samples multiple minibatches at once. When the input graph topology does not fit on a single device, our method distributes the graph and use communication-avoiding SpGEMM algorithms to scale GNN minibatch sampling, enabling GNN training on much larger graphs than those that can fit into a single device memory. When the input graph topology (but not the embeddings) fits in the memory of one GPU, our approach (1) performs sampling without communication, (2) amortizes the overheads of sampling a minibatch, and (3) can represent multiple sampling algorithms by simply using different matrix constructions. In addition to new methods for sampling, we show that judiciously replicating feature data with a simple all-to-all exchange can outperform current methods for the feature extraction step in distributed GNN training. We provide experimental results on the largest Open Graph Benchmark (OGB) datasets on $128$ GPUs, and show that our pipeline is $2.5\times$ faster Quiver (a distributed extension to PyTorch-Geometric) on a $3$-layer GraphSAGE network. On datasets outside of OGB, we show a $8.46\times$ speedup on $128$ GPUs in-per epoch time. Finally, we show scaling when the graph is distributed across GPUs and scaling for both node-wise and layer-wise sampling algorithms  ( 3 min )
    Mava: a research library for distributed multi-agent reinforcement learning in JAX. (arXiv:2107.01460v2 [cs.LG] UPDATED)
    Multi-agent reinforcement learning (MARL) research is inherently computationally expensive and it is often difficult to obtain a sufficient number of experiment samples to test hypotheses and make robust statistical claims. Furthermore, MARL algorithms are typically complex in their design and can be tricky to implement correctly. These aspects of MARL present a difficult challenge when it comes to creating useful software for advanced research. Our criteria for such software is that it should be simple enough to use to implement new ideas quickly, while at the same time be scalable and fast enough to test those ideas in a reasonable amount of time. In this preliminary technical report, we introduce Mava, a research library for MARL written purely in JAX, that aims to fulfill these criteria. We discuss the design and core features of Mava, and demonstrate its use and performance across a variety of environments. In particular, we show Mava's substantial speed advantage, with improvements of 10-100x compared to other popular MARL frameworks, while maintaining strong performance. This allows for researchers to test ideas in a few minutes instead of several hours. Finally, Mava forms part of an ecosystem of libraries that seamlessly integrate with each other to help facilitate advanced research in MARL. We hope Mava will benefit the community and help drive scientifically sound and statistically robust research in the field. The open-source repository for Mava is available at https://github.com/instadeepai/Mava.  ( 3 min )
    Robustness May be More Brittle than We Think under Different Degrees of Distribution Shifts. (arXiv:2310.06622v2 [cs.LG] UPDATED)
    Out-of-distribution (OOD) generalization is a complicated problem due to the idiosyncrasies of possible distribution shifts between training and test domains. Most benchmarks employ diverse datasets to address this issue; however, the degree of the distribution shift between the training domains and the test domains of each dataset remains largely fixed. This may lead to biased conclusions that either underestimate or overestimate the actual OOD performance of a model. Our study delves into a more nuanced evaluation setting that covers a broad range of shift degrees. We show that the robustness of models can be quite brittle and inconsistent under different degrees of distribution shifts, and therefore one should be more cautious when drawing conclusions from evaluations under a limited range of degrees. In addition, we observe that large-scale pre-trained models, such as CLIP, are sensitive to even minute distribution shifts of novel downstream tasks. This indicates that while pre-trained representations may help improve downstream in-distribution performance, they could have minimal or even adverse effects on generalization in certain OOD scenarios of the downstream task if not used properly. In light of these findings, we encourage future research to conduct evaluations across a broader range of shift degrees whenever possible.  ( 3 min )
    Dementia Assessment Using Mandarin Speech with an Attention-based Speech Recognition Encoder. (arXiv:2310.03985v2 [cs.CL] UPDATED)
    Dementia diagnosis requires a series of different testing methods, which is complex and time-consuming. Early detection of dementia is crucial as it can prevent further deterioration of the condition. This paper utilizes a speech recognition model to construct a dementia assessment system tailored for Mandarin speakers during the picture description task. By training an attention-based speech recognition model on voice data closely resembling real-world scenarios, we have significantly enhanced the model's recognition capabilities. Subsequently, we extracted the encoder from the speech recognition model and added a linear layer for dementia assessment. We collected Mandarin speech data from 99 subjects and acquired their clinical assessments from a local hospital. We achieved an accuracy of 92.04% in Alzheimer's disease detection and a mean absolute error of 9% in clinical dementia rating score prediction.  ( 2 min )
    Greedy Shapley Client Selection for Communication-Efficient Federated Learning. (arXiv:2312.09108v2 [cs.LG] UPDATED)
    The standard client selection algorithms for Federated Learning (FL) are often unbiased and involve uniform random sampling of clients. This has been proven sub-optimal for fast convergence under practical settings characterized by significant heterogeneity in data distribution, computing, and communication resources across clients. For applications having timing constraints due to limited communication opportunities with the parameter server (PS), the client selection strategy is critical to complete model training within the fixed budget of communication rounds. To address this, we develop a biased client selection strategy, GreedyFed, that identifies and greedily selects the most contributing clients in each communication round. This method builds on a fast approximation algorithm for the Shapley Value at the PS, making the computation tractable for real-world applications with many clients. Compared to various client selection strategies on several real-world datasets, GreedyFed demonstrates fast and stable convergence with high accuracy under timing constraints and when imposing a higher degree of heterogeneity in data distribution, systems constraints, and privacy requirements.  ( 2 min )
    Very high resolution canopy height maps from RGB imagery using self-supervised vision transformer and convolutional decoder trained on Aerial Lidar. (arXiv:2304.07213v3 [cs.CV] UPDATED)
    Vegetation structure mapping is critical for understanding the global carbon cycle and monitoring nature-based approaches to climate adaptation and mitigation. Repeated measurements of these data allow for the observation of deforestation or degradation of existing forests, natural forest regeneration, and the implementation of sustainable agricultural practices like agroforestry. Assessments of tree canopy height and crown projected area at a high spatial resolution are also important for monitoring carbon fluxes and assessing tree-based land uses, since forest structures can be highly spatially heterogeneous, especially in agroforestry systems. Very high resolution satellite imagery (less than one meter (1m) Ground Sample Distance) makes it possible to extract information at the tree level while allowing monitoring at a very large scale. This paper presents the first high-resolution canopy height map concurrently produced for multiple sub-national jurisdictions. Specifically, we produce very high resolution canopy height maps for the states of California and Sao Paulo, a significant improvement in resolution over the ten meter (10m) resolution of previous Sentinel / GEDI based worldwide maps of canopy height. The maps are generated by the extraction of features from a self-supervised model trained on Maxar imagery from 2017 to 2020, and the training of a dense prediction decoder against aerial lidar maps. We also introduce a post-processing step using a convolutional network trained on GEDI observations. We evaluate the proposed maps with set-aside validation lidar data as well as by comparing with other remotely sensed maps and field-collected data, and find our model produces an average Mean Absolute Error (MAE) of 2.8 meters and Mean Error (ME) of 0.6 meters.  ( 3 min )
    3D-MuPPET: 3D Multi-Pigeon Pose Estimation and Tracking. (arXiv:2308.15316v3 [cs.CV] UPDATED)
    Markerless methods for animal posture tracking have been rapidly developing recently, but frameworks and benchmarks for tracking large animal groups in 3D are still lacking. To overcome this gap in the literature, we present 3D-MuPPET, a framework to estimate and track 3D poses of up to 10 pigeons at interactive speed using multiple camera views. We train a pose estimator to infer 2D keypoints and bounding boxes of multiple pigeons, then triangulate the keypoints to 3D. For identity matching of individuals in all views, we first dynamically match 2D detections to global identities in the first frame, then use a 2D tracker to maintain IDs across views in subsequent frames. We achieve comparable accuracy to a state of the art 3D pose estimator in terms of median error and Percentage of Correct Keypoints. Additionally, we benchmark the inference speed of 3D-MuPPET, with up to 9.45 fps in 2D and 1.89 fps in 3D, and perform quantitative tracking evaluation, which yields encouraging results. Finally, we showcase two novel applications for 3D-MuPPET. First, we train a model with data of single pigeons and achieve comparable results in 2D and 3D posture estimation for up to 5 pigeons. Second, we show that 3D-MuPPET also works in outdoors without additional annotations from natural environments. Both use cases simplify the domain shift to new species and environments, largely reducing annotation effort needed for 3D posture tracking. To the best of our knowledge we are the first to present a framework for 2D/3D animal posture and trajectory tracking that works in both indoor and outdoor environments for up to 10 individuals. We hope that the framework can open up new opportunities in studying animal collective behaviour and encourages further developments in 3D multi-animal posture tracking.  ( 3 min )
    Streaming Active Learning for Regression Problems Using Regression via Classification. (arXiv:2309.01013v2 [cs.LG] UPDATED)
    One of the challenges in deploying a machine learning model is that the model's performance degrades as the operating environment changes. To maintain the performance, streaming active learning is used, in which the model is retrained by adding a newly annotated sample to the training dataset if the prediction of the sample is not certain enough. Although many streaming active learning methods have been proposed for classification, few efforts have been made for regression problems, which are often handled in the industrial field. In this paper, we propose to use the regression-via-classification framework for streaming active learning for regression. Regression-via-classification transforms regression problems into classification problems so that streaming active learning methods proposed for classification problems can be applied directly to regression problems. Experimental validation on four real data sets shows that the proposed method can perform regression with higher accuracy at the same annotation cost.  ( 2 min )
    Debiased Machine Learning and Network Cohesion for Doubly-Robust Differential Reward Models in Contextual Bandits. (arXiv:2312.06403v2 [stat.ML] UPDATED)
    A common approach to learning mobile health (mHealth) intervention policies is linear Thompson sampling. Two desirable mHealth policy features are (1) pooling information across individuals and time and (2) incorporating a time-varying baseline reward. Previous approaches pooled information across individuals but not time, failing to capture trends in treatment effects over time. In addition, these approaches did not explicitly model the baseline reward, which limited the ability to precisely estimate the parameters in the differential reward model. In this paper, we propose a novel Thompson sampling algorithm, termed ''DML-TS-NNR'' that leverages (1) nearest-neighbors to efficiently pool information on the differential reward function across users and time and (2) the Double Machine Learning (DML) framework to explicitly model baseline rewards and stay agnostic to the supervised learning algorithms used. By explicitly modeling baseline rewards, we obtain smaller confidence sets for the differential reward parameters. We offer theoretical guarantees on the pseudo-regret, which are supported by empirical results. Importantly, the DML-TS-NNR algorithm demonstrates robustness to potential misspecifications in the baseline reward model.  ( 2 min )
    Optimal Data Selection: An Online Distributed View. (arXiv:2201.10547v3 [cs.LG] UPDATED)
    The blessing of ubiquitous data also comes with a curse: the communication, storage, and labeling of massive, mostly redundant datasets. We seek to solve this problem at its core, collecting only valuable data and throwing out the rest via submodular maximization. Specifically, we develop algorithms for the online and distributed version of the problem, where data selection occurs in an uncoordinated fashion across multiple data streams. We design a general and flexible core selection routine for our algorithms which, given any stream of data, any assessment of its value, and any formulation of its selection cost, extracts the most valuable subset of the stream up to a constant factor while using minimal memory. Notably, our methods have the same theoretical guarantees as their offline counterparts, and, as far as we know, provide the first guarantees for online distributed submodular optimization in the literature. Finally, in learning tasks on ImageNet and MNIST, we show that our selection methods outperform random selection by $5-20\%$.  ( 2 min )
    Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification. (arXiv:2303.01125v2 [cs.SD] UPDATED)
    Even though deep speaker models have demonstrated impressive accuracy in speaker verification tasks, this often comes at the expense of increased model size and computation time, presenting challenges for deployment in resource-constrained environments. Our research focuses on addressing this limitation through the development of small footprint deep speaker embedding extraction using knowledge distillation. While previous work in this domain has concentrated on speaker embedding extraction at the utterance level, our approach involves amalgamating embeddings from different levels of the x-vector model (teacher network) to train a compact student network. The results highlight the significance of frame-level information, with the student models exhibiting a remarkable size reduction of 85%-91% compared to their teacher counterparts, depending on the size of the teacher embeddings. Notably, by concatenating teacher embeddings, we achieve student networks that maintain comparable performance to the teacher while enjoying a substantial 75% reduction in model size. These findings and insights extend to other x-vector variants, underscoring the broad applicability of our approach.  ( 2 min )
    Efficiently Adapting Pretrained Language Models To New Languages. (arXiv:2311.05741v2 [cs.CL] UPDATED)
    Recent large language models (LLM) exhibit sub-optimal performance on low-resource languages, as the training data of these models is usually dominated by English and other high-resource languages. Furthermore, it is challenging to train models for low-resource languages, especially from scratch, due to a lack of high quality training data. Adapting pretrained LLMs reduces the need for data in the new language while also providing cross lingual transfer capabilities. However, naively adapting to new languages leads to catastrophic forgetting and poor tokenizer efficiency. In this work, we study how to efficiently adapt any existing pretrained LLM to a new language without running into these issues. In particular, we improve the encoding efficiency of the tokenizer by adding new tokens from the target language and study the data mixing recipe to mitigate forgetting. Our experiments on adapting an English LLM to Hungarian and Thai show that our recipe can reach better performance than open source models on the target language, with minimal regressions on English.  ( 2 min )
    Entropy Causal Graphs for Multivariate Time Series Anomaly Detection. (arXiv:2312.09478v1 [cs.LG])
    Many multivariate time series anomaly detection frameworks have been proposed and widely applied. However, most of these frameworks do not consider intrinsic relationships between variables in multivariate time series data, thus ignoring the causal relationship among variables and degrading anomaly detection performance. This work proposes a novel framework called CGAD, an entropy Causal Graph for multivariate time series Anomaly Detection. CGAD utilizes transfer entropy to construct graph structures that unveil the underlying causal relationships among time series data. Weighted graph convolutional networks combined with causal convolutions are employed to model both the causal graph structures and the temporal patterns within multivariate time series data. Furthermore, CGAD applies anomaly scoring, leveraging median absolute deviation-based normalization to improve the robustness of the anomaly identification process. Extensive experiments demonstrate that CGAD outperforms state-of-the-art methods on real-world datasets with a 15% average improvement based on three different multivariate time series anomaly detection metrics.  ( 2 min )
    A Novel Ehanced Move Recognition Algorithm Based on Pre-trained Models with Positional Embeddings. (arXiv:2308.10822v2 [cs.CL] UPDATED)
    The recognition of abstracts is crucial for effectively locating the content and clarifying the article. Existing move recognition algorithms lack the ability to learn word position information to obtain contextual semantics. This paper proposes a novel enhanced move recognition algorithm with an improved pre-trained model and a gated network with attention mechanism for unstructured abstracts of Chinese scientific and technological papers. The proposed algorithm first performs summary data segmentation and vocabulary training. The EP-ERNIE$\_$AT-GRU framework is leveraged to incorporate word positional information, facilitating deep semantic learning and targeted feature extraction. Experimental results demonstrate that the proposed algorithm achieves 13.37$\%$ higher accuracy on the split dataset than on the original dataset and a 7.55$\%$ improvement in accuracy over the basic comparison model.  ( 2 min )
    A systematic review of the use of Deep Learning in Satellite Imagery for Agriculture. (arXiv:2210.01272v2 [cs.CV] UPDATED)
    Agricultural research is essential for increasing food production to meet the requirements of an increasing population in the coming decades. Recently, satellite technology has been improving rapidly and deep learning has seen much success in generic computer vision tasks and many application areas which presents an important opportunity to improve analysis of agricultural land. Here we present a systematic review of 150 studies to find the current uses of deep learning on satellite imagery for agricultural research. Although we identify 5 categories of agricultural monitoring tasks, the majority of the research interest is in crop segmentation and yield prediction. We found that, when used, modern deep learning methods consistently outperformed traditional machine learning across most tasks; the only exception was that Long Short-Term Memory (LSTM) Recurrent Neural Networks did not consistently outperform Random Forests (RF) for yield prediction. The reviewed studies have largely adopted methodologies from generic computer vision, except for one major omission: benchmark datasets are not utilised to evaluate models across studies, making it difficult to compare results. Additionally, some studies have specifically utilised the extra spectral resolution available in satellite imagery, but other divergent properties of satellite images - such as the hugely different scales of spatial patterns - are not being taken advantage of in the reviewed studies.  ( 3 min )
    Risk-Aware Continuous Control with Neural Contextual Bandits. (arXiv:2312.09961v1 [cs.LG])
    Recent advances in learning techniques have garnered attention for their applicability to a diverse range of real-world sequential decision-making problems. Yet, many practical applications have critical constraints for operation in real environments. Most learning solutions often neglect the risk of failing to meet these constraints, hindering their implementation in real-world contexts. In this paper, we propose a risk-aware decision-making framework for contextual bandit problems, accommodating constraints and continuous action spaces. Our approach employs an actor multi-critic architecture, with each critic characterizing the distribution of performance and constraint metrics. Our framework is designed to cater to various risk levels, effectively balancing constraint satisfaction against performance. To demonstrate the effectiveness of our approach, we first compare it against state-of-the-art baseline methods in a synthetic environment, highlighting the impact of intrinsic environmental noise across different risk configurations. Finally, we evaluate our framework in a real-world use case involving a 5G mobile network where only our approach consistently satisfies the system constraint (a signal processing reliability target) with a small performance toll (8.5% increase in power consumption).  ( 2 min )
    Calibrated One Round Federated Learning with Bayesian Inference in the Predictive Space. (arXiv:2312.09817v1 [cs.LG])
    Federated Learning (FL) involves training a model over a dataset distributed among clients, with the constraint that each client's dataset is localized and possibly heterogeneous. In FL, small and noisy datasets are common, highlighting the need for well-calibrated models that represent the uncertainty of predictions. The closest FL techniques to achieving such goals are the Bayesian FL methods which collect parameter samples from local posteriors, and aggregate them to approximate the global posterior. To improve scalability for larger models, one common Bayesian approach is to approximate the global predictive posterior by multiplying local predictive posteriors. In this work, we demonstrate that this method gives systematically overconfident predictions, and we remedy this by proposing $\beta$-Predictive Bayes, a Bayesian FL algorithm that interpolates between a mixture and product of the predictive posteriors, using a tunable parameter $\beta$. This parameter is tuned to improve the global ensemble's calibration, before it is distilled to a single model. Our method is evaluated on a variety of regression and classification datasets to demonstrate its superiority in calibration to other baselines, even as data heterogeneity increases. Code available at https://github.com/hasanmohsin/betaPredBayes_FL  ( 2 min )
    Stochastic interpolants with data-dependent couplings. (arXiv:2310.03725v2 [cs.LG] UPDATED)
    Generative models inspired by dynamical transport of measure -- such as flows and diffusions -- construct a continuous-time map between two probability densities. Conventionally, one of these is the target density, only accessible through samples, while the other is taken as a simple base density that is data-agnostic. In this work, using the framework of stochastic interpolants, we formalize how to \textit{couple} the base and the target densities, whereby samples from the base are computed conditionally given samples from the target in a way that is different from (but does preclude) incorporating information about class labels or continuous embeddings. This enables us to construct dynamical transport maps that serve as conditional generative models. We show that these transport maps can be learned by solving a simple square loss regression problem analogous to the standard independent setting. We demonstrate the usefulness of constructing dependent couplings in practice through experiments in super-resolution and in-painting.  ( 2 min )
    Selective Knowledge Sharing for Privacy-Preserving Federated Distillation without A Good Teacher. (arXiv:2304.01731v4 [cs.LG] UPDATED)
    While federated learning is promising for privacy-preserving collaborative learning without revealing local data, it remains vulnerable to white-box attacks and struggles to adapt to heterogeneous clients. Federated distillation (FD), built upon knowledge distillation--an effective technique for transferring knowledge from a teacher model to student models--emerges as an alternative paradigm, which provides enhanced privacy guarantees and addresses model heterogeneity. Nevertheless, challenges arise due to variations in local data distributions and the absence of a well-trained teacher model, which leads to misleading and ambiguous knowledge sharing that significantly degrades model performance. To address these issues, this paper proposes a selective knowledge sharing mechanism for FD, termed Selective-FD. It includes client-side selectors and a server-side selector to accurately and precisely identify knowledge from local and ensemble predictions, respectively. Empirical studies, backed by theoretical insights, demonstrate that our approach enhances the generalization capabilities of the FD framework and consistently outperforms baseline methods.  ( 2 min )
    Data Compression and Inference in Cosmology with Self-Supervised Machine Learning. (arXiv:2308.09751v2 [astro-ph.CO] UPDATED)
    The influx of massive amounts of data from current and upcoming cosmological surveys necessitates compression schemes that can efficiently summarize the data with minimal loss of information. We introduce a method that leverages the paradigm of self-supervised machine learning in a novel manner to construct representative summaries of massive datasets using simulation-based augmentations. Deploying the method on hydrodynamical cosmological simulations, we show that it can deliver highly informative summaries, which can be used for a variety of downstream tasks, including precise and accurate parameter inference. We demonstrate how this paradigm can be used to construct summary representations that are insensitive to prescribed systematic effects, such as the influence of baryonic physics. Our results indicate that self-supervised machine learning techniques offer a promising new approach for compression of cosmological data as well its analysis.  ( 2 min )
    Modeling Unknown Stochastic Dynamical System via Autoencoder. (arXiv:2312.10001v1 [cs.LG])
    We present a numerical method to learn an accurate predictive model for an unknown stochastic dynamical system from its trajectory data. The method seeks to approximate the unknown flow map of the underlying system. It employs the idea of autoencoder to identify the unobserved latent random variables. In our approach, we design an encoding function to discover the latent variables, which are modeled as unit Gaussian, and a decoding function to reconstruct the future states of the system. Both the encoder and decoder are expressed as deep neural networks (DNNs). Once the DNNs are trained by the trajectory data, the decoder serves as a predictive model for the unknown stochastic system. Through an extensive set of numerical examples, we demonstrate that the method is able to produce long-term system predictions by using short bursts of trajectory data. It is also applicable to systems driven by non-Gaussian noises.  ( 2 min )
    A Comparative Evaluation of Additive Separability Tests for Physics-Informed Machine Learning. (arXiv:2312.09775v1 [cs.LG])
    Many functions characterising physical systems are additively separable. This is the case, for instance, of mechanical Hamiltonian functions in physics, population growth equations in biology, and consumer preference and utility functions in economics. We consider the scenario in which a surrogate of a function is to be tested for additive separability. The detection that the surrogate is additively separable can be leveraged to improve further learning. Hence, it is beneficial to have the ability to test for such separability in surrogates. The mathematical approach is to test if the mixed partial derivative of the surrogate is zero; or empirically, lower than a threshold. We present and comparatively and empirically evaluate the eight methods to compute the mixed partial derivative of a surrogate function.  ( 2 min )
    Collaborating Foundation models for Domain Generalized Semantic Segmentation. (arXiv:2312.09788v1 [cs.CV])
    Domain Generalized Semantic Segmentation (DGSS) deals with training a model on a labeled source domain with the aim of generalizing to unseen domains during inference. Existing DGSS methods typically effectuate robust features by means of Domain Randomization (DR). Such an approach is often limited as it can only account for style diversification and not content. In this work, we take an orthogonal approach to DGSS and propose to use an assembly of CoLlaborative FOUndation models for Domain Generalized Semantic Segmentation (CLOUDS). In detail, CLOUDS is a framework that integrates FMs of various kinds: (i) CLIP backbone for its robust feature representation, (ii) generative models to diversify the content, thereby covering various modes of the possible target distribution, and (iii) Segment Anything Model (SAM) for iteratively refining the predictions of the segmentation model. Extensive experiments show that our CLOUDS excels in adapting from synthetic to real DGSS benchmarks and under varying weather conditions, notably outperforming prior methods by 5.6% and 6.7% on averaged miou, respectively. The code is available at : https://github.com/yasserben/CLOUDS  ( 2 min )
    Sketch and shift: a robust decoder for compressive clustering. (arXiv:2312.09940v1 [cs.LG])
    Compressive learning is an emerging approach to drastically reduce the memory footprint of large-scale learning, by first summarizing a large dataset into a low-dimensional sketch vector, and then decoding from this sketch the latent information needed for learning. In light of recent progress on information preservation guarantees for sketches based on random features, a major objective is to design easy-to-tune algorithms (called decoders) to robustly and efficiently extract this information. To address the underlying non-convex optimization problems, various heuristics have been proposed. In the case of compressive clustering, the standard heuristic is CL-OMPR, a variant of sliding Frank-Wolfe. Yet, CL-OMPR is hard to tune, and the examination of its robustness was overlooked. In this work, we undertake a scrutinized examination of CL-OMPR to circumvent its limitations. In particular, we show how this algorithm can fail to recover the clusters even in advantageous scenarios. To gain insight, we show how the deficiencies of this algorithm can be attributed to optimization difficulties related to the structure of a correlation function appearing at core steps of the algorithm. To address these limitations, we propose an alternative decoder offering substantial improvements over CL-OMPR. Its design is notably inspired from the mean shift algorithm, a classic approach to detect the local maxima of kernel density estimators. The proposed algorithm can extract clustering information from a sketch of the MNIST dataset that is 10 times smaller than previously.  ( 3 min )
    One Self-Configurable Model to Solve Many Abstract Visual Reasoning Problems. (arXiv:2312.09997v1 [cs.AI])
    Abstract Visual Reasoning (AVR) comprises a wide selection of various problems similar to those used in human IQ tests. Recent years have brought dynamic progress in solving particular AVR tasks, however, in the contemporary literature AVR problems are largely dealt with in isolation, leading to highly specialized task-specific methods. With the aim of developing universal learning systems in the AVR domain, we propose the unified model for solving Single-Choice Abstract visual Reasoning tasks (SCAR), capable of solving various single-choice AVR tasks, without making any a priori assumptions about the task structure, in particular the number and location of panels. The proposed model relies on a novel Structure-Aware dynamic Layer (SAL), which adapts its weights to the structure of the considered AVR problem. Experiments conducted on Raven's Progressive Matrices, Visual Analogy Problems, and Odd One Out problems show that SCAR (SAL-based models, in general) effectively solves diverse AVR tasks, and its performance is on par with the state-of-the-art task-specific baselines. What is more, SCAR demonstrates effective knowledge reuse in multi-task and transfer learning settings. To our knowledge, this work is the first successful attempt to construct a general single-choice AVR solver relying on self-configurable architecture and unified solving method. With this work we aim to stimulate and foster progress on task-independent research paths in the AVR domain, with the long-term goal of development of a general AVR solver.  ( 3 min )
    Distributed Learning of Mixtures of Experts. (arXiv:2312.09877v1 [cs.LG])
    In modern machine learning problems we deal with datasets that are either distributed by nature or potentially large for which distributing the computations is usually a standard way to proceed, since centralized algorithms are in general ineffective. We propose a distributed learning approach for mixtures of experts (MoE) models with an aggregation strategy to construct a reduction estimator from local estimators fitted parallelly to distributed subsets of the data. The aggregation is based on an optimal minimization of an expected transportation divergence between the large MoE composed of local estimators and the unknown desired MoE model. We show that the provided reduction estimator is consistent as soon as the local estimators to be aggregated are consistent, and its construction is performed by a proposed majorization-minimization (MM) algorithm that is computationally effective. We study the statistical and numerical properties for the proposed reduction estimator on experiments that demonstrate its performance compared to namely the global estimator constructed in a centralized way from the full dataset. For some situations, the computation time is more than ten times faster, for a comparable performance. Our source codes are publicly available on Github.  ( 2 min )
    GraphRARE: Reinforcement Learning Enhanced Graph Neural Network with Relative Entropy. (arXiv:2312.09708v1 [cs.LG])
    Graph neural networks (GNNs) have shown advantages in graph-based analysis tasks. However, most existing methods have the homogeneity assumption and show poor performance on heterophilic graphs, where the linked nodes have dissimilar features and different class labels, and the semantically related nodes might be multi-hop away. To address this limitation, this paper presents GraphRARE, a general framework built upon node relative entropy and deep reinforcement learning, to strengthen the expressive capability of GNNs. An innovative node relative entropy, which considers node features and structural similarity, is used to measure mutual information between node pairs. In addition, to avoid the sub-optimal solutions caused by mixing useful information and noises of remote nodes, a deep reinforcement learning-based algorithm is developed to optimize the graph topology. This algorithm selects informative nodes and discards noisy nodes based on the defined node relative entropy. Extensive experiments are conducted on seven real-world datasets. The experimental results demonstrate the superiority of GraphRARE in node classification and its capability to optimize the original graph topology.  ( 2 min )
    Peer Learning: Learning Complex Policies in Groups from Scratch via Action Recommendations. (arXiv:2312.09950v1 [cs.LG])
    Peer learning is a novel high-level reinforcement learning framework for agents learning in groups. While standard reinforcement learning trains an individual agent in trial-and-error fashion, all on its own, peer learning addresses a related setting in which a group of agents, i.e., peers, learns to master a task simultaneously together from scratch. Peers are allowed to communicate only about their own states and actions recommended by others: "What would you do in my situation?". Our motivation is to study the learning behavior of these agents. We formalize the teacher selection process in the action advice setting as a multi-armed bandit problem and therefore highlight the need for exploration. Eventually, we analyze the learning behavior of the peers and observe their ability to rank the agents' performance within the study group and understand which agents give reliable advice. Further, we compare peer learning with single agent learning and a state-of-the-art action advice baseline. We show that peer learning is able to outperform single-agent learning and the baseline in several challenging discrete and continuous OpenAI Gym domains. Doing so, we also show that within such a framework complex policies from action recommendations beyond discrete action spaces can evolve.  ( 2 min )
    Simple Weak Coresets for Non-Decomposable Classification Measures. (arXiv:2312.09885v1 [cs.LG])
    While coresets have been growing in terms of their application, barring few exceptions, they have mostly been limited to unsupervised settings. We consider supervised classification problems, and non-decomposable evaluation measures in such settings. We show that stratified uniform sampling based coresets have excellent empirical performance that are backed by theoretical guarantees too. We focus on the F1 score and Matthews Correlation Coefficient, two widely used non-decomposable objective functions that are nontrivial to optimize for and show that uniform coresets attain a lower bound for coreset size, and have good empirical performance, comparable with ``smarter'' coreset construction strategies.  ( 2 min )
    Nonlinear Meta-Learning Can Guarantee Faster Rates. (arXiv:2307.10870v2 [stat.ML] UPDATED)
    Many recent theoretical works on \emph{meta-learning} aim to achieve guarantees in leveraging similar representational structures from related tasks towards simplifying a target task. Importantly, the main aim in theory works on the subject is to understand the extent to which convergence rates -- in learning a common representation -- \emph{may scale with the number $N$ of tasks} (as well as the number of samples per task). First steps in this setting demonstrate this property when both the shared representation amongst tasks, and task-specific regression functions, are linear. This linear setting readily reveals the benefits of aggregating tasks, e.g., via averaging arguments. In practice, however, the representation is often highly nonlinear, introducing nontrivial biases in each task that cannot easily be averaged out as in the linear case. In the present work, we derive theoretical guarantees for meta-learning with nonlinear representations. In particular, assuming the shared nonlinearity maps to an infinite-dimensional RKHS, we show that additional biases can be mitigated with careful regularization that leverages the smoothness of task-specific regression functions,  ( 2 min )
    Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization. (arXiv:2310.03708v3 [cs.LG] UPDATED)
    A single language model (LM), despite aligning well with an average labeler through reinforcement learning from human feedback (RLHF), may not universally suit diverse human preferences. Recent approaches therefore opt for customization by collecting multi-dimensional feedback and creating distinct reward models (RMs) for each dimension (e.g., helpfulness, harmlessness, or honesty). Different LMs can then be optimized for different preferences using multi-objective RLHF (MORLHF) with different reward weightings. Yet, RL fine-tuning is unstable and resource-heavy, especially for MORLHF with diverse and usually conflicting objectives. In this paper, we present Multi-Objective Direct Preference Optimization (MODPO), an RL-free algorithm that extends Direct Preference Optimization (DPO) for multiple alignment objectives with minimal overheads. Essentially, MODPO folds language modeling directly into reward modeling, training LMs as implicit collective reward models (cRMs) that combine all objectives with specific weightings. While theoretically guaranteed to produce the same optimal solutions as MORLHF, MODPO is practically more stable and computationally efficient. Empirical results from safety alignment and long-form question answering confirm that MODPO matches or outperforms existing methods, consistently producing a Pareto front of LMs that cater to diverse preferences with 3 times less computational resources compared to MORLHF.  ( 2 min )
    W-MAE: Pre-trained weather model with masked autoencoder for multi-variable weather forecasting. (arXiv:2304.08754v2 [cs.LG] UPDATED)
    Weather forecasting is a long-standing computational challenge with direct societal and economic impacts. This task involves a large amount of continuous data collection and exhibits rich spatiotemporal dependencies over long periods, making it highly suitable for deep learning models. In this paper, we apply pre-training techniques to weather forecasting and propose W-MAE, a Weather model with Masked AutoEncoder pre-training for weather forecasting. W-MAE is pre-trained in a self-supervised manner to reconstruct spatial correlations within meteorological variables. On the temporal scale, we fine-tune the pre-trained W-MAE to predict the future states of meteorological variables, thereby modeling the temporal dependencies present in weather data. We conduct our experiments using the fifth-generation ECMWF Reanalysis (ERA5) data, with samples selected every six hours. Experimental results show that our W-MAE framework offers three key benefits: 1) when predicting the future state of meteorological variables, the utilization of our pre-trained W-MAE can effectively alleviate the problem of cumulative errors in prediction, maintaining stable performance in the short-to-medium term; 2) when predicting diagnostic variables (e.g., total precipitation), our model exhibits significant performance advantages over FourCastNet; 3) Our task-agnostic pre-training schema can be easily integrated with various task-specific models. When our pre-training framework is applied to FourCastNet, it yields an average 20% performance improvement in Anomaly Correlation Coefficient (ACC).  ( 3 min )
    The Optimal Approximation Factors in Misspecified Off-Policy Value Function Estimation. (arXiv:2307.13332v2 [cs.LG] UPDATED)
    Theoretical guarantees in reinforcement learning (RL) are known to suffer multiplicative blow-up factors with respect to the misspecification error of function approximation. Yet, the nature of such \emph{approximation factors} -- especially their optimal form in a given learning problem -- is poorly understood. In this paper we study this question in linear off-policy value function estimation, where many open questions remain. We study the approximation factor in a broad spectrum of settings, such as with the weighted $L_2$-norm (where the weighting is the offline state distribution), the $L_\infty$ norm, the presence vs. absence of state aliasing, and full vs. partial coverage of the state space. We establish the optimal asymptotic approximation factors (up to constants) for all of these settings. In particular, our bounds identify two instance-dependent factors for the $L_2(\mu)$ norm and only one for the $L_\infty$ norm, which are shown to dictate the hardness of off-policy evaluation under misspecification.  ( 2 min )
    DemoFusion: Democratising High-Resolution Image Generation With No $$$. (arXiv:2311.16973v2 [cs.CV] UPDATED)
    High-resolution image generation with Generative Artificial Intelligence (GenAI) has immense potential but, due to the enormous capital investment required for training, it is increasingly centralised to a few large corporations, and hidden behind paywalls. This paper aims to democratise high-resolution GenAI by advancing the frontier of high-resolution generation while remaining accessible to a broad audience. We demonstrate that existing Latent Diffusion Models (LDMs) possess untapped potential for higher-resolution image generation. Our novel DemoFusion framework seamlessly extends open-source GenAI models, employing Progressive Upscaling, Skip Residual, and Dilated Sampling mechanisms to achieve higher-resolution image generation. The progressive nature of DemoFusion requires more passes, but the intermediate results can serve as "previews", facilitating rapid prompt iteration.  ( 2 min )
    Online Submodular Maximization via Online Convex Optimization. (arXiv:2309.04339v3 [cs.LG] UPDATED)
    We study monotone submodular maximization under general matroid constraints in the online setting. We prove that online optimization of a large class of submodular functions, namely, weighted threshold potential functions, reduces to online convex optimization (OCO). This is precisely because functions in this class admit a concave relaxation; as a result, OCO policies, coupled with an appropriate rounding scheme, can be used to achieve sublinear regret in the combinatorial setting. We show that our reduction extends to many different versions of the online learning problem, including the dynamic regret, bandit, and optimistic-learning settings.  ( 2 min )
    Adaptive action supervision in reinforcement learning from real-world multi-agent demonstrations. (arXiv:2305.13030v3 [cs.AI] UPDATED)
    Modeling of real-world biological multi-agents is a fundamental problem in various scientific and engineering fields. Reinforcement learning (RL) is a powerful framework to generate flexible and diverse behaviors in cyberspace; however, when modeling real-world biological multi-agents, there is a domain gap between behaviors in the source (i.e., real-world data) and the target (i.e., cyberspace for RL), and the source environment parameters are usually unknown. In this paper, we propose a method for adaptive action supervision in RL from real-world demonstrations in multi-agent scenarios. We adopt an approach that combines RL and supervised learning by selecting actions of demonstrations in RL based on the minimum distance of dynamic time warping for utilizing the information of the unknown source dynamics. This approach can be easily applied to many existing neural network architectures and provide us with an RL model balanced between reproducibility as imitation and generalization ability to obtain rewards in cyberspace. In the experiments, using chase-and-escape and football tasks with the different dynamics between the unknown source and target environments, we show that our approach achieved a balance between the reproducibility and the generalization ability compared with the baselines. In particular, we used the tracking data of professional football players as expert demonstrations in football and show successful performances despite the larger gap between behaviors in the source and target environments than the chase-and-escape task.  ( 3 min )
    Associative Learning Mechanism for Drug-Target Interaction Prediction. (arXiv:2205.15364v5 [q-bio.BM] UPDATED)
    As a necessary process in drug development, finding a drug compound that can selectively bind to a specific protein is highly challenging and costly. Drug-target affinity (DTA), which represents the strength of drug-target interaction (DTI), has played an important role in the DTI prediction task over the past decade. Although deep learning has been applied to DTA-related research, existing solutions ignore fundamental correlations between molecular substructures in molecular representation learning of drug compound molecules/protein targets. Moreover, traditional methods lack the interpretability of the DTA prediction process. This results in missing feature information of intermolecular interactions, thereby affecting prediction performance. Therefore, this paper proposes a DTA prediction method with interactive learning and an autoencoder mechanism. The proposed model enhances the corresponding ability to capture the feature information of a single molecular sequence by the drug/protein molecular representation learning module and supplements the information interaction between molecular sequence pairs by the interactive information learning module. The DTA value prediction module fuses the drug-target pair interaction information to output the predicted value of DTA. Additionally, this paper theoretically proves that the proposed method maximizes evidence lower bound (ELBO) for the joint distribution of the DTA prediction model, which enhances the consistency of the probability distribution between the actual value and the predicted value. The experimental results confirm mutual transformer-drug target affinity (MT-DTA) achieves better performance than other comparative methods.  ( 3 min )
    Using machine learning to understand causal relationships between urban form and travel CO2 emissions across continents. (arXiv:2308.16599v2 [cs.LG] UPDATED)
    Climate change mitigation in urban mobility requires policies reconfiguring urban form to increase accessibility and facilitate low-carbon modes of transport. However, current policy research has insufficiently assessed urban form effects on car travel at three levels: (1) Causality -- Can causality be established beyond theoretical and correlation-based analyses? (2) Generalizability -- Do relationships hold across different cities and world regions? (3) Context specificity -- How do relationships vary across neighborhoods of a city? Here, we address all three gaps via causal graph discovery and explainable machine learning to detect urban form effects on intra-city car travel, based on mobility data of six cities across three continents. We find significant causal effects of urban form on trip emissions and inter-feature effects, which had been neglected in previous work. Our results demonstrate that destination accessibility matters most overall, while low density and low connectivity also sharply increase CO$_2$ emissions. These general trends are similar across cities but we find idiosyncratic effects that can lead to substantially different recommendations. In more monocentric cities, we identify spatial corridors -- about 10--50 km from the city center -- where subcenter-oriented development is more relevant than increased access to the main center. Our work demonstrates a novel application of machine learning that enables new research addressing the needs of causality, generalizability, and contextual specificity for scaling evidence-based urban climate solutions.  ( 3 min )
    Assume-Guarantee Reinforcement Learning. (arXiv:2312.09938v1 [cs.LG])
    We present a modular approach to \emph{reinforcement learning} (RL) in environments consisting of simpler components evolving in parallel. A monolithic view of such modular environments may be prohibitively large to learn, or may require unrealizable communication between the components in the form of a centralized controller. Our proposed approach is based on the assume-guarantee paradigm where the optimal control for the individual components is synthesized in isolation by making \emph{assumptions} about the behaviors of neighboring components, and providing \emph{guarantees} about their own behavior. We express these \emph{assume-guarantee contracts} as regular languages and provide automatic translations to scalar rewards to be used in RL. By combining local probabilities of satisfaction for each component, we provide a lower bound on the probability of satisfaction of the complete system. By solving a Markov game for each component, RL can produce a controller for each component that maximizes this lower bound. The controller utilizes the information it receives through communication, observations, and any knowledge of a coarse model of other agents. We experimentally demonstrate the efficiency of the proposed approach on a variety of case studies.  ( 2 min )
    GeoTMI:Predicting quantum chemical property with easy-to-obtain geometry via positional denoising. (arXiv:2304.03724v3 [physics.chem-ph] UPDATED)
    As quantum chemical properties have a dependence on their geometries, graph neural networks (GNNs) using 3D geometric information have achieved high prediction accuracy in many tasks. However, they often require 3D geometries obtained from high-level quantum mechanical calculations, which are practically infeasible, limiting their applicability to real-world problems. To tackle this, we propose a new training framework, GeoTMI, that employs denoising process to predict properties accurately using easy-to-obtain geometries (corrupted versions of correct geometries, such as those obtained from low-level calculations). Our starting point was the idea that the correct geometry is the best description of the target property. Hence, to incorporate information of the correct, GeoTMI aims to maximize mutual information between three variables: the correct and the corrupted geometries and the property. GeoTMI also explicitly updates the corrupted input to approach the correct geometry as it passes through the GNN layers, contributing to more effective denoising. We investigated the performance of the proposed method using 3D GNNs for three prediction tasks: molecular properties, a chemical reaction property, and relaxed energy in a heterogeneous catalytic system. Our results showed consistent improvements in accuracy across various tasks, demonstrating the effectiveness and robustness of GeoTMI.  ( 3 min )
    Learned Regularization for Inverse Problems: Insights from a Spectral Model. (arXiv:2312.09845v1 [math.NA])
    The aim of this paper is to provide a theoretically founded investigation of state-of-the-art learning approaches for inverse problems. We give an extended definition of regularization methods and their convergence in terms of the underlying data distributions, which paves the way for future theoretical studies. Based on a simple spectral learning model previously introduced for supervised learning, we investigate some key properties of different learning paradigms for inverse problems, which can be formulated independently of specific architectures. In particular we investigate the regularization properties, bias, and critical dependence on training data distributions. Moreover, our framework allows to highlight and compare the specific behavior of the different paradigms in the infinite-dimensional limit.  ( 2 min )
    SafeAR: Towards Safer Algorithmic Recourse by Risk-Aware Policies. (arXiv:2308.12367v2 [cs.LG] UPDATED)
    With the growing use of machine learning (ML) models in critical domains such as finance and healthcare, the need to offer recourse for those adversely affected by the decisions of ML models has become more important; individuals ought to be provided with recommendations on actions to take for improving their situation and thus receiving a favorable decision. Prior work on sequential algorithmic recourse -- which recommends a series of changes -- focuses on action feasibility and uses the proximity of feature changes to determine action costs. However, the uncertainties of feature changes and the risk of higher than average costs in recourse have not been considered. It is undesirable if a recourse could (with some probability) result in a worse situation from which recovery requires an extremely high cost. It is essential to incorporate risks when computing and evaluating recourse. We call the recourse computed with such risk considerations as Safer Algorithmic Recourse (SafeAR). The objective is to empower people to choose a recourse based on their risk tolerance. In this work, we discuss and show how existing recourse desiderata can fail to capture the risk of higher costs. We present a method to compute recourse policies that consider variability in cost and connect algorithmic recourse literature with risk-sensitive reinforcement learning. We also adopt measures "Value at Risk" and "Conditional Value at Risk" from the financial literature to summarize risk concisely. We apply our method to two real-world datasets and compare policies with different risk-aversion levels using risk measures and recourse desiderata (sparsity and proximity).  ( 3 min )
    Nonlinear Multi-objective Reinforcement Learning with Provable Guarantees. (arXiv:2311.02544v2 [cs.LG] UPDATED)
    We describe RA-E3 (Reward-Aware Explicit Explore or Exploit), an algorithm with provable guarantees for solving a single or multi-objective Markov Decision Process (MDP) where we want to maximize the expected value of a nonlinear function over accumulated rewards. This allows us to model fairness-aware welfare optimization for multi-objective reinforcement learning as well as risk-aware reinforcement learning with nonlinear Von Neumann-Morgenstern utility functions in the single objective setting. RA-E3 extends the classic E3 algorithm that solves MDPs with scalar rewards and linear preferences. We first state a distinct reward-aware version of value iteration that calculates a non-stationary policy that is approximately optimal for a given model of the environment. This sub-procedure is based on an extended form of Bellman optimality for nonlinear optimization that explicitly considers time and current accumulated reward. We then describe how to use this optimization procedure in a larger algorithm that must simultaneously learn a model of the environment. The algorithm learns an approximately optimal policy in time that depends polynomially on the MDP size, desired approximation, and smoothness of the nonlinear function, and exponentially on the number of objectives.  ( 2 min )
    GreenLightningAI: An Efficient AI System with Decoupled Structural and Quantitative Knowledge. (arXiv:2312.09971v1 [cs.LG])
    The number and complexity of artificial intelligence (AI) applications is growing relentlessly. As a result, even with the many algorithmic and mathematical advances experienced over past decades as well as the impressive energy efficiency and computational capacity of current hardware accelerators, training the most powerful and popular deep neural networks comes at very high economic and environmental costs. Recognising that additional optimisations of conventional neural network training is very difficult, this work takes a radically different approach by proposing GreenLightningAI, a new AI system design consisting of a linear model that is capable of emulating the behaviour of deep neural networks by subsetting the model for each particular sample. The new AI system stores the information required to select the system subset for a given sample (referred to as structural information) separately from the linear model parameters (referred to as quantitative knowledge). In this paper we present a proof of concept, showing that the structural information stabilises far earlier than the quantitative knowledge. Additionally, we show experimentally that the structural information can be kept unmodified when re-training the AI system with new samples while still achieving a validation accuracy similar to that obtained when re-training a neural network with similar size. Since the proposed AI system is based on a linear model, multiple copies of the model, trained with different datasets, can be easily combined. This enables faster and greener (re)-training algorithms, including incremental re-training and federated incremental re-training.  ( 3 min )
    $\nu^2$-Flows: Fast and improved neutrino reconstruction in multi-neutrino final states with conditional normalizing flows. (arXiv:2307.02405v3 [hep-ph] UPDATED)
    In this work we introduce $\nu^2$-Flows, an extension of the $\nu$-Flows method to final states containing multiple neutrinos. The architecture can natively scale for all combinations of object types and multiplicities in the final state for any desired neutrino multiplicities. In $t\bar{t}$ dilepton events, the momenta of both neutrinos and correlations between them are reconstructed more accurately than when using the most popular standard analytical techniques, and solutions are found for all events. Inference time is significantly faster than competing methods, and can be reduced further by evaluating in parallel on graphics processing units. We apply $\nu^2$-Flows to $t\bar{t}$ dilepton events and show that the per-bin uncertainties in unfolded distributions is much closer to the limit of performance set by perfect neutrino reconstruction than standard techniques. For the chosen double differential observables $\nu^2$-Flows results in improved statistical precision for each bin by a factor of 1.5 to 2 in comparison to the Neutrino Weighting method and up to a factor of four in comparison to the Ellipse approach.  ( 3 min )
    LLMs Can Understand Encrypted Prompt: Towards Privacy-Computing Friendly Transformers. (arXiv:2305.18396v3 [cs.LG] UPDATED)
    The community explored to build private inference frameworks for transformer-based large language models (LLMs) in a server-client setting, where the server holds the model parameters and the client inputs its private data (or prompt) for inference. However, these frameworks impose significant overhead when the private inputs are forward propagated through the original LLMs. In this paper, we show that substituting the computation- and communication-heavy operators in the transformer architecture with privacy-computing friendly approximations can greatly reduce the private inference costs while incurring very minor impact on model performance. Compared to state-of-the-art Iron (NeurIPS 2022), our privacy-computing friendly model inference pipeline achieves a $5\times$ acceleration in computation and an 80% reduction in communication overhead, while retaining nearly identical accuracy.  ( 2 min )
    UMedNeRF: Uncertainty-aware Single View Volumetric Rendering for Medical Neural Radiance Fields. (arXiv:2311.05836v4 [eess.IV] UPDATED)
    In the field of clinical medicine, computed tomography (CT) is an effective medical imaging modality for the diagnosis of various pathologies. Compared with X-ray images, CT images can provide more information, including multi-planar slices and three-dimensional structures for clinical diagnosis. However, CT imaging requires patients to be exposed to large doses of ionizing radiation for a long time, which may cause irreversible physical harm. In this paper, we propose an Uncertainty-aware MedNeRF (UMedNeRF) network based on generated radiation fields. The network can learn a continuous representation of CT projections from 2D X-ray images by obtaining the internal structure and depth information and using adaptive loss weights to ensure the quality of the generated images. Our model is trained on publicly available knee and chest datasets, and we show the results of CT projection rendering with a single X-ray and compare our method with other methods based on generated radiation fields.  ( 2 min )
    A novel dual-stream time-frequency contrastive pretext tasks framework for sleep stage classification. (arXiv:2312.09623v1 [eess.SP])
    Self-supervised learning addresses the challenge encountered by many supervised methods, i.e. the requirement of large amounts of annotated data. This challenge is particularly pronounced in fields such as the electroencephalography (EEG) research domain. Self-supervised learning operates instead by utilizing pseudo-labels, which are generated by pretext tasks, to obtain a rich and meaningful data representation. In this study, we aim at introducing a dual-stream pretext task architecture that operates both in the time and frequency domains. In particular, we have examined the incorporation of the novel Frequency Similarity (FS) pretext task into two existing pretext tasks, Relative Positioning (RP) and Temporal Shuffling (TS). We assess the accuracy of these models using the Physionet Challenge 2018 (PC18) dataset in the context of the downstream task sleep stage classification. The inclusion of FS resulted in a notable improvement in downstream task accuracy, with a 1.28 percent improvement on RP and a 2.02 percent improvement on TS. Furthermore, when visualizing the learned embeddings using Uniform Manifold Approximation and Projection (UMAP), distinct clusters emerge, indicating that the learned representations carry meaningful information.  ( 2 min )
    Learning Diverse Risk Preferences in Population-based Self-play. (arXiv:2305.11476v2 [cs.LG] UPDATED)
    Among the great successes of Reinforcement Learning (RL), self-play algorithms play an essential role in solving competitive games. Current self-play algorithms optimize the agent to maximize expected win-rates against its current or historical copies, making it often stuck in the local optimum and its strategy style simple and homogeneous. A possible solution is to improve the diversity of policies, which helps the agent break the stalemate and enhances its robustness when facing different opponents. However, enhancing diversity in the self-play algorithms is not trivial. In this paper, we aim to introduce diversity from the perspective that agents could have diverse risk preferences in the face of uncertainty. Specifically, we design a novel reinforcement learning algorithm called Risk-sensitive Proximal Policy Optimization (RPPO), which smoothly interpolates between worst-case and best-case policy learning and allows for policy learning with desired risk preferences. Seamlessly integrating RPPO with population-based self-play, agents in the population optimize dynamic risk-sensitive objectives with experiences from playing against diverse opponents. Empirical results show that our method achieves comparable or superior performance in competitive games and that diverse modes of behaviors emerge. Our code is public online at \url{https://github.com/Jackory/RPBT}.  ( 2 min )
    Understanding Probe Behaviors through Variational Bounds of Mutual Information. (arXiv:2312.10019v1 [cs.IT])
    With the success of self-supervised representations, researchers seek a better understanding of the information encapsulated within a representation. Among various interpretability methods, we focus on classification-based linear probing. We aim to foster a solid understanding and provide guidelines for linear probing by constructing a novel mathematical framework leveraging information theory. First, we connect probing with the variational bounds of mutual information (MI) to relax the probe design, equating linear probing with fine-tuning. Then, we investigate empirical behaviors and practices of probing through our mathematical framework. We analyze the layer-wise performance curve being convex, which seemingly violates the data processing inequality. However, we show that the intermediate representations can have the biggest MI estimate because of the tradeoff between better separability and decreasing MI. We further suggest that the margin of linearly separable representations can be a criterion for measuring the "goodness of representation." We also compare accuracy with MI as the measuring criteria. Finally, we empirically validate our claims by observing the self-supervised speech models on retaining word and phoneme information.  ( 2 min )
    Distributed Semi-Supervised Sparse Statistical Inference. (arXiv:2306.10395v2 [stat.ML] UPDATED)
    The debiased estimator is a crucial tool in statistical inference for high-dimensional model parameters. However, constructing such an estimator involves estimating the high-dimensional inverse Hessian matrix, incurring significant computational costs. This challenge becomes particularly acute in distributed setups, where traditional methods necessitate computing a debiased estimator on every machine. This becomes unwieldy, especially with a large number of machines. In this paper, we delve into semi-supervised sparse statistical inference in a distributed setup. An efficient multi-round distributed debiased estimator, which integrates both labeled and unlabelled data, is developed. We will show that the additional unlabeled data helps to improve the statistical rate of each round of iteration. Our approach offers tailored debiasing methods for $M$-estimation and generalized linear models according to the specific form of the loss function. Our method also applies to a non-smooth loss like absolute deviation loss. Furthermore, our algorithm is computationally efficient since it requires only one estimation of a high-dimensional inverse covariance matrix. We demonstrate the effectiveness of our method by presenting simulation studies and real data applications that highlight the benefits of incorporating unlabeled data.  ( 2 min )
    Federated Inference with Reliable Uncertainty Quantification over Wireless Channels via Conformal Prediction. (arXiv:2308.04237v2 [cs.IT] UPDATED)
    In this paper, we consider a wireless federated inference scenario in which devices and a server share a pre-trained machine learning model. The devices communicate statistical information about their local data to the server over a common wireless channel, aiming to enhance the quality of the inference decision at the server. Recent work has introduced federated conformal prediction (CP), which leverages devices-to-server communication to improve the reliability of the server's decision. With federated CP, devices communicate to the server information about the loss accrued by the shared pre-trained model on the local data, and the server leverages this information to calibrate a decision interval, or set, so that it is guaranteed to contain the correct answer with a pre-defined target reliability level. Previous work assumed noise-free communication, whereby devices can communicate a single real number to the server. In this paper, we study for the first time federated CP in a wireless setting. We introduce a novel protocol, termed wireless federated conformal prediction (WFCP), which builds on type-based multiple access (TBMA) and on a novel quantile correction strategy. WFCP is proved to provide formal reliability guarantees in terms of coverage of the predicted set produced by the server. Using numerical results, we demonstrate the significant advantages of WFCP against digital implementations of existing federated CP schemes, especially in regimes with limited communication resources and/or large number of devices.  ( 3 min )
    Accelerating Neural Network Training: A Brief Review. (arXiv:2312.10024v1 [cs.LG])
    The process of training a deep neural network is characterized by significant time requirements and associated costs. Although researchers have made considerable progress in this area, further work is still required due to resource constraints. This study examines innovative approaches to expedite the training process of deep neural networks (DNN), with specific emphasis on three state-of-the-art models such as ResNet50, Vision Transformer (ViT), and EfficientNet. The research utilizes sophisticated methodologies, including Gradient Accumulation (GA), Automatic Mixed Precision (AMP), and Pin Memory (PM), in order to optimize performance and accelerate the training procedure. The study examines the effects of these methodologies on the DNN models discussed earlier, assessing their efficacy with regard to training rate and computational efficacy. The study showcases the efficacy of including GA as a strategic approach, resulting in a noteworthy decrease in the duration required for training. This enables the models to converge at a faster pace. The utilization of AMP enhances the speed of computations by taking advantage of the advantages offered by lower precision arithmetic while maintaining the correctness of the model. Furthermore, this study investigates the application of Pin Memory as a strategy to enhance the efficiency of data transmission between the central processing unit and the graphics processing unit, thereby offering a promising opportunity for enhancing overall performance. The experimental findings demonstrate that the combination of these sophisticated methodologies significantly accelerates the training of DNNs, offering vital insights for experts seeking to improve the effectiveness of deep learning processes.  ( 3 min )
    The troublesome kernel -- On hallucinations, no free lunches and the accuracy-stability trade-off in inverse problems. (arXiv:2001.01258v3 [cs.LG] UPDATED)
    Methods inspired by Artificial Intelligence (AI) are starting to fundamentally change computational science and engineering through breakthrough performances on challenging problems. However, reliability and trustworthiness of such techniques is becoming a major concern. In inverse problems in imaging, the focus of this paper, there is increasing empirical evidence that methods may suffer from hallucinations, i.e., false, but realistic-looking artifacts; instability, i.e., sensitivity to perturbations in the data; and unpredictable generalization, i.e., excellent performance on some images, but significant deterioration on others. This paper presents a theoretical foundation for these phenomena. We give a mathematical framework describing how and when such effects arise in arbitrary reconstruction methods, not just AI-inspired techniques. Several of our results take the form of `no free lunch' theorems. Specifically, we show that (i) methods that overperform on a single image can wrongly transfer details from one image to another, creating a hallucination, (ii) methods that overperform on two or more images can hallucinate or be unstable, (iii) optimizing the accuracy-stability trade-off is generally difficult, (iv) hallucinations and instabilities, if they occur, are not rare events, and may be encouraged by standard training, (v) it may be impossible to construct optimal reconstruction maps for certain problems. Our results trace these effects to the kernel of the forward operator whenever it is nontrivial, but also extend to the case when the forward operator is ill-conditioned. Based on these insights, our work aims to spur research into new ways to develop robust and reliable AI-inspired methods for inverse problems in imaging.  ( 3 min )
    Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse Problems. (arXiv:2303.05754v2 [cs.LG] UPDATED)
    Krylov subspace, which is generated by multiplying a given vector by the matrix of a linear transformation and its successive powers, has been extensively studied in classical optimization literature to design algorithms that converge quickly for large linear inverse problems. For example, the conjugate gradient method (CG), one of the most popular Krylov subspace methods, is based on the idea of minimizing the residual error in the Krylov subspace. However, with the recent advancement of high-performance diffusion solvers for inverse problems, it is not clear how classical wisdom can be synergistically combined with modern diffusion models. In this study, we propose a novel and efficient diffusion sampling strategy that synergistically combine the diffusion sampling and Krylov subspace methods. Specifically, we prove that if the tangent space at a denoised sample by Tweedie's formula forms a Krylov subspace, then the CG initialized with the denoised data ensures the data consistency update to remain in the tangent space. This negates the need to compute the manifold-constrained gradient (MCG), leading to a more efficient diffusion sampling method. Our method is applicable regardless of the parametrization and setting (i.e., VE, VP). Notably, we achieve state-of-the-art reconstruction quality on challenging real-world medical inverse imaging problems, including multi-coil MRI reconstruction and 3D CT reconstruction. Moreover, our proposed method achieves more than 80 times faster inference time than the previous state-of-the-art method.  ( 3 min )
    Physics-enhanced deep surrogates for partial differential equations. (arXiv:2111.05841v4 [cs.LG] UPDATED)
    Many physics and engineering applications demand Partial Differential Equations (PDE) property evaluations that are traditionally computed with resource-intensive high-fidelity numerical solvers. Data-driven surrogate models provide an efficient alternative but come with a significant cost of training. Emerging applications would benefit from surrogates with an improved accuracy-cost tradeoff, while studied at scale. Here we present a "physics-enhanced deep-surrogate" ("PEDS") approach towards developing fast surrogate models for complex physical systems, which is described by PDEs. Specifically, a combination of a low-fidelity, explainable physics simulator and a neural network generator is proposed, which is trained end-to-end to globally match the output of an expensive high-fidelity numerical solver. Experiments on three exemplar testcases, diffusion, reaction-diffusion, and electromagnetic scattering models, show that a PEDS surrogate can be up to 3$\times$ more accurate than an ensemble of feedforward neural networks with limited data ($\approx 10^3$ training points), and reduces the training data need by at least a factor of 100 to achieve a target error of 5%. Experiments reveal that PEDS provides a general, data-driven strategy to bridge the gap between a vast array of simplified physical models with corresponding brute-force numerical solvers modeling complex systems, offering accuracy, speed, data efficiency, as well as physical insights into the process.  ( 3 min )
    A Game-theoretic Framework for Privacy-preserving Federated Learning. (arXiv:2304.05836v2 [cs.LG] UPDATED)
    In federated learning, benign participants aim to optimize a global model collaboratively. However, the risk of \textit{privacy leakage} cannot be ignored in the presence of \textit{semi-honest} adversaries. Existing research has focused either on designing protection mechanisms or on inventing attacking mechanisms. While the battle between defenders and attackers seems never-ending, we are concerned with one critical question: is it possible to prevent potential attacks in advance? To address this, we propose the first game-theoretic framework that considers both FL defenders and attackers in terms of their respective payoffs, which include computational costs, FL model utilities, and privacy leakage risks. We name this game the federated learning privacy game (FLPG), in which neither defenders nor attackers are aware of all participants' payoffs. To handle the \textit{incomplete information} inherent in this situation, we propose associating the FLPG with an \textit{oracle} that has two primary responsibilities. First, the oracle provides lower and upper bounds of the payoffs for the players. Second, the oracle acts as a correlation device, privately providing suggested actions to each player. With this novel framework, we analyze the optimal strategies of defenders and attackers. Furthermore, we derive and demonstrate conditions under which the attacker, as a rational decision-maker, should always follow the oracle's suggestion \textit{not to attack}.  ( 3 min )
    FuXi-S2S: An accurate machine learning model for global subseasonal forecasts. (arXiv:2312.09926v1 [physics.ao-ph])
    Skillful subseasonal forecasts beyond 2 weeks are crucial for a wide range of applications across various sectors of society. Recently, state-of-the-art machine learning based weather forecasting models have made significant advancements, outperforming the high-resolution forecast (HRES) from the European Centre for Medium-Range Weather Forecasts (ECMWF). However, the full potential of machine learning models in subseasonal forecasts has yet to be fully explored. In this study, we introduce FuXi Subseasonal-to-Seasonal (FuXi-S2S), a machine learning based subseasonal forecasting model that provides global daily mean forecasts up to 42 days, covering 5 upper-air atmospheric variables at 13 pressure levels and 11 surface variables. FuXi-S2S integrates an enhanced FuXi base model with a perturbation module for flow-dependent perturbations in hidden features, and incorporates Perlin noise to perturb initial conditions. The model is developed using 72 years of daily statistics from ECMWF ERA5 reanalysis data. When compared to the ECMWF Subseasonal-to-Seasonal (S2S) reforecasts, the FuXi-S2S forecasts demonstrate superior deterministic and ensemble forecasts for total precipitation (TP), outgoing longwave radiation (OLR), and geopotential at 500 hPa (Z500). Although it shows slightly inferior performance in predicting 2-meter temperature (T2M), it has clear advantages over land area. Regarding the extreme forecasts, FuXi-S2S outperforms ECMWF S2S globally for TP. Furthermore, FuXi-S2S forecasts surpass the ECMWF S2S reforecasts in predicting the Madden Julian Oscillation (MJO), a key source of subseasonal predictability. They extend the skillful prediction of MJO from 30 days to 36 days.  ( 3 min )
    Parametric Classification for Generalized Category Discovery: A Baseline Study. (arXiv:2211.11727v4 [cs.CV] UPDATED)
    Generalized Category Discovery (GCD) aims to discover novel categories in unlabelled datasets using knowledge learned from labelled samples. Previous studies argued that parametric classifiers are prone to overfitting to seen categories, and endorsed using a non-parametric classifier formed with semi-supervised k-means. However, in this study, we investigate the failure of parametric classifiers, verify the effectiveness of previous design choices when high-quality supervision is available, and identify unreliable pseudo-labels as a key problem. We demonstrate that two prediction biases exist: the classifier tends to predict seen classes more often, and produces an imbalanced distribution across seen and novel categories. Based on these findings, we propose a simple yet effective parametric classification method that benefits from entropy regularisation, achieves state-of-the-art performance on multiple GCD benchmarks and shows strong robustness to unknown class numbers. We hope the investigation and proposed simple framework can serve as a strong baseline to facilitate future studies in this field. Our code is available at: https://github.com/CVMI-Lab/SimGCD.  ( 2 min )
    Improving Biomedical Entity Linking with Retrieval-enhanced Learning. (arXiv:2312.09806v1 [cs.CL])
    Biomedical entity linking (BioEL) has achieved remarkable progress with the help of pre-trained language models. However, existing BioEL methods usually struggle to handle rare and difficult entities due to long-tailed distribution. To address this limitation, we introduce a new scheme $k$NN-BioEL, which provides a BioEL model with the ability to reference similar instances from the entire training corpus as clues for prediction, thus improving the generalization capabilities. Moreover, we design a contrastive learning objective with dynamic hard negative sampling (DHNS) that improves the quality of the retrieved neighbors during inference. Extensive experimental results show that $k$NN-BioEL outperforms state-of-the-art baselines on several datasets.  ( 2 min )
    Machine learning for advancing low-temperature plasma modeling and simulation. (arXiv:2307.00131v2 [physics.plasm-ph] UPDATED)
    Machine learning has had an enormous impact in many scientific disciplines. Also in the field of low-temperature plasma modeling and simulation it has attracted significant interest within the past years. Whereas its application should be carefully assessed in general, many aspects of plasma modeling and simulation have benefited substantially from recent developments within the field of machine learning and data-driven modeling. In this survey, we approach two main objectives: (a) We review the state-of-the-art focusing on approaches to low-temperature plasma modeling and simulation. By dividing our survey into plasma physics, plasma chemistry, plasma-surface interactions, and plasma process control, we aim to extensively discuss relevant examples from literature. (b) We provide a perspective of potential advances to plasma science and technology. We specifically elaborate on advances possibly enabled by adaptation from other scientific disciplines. We argue that not only the known unknowns, but also unknown unknowns may be discovered due to the inherent propensity of data-driven methods to spotlight hidden patterns in data.  ( 2 min )
    Unsupervised Neighborhood Propagation Kernel Layers for Semi-supervised Node Classification. (arXiv:2301.13764v3 [cs.LG] UPDATED)
    We present a deep Graph Convolutional Kernel Machine (GCKM) for semi-supervised node classification in graphs. The method is built of two main types of blocks: (i) We introduce unsupervised kernel machine layers propagating the node features in a one-hop neighborhood, using implicit node feature mappings. (ii) We specify a semi-supervised classification kernel machine through the lens of the Fenchel-Young inequality. We derive an effective initialization scheme and efficient end-to-end training algorithm in the dual variables for the full architecture. The main idea underlying GCKM is that, because of the unsupervised core, the final model can achieve higher performance in semi-supervised node classification when few labels are available for training. Experimental results demonstrate the effectiveness of the proposed framework.  ( 2 min )
    Approaching Globally Optimal Energy Efficiency in Interference Networks via Machine Learning. (arXiv:2212.12329v2 [eess.SP] UPDATED)
    This work presents a machine learning approach to optimize the energy efficiency (EE) in a multi-cell wireless network. This optimization problem is non-convex and its global optimum is difficult to find. In the literature, either simple but suboptimal approaches or optimal methods with high complexity and poor scalability are proposed. In contrast, we propose a machine learning framework to approach the global optimum. While the neural network (NN) training takes moderate time, application with the trained model requires very low computational complexity. In particular, we introduce a novel objective function based on stochastic actions to solve the non-convex optimization problem. Besides, we design a dedicated NN architecture for the multi-cell network optimization problems that is permutation-equivariant. It classifies channels according to their roles in the EE computation. In this way, we encode our domain knowledge into the NN design and shed light into the black box of machine learning. Training and testing results show that the proposed method without supervision and with reasonable computational effort achieves an EE close to the global optimum found by the branch-and-bound algorithm. Hence, the proposed approach balances between computational complexity and performance.  ( 2 min )
    Symplectic Autoencoders for Model Reduction of Hamiltonian Systems. (arXiv:2312.10004v1 [cs.LG])
    Many applications, such as optimization, uncertainty quantification and inverse problems, require repeatedly performing simulations of large-dimensional physical systems for different choices of parameters. This can be prohibitively expensive. In order to save computational cost, one can construct surrogate models by expressing the system in a low-dimensional basis, obtained from training data. This is referred to as model reduction. Past investigations have shown that, when performing model reduction of Hamiltonian systems, it is crucial to preserve the symplectic structure associated with the system in order to ensure long-term numerical stability. Up to this point structure-preserving reductions have largely been limited to linear transformations. We propose a new neural network architecture in the spirit of autoencoders, which are established tools for dimension reduction and feature extraction in data science, to obtain more general mappings. In order to train the network, a non-standard gradient descent approach is applied that leverages the differential-geometric structure emerging from the network design. The new architecture is shown to significantly outperform existing designs in accuracy.  ( 2 min )
    PDE+: Enhancing Generalization via PDE with Adaptive Distributional Diffusion. (arXiv:2305.15835v2 [cs.LG] UPDATED)
    The generalization of neural networks is a central challenge in machine learning, especially concerning the performance under distributions that differ from training ones. Current methods, mainly based on the data-driven paradigm such as data augmentation, adversarial training, and noise injection, may encounter limited generalization due to model non-smoothness. In this paper, we propose to investigate generalization from a Partial Differential Equation (PDE) perspective, aiming to enhance it directly through the underlying function of neural networks, rather than focusing on adjusting input data. Specifically, we first establish the connection between neural network generalization and the smoothness of the solution to a specific PDE, namely "transport equation". Building upon this, we propose a general framework that introduces adaptive distributional diffusion into transport equation to enhance the smoothness of its solution, thereby improving generalization. In the context of neural networks, we put this theoretical framework into practice as $\textbf{PDE+}$ ($\textbf{PDE}$ with $\textbf{A}$daptive $\textbf{D}$istributional $\textbf{D}$iffusion) which diffuses each sample into a distribution covering semantically similar inputs. This enables better coverage of potentially unobserved distributions in training, thus improving generalization beyond merely data-driven methods. The effectiveness of PDE+ is validated through extensive experimental settings, demonstrating its superior performance compared to SOTA methods.  ( 3 min )
    Scalable and hyper-parameter-free non-parametric covariate shift adaptation with conditional sampling. (arXiv:2312.09969v1 [stat.ML])
    Many existing covariate shift adaptation methods estimate sample weights to be used in the risk estimation in order to mitigate the gap between the source and the target distribution. However, non-parametrically estimating the optimal weights typically involves computationally expensive hyper-parameter tuning that is crucial to the final performance. In this paper, we propose a new non-parametric approach to covariate shift adaptation which avoids estimating weights and has no hyper-parameter to be tuned. Our basic idea is to label unlabeled target data according to the $k$-nearest neighbors in the source dataset. Our analysis indicates that setting $k = 1$ is an optimal choice. Thanks to this property, there is no need to tune any hyper-parameters, unlike other non-parametric methods. Moreover, our method achieves a running time quasi-linear in the sample size with a theoretical guarantee, for the first time in the literature to the best of our knowledge. Our results include sharp rates of convergence for estimating the joint probability distribution of the target data. In particular, the variance of our estimators has the same rate of convergence as for standard parametric estimation despite their non-parametric nature. Our numerical experiments show that proposed method brings drastic reduction in the running time with accuracy comparable to that of the state-of-the-art methods.  ( 2 min )
    Dynamic Gradient Balancing for Enhanced Adversarial Attacks on Multi-Task Models. (arXiv:2305.12066v2 [cs.LG] UPDATED)
    Multi-task learning (MTL) creates a single machine learning model called multi-task model to simultaneously perform multiple tasks. Although the security of single task classifiers has been extensively studied, there are several critical security research questions for multi-task models including 1) How secure are multi-task models to single task adversarial machine learning attacks, 2) Can adversarial attacks be designed to attack multiple tasks simultaneously, and 3) Does task sharing and adversarial training increase multi-task model robustness to adversarial attacks? In this paper, we answer these questions through careful analysis and rigorous experimentation. First, we develop na\"ive adaptation of single-task white-box attacks and analyze their inherent drawbacks. We then propose a novel attack framework, Dynamic Gradient Balancing Attack (DGBA). Our framework poses the problem of attacking a multi-task model as an optimization problem based on averaged relative loss change, which can be solved by approximating the problem as an integer linear programming problem. Extensive evaluation on two popular MTL benchmarks, NYUv2 and Tiny-Taxonomy, demonstrates the effectiveness of DGBA compared to na\"ive multi-task attack baselines on both clean and adversarially trained multi-task models. The results also reveal a fundamental trade-off between improving task accuracy by sharing parameters across tasks and undermining model robustness due to increased attack transferability from parameter sharing. DGBA is open-sourced and available at https://github.com/zhanglijun95/MTLAttack-DGBA.  ( 3 min )
    Disentangling Linear Mode-Connectivity. (arXiv:2312.09832v1 [cs.LG])
    Linear mode-connectivity (LMC) (or lack thereof) is one of the intriguing characteristics of neural network loss landscapes. While empirically well established, it unfortunately still lacks a proper theoretical understanding. Even worse, although empirical data points are abound, a systematic study of when networks exhibit LMC is largely missing in the literature. In this work we aim to close this gap. We explore how LMC is affected by three factors: (1) architecture (sparsity, weight-sharing), (2) training strategy (optimization setup) as well as (3) the underlying dataset. We place particular emphasis on minimal but non-trivial settings, removing as much unnecessary complexity as possible. We believe that our insights can guide future theoretical works on uncovering the inner workings of LMC.  ( 2 min )
    Toward Computationally Efficient Inverse Reinforcement Learning via Reward Shaping. (arXiv:2312.09983v1 [cs.LG])
    Inverse reinforcement learning (IRL) is computationally challenging, with common approaches requiring the solution of multiple reinforcement learning (RL) sub-problems. This work motivates the use of potential-based reward shaping to reduce the computational burden of each RL sub-problem. This work serves as a proof-of-concept and we hope will inspire future developments towards computationally efficient IRL.  ( 2 min )
    Challenges with unsupervised LLM knowledge discovery. (arXiv:2312.10029v1 [cs.LG])
    We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation is that knowledge satisfies a consistency structure, which can be used to discover knowledge. We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a particular leading unsupervised knowledge-elicitation method, contrast-consistent search (Burns et al. - arXiv:2212.03827). We then present a series of experiments showing settings in which unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature. We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. Conceptually, we hypothesise that the identification issues explored here, e.g. distinguishing a model's knowledge from that of a simulated character's, will persist for future unsupervised methods.  ( 2 min )
    Optimal Estimation of Generic Dynamics by Path-Dependent Neural Jump ODEs. (arXiv:2206.14284v5 [stat.ML] UPDATED)
    This paper studies the problem of forecasting general stochastic processes using a path-dependent extension of the Neural Jump ODE (NJ-ODE) framework \citep{herrera2021neural}. While NJ-ODE was the first framework to establish convergence guarantees for the prediction of irregularly observed time series, these results were limited to data stemming from It\^o-diffusions with complete observations, in particular Markov processes, where all coordinates are observed simultaneously. In this work, we generalise these results to generic, possibly non-Markovian or discontinuous, stochastic processes with incomplete observations, by utilising the reconstruction properties of the signature transform. These theoretical results are supported by empirical studies, where it is shown that the path-dependent NJ-ODE outperforms the original NJ-ODE framework in the case of non-Markovian data. Moreover, we show that PD-NJ-ODE can be applied successfully to classical stochastic filtering problems and to limit order book (LOB) data.  ( 2 min )
    Quantum Generative Adversarial Networks: Bridging Classical and Quantum Realms. (arXiv:2312.09939v1 [quant-ph])
    In this pioneering research paper, we present a groundbreaking exploration into the synergistic fusion of classical and quantum computing paradigms within the realm of Generative Adversarial Networks (GANs). Our objective is to seamlessly integrate quantum computational elements into the conventional GAN architecture, thereby unlocking novel pathways for enhanced training processes. Drawing inspiration from the inherent capabilities of quantum bits (qubits), we delve into the incorporation of quantum data representation methodologies within the GAN framework. By capitalizing on the unique quantum features, we aim to accelerate the training process of GANs, offering a fresh perspective on the optimization of generative models. Our investigation deals with theoretical considerations and evaluates the potential quantum advantages that may manifest in terms of training efficiency and generative quality. We confront the challenges inherent in the quantum-classical amalgamation, addressing issues related to quantum hardware constraints, error correction mechanisms, and scalability considerations. This research is positioned at the forefront of quantum-enhanced machine learning, presenting a critical stride towards harnessing the computational power of quantum systems to expedite the training of Generative Adversarial Networks. Through our comprehensive examination of the interface between classical and quantum realms, we aim to uncover transformative insights that will propel the field forward, fostering innovation and advancing the frontier of quantum machine learning.  ( 2 min )
    Quilt: Robust Data Segment Selection against Concept Drifts. (arXiv:2312.09691v1 [cs.LG])
    Continuous machine learning pipelines are common in industrial settings where models are periodically trained on data streams. Unfortunately, concept drifts may occur in data streams where the joint distribution of the data X and label y, P(X, y), changes over time and possibly degrade model accuracy. Existing concept drift adaptation approaches mostly focus on updating the model to the new data possibly using ensemble techniques of previous models and tend to discard the drifted historical data. However, we contend that explicitly utilizing the drifted data together leads to much better model accuracy and propose Quilt, a data-centric framework for identifying and selecting data segments that maximize model accuracy. To address the potential downside of efficiency, Quilt extends existing data subset selection techniques, which can be used to reduce the training data without compromising model accuracy. These techniques cannot be used as is because they only assume virtual drifts where the posterior probabilities P(y|X) are assumed not to change. In contrast, a key challenge in our setup is to also discard undesirable data segments with concept drifts. Quilt thus discards drifted data segments and selects data segment subsets holistically for accurate and efficient model training. The two operations use gradient-based scores, which have little computation overhead. In our experiments, we show that Quilt outperforms state-of-the-art drift adaptation and data selection baselines on synthetic and real datasets.  ( 3 min )
    Generic Unsupervised Optimization for a Latent Variable Model With Exponential Family Observables. (arXiv:2003.02214v3 [cs.LG] UPDATED)
    Latent variable models (LVMs) represent observed variables by parameterized functions of latent variables. Prominent examples of LVMs for unsupervised learning are probabilistic PCA or probabilistic SC which both assume a weighted linear summation of the latents to determine the mean of a Gaussian distribution for the observables. In many cases, however, observables do not follow a Gaussian distribution. For unsupervised learning, LVMs which assume specific non-Gaussian observables have therefore been considered. Already for specific choices of distributions, parameter optimization is challenging and only a few previous contributions considered LVMs with more generally defined observable distributions. Here, we consider LVMs that are defined for a range of different distributions, i.e., observables can follow any (regular) distribution of the exponential family. The novel class of LVMs presented is defined for binary latents, and it uses maximization in place of summation to link the latents to observables. To derive an optimization procedure, we follow an EM approach for maximum likelihood parameter estimation. We show that a set of very concise parameter update equations can be derived which feature the same functional form for all exponential family distributions. The derived generic optimization can consequently be applied to different types of metric data as well as to different types of discrete data. Also, the derived optimization equations can be combined with a recently suggested variational acceleration which is likewise generically applicable to the LVMs considered here. So, the combination maintains generic and direct applicability of the derived optimization procedure, but, crucially, enables efficient scalability. We numerically verify our analytical results and discuss some potential applications such as learning of variance structure, noise type estimation and denoising.  ( 3 min )
    Multiple Instance Learning for Uplift Modeling. (arXiv:2312.09639v1 [cs.LG])
    Uplift modeling is widely used in performance marketing to estimate effects of promotion campaigns (e.g., increase of customer retention rate). Since it is impossible to observe outcomes of a recipient in treatment (e.g., receiving a certain promotion) and control (e.g., without promotion) groups simultaneously (i.e., counter-factual), uplift models are mainly trained on instances of treatment and control groups separately to form two models respectively, and uplifts are predicted by the difference of predictions from these two models (i.e., two-model method). When responses are noisy and the treatment effect is fractional, induced individual uplift predictions will be inaccurate, resulting in targeting undesirable customers. Though it is impossible to obtain the ideal ground-truth individual uplifts, known as Individual Treatment Effects (ITEs), alternatively, an average uplift of a group of users, called Average Treatment Effect (ATE), can be observed from experimental deliveries. Upon this, similar to Multiple Instance Learning (MIL) in which each training sample is a bag of instances, our framework sums up individual user uplift predictions for each bag of users as its bag-wise ATE prediction, and regularizes it to its ATE label, thus learning more accurate individual uplifts. Additionally, to amplify the fractional treatment effect, bags are composed of instances with adjacent individual uplift predictions, instead of random instances. Experiments conducted on two datasets show the effectiveness and universality of the proposed framework.  ( 3 min )
    PulseImpute: A Novel Benchmark Task for Pulsative Physiological Signal Imputation. (arXiv:2212.07514v2 [cs.LG] UPDATED)
    The promise of Mobile Health (mHealth) is the ability to use wearable sensors to monitor participant physiology at high frequencies during daily life to enable temporally-precise health interventions. However, a major challenge is frequent missing data. Despite a rich imputation literature, existing techniques are ineffective for the pulsative signals which comprise many mHealth applications, and a lack of available datasets has stymied progress. We address this gap with PulseImpute, the first large-scale pulsative signal imputation challenge which includes realistic mHealth missingness models, an extensive set of baselines, and clinically-relevant downstream tasks. Our baseline models include a novel transformer-based architecture designed to exploit the structure of pulsative signals. We hope that PulseImpute will enable the ML community to tackle this significant and challenging task.  ( 2 min )
    Faithful Persona-based Conversational Dataset Generation with Large Language Models. (arXiv:2312.10007v1 [cs.CL])
    High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user's character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and comprehensive persona-based dataset can lead to conversational models that create a deeper connection with the user, and maintain their engagement. In this paper, we leverage the power of Large Language Models (LLMs) to create a large, high-quality conversational dataset from a seed dataset. We propose a Generator-Critic architecture framework to expand the initial dataset, while improving the quality of its conversations. The Generator is an LLM prompted to output conversations. The Critic consists of a mixture of expert LLMs that control the quality of the generated conversations. These experts select the best generated conversations, which we then use to improve the Generator. We release Synthetic-Persona-Chat, consisting of 20k conversations seeded from Persona-Chat. We evaluate the quality of Synthetic-Persona-Chat and our generation framework on different dimensions through extensive experiments, and observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat during Turing test decreases from 17.2% to 8.8% over three iterations.  ( 2 min )
    Machine-Learned Exclusion Limits without Binning. (arXiv:2211.04806v2 [hep-ph] UPDATED)
    Machine-Learned Likelihoods (MLL) combines machine-learning classification techniques with likelihood-based inference tests to estimate the experimental sensitivity of high-dimensional data sets. We extend the MLL method by including Kernel Density Estimators (KDE) to avoid binning the classifier output to extract the resulting one-dimensional signal and background probability density functions. We first test our method on toy models generated with multivariate Gaussian distributions, where the true probability distribution functions are known. Later, we apply the method to two cases of interest at the LHC: a search for exotic Higgs bosons, and a $Z'$ boson decaying into lepton pairs. In contrast to physical-based quantities, the typical fluctuations of the ML outputs give non-smooth probability distributions for pure-signal and pure-background samples. The non-smoothness is propagated into the density estimation due to the good performance and flexibility of the KDE method. We study its impact on the final significance computation, and we compare the results using the average of several independent ML output realizations, which allows us to obtain smoother distributions. We conclude that the significance estimation turns out to be not sensible to this issue.  ( 3 min )
    A Kronecker product accelerated efficient sparse Gaussian Process (E-SGP) for flow emulation. (arXiv:2312.10023v1 [cs.LG])
    In this paper, we introduce an efficient sparse Gaussian process (E-SGP) for the surrogate modelling of fluid mechanics. This novel Bayesian machine learning algorithm allows efficient model training using databases of different structures. It is a further development of the approximated sparse GP algorithm, combining the concept of efficient GP (E-GP) and variational energy free sparse Gaussian process (VEF-SGP). The developed E-SGP approach exploits the arbitrariness of inducing points and the monotonically increasing nature of the objective function with respect to the number of inducing points in VEF-SGP. By specifying the inducing points on the orthogonal grid/input subspace and using the Kronecker product, E-SGP significantly improves computational efficiency without imposing any constraints on the covariance matrix or increasing the number of parameters that need to be optimised during training. The E-SGP algorithm developed in this paper outperforms E-GP not only in scalability but also in model quality in terms of mean standardized logarithmic loss (MSLL). The computational complexity of E-GP suffers from the cubic growth regarding the growing structured training database. However, E-SGP maintains computational efficiency whilst the resolution of the model, (i.e., the number of inducing points) remains fixed. The examples show that E-SGP produces more accurate predictions in comparison with E-GP when the model resolutions are similar in both. E-GP benefits from more training data but comes with higher computational demands, while E-SGP achieves a comparable level of accuracy but is more computationally efficient, making E-SGP a potentially preferable choice for fluid mechanic problems. Furthermore, E-SGP can produce more reasonable estimates of model uncertainty, whilst E-GP is more likely to produce over-confident predictions.  ( 3 min )
    Celestial Machine Learning: From Data to Mars and Beyond with AI Feynman. (arXiv:2312.09766v1 [cs.LG])
    Can a machine or algorithm discover or learn Kepler's first law from astronomical sightings alone? We emulate Johannes Kepler's discovery of the equation of the orbit of Mars with the Rudolphine tables using AI Feynman, a physics-inspired tool for symbolic regression.  ( 2 min )
    Automatic Rao-Blackwellization for Sequential Monte Carlo with Belief Propagation. (arXiv:2312.09860v1 [cs.LG])
    Exact Bayesian inference on state-space models~(SSM) is in general untractable, and unfortunately, basic Sequential Monte Carlo~(SMC) methods do not yield correct approximations for complex models. In this paper, we propose a mixed inference algorithm that computes closed-form solutions using belief propagation as much as possible, and falls back to sampling-based SMC methods when exact computations fail. This algorithm thus implements automatic Rao-Blackwellization and is even exact for Gaussian tree models.  ( 2 min )
    Movement Primitive Diffusion: Learning Gentle Robotic Manipulation of Deformable Objects. (arXiv:2312.10008v1 [cs.RO])
    Policy learning in robot-assisted surgery (RAS) lacks data efficient and versatile methods that exhibit the desired motion quality for delicate surgical interventions. To this end, we introduce Movement Primitive Diffusion (MPD), a novel method for imitation learning (IL) in RAS that focuses on gentle manipulation of deformable objects. The approach combines the versatility of diffusion-based imitation learning (DIL) with the high-quality motion generation capabilities of Probabilistic Dynamic Movement Primitives (ProDMPs). This combination enables MPD to achieve gentle manipulation of deformable objects, while maintaining data efficiency critical for RAS applications where demonstration data is scarce. We evaluate MPD across various simulated tasks and a real world robotic setup on both state and image observations. MPD outperforms state-of-the-art DIL methods in success rate, motion quality, and data efficiency.  ( 2 min )
    ACPO: AI-Enabled Compiler-Driven Program Optimization. (arXiv:2312.09982v1 [cs.PL])
    The key to performance optimization of a program is to decide correctly when a certain transformation should be applied by a compiler. Traditionally, such profitability decisions are made by hand-coded algorithms tuned for a very small number of benchmarks, usually requiring a great deal of effort to be retuned when the benchmark suite changes. This is an ideal opportunity to apply machine-learning models to speed up the tuning process; while this realization has been around since the late 90s, only recent advancements in ML enabled a practical application of ML to compilers as an end-to-end framework. Even so, seamless integration of ML into the compiler would require constant rebuilding of the compiler when models are updated. This paper presents ACPO: \textbf{\underline{A}}I-Enabled \textbf{\underline{C}}ompiler-driven \textbf{\underline{P}}rogram \textbf{\underline{O}}ptimization; a novel framework to provide LLVM with simple and comprehensive tools to benefit from employing ML models for different optimization passes. We first showcase the high-level view, class hierarchy, and functionalities of ACPO and subsequently, demonstrate \taco{a couple of use cases of ACPO by ML-enabling the Loop Unroll and Function Inlining passes and describe how ACPO can be leveraged to optimize other passes. Experimental results reveal that ACPO model for Loop Unroll is able to gain on average 4\% and 3\%, 5.4\%, 0.2\% compared to LLVM's O3 optimization when deployed on Polybench, Coral-2, CoreMark, and Graph-500, respectively. Furthermore, by adding the Inliner model as well, ACPO is able to provide up to 4.5\% and 2.4\% on Polybench and Cbench compared with LLVM's O3 optimization, respectively.  ( 3 min )
    Small jet engine reservoir computing digital twin. (arXiv:2312.09978v1 [cs.LG])
    Machine learning was applied to create a digital twin of a numerical simulation of a single-scroll jet engine. A similar model based on the insights gained from this numerical study was used to create a digital twin of a JetCat P100-RX jet engine using only experimental data. Engine data was collected from a custom sensor system measuring parameters such as thrust, exhaust gas temperature, shaft speed, weather conditions, etc. Data was gathered while the engine was placed under different test conditions by controlling shaft speed. The machine learning model was generated (trained) using a next-generation reservoir computer, a best-in-class machine learning algorithm for dynamical systems. Once the model was trained, it was used to predict behavior it had never seen with an accuracy of better than 1.8% when compared to the testing data.  ( 2 min )
    Learning in Online Principle-Agent Interactions: The Power of Menus. (arXiv:2312.09869v1 [cs.GT])
    We study a ubiquitous learning challenge in online principal-agent problems during which the principal learns the agent's private information from the agent's revealed preferences in historical interactions. This paradigm includes important special cases such as pricing and contract design, which have been widely studied in recent literature. However, existing work considers the case where the principal can only choose a single strategy at every round to interact with the agent and then observe the agent's revealed preference through their actions. In this paper, we extend this line of study to allow the principal to offer a menu of strategies to the agent and learn additionally from observing the agent's selection from the menu. We provide a thorough investigation of several online principal-agent problem settings and characterize their sample complexities, accompanied by the corresponding algorithms we have developed. We instantiate this paradigm to several important design problems $-$ including Stackelberg (security) games, contract design, and information design. Finally, we also explore the connection between our findings and existing results about online learning in Stackelberg games, and we offer a solution that can overcome a key hard instance of Peng et al. (2019).  ( 2 min )
    Hard Negative Sampling via Regularized Optimal Transport for Contrastive Representation Learning. (arXiv:2111.03169v3 [cs.LG] UPDATED)
    We study the problem of designing hard negative sampling distributions for unsupervised contrastive representation learning. We propose and analyze a novel min-max framework that seeks a representation which minimizes the maximum (worst-case) generalized contrastive learning loss over all couplings (joint distributions between positive and negative samples subject to marginal constraints) and prove that the resulting min-max optimum representation will be degenerate. This provides the first theoretical justification for incorporating additional regularization constraints on the couplings. We re-interpret the min-max problem through the lens of Optimal Transport (OT) theory and utilize regularized transport couplings to control the degree of hardness of negative examples. Through experiments we demonstrate that the negative samples generated from our designed negative distribution are more similar to the anchor than those generated from the baseline negative distribution. We also demonstrate that entropic regularization yields negative sampling distributions with parametric form similar to that in a recent state-of-the-art negative sampling design and has similar performance in multiple datasets. Utilizing the uncovered connection with OT, we propose a new ground cost for designing the negative distribution and show improved performance of the learned representation on downstream tasks compared to the representation learned when using squared Euclidean cost.  ( 3 min )
    Dynamic Heterogeneous Federated Learning with Multi-Level Prototypes. (arXiv:2312.09881v1 [cs.LG])
    Federated learning shows promise as a privacy-preserving collaborative learning technique. Existing heterogeneous federated learning mainly focuses on skewing the label distribution across clients. However, most approaches suffer from catastrophic forgetting and concept drift, mainly when the global distribution of all classes is extremely unbalanced and the data distribution of the client dynamically evolves over time. In this paper, we study the new task, i.e., Dynamic Heterogeneous Federated Learning (DHFL), which addresses the practical scenario where heterogeneous data distributions exist among different clients and dynamic tasks within the client. Accordingly, we propose a novel federated learning framework named Federated Multi-Level Prototypes (FedMLP) and design federated multi-level regularizations. To mitigate concept drift, we construct prototypes and semantic prototypes to provide fruitful generalization knowledge and ensure the continuity of prototype spaces. To maintain the model stability and consistency of convergence, three regularizations are introduced as training losses, i.e., prototype-based regularization, semantic prototype-based regularization, and federated inter-task regularization. Extensive experiments show that the proposed method achieves state-of-the-art performance in various settings.  ( 2 min )
    Reliable Probabilistic Classification with Neural Networks. (arXiv:2312.09912v1 [cs.LG])
    Venn Prediction (VP) is a new machine learning framework for producing well-calibrated probabilistic predictions. In particular it provides well-calibrated lower and upper bounds for the conditional probability of an example belonging to each possible class of the problem at hand. This paper proposes five VP methods based on Neural Networks (NNs), which is one of the most widely used machine learning techniques. The proposed methods are evaluated experimentally on four benchmark datasets and the obtained results demonstrate the empirical well-calibratedness of their outputs and their superiority over the outputs of the traditional NN classifier.  ( 2 min )
    Deep Unsupervised Domain Adaptation for Time Series Classification: a Benchmark. (arXiv:2312.09857v1 [cs.LG])
    Unsupervised Domain Adaptation (UDA) aims to harness labeled source data to train models for unlabeled target data. Despite extensive research in domains like computer vision and natural language processing, UDA remains underexplored for time series data, which has widespread real-world applications ranging from medicine and manufacturing to earth observation and human activity recognition. Our paper addresses this gap by introducing a comprehensive benchmark for evaluating UDA techniques for time series classification, with a focus on deep learning methods. We provide seven new benchmark datasets covering various domain shifts and temporal dynamics, facilitating fair and standardized UDA method assessments with state of the art neural network backbones (e.g. Inception) for time series data. This benchmark offers insights into the strengths and limitations of the evaluated approaches while preserving the unsupervised nature of domain adaptation, making it directly applicable to practical problems. Our paper serves as a vital resource for researchers and practitioners, advancing domain adaptation solutions for time series data and fostering innovation in this critical field. The implementation code of this benchmark is available at https://github.com/EricssonResearch/UDA-4-TSC.  ( 2 min )
    Automating reward function configuration for drug design. (arXiv:2312.09865v1 [cs.LG])
    Designing reward functions that guide generative molecular design (GMD) algorithms to desirable areas of chemical space is of critical importance in AI-driven drug discovery. Traditionally, this has been a manual and error-prone task; the selection of appropriate computational methods to approximate biological assays is challenging and the aggregation of computed values into a single score even more so, leading to potential reliance on trial-and-error approaches. We propose a novel approach for automated reward configuration that relies solely on experimental data, mitigating the challenges of manual reward adjustment on drug discovery projects. Our method achieves this by constructing a ranking over experimental data based on Pareto dominance over the multi-objective space, then training a neural network to approximate the reward function such that rankings determined by the predicted reward correlate with those determined by the Pareto dominance relation. We validate our method using two case studies. In the first study we simulate Design-Make-Test-Analyse (DMTA) cycles by alternating reward function updates and generative runs guided by that function. We show that the learned function adapts over time to yield compounds that score highly with respect to evaluation functions taken from the literature. In the second study we apply our algorithm to historical data from four real drug discovery projects. We show that our algorithm yields reward functions that outperform the predictive accuracy of human-defined functions, achieving an improvement of up to 0.4 in Spearman's correlation against a ground truth evaluation function that encodes the target drug profile for that project. Our method provides an efficient data-driven way to configure reward functions for GMD, and serves as a strong baseline for future research into transformative approaches for the automation of drug discovery.  ( 3 min )
    Latent Diffusion Models with Image-Derived Annotations for Enhanced AI-Assisted Cancer Diagnosis in Histopathology. (arXiv:2312.09792v1 [cs.CV])
    Artificial Intelligence (AI) based image analysis has an immense potential to support diagnostic histopathology, including cancer diagnostics. However, developing supervised AI methods requires large-scale annotated datasets. A potentially powerful solution is to augment training data with synthetic data. Latent diffusion models, which can generate high-quality, diverse synthetic images, are promising. However, the most common implementations rely on detailed textual descriptions, which are not generally available in this domain. This work proposes a method that constructs structured textual prompts from automatically extracted image features. We experiment with the PCam dataset, composed of tissue patches only loosely annotated as healthy or cancerous. We show that including image-derived features in the prompt, as opposed to only healthy and cancerous labels, improves the Fr\'echet Inception Distance (FID) from 178.8 to 90.2. We also show that pathologists find it challenging to detect synthetic images, with a median sensitivity/specificity of 0.55/0.55. Finally, we show that synthetic data effectively trains AI models.  ( 3 min )
    Probabilistic learning of the Purkinje network from the electrocardiogram. (arXiv:2312.09887v1 [stat.ML])
    The identification of the Purkinje conduction system in the heart is a challenging task, yet essential for a correct definition of cardiac digital twins for precision cardiology. Here, we propose a probabilistic approach for identifying the Purkinje network from non-invasive clinical data such as the standard electrocardiogram (ECG). We use cardiac imaging to build an anatomically accurate model of the ventricles; we algorithmically generate a rule-based Purkinje network tailored to the anatomy; we simulate physiological electrocardiograms with a fast model; we identify the geometrical and electrical parameters of the Purkinje-ECG model with Bayesian optimization and approximate Bayesian computation. The proposed approach is inherently probabilistic and generates a population of plausible Purkinje networks, all fitting the ECG within a given tolerance. In this way, we can estimate the uncertainty of the parameters, thus providing reliable predictions. We test our methodology in physiological and pathological scenarios, showing that we are able to accurately recover the ECG with our model. We propagate the uncertainty in the Purkinje network parameters in a simulation of conduction system pacing therapy. Our methodology is a step forward in creation of digital twins from non-invasive data in precision medicine. An open source implementation can be found at this http URL  ( 2 min )
    Learning Distributions on Manifolds with Free-form Flows. (arXiv:2312.09852v1 [cs.LG])
    Many real world data, particularly in the natural sciences and computer vision, lie on known Riemannian manifolds such as spheres, tori or the group of rotation matrices. The predominant approaches to learning a distribution on such a manifold require solving a differential equation in order to sample from the model and evaluate densities. The resulting sampling times are slowed down by a high number of function evaluations. In this work, we propose an alternative approach which only requires a single function evaluation followed by a projection to the manifold. Training is achieved by an adaptation of the recently proposed free-form flow framework to Riemannian manifolds. The central idea is to estimate the gradient of the negative log-likelihood via a trace evaluated in the tangent space. We evaluate our method on various manifolds, and find significantly faster inference at competitive performance compared to previous work. We make our code public at https://github.com/vislearn/FFF.  ( 2 min )
    Small Dataset, Big Gains: Enhancing Reinforcement Learning by Offline Pre-Training with Model Based Augmentation. (arXiv:2312.09844v1 [cs.LG])
    Offline reinforcement learning leverages pre-collected datasets of transitions to train policies. It can serve as effective initialization for online algorithms, enhancing sample efficiency and speeding up convergence. However, when such datasets are limited in size and quality, offline pre-training can produce sub-optimal policies and lead to degraded online reinforcement learning performance. In this paper we propose a model-based data augmentation strategy to maximize the benefits of offline reinforcement learning pre-training and reduce the scale of data needed to be effective. Our approach leverages a world model of the environment trained on the offline dataset to augment states during offline pre-training. We evaluate our approach on a variety of MuJoCo robotic tasks and our results show it can jump-start online fine-tuning and substantially reduce - in some cases by an order of magnitude - the required number of environment interactions.  ( 2 min )
    A Synthesis of Green Architectural Tactics for ML-Enabled Systems. (arXiv:2312.09610v1 [cs.SE])
    The rapid adoption of artificial intelligence (AI) and machine learning (ML) has generated growing interest in understanding their environmental impact and the challenges associated with designing environmentally friendly ML-enabled systems. While Green AI research, i.e., research that tries to minimize the energy footprint of AI, is receiving increasing attention, very few concrete guidelines are available on how ML-enabled systems can be designed to be more environmentally sustainable. In this paper, we provide a catalog of 30 green architectural tactics for ML-enabled systems to fill this gap. An architectural tactic is a high-level design technique to improve software quality, in our case environmental sustainability. We derived the tactics from the analysis of 51 peer-reviewed publications that primarily explore Green AI, and validated them using a focus group approach with three experts. The 30 tactics we identified are aimed to serve as an initial reference guide for further exploration into Green AI from a software engineering perspective, and assist in designing sustainable ML-enabled systems. To enhance transparency and facilitate their widespread use and extension, we make the tactics available online in easily consumable formats. Wide-spread adoption of these tactics has the potential to substantially reduce the societal impact of ML-enabled systems regarding their energy and carbon footprint.  ( 2 min )
    On the locality of local neural operator in learning fluid dynamics. (arXiv:2312.09820v1 [physics.flu-dyn])
    This paper launches a thorough discussion on the locality of local neural operator (LNO), which is the core that enables LNO great flexibility on varied computational domains in solving transient partial differential equations (PDEs). We investigate the locality of LNO by looking into its receptive field and receptive range, carrying a main concern about how the locality acts in LNO training and applications. In a large group of LNO training experiments for learning fluid dynamics, it is found that an initial receptive range compatible with the learning task is crucial for LNO to perform well. On the one hand, an over-small receptive range is fatal and usually leads LNO to numerical oscillation; on the other hand, an over-large receptive range hinders LNO from achieving the best accuracy. We deem rules found in this paper general when applying LNO to learn and solve transient PDEs in diverse fields. Practical examples of applying the pre-trained LNOs in flow prediction are presented to confirm the findings further. Overall, with the architecture properly designed with a compatible receptive range, the pre-trained LNO shows commendable accuracy and efficiency in solving practical cases.  ( 2 min )
    SQA-SAM: Segmentation Quality Assessment for Medical Images Utilizing the Segment Anything Model. (arXiv:2312.09899v1 [eess.IV])
    Segmentation quality assessment (SQA) plays a critical role in the deployment of a medical image based AI system. Users need to be informed/alerted whenever an AI system generates unreliable/incorrect predictions. With the introduction of the Segment Anything Model (SAM), a general foundation segmentation model, new research opportunities emerged in how one can utilize SAM for medical image segmentation. In this paper, we propose a novel SQA method, called SQA-SAM, which exploits SAM to enhance the accuracy of quality assessment for medical image segmentation. When a medical image segmentation model (MedSeg) produces predictions for a test image, we generate visual prompts based on the predictions, and SAM is utilized to generate segmentation maps corresponding to the visual prompts. How well MedSeg's segmentation aligns with SAM's segmentation indicates how well MedSeg's segmentation aligns with the general perception of objectness and image region partition. We develop a score measure for such alignment. In experiments, we find that the generated scores exhibit moderate to strong positive correlation (in Pearson correlation and Spearman correlation) with Dice coefficient scores reflecting the true segmentation quality.  ( 2 min )
    ChemTime: Rapid and Early Classification for Multivariate Time Series Classification of Chemical Sensors. (arXiv:2312.09871v1 [cs.LG])
    Multivariate time series data are ubiquitous in the application of machine learning to problems in the physical sciences. Chemiresistive sensor arrays are highly promising in chemical detection tasks relevant to industrial, safety, and military applications. Sensor arrays are an inherently multivariate time series data collection tool which demand rapid and accurate classification of arbitrary chemical analytes. Previous research has benchmarked data-agnostic multivariate time series classifiers across diverse multivariate time series supervised tasks in order to find general-purpose classification algorithms. To our knowledge, there has yet to be an effort to survey machine learning and time series classification approaches to chemiresistive hardware sensor arrays for the detection of chemical analytes. In addition to benchmarking existing approaches to multivariate time series classifiers, we incorporate findings from a model survey to propose the novel \textit{ChemTime} approach to sensor array classification for chemical sensing. We design experiments addressing the unique challenges of hardware sensor arrays classification including the rapid classification ability of classifiers and minimization of inference time while maintaining performance for deployed lightweight hardware sensing devices. We find that \textit{ChemTime} is uniquely positioned for the chemical sensing task by combining rapid and early classification of time series with beneficial inference and high accuracy.  ( 3 min )
    Style Generation in Robot Calligraphy with Deep Generative Adversarial Networks. (arXiv:2312.09673v1 [cs.CV])
    Robot calligraphy is an emerging exploration of artificial intelligence in the fields of art and education. Traditional calligraphy generation researches mainly focus on methods such as tool-based image processing, generative models, and style transfer. Unlike the English alphabet, the number of Chinese characters is tens of thousands, which leads to difficulties in the generation of a style consistent Chinese calligraphic font with over 6000 characters. Due to the lack of high-quality data sets, formal definitions of calligraphy knowledge, and scientific art evaluation methods, The results generated are frequently of low quality and falls short of professional-level requirements. To address the above problem, this paper proposes an automatic calligraphy generation model based on deep generative adversarial networks (deepGAN) that can generate style calligraphy fonts with professional standards. The key highlights of the proposed method include: (1) The datasets use a high-precision calligraphy synthesis method to ensure its high quality and sufficient quantity; (2) Professional calligraphers are invited to conduct a series of Turing tests to evaluate the gap between model generation results and human artistic level; (3) Experimental results indicate that the proposed model is the state-of-the-art among current calligraphy generation methods. The Turing tests and similarity evaluations validate the effectiveness of the proposed method.  ( 2 min )
    TF-CLIP: Learning Text-free CLIP for Video-based Person Re-Identification. (arXiv:2312.09627v1 [cs.CV])
    Large-scale language-image pre-trained models (e.g., CLIP) have shown superior performances on many cross-modal retrieval tasks. However, the problem of transferring the knowledge learned from such models to video-based person re-identification (ReID) has barely been explored. In addition, there is a lack of decent text descriptions in current ReID benchmarks. To address these issues, in this work, we propose a novel one-stage text-free CLIP-based learning framework named TF-CLIP for video-based person ReID. More specifically, we extract the identity-specific sequence feature as the CLIP-Memory to replace the text feature. Meanwhile, we design a Sequence-Specific Prompt (SSP) module to update the CLIP-Memory online. To capture temporal information, we further propose a Temporal Memory Diffusion (TMD) module, which consists of two key components: Temporal Memory Construction (TMC) and Memory Diffusion (MD). Technically, TMC allows the frame-level memories in a sequence to communicate with each other, and to extract temporal information based on the relations within the sequence. MD further diffuses the temporal memories to each token in the original features to obtain more robust sequence features. Extensive experiments demonstrate that our proposed method shows much better results than other state-of-the-art methods on MARS, LS-VID and iLIDS-VID. The code is available at https://github.com/AsuradaYuci/TF-CLIP.  ( 3 min )
    Rethinking Causal Relationships Learning in Graph Neural Networks. (arXiv:2312.09613v1 [cs.LG])
    Graph Neural Networks (GNNs) demonstrate their significance by effectively modeling complex interrelationships within graph-structured data. To enhance the credibility and robustness of GNNs, it becomes exceptionally crucial to bolster their ability to capture causal relationships. However, despite recent advancements that have indeed strengthened GNNs from a causal learning perspective, conducting an in-depth analysis specifically targeting the causal modeling prowess of GNNs remains an unresolved issue. In order to comprehensively analyze various GNN models from a causal learning perspective, we constructed an artificially synthesized dataset with known and controllable causal relationships between data and labels. The rationality of the generated data is further ensured through theoretical foundations. Drawing insights from analyses conducted using our dataset, we introduce a lightweight and highly adaptable GNN module designed to strengthen GNNs' causal learning capabilities across a diverse range of tasks. Through a series of experiments conducted on both synthetic datasets and other real-world datasets, we empirically validate the effectiveness of the proposed module.  ( 2 min )
    PELP: Pioneer Event Log Prediction Using Sequence-to-Sequence Neural Networks. (arXiv:2312.09741v1 [cs.LG])
    Process mining, a data-driven approach for analyzing, visualizing, and improving business processes using event logs, has emerged as a powerful technique in the field of business process management. Process forecasting is a sub-field of process mining that studies how to predict future processes and process models. In this paper, we introduce and motivate the problem of event log prediction and present our approach to solving the event log prediction problem, in particular, using the sequence-to-sequence deep learning approach. We evaluate and analyze the prediction outcomes on a variety of synthetic logs and seven real-life logs and show that our approach can generate perfect predictions on synthetic logs and that deep learning techniques have the potential to be applied in real-world event log prediction tasks. We further provide practical recommendations for event log predictions grounded in the outcomes of the conducted experiments.  ( 2 min )
    Urban Region Embedding via Multi-View Contrastive Prediction. (arXiv:2312.09681v1 [cs.LG])
    Recently, learning urban region representations utilizing multi-modal data (information views) has become increasingly popular, for deep understanding of the distributions of various socioeconomic features in cities. However, previous methods usually blend multi-view information in a posteriors stage, falling short in learning coherent and consistent representations across different views. In this paper, we form a new pipeline to learn consistent representations across varying views, and propose the multi-view Contrastive Prediction model for urban Region embedding (ReCP), which leverages the multiple information views from point-of-interest (POI) and human mobility data. Specifically, ReCP comprises two major modules, namely an intra-view learning module utilizing contrastive learning and feature reconstruction to capture the unique information from each single view, and inter-view learning module that perceives the consistency between the two views using a contrastive prediction learning scheme. We conduct thorough experiments on two downstream tasks to assess the proposed model, i.e., land use clustering and region popularity prediction. The experimental results demonstrate that our model outperforms state-of-the-art baseline methods significantly in urban region representation learning.  ( 2 min )
    Fragility, Robustness and Antifragility in Deep Learning. (arXiv:2312.09821v1 [cs.LG])
    We propose a systematic analysis of deep neural networks (DNNs) based on a signal processing technique for network parameter removal, in the form of synaptic filters that identifies the fragility, robustness and antifragility characteristics of DNN parameters. Our proposed analysis investigates if the DNN performance is impacted negatively, invariantly, or positively on both clean and adversarially perturbed test datasets when the DNN undergoes synaptic filtering. We define three \textit{filtering scores} for quantifying the fragility, robustness and antifragility characteristics of DNN parameters based on the performances for (i) clean dataset, (ii) adversarial dataset, and (iii) the difference in performances of clean and adversarial datasets. We validate the proposed systematic analysis on ResNet-18, ResNet-50, SqueezeNet-v1.1 and ShuffleNet V2 x1.0 network architectures for MNIST, CIFAR10 and Tiny ImageNet datasets. The filtering scores, for a given network architecture, identify network parameters that are invariant in characteristics across different datasets over learning epochs. Vice-versa, for a given dataset, the filtering scores identify the parameters that are invariant in characteristics across different network architectures. We show that our synaptic filtering method improves the test accuracy of ResNet and ShuffleNet models on adversarial datasets when only the robust and antifragile parameters are selectively retrained at any given epoch, thus demonstrating applications of the proposed strategy in improving model robustness.  ( 2 min )
    Socio-Economic Deprivation Analysis: Diffusion Maps. (arXiv:2312.09830v1 [cs.LG])
    This report proposes a model to predict the location of the most deprived areas in a city using data from the census. A census data is very high dimensional and needs to be simplified. We use a novel algorithm to reduce dimensionality and find patterns: The diffusion map. Features are defined by eigenvectors of the Laplacian matrix that defines the diffusion map. Eigenvectors corresponding to the smallest eigenvalues indicate specific population features. Previous work has found qualitatively that the second most important dimension for describing the census data in Bristol is linked to deprivation. In this report, we analyse how good this dimension is as a model for predicting deprivation by comparing with the recognised measures. The Pearson correlation coefficient was found to be over 0.7. The top 10 per cent of deprived areas in the UK which also locate in Bristol are extracted to test the accuracy of the model. There are 52 most deprived areas, and 38 areas are correctly identified by comparing to the model. The influence of scores of IMD domains that do not correlate with the models, Eigenvector 2 entries of non-deprived OAs and orthogonality of Eigenvectors cause the model to fail the prediction of 14 deprived areas. However, overall, the model shows a high performance to predict the future deprivation of overall areas where the project considers. This project is expected to support the government to allocate resources and funding.  ( 2 min )
    Part Representation Learning with Teacher-Student Decoder for Occluded Person Re-identification. (arXiv:2312.09797v1 [cs.CV])
    Occluded person re-identification (ReID) is a very challenging task due to the occlusion disturbance and incomplete target information. Leveraging external cues such as human pose or parsing to locate and align part features has been proven to be very effective in occluded person ReID. Meanwhile, recent Transformer structures have a strong ability of long-range modeling. Considering the above facts, we propose a Teacher-Student Decoder (TSD) framework for occluded person ReID, which utilizes the Transformer decoder with the help of human parsing. More specifically, our proposed TSD consists of a Parsing-aware Teacher Decoder (PTD) and a Standard Student Decoder (SSD). PTD employs human parsing cues to restrict Transformer's attention and imparts this information to SSD through feature distillation. Thereby, SSD can learn from PTD to aggregate information of body parts automatically. Moreover, a mask generator is designed to provide discriminative regions for better ReID. In addition, existing occluded person ReID benchmarks utilize occluded samples as queries, which will amplify the role of alleviating occlusion interference and underestimate the impact of the feature absence issue. Contrastively, we propose a new benchmark with non-occluded queries, serving as a complement to the existing benchmark. Extensive experiments demonstrate that our proposed method is superior and the new benchmark is essential. The source codes are available at https://github.com/hh23333/TSD.  ( 3 min )
    Concept Prerequisite Relation Prediction by Using Permutation-Equivariant Directed Graph Neural Networks. (arXiv:2312.09802v1 [cs.LG])
    This paper studies the problem of CPRP, concept prerequisite relation prediction, which is a fundamental task in using AI for education. CPRP is usually formulated into a link-prediction task on a relationship graph of concepts and solved by training the graph neural network (GNN) model. However, current directed GNNs fail to manage graph isomorphism which refers to the invariance of non-isomorphic graphs, reducing the expressivity of resulting representations. We present a permutation-equivariant directed GNN model by introducing the Weisfeiler-Lehman test into directed GNN learning. Our method is then used for CPRP and evaluated on three public datasets. The experimental results show that our model delivers better prediction performance than the state-of-the-art methods.  ( 2 min )
    Optimization meets Machine Learning: An Exact Algorithm for Semi-Supervised Support Vector Machines. (arXiv:2312.09789v1 [math.OC])
    Support vector machines (SVMs) are well-studied supervised learning models for binary classification. In many applications, large amounts of samples can be cheaply and easily obtained. What is often a costly and error-prone process is to manually label these instances. Semi-supervised support vector machines (S3VMs) extend the well-known SVM classifiers to the semi-supervised approach, aiming at maximizing the margin between samples in the presence of unlabeled data. By leveraging both labeled and unlabeled data, S3VMs attempt to achieve better accuracy and robustness compared to traditional SVMs. Unfortunately, the resulting optimization problem is non-convex and hence difficult to solve exactly. In this paper, we present a new branch-and-cut approach for S3VMs using semidefinite programming (SDP) relaxations. We apply optimality-based bound tightening to bound the feasible set. Box constraints allow us to include valid inequalities, strengthening the lower bound. The resulting SDP relaxation provides bounds significantly stronger than the ones available in the literature. For the upper bound, instead, we define a local search exploiting the solution of the SDP relaxation. Computational results highlight the efficiency of the algorithm, showing its capability to solve instances with a number of data points 10 times larger than the ones solved in the literature.  ( 2 min )
    Toward Deep Drum Source Separation. (arXiv:2312.09663v1 [eess.AS])
    In the past, the field of drum source separation faced significant challenges due to limited data availability, hindering the adoption of cutting-edge deep learning methods that have found success in other related audio applications. In this manuscript, we introduce StemGMD, a large-scale audio dataset of isolated single-instrument drum stems. Each audio clip is synthesized from MIDI recordings of expressive drums performances using ten real-sounding acoustic drum kits. Totaling 1224 hours, StemGMD is the largest audio dataset of drums to date and the first to comprise isolated audio clips for every instrument in a canonical nine-piece drum kit. We leverage StemGMD to develop LarsNet, a novel deep drum source separation model. Through a bank of dedicated U-Nets, LarsNet can separate five stems from a stereo drum mixture faster than real-time and is shown to significantly outperform state-of-the-art nonnegative spectro-temporal factorization methods.  ( 2 min )
    A Malware Classification Survey on Adversarial Attacks and Defences. (arXiv:2312.09636v1 [cs.CR])
    As the number and complexity of malware attacks continue to increase, there is an urgent need for effective malware detection systems. While deep learning models are effective at detecting malware, they are vulnerable to adversarial attacks. Attacks like this can create malicious files that are resistant to detection, creating a significant cybersecurity risk. Recent research has seen the development of several adversarial attack and response approaches aiming at strengthening deep learning models' resilience to such attacks. This survey study offers an in-depth look at current research in adversarial attack and defensive strategies for malware classification in cybersecurity. The methods are classified into four categories: generative models, feature-based approaches, ensemble methods, and hybrid tactics. The article outlines cutting-edge procedures within each area, assessing their benefits and drawbacks. Each topic presents cutting-edge approaches and explores their advantages and disadvantages. In addition, the study discusses the datasets and assessment criteria that are often utilized on this subject. Finally, it identifies open research difficulties and suggests future study options. This document is a significant resource for malware categorization and cyber security researchers and practitioners.  ( 2 min )
    Keep the Faith: Faithful Explanations in Convolutional Neural Networks for Case-Based Reasoning. (arXiv:2312.09783v1 [cs.LG])
    Explaining predictions of black-box neural networks is crucial when applied to decision-critical tasks. Thus, attribution maps are commonly used to identify important image regions, despite prior work showing that humans prefer explanations based on similar examples. To this end, ProtoPNet learns a set of class-representative feature vectors (prototypes) for case-based reasoning. During inference, similarities of latent features to prototypes are linearly classified to form predictions and attribution maps are provided to explain the similarity. In this work, we evaluate whether architectures for case-based reasoning fulfill established axioms required for faithful explanations using the example of ProtoPNet. We show that such architectures allow the extraction of faithful explanations. However, we prove that the attribution maps used to explain the similarities violate the axioms. We propose a new procedure to extract explanations for trained ProtoPNets, named ProtoPFaith. Conceptually, these explanations are Shapley values, calculated on the similarity scores of each prototype. They allow to faithfully answer which prototypes are present in an unseen image and quantify each pixel's contribution to that presence, thereby complying with all axioms. The theoretical violations of ProtoPNet manifest in our experiments on three datasets (CUB-200-2011, Stanford Dogs, RSNA) and five architectures (ConvNet, ResNet, ResNet50, WideResNet50, ResNeXt50). Our experiments show a qualitative difference between the explanations given by ProtoPNet and ProtoPFaith. Additionally, we quantify the explanations with the Area Over the Perturbation Curve, on which ProtoPFaith outperforms ProtoPNet on all experiments by a factor $>10^3$.  ( 3 min )
    Vectorizing string entries for data processing on tables: when are larger language models better?. (arXiv:2312.09634v1 [stat.ML])
    There are increasingly efficient data processing pipelines that work on vectors of numbers, for instance most machine learning models, or vector databases for fast similarity search. These require converting the data to numbers. While this conversion is easy for simple numerical and categorical entries, databases are strife with text entries, such as names or descriptions. In the age of large language models, what's the best strategies to vectorize tables entries, baring in mind that larger models entail more operational complexity? We study the benefits of language models in 14 analytical tasks on tables while varying the training size, as well as for a fuzzy join benchmark. We introduce a simple characterization of a column that reveals two settings: 1) a dirty categories setting, where strings share much similarities across entries, and conversely 2) a diverse entries setting. For dirty categories, pretrained language models bring little-to-no benefit compared to simpler string models. For diverse entries, we show that larger language models improve data processing. For these we investigate the complexity-performance tradeoffs and show that they reflect those of classic text embedding: larger models tend to perform better, but it is useful to fine tune them for embedding purposes.  ( 2 min )
    PAC-Bayes Generalisation Bounds for Dynamical Systems Including Stable RNNs. (arXiv:2312.09793v1 [cs.LG])
    In this paper, we derive a PAC-Bayes bound on the generalisation gap, in a supervised time-series setting for a special class of discrete-time non-linear dynamical systems. This class includes stable recurrent neural networks (RNN), and the motivation for this work was its application to RNNs. In order to achieve the results, we impose some stability constraints, on the allowed models. Here, stability is understood in the sense of dynamical systems. For RNNs, these stability conditions can be expressed in terms of conditions on the weights. We assume the processes involved are essentially bounded and the loss functions are Lipschitz. The proposed bound on the generalisation gap depends on the mixing coefficient of the data distribution, and the essential supremum of the data. Furthermore, the bound converges to zero as the dataset size increases. In this paper, we 1) formalize the learning problem, 2) derive a PAC-Bayesian error bound for such systems, 3) discuss various consequences of this error bound, and 4) show an illustrative example, with discussions on computing the proposed bound. Unlike other available bounds the derived bound holds for non i.i.d. data (time-series) and it does not grow with the number of steps of the RNN.  ( 2 min )
    Physics-informed Neural Network Estimation of Material Properties in Soft Tissue Nonlinear Biomechanical Models. (arXiv:2312.09787v1 [cs.LG])
    The development of biophysical models for clinical applications is rapidly advancing in the research community, thanks to their predictive nature and their ability to assist the interpretation of clinical data. However, high-resolution and accurate multi-physics computational models are computationally expensive and their personalisation involves fine calibration of a large number of parameters, which may be space-dependent, challenging their clinical translation. In this work, we propose a new approach which relies on the combination of physics-informed neural networks (PINNs) with three-dimensional soft tissue nonlinear biomechanical models, capable of reconstructing displacement fields and estimating heterogeneous patient-specific biophysical properties. The proposed learning algorithm encodes information from a limited amount of displacement and, in some cases, strain data, that can be routinely acquired in the clinical setting, and combines it with the physics of the problem, represented by a mathematical model based on partial differential equations, to regularise the problem and improve its convergence properties. Several benchmarks are presented to show the accuracy and robustness of the proposed method and its great potential to enable the robust and effective identification of patient-specific, heterogeneous physical properties, s.a. tissue stiffness properties. In particular, we demonstrate the capability of the PINN to detect the presence, location and severity of scar tissue, which is beneficial to develop personalised simulation models for disease diagnosis, especially for cardiac applications.  ( 3 min )
    Verification-Friendly Deep Neural Networks. (arXiv:2312.09748v1 [cs.LG])
    Machine learning techniques often lack formal correctness guarantees. This is evidenced by the widespread adversarial examples that plague most deep-learning applications. This resulted in several research efforts that aim at verifying deep neural networks, with a particular focus on safety-critical applications. However, formal verification techniques still face major scalability and precision challenges when dealing with the complexity of such networks. The over-approximation introduced during the formal verification process to tackle the scalability challenge often results in inconclusive analysis. To address this challenge, we propose a novel framework to generate Verification-friendly Neural Networks (VNNs). We present a post-training optimization framework to achieve a balance between preserving prediction performance and robustness in the resulting networks. Our proposed framework proves to result in networks that are comparable to the original ones in terms of prediction performance, while amenable to verification. This essentially enables us to establish robustness for more VNNs than their deep neural network counterparts, in a more time-efficient manner.  ( 2 min )
    Diagnosing and Rectifying Fake OOD Invariance: A Restructured Causal Approach. (arXiv:2312.09758v1 [cs.LG])
    Invariant representation learning (IRL) encourages the prediction from invariant causal features to labels de-confounded from the environments, advancing the technical roadmap of out-of-distribution (OOD) generalization. Despite spotlights around, recent theoretical results verified that some causal features recovered by IRLs merely pretend domain-invariantly in the training environments but fail in unseen domains. The \emph{fake invariance} severely endangers OOD generalization since the trustful objective can not be diagnosed and existing causal surgeries are invalid to rectify. In this paper, we review a IRL family (InvRat) under the Partially and Fully Informative Invariant Feature Structural Causal Models (PIIF SCM /FIIF SCM) respectively, to certify their weaknesses in representing fake invariant features, then, unify their causal diagrams to propose ReStructured SCM (RS-SCM). RS-SCM can ideally rebuild the spurious and the fake invariant features simultaneously. Given this, we further develop an approach based on conditional mutual information with respect to RS-SCM, then rigorously rectify the spurious and fake invariant effects. It can be easily implemented by a small feature selection subnet introduced in the IRL family, which is alternatively optimized to achieve our goal. Experiments verified the superiority of our approach to fight against the fake invariant issue across a variety of OOD generalization benchmarks.  ( 2 min )
    Bridging the Semantic-Numerical Gap: A Numerical Reasoning Method of Cross-modal Knowledge Graph for Material Property Prediction. (arXiv:2312.09744v1 [cs.LG])
    Using machine learning (ML) techniques to predict material properties is a crucial research topic. These properties depend on numerical data and semantic factors. Due to the limitations of small-sample datasets, existing methods typically adopt ML algorithms to regress numerical properties or transfer other pre-trained knowledge graphs (KGs) to the material. However, these methods cannot simultaneously handle semantic and numerical information. In this paper, we propose a numerical reasoning method for material KGs (NR-KG), which constructs a cross-modal KG using semantic nodes and numerical proxy nodes. It captures both types of information by projecting KG into a canonical KG and utilizes a graph neural network to predict material properties. In this process, a novel projection prediction loss is proposed to extract semantic features from numerical information. NR-KG facilitates end-to-end processing of cross-modal data, mining relationships and cross-modal information in small-sample datasets, and fully utilizes valuable experimental data to enhance material prediction. We further propose two new High-Entropy Alloys (HEA) property datasets with semantic descriptions. NR-KG outperforms state-of-the-art (SOTA) methods, achieving relative improvements of 25.9% and 16.1% on two material datasets. Besides, NR-KG surpasses SOTA methods on two public physical chemistry molecular datasets, showing improvements of 22.2% and 54.3%, highlighting its potential application and generalizability. We hope the proposed datasets, algorithms, and pre-trained models can facilitate the communities of KG and AI for materials.  ( 3 min )
    Optimal Regret Bounds for Collaborative Learning in Bandits. (arXiv:2312.09674v1 [cs.LG])
    We consider regret minimization in a general collaborative multi-agent multi-armed bandit model, in which each agent faces a finite set of arms and may communicate with other agents through a central controller. The optimal arm for each agent in this model is the arm with the largest expected mixed reward, where the mixed reward of each arm is a weighted average of its rewards across all agents, making communication among agents crucial. While near-optimal sample complexities for best arm identification are known under this collaborative model, the question of optimal regret remains open. In this work, we address this problem and propose the first algorithm with order optimal regret bounds under this collaborative bandit model. Furthermore, we show that only a small constant number of expected communication rounds is needed.  ( 2 min )
    What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection. (arXiv:2312.09651v1 [cs.SD])
    The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms. Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. To address this challenge, one of the emergent effective approaches is continual learning. In this paper, we propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection. The fundamental concept underlying RWM involves categorizing all classes into two groups: those with compact feature distributions across tasks, such as genuine audio, and those with more spread-out distributions, like various types of fake audio. These distinctions are quantified by means of the in-class cosine distance, which subsequently serves as the basis for RWM to introduce a trainable gradient modification direction for distinct data types. Experimental evaluations against mainstream continual learning methods reveal the superiority of RWM in terms of knowledge acquisition and mitigating forgetting in audio deepfake detection. Furthermore, RWM's applicability extends beyond audio deepfake detection, demonstrating its potential significance in diverse machine learning domains such as image recognition.  ( 2 min )
    Hypergraph-MLP: Learning on Hypergraphs without Message Passing. (arXiv:2312.09778v1 [cs.LG])
    Hypergraphs are vital in modelling data with higher-order relations containing more than two entities, gaining prominence in machine learning and signal processing. Many hypergraph neural networks leverage message passing over hypergraph structures to enhance node representation learning, yielding impressive performances in tasks like hypergraph node classification. However, these message-passing-based models face several challenges, including oversmoothing as well as high latency and sensitivity to structural perturbations at inference time. To tackle those challenges, we propose an alternative approach where we integrate the information about hypergraph structures into training supervision without explicit message passing, thus also removing the reliance on it at inference. Specifically, we introduce Hypergraph-MLP, a novel learning framework for hypergraph-structured data, where the learning model is a straightforward multilayer perceptron (MLP) supervised by a loss function based on a notion of signal smoothness on hypergraphs. Experiments on hypergraph node classification tasks demonstrate that Hypergraph-MLP achieves competitive performance compared to existing baselines, and is considerably faster and more robust against structural perturbations at inference.  ( 2 min )
    End-to-End Training of Neural Networks for Automotive Radar Interference Mitigation. (arXiv:2312.09790v1 [cs.LG])
    In this paper we propose a new method for training neural networks (NNs) for frequency modulated continuous wave (FMCW) radar mutual interference mitigation. Instead of training NNs to regress from interfered to clean radar signals as in previous work, we train NNs directly on object detection maps. We do so by performing a continuous relaxation of the cell-averaging constant false alarm rate (CA-CFAR) peak detector, which is a well-established algorithm for object detection using radar. With this new training objective we are able to increase object detection performance by a large margin. Furthermore, we introduce separable convolution kernels to strongly reduce the number of parameters and computational complexity of convolutional NN architectures for radar applications. We validate our contributions with experiments on real-world measurement data and compare them against signal processing interference mitigation methods.  ( 2 min )
    Learning of Hamiltonian Dynamics with Reproducing Kernel Hilbert Spaces. (arXiv:2312.09734v1 [cs.RO])
    This paper presents a method for learning Hamiltonian dynamics from a limited set of data points. The Hamiltonian vector field is found by regularized optimization over a reproducing kernel Hilbert space of vector fields that are inherently Hamiltonian, and where the vector field is required to be odd or even. This is done with a symplectic kernel, and it is shown how this symplectic kernel can be modified to be odd or even. The performance of the method is validated in simulations for two Hamiltonian systems. It is shown that the learned dynamics are Hamiltonian, and that the learned Hamiltonian vector field can be prescribed to be odd or even.  ( 2 min )
    Reliable Prediction Intervals with Regression Neural Networks. (arXiv:2312.09606v1 [cs.LG])
    This paper proposes an extension to conventional regression Neural Networks (NNs) for replacing the point predictions they produce with prediction intervals that satisfy a required level of confidence. Our approach follows a novel machine learning framework, called Conformal Prediction (CP), for assigning reliable confidence measures to predictions without assuming anything more than that the data are independent and identically distributed (i.i.d.). We evaluate the proposed method on four benchmark datasets and on the problem of predicting Total Electron Content (TEC), which is an important parameter in trans-ionospheric links; for the latter we use a dataset of more than 60000 TEC measurements collected over a period of 11 years. Our experimental results show that the prediction intervals produced by our method are both well-calibrated and tight enough to be useful in practice.  ( 2 min )
    Taming Waves: A Physically-Interpretable Machine Learning Framework for Realizable Control of Wave Dynamics. (arXiv:2312.09460v1 [eess.SP])
    Controlling systems governed by partial differential equations is an inherently hard problem. Specifically, control of wave dynamics is challenging due to additional physical constraints and intrinsic properties of wave phenomena such as dissipation, attenuation, reflection, and scattering. In this work, we introduce an environment designed for the study of the control of acoustic waves by actuated metamaterial designs. We utilize this environment for the development of a novel machine-learning method, based on deep neural networks, for efficiently learning the dynamics of an acoustic PDE from samples. Our model is fully interpretable and maps physical constraints and intrinsic properties of the real acoustic environment into its latent representation of information. Within our model we use a trainable perfectly matched layer to explicitly learn the property of acoustic energy dissipation. Our model can be used to predict and control scattered wave energy. The capabilities of our model are demonstrated on an important problem in acoustics, which is the minimization of total scattered energy. Furthermore, we show that the prediction of scattered energy by our model generalizes in time and can be extended to long time horizons. We make our code repository publicly available.  ( 2 min )
    Safe Reinforcement Learning in a Simulated Robotic Arm. (arXiv:2312.09468v1 [cs.RO])
    Reinforcement learning (RL) agents need to explore their environments in order to learn optimal policies. In many environments and tasks, safety is of critical importance. The widespread use of simulators offers a number of advantages, including safe exploration which will be inevitable in cases when RL systems need to be trained directly in the physical environment (e.g. in human-robot interaction). The popular Safety Gym library offers three mobile agent types that can learn goal-directed tasks while considering various safety constraints. In this paper, we extend the applicability of safe RL algorithms by creating a customized environment with Panda robotic arm where Safety Gym algorithms can be tested. We performed pilot experiments with the popular PPO algorithm comparing the baseline with the constrained version and show that the constrained version is able to learn the equally good policy while better complying with safety constraints and taking longer training time as expected.  ( 2 min )
    Pioneering EEG Motor Imagery Classification Through Counterfactual Analysis. (arXiv:2312.09456v1 [eess.SP])
    The application of counterfactual explanation (CE) techniques in the realm of electroencephalography (EEG) classification has been relatively infrequent in contemporary research. In this study, we attempt to introduce and explore a novel non-generative approach to CE, specifically tailored for the analysis of EEG signals. This innovative approach assesses the model's decision-making process by strategically swapping patches derived from time-frequency analyses. By meticulously examining the variations and nuances introduced in the classification outcomes through this method, we aim to derive insights that can enhance interpretability. The empirical results obtained from our experimental investigations serve not only to validate the efficacy of our proposed approach but also to reinforce human confidence in the model's predictive capabilities. Consequently, these findings underscore the significance and potential value of conducting further, more extensive research in this promising direction.  ( 2 min )
    Riemannian Prediction of Anatomical Diagnoses in Congenital Heart Disease based on 12-lead ECGs. (arXiv:2312.09437v1 [eess.SP])
    Congenital heart disease (CHD) is a relatively rare disease that affects patients at birth and results in extremely heterogeneous anatomical and functional defects. 12-lead ECG signal is routinely collected in CHD patients because it provides significant biomarkers for disease prognosis. However, developing accurate machine learning models is challenging due to the lack of large available datasets. Here, we suggest exploiting the Riemannian geometry of the spatial covariance structure of the ECG signal to improve classification. Firstly, we use covariance augmentation to mix samples across the Riemannian geodesic between corresponding classes. Secondly, we suggest to project the covariance matrices to their respective class Riemannian mean to enhance the quality of feature extraction via tangent space projection. We perform several ablation experiments and demonstrate significant improvement compared to traditional machine learning models and deep learning on ECG time series data.  ( 2 min )
    Adaptive Integration of Partial Label Learning and Negative Learning for Enhanced Noisy Label Learning. (arXiv:2312.09505v1 [cs.LG])
    There has been significant attention devoted to the effectiveness of various domains, such as semi-supervised learning, contrastive learning, and meta-learning, in enhancing the performance of methods for noisy label learning (NLL) tasks. However, most existing methods still depend on prior assumptions regarding clean samples amidst different sources of noise (\eg, a pre-defined drop rate or a small subset of clean samples). In this paper, we propose a simple yet powerful idea called \textbf{NPN}, which revolutionizes \textbf{N}oisy label learning by integrating \textbf{P}artial label learning (PLL) and \textbf{N}egative learning (NL). Toward this goal, we initially decompose the given label space adaptively into the candidate and complementary labels, thereby establishing the conditions for PLL and NL. We propose two adaptive data-driven paradigms of label disambiguation for PLL: hard disambiguation and soft disambiguation. Furthermore, we generate reliable complementary labels using all non-candidate labels for NL to enhance model robustness through indirect supervision. To maintain label reliability during the later stage of model training, we introduce a consistency regularization term that encourages agreement between the outputs of multiple augmentations. Experiments conducted on both synthetically corrupted and real-world noisy datasets demonstrate the superiority of NPN compared to other state-of-the-art (SOTA) methods. The source code has been made available at {\color{purple}{\url{https://github.com/NUST-Machine-Intelligence-Laboratory/NPN}}}.  ( 3 min )
    Enhancing Trajectory Prediction through Self-Supervised Waypoint Noise Prediction. (arXiv:2312.09466v1 [cs.RO])
    Trajectory prediction is an important task that involves modeling the indeterminate nature of traffic actors to forecast future trajectories given the observed trajectory sequences. However, current methods confine themselves to presumed data manifolds, assuming that trajectories strictly adhere to these manifolds, resulting in overly simplified predictions. To this end, we propose a novel approach called SSWNP (Self-Supervised Waypoint Noise Prediction). In our approach, we first create clean and noise-augmented views of past observed trajectories across the spatial domain of waypoints. We then compel the trajectory prediction model to maintain spatial consistency between predictions from these two views, in addition to the trajectory prediction task. Introducing the noise-augmented view mitigates the model's reliance on a narrow interpretation of the data manifold, enabling it to learn more plausible and diverse representations. We also predict the noise present in the two views of past observed trajectories as an auxiliary self-supervised task, enhancing the model's understanding of the underlying representation and future predictions. Empirical evidence demonstrates that the incorporation of SSWNP into the model learning process significantly improves performance, even in noisy environments, when compared to baseline methods. Our approach can complement existing trajectory prediction methods. To showcase the effectiveness of our approach, we conducted extensive experiments on three datasets: NBA Sports VU, ETH-UCY, and TrajNet++, with experimental results highlighting the substantial improvement achieved in trajectory prediction tasks.  ( 2 min )
    Self-Supervised Learning for Anomalous Sound Detection. (arXiv:2312.09578v1 [eess.AS])
    State-of-the-art anomalous sound detection (ASD) systems are often trained by using an auxiliary classification task to learn an embedding space. Doing so enables the system to learn embeddings that are robust to noise and are ignoring non-target sound events but requires manually annotated meta information to be used as class labels. However, the less difficult the classification task becomes, the less informative are the embeddings and the worse is the resulting ASD performance. A solution to this problem is to utilize self-supervised learning (SSL). In this work, feature exchange (FeatEx), a simple yet effective SSL approach for ASD, is proposed. In addition, FeatEx is compared to and combined with existing SSL approaches. As the main result, a new state-of-the-art performance for the DCASE2023 ASD dataset is obtained that outperforms all other published results on this dataset by a large margin.  ( 2 min )
    Clinical Text Deduplication Practices for Efficient Pretraining and Improved Clinical Tasks. (arXiv:2312.09469v1 [cs.CL])
    Despite being a unique source of information on patients' status and disease progression, clinical notes are characterized by high levels of duplication and information redundancy. In general domain text, it has been shown that deduplication does not harm language model (LM) pretraining, thus helping reduce the training cost. Although large LMs have proven to learn medical knowledge, they still require specialized domain adaptation for improved downstream clinical tasks. By leveraging large real-world clinical corpora, we first provided a fine-grained characterization of duplicates stemming from common writing practices and clinical relevancy. Second, we demonstrated that deduplicating clinical text can help clinical LMs encode less redundant information in a more efficient manner and do not harm classification tasks via prompt-based learning.  ( 2 min )
    A Compact LSTM-SVM Fusion Model for Long-Duration Cardiovascular Diseases Detection. (arXiv:2312.09442v1 [eess.SP])
    Globally, cardiovascular diseases (CVDs) are the leading cause of mortality, accounting for an estimated 17.9 million deaths annually. One critical clinical objective is the early detection of CVDs using electrocardiogram (ECG) data, an area that has received significant attention from the research community. Recent advancements based on machine learning and deep learning have achieved great progress in this domain. However, existing methodologies exhibit inherent limitations, including inappropriate model evaluations and instances of data leakage. In this study, we present a streamlined workflow paradigm for preprocessing ECG signals into consistent 10-second durations, eliminating the need for manual feature extraction/beat detection. We also propose a hybrid model of Long Short-Term Memory (LSTM) with Support Vector Machine (SVM) for fraud detection. This architecture consists of two LSTM layers and an SVM classifier, which achieves a SOTA results with an Average precision score of 0.9402 on the MIT-BIH arrhythmia dataset and 0.9563 on the MIT-BIH atrial fibrillation dataset. Based on the results, we believe our method can significantly benefit the early detection and management of CVDs.  ( 2 min )
    Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models. (arXiv:2312.09601v1 [cs.CR])
    Binary code summarization, while invaluable for understanding code semantics, is challenging due to its labor-intensive nature. This study delves into the potential of large language models (LLMs) for binary code comprehension. To this end, we present BinSum, a comprehensive benchmark and dataset of over 557K binary functions and introduce a novel method for prompt synthesis and optimization. To more accurately gauge LLM performance, we also propose a new semantic similarity metric that surpasses traditional exact-match approaches. Our extensive evaluation of prominent LLMs, including ChatGPT, GPT-4, Llama 2, and Code Llama, reveals 10 pivotal insights. This evaluation generates 4 billion inference tokens, incurred a total expense of 11,418 US dollars and 873 NVIDIA A100 GPU hours. Our findings highlight both the transformative potential of LLMs in this field and the challenges yet to be overcome.  ( 2 min )
    STEAM & MoSAFE: SOTIF Error-and-Failure Model & Analysis for AI-Enabled Driving Automation. (arXiv:2312.09559v1 [cs.LG])
    Driving Automation Systems (DAS) are subject to complex road environments and vehicle behaviors and increasingly rely on sophisticated sensors and Artificial Intelligence (AI). These properties give rise to unique safety faults stemming from specification insufficiencies and technological performance limitations, where sensors and AI introduce errors that vary in magnitude and temporal patterns, posing potential safety risks. The Safety of the Intended Functionality (SOTIF) standard emerges as a promising framework for addressing these concerns, focusing on scenario-based analysis to identify hazardous behaviors and their causes. Although the current standard provides a basic cause-and-effect model and high-level process guidance, it lacks concepts required to identify and evaluate hazardous errors, especially within the context of AI. This paper introduces two key contributions to bridge this gap. First, it defines the SOTIF Temporal Error and Failure Model (STEAM) as a refinement of the SOTIF cause-and-effect model, offering a comprehensive system-design perspective. STEAM refines error definitions, introduces error sequences, and classifies them as error sequence patterns, providing particular relevance to systems employing advanced sensors and AI. Second, this paper proposes the Model-based SOTIF Analysis of Failures and Errors (MoSAFE) method, which allows instantiating STEAM based on system-design models by deriving hazardous error sequence patterns at module level from hazardous behaviors at vehicle level via weakest precondition reasoning. Finally, the paper presents a case study centered on an automated speed-control feature, illustrating the practical applicability of the refined model and the MoSAFE method in addressing complex safety challenges in DAS.  ( 3 min )
    Improving Generalization of Drowsiness State Classification by Domain-Specific Normalization. (arXiv:2312.09461v1 [eess.SP])
    Abnormal driver states, particularly have been major concerns for road safety, emphasizing the importance of accurate drowsiness detection to prevent accidents. Electroencephalogram (EEG) signals are recognized for their effectiveness in monitoring a driver's mental state by monitoring brain activities. However, the challenge lies in the requirement for prior calibration due to the variation of EEG signals among and within individuals. The necessity of calibration has made the brain-computer interface (BCI) less accessible. We propose a practical generalized framework for classifying driver drowsiness states to improve accessibility and convenience. We separate the normalization process for each driver, treating them as individual domains. The goal of developing a general model is similar to that of domain generalization. The framework considers the statistics of each domain separately since they vary among domains. We experimented with various normalization methods to enhance the ability to generalize across subjects, i.e. the model's generalization performance of unseen domains. The experiments showed that applying individual domain-specific normalization yielded an outstanding improvement in generalizability. Furthermore, our framework demonstrates the potential and accessibility by removing the need for calibration in BCI applications.  ( 2 min )
    vEEGNet: learning latent representations to reconstruct EEG raw data via variational autoencoders. (arXiv:2312.09449v1 [eess.SP])
    Electroencephalografic (EEG) data are complex multi-dimensional time-series that are very useful in many applications, from diagnostics to driving brain-computer interface systems. Their classification is still a challenging task, due to the inherent within- and between-subject variability and their low signal-to-noise ratio. On the other hand, the reconstruction of raw EEG data is even more difficult because of the high temporal resolution of these signals. Recent literature has proposed numerous machine and deep learning models that could classify, e.g., different types of movements, with an accuracy in the range 70% to 80% (with 4 classes). On the other hand, a limited number of works targeted the reconstruction problem, with very limited results. In this work, we propose vEEGNet, a DL architecture with two modules, i.e., an unsupervised module based on variational autoencoders to extract a latent representation of the data, and a supervised module based on a feed-forward neural network to classify different movements. To build the encoder and the decoder of VAE we exploited the well-known EEGNet network. We implemented two slightly different architectures of vEEGNet, thus showing state-of-the-art classification performance, and the ability to reconstruct both low-frequency and middle-range components of the raw EEG. Although preliminary, this work is promising as we found out that the low-frequency reconstructed signals are consistent with the so-called motor-related cortical potentials, well-known motor-related EEG patterns and we could improve over previous literature by reconstructing faster EEG components, too. Further investigations are needed to explore the potentialities of vEEGNet in reconstructing the full EEG data, generating new samples, and studying the relationship between classification and reconstruction performance.  ( 3 min )
    Stethoscope-guided Supervised Contrastive Learning for Cross-domain Adaptation on Respiratory Sound Classification. (arXiv:2312.09603v1 [cs.SD])
    Despite the remarkable advances in deep learning technology, achieving satisfactory performance in lung sound classification remains a challenge due to the scarcity of available data. Moreover, the respiratory sound samples are collected from a variety of electronic stethoscopes, which could potentially introduce biases into the trained models. When a significant distribution shift occurs within the test dataset or in a practical scenario, it can substantially decrease the performance. To tackle this issue, we introduce cross-domain adaptation techniques, which transfer the knowledge from a source domain to a distinct target domain. In particular, by considering different stethoscope types as individual domains, we propose a novel stethoscope-guided supervised contrastive learning approach. This method can mitigate any domain-related disparities and thus enables the model to distinguish respiratory sounds of the recording variation of the stethoscope. The experimental results on the ICBHI dataset demonstrate that the proposed methods are effective in reducing the domain dependency and achieving the ICBHI Score of 61.71%, which is a significant improvement of 2.16% over the baseline.  ( 2 min )
    Joint State Estimation and Noise Identification Based on Variational Optimization. (arXiv:2312.09585v1 [eess.SY])
    In this article, the state estimation problems with unknown process noise and measurement noise covariances for both linear and nonlinear systems are considered. By formulating the joint estimation of system state and noise parameters into an optimization problem, a novel adaptive Kalman filter method based on conjugate-computation variational inference, referred to as CVIAKF, is proposed to approximate the joint posterior probability density function of the latent variables. Unlike the existing adaptive Kalman filter methods utilizing variational inference in natural-parameter space, CVIAKF performs optimization in expectation-parameter space, resulting in a faster and simpler solution. Meanwhile, CVIAKF divides optimization objectives into conjugate and non-conjugate parts of nonlinear dynamical models, whereas conjugate computations and stochastic mirror-descent are applied, respectively. Remarkably, the reparameterization trick is used to reduce the variance of stochastic gradients of the non-conjugate parts. The effectiveness of CVIAKF is validated through synthetic and real-world datasets of maneuvering target tracking.  ( 2 min )
    Adversarial Robustness on Image Classification with $k$-means. (arXiv:2312.09533v1 [cs.LG])
    In this paper we explore the challenges and strategies for enhancing the robustness of $k$-means clustering algorithms against adversarial manipulations. We evaluate the vulnerability of clustering algorithms to adversarial attacks, emphasising the associated security risks. Our study investigates the impact of incremental attack strength on training, introduces the concept of transferability between supervised and unsupervised models, and highlights the sensitivity of unsupervised models to sample distributions. We additionally introduce and evaluate an adversarial training method that improves testing performance in adversarial scenarios, and we highlight the importance of various parameters in the proposed training method, such as continuous learning, centroid initialisation, and adversarial step-count.  ( 2 min )
    A Novel Hybrid Ordinal Learning Model with Health Care Application. (arXiv:2312.09540v1 [cs.LG])
    Ordinal learning (OL) is a type of machine learning models with broad utility in health care applications such as diagnosis of different grades of a disease (e.g., mild, modest, severe) and prediction of the speed of disease progression (e.g., very fast, fast, moderate, slow). This paper aims to tackle a situation when precisely labeled samples are limited in the training set due to cost or availability constraints, whereas there could be an abundance of samples with imprecise labels. We focus on imprecise labels that are intervals, i.e., one can know that a sample belongs to an interval of labels but cannot know which unique label it has. This situation is quite common in health care datasets due to limitations of the diagnostic instrument, sparse clinical visits, or/and patient dropout. Limited research has been done to develop OL models with imprecise/interval labels. We propose a new Hybrid Ordinal Learner (HOL) to integrate samples with both precise and interval labels to train a robust OL model. We also develop a tractable and efficient optimization algorithm to solve the HOL formulation. We compare HOL with several recently developed OL methods on four benchmarking datasets, which demonstrate the superior performance of HOL. Finally, we apply HOL to a real-world dataset for predicting the speed of progressing to Alzheimer's Disease (AD) for individuals with Mild Cognitive Impairment (MCI) based on a combination of multi-modality neuroimaging and demographic/clinical datasets. HOL achieves high accuracy in the prediction and outperforms existing methods. The capability of accurately predicting the speed of progression to AD for each individual with MCI has the potential for helping facilitate more individually-optimized interventional strategies.  ( 3 min )
    IncepSE: Leveraging InceptionTime's performance with Squeeze and Excitation mechanism in ECG analysis. (arXiv:2312.09445v1 [eess.SP])
    Our study focuses on the potential for modifications of Inception-like architecture within the electrocardiogram (ECG) domain. To this end, we introduce IncepSE, a novel network characterized by strategic architectural incorporation that leverages the strengths of both InceptionTime and channel attention mechanisms. Furthermore, we propose a training setup that employs stabilization techniques that are aimed at tackling the formidable challenges of severe imbalance dataset PTB-XL and gradient corruption. By this means, we manage to set a new height for deep learning model in a supervised learning manner across the majority of tasks. Our model consistently surpasses InceptionTime by substantial margins compared to other state-of-the-arts in this domain, noticeably 0.013 AUROC score improvement in the "all" task, while also mitigating the inherent dataset fluctuations during training.  ( 2 min )
    Multi-stage Learning for Radar Pulse Activity Segmentation. (arXiv:2312.09489v1 [cs.LG])
    Radio signal recognition is a crucial function in electronic warfare. Precise identification and localisation of radar pulse activities are required by electronic warfare systems to produce effective countermeasures. Despite the importance of these tasks, deep learning-based radar pulse activity recognition methods have remained largely underexplored. While deep learning for radar modulation recognition has been explored previously, classification tasks are generally limited to short and non-interleaved IQ signals, limiting their applicability to military applications. To address this gap, we introduce an end-to-end multi-stage learning approach to detect and localise pulse activities of interleaved radar signals across an extended time horizon. We propose a simple, yet highly effective multi-stage architecture for incrementally predicting fine-grained segmentation masks that localise radar pulse activities across multiple channels. We demonstrate the performance of our approach against several reference models on a novel radar dataset, while also providing a first-of-its-kind benchmark for radar pulse activity segmentation.  ( 2 min )
    Deep Generative Models for Detector Signature Simulation: An Analytical Taxonomy. (arXiv:2312.09597v1 [physics.ins-det])
    In modern collider experiments, the quest to explore fundamental interactions between elementary particles has reached unparalleled levels of precision. Signatures from particle physics detectors are low-level objects encoding the physics of collisions. The complete simulation of them in a detector is a memory and storage-intensive task. To address this computational bottleneck in particle physics, "Fast Simulation" has been introduced and refined over the years. The field has seen a surge in interest in surrogate modeling the detector simulation, fueled by the advancements in deep generative models. These models aim to generate responses that are statistically identical to the observed data. In this paper, we conduct a comprehensive and exhaustive taxonomic review of the existing literature on the simulation of detector signatures from both methodological and application-wise perspectives. Initially, we formulate the problem of detector signature simulation and discuss its different variations that can be unified. Next, we classify the state-of-the-art methods into four distinct categories based on their underlying model architectures, summarizing their respective generation strategies. We then identify and discuss three key application areas. Finally, we shed light on the challenges and opportunities that lie ahead in detector signature simulation, setting the stage for future research and development.  ( 2 min )
    Sequence adaptive field-imperfection estimation (SAFE): retrospective estimation and correction of $B_1^+$ and $B_0$ inhomogeneities for enhanced MRF quantification. (arXiv:2312.09488v1 [eess.IV])
    $B_1^+$ and $B_0$ field-inhomogeneities can significantly reduce accuracy and robustness of MRF's quantitative parameter estimates. Additional $B_1^+$ and $B_0$ calibration scans can mitigate this but add scan time and cannot be applied retrospectively to previously collected data. Here, we proposed a calibration-free sequence-adaptive deep-learning framework, to estimate and correct for $B_1^+$ and $B_0$ effects of any MRF sequence. We demonstrate its capability on arbitrary MRF sequences at 3T, where no training data were previously obtained. Such approach can be applied to any previously-acquired and future MRF-scans. The flexibility in directly applying this framework to other quantitative sequences is also highlighted.  ( 2 min )
    Unraveling Batch Normalization for Realistic Test-Time Adaptation. (arXiv:2312.09486v1 [cs.CV])
    While recent test-time adaptations exhibit efficacy by adjusting batch normalization to narrow domain disparities, their effectiveness diminishes with realistic mini-batches due to inaccurate target estimation. As previous attempts merely introduce source statistics to mitigate this issue, the fundamental problem of inaccurate target estimation still persists, leaving the intrinsic test-time domain shifts unresolved. This paper delves into the problem of mini-batch degradation. By unraveling batch normalization, we discover that the inexact target statistics largely stem from the substantially reduced class diversity in batch. Drawing upon this insight, we introduce a straightforward tool, Test-time Exponential Moving Average (TEMA), to bridge the class diversity gap between training and testing batches. Importantly, our TEMA adaptively extends the scope of typical methods beyond the current batch to incorporate a diverse set of class information, which in turn boosts an accurate target estimation. Built upon this foundation, we further design a novel layer-wise rectification strategy to consistently promote test-time performance. Our proposed method enjoys a unique advantage as it requires neither training nor tuning parameters, offering a truly hassle-free solution. It significantly enhances model robustness against shifted domains and maintains resilience in diverse real-world scenarios with various batch sizes, achieving state-of-the-art performance on several major benchmarks. Code is available at \url{https://github.com/kiwi12138/RealisticTTA}.  ( 2 min )
    Applying Machine Learning Models on Metrology Data for Predicting Device Electrical Performance. (arXiv:2312.09462v1 [eess.SP])
    Moore Law states that transistor density will double every two years, which is sustained until today due to continuous multi-directional innovations, such as extreme ultraviolet lithography, novel patterning techniques etc., leading the semiconductor industry towards 3nm node and beyond. For any patterning scheme, the most important metric to evaluate the quality of printed patterns is EPE, with overlay being its largest contribution. Overlay errors can lead to fatal failures of IC devices such as short circuits or broken connections in terms of P2P electrical contacts. Therefore, it is essential to develop effective overlay analysis and control techniques to ensure good functionality of fabricated semiconductor devices. In this work we have used an imec N14 BEOL process flow using LELE patterning technique to print metal layers with minimum pitch of 48nm with 193i lithography. FF structures are decomposed into two mask layers (M1A and M1B) and then the LELE flow is carried out to make the final patterns. Since a single M1 layer is decomposed into two masks, control of overlay between the two masks is critical. The goal of this work is of two-fold as, (a) to quantify the impact of overlay on capacitance and (b) to see if we can predict the final capacitance measurements with selected machine learning models at an early stage. To do so, scatterometry spectra are collected on these electrical test structures at (a)post litho, (b)post TiN hardmask etch, and (c)post Cu plating and CMP. Critical Dimension and overlay measurements for line-space pattern are done with SEM post litho, post etch and post Cu CMP. Various machine learning models are applied to do the capacitance prediction with multiple metrology inputs at different steps of wafer processing. Finally, we demonstrate that by using appropriate machine learning models we are able to do better prediction of electrical results.  ( 3 min )
    Uncertainty Quantification in Machine Learning for Biosignal Applications -- A Review. (arXiv:2312.09454v1 [eess.SP])
    Uncertainty Quantification (UQ) has gained traction in an attempt to fix the black-box nature of Deep Learning. Specifically (medical) biosignals such as electroencephalography (EEG), electrocardiography (ECG), electroocculography (EOG) and electromyography (EMG) could benefit from good UQ, since these suffer from a poor signal to noise ratio, and good human interpretability is pivotal for medical applications and Brain Computer Interfaces. In this paper, we review the state of the art at the intersection of Uncertainty Quantification and Biosignal with Machine Learning. We present various methods, shortcomings, uncertainty measures and theoretical frameworks that currently exist in this application domain. Overall it can be concluded that promising UQ methods are available, but that research is needed on how people and systems may interact with an uncertainty model in a (clinical) environment.  ( 2 min )
    Continual Adversarial Defense. (arXiv:2312.09481v1 [cs.CV])
    In response to the rapidly evolving nature of adversarial attacks on a monthly basis, numerous defenses have been proposed to generalize against as many known attacks as possible. However, designing a defense method that can generalize to all types of attacks, including unseen ones, is not realistic because the environment in which defense systems operate is dynamic and comprises various unique attacks used by many attackers. The defense system needs to upgrade itself by utilizing few-shot defense feedback and efficient memory. Therefore, we propose the first continual adversarial defense (CAD) framework that adapts to any attacks in a dynamic scenario, where various attacks emerge stage by stage. In practice, CAD is modeled under four principles: (1) continual adaptation to new attacks without catastrophic forgetting, (2) few-shot adaptation, (3) memory-efficient adaptation, and (4) high accuracy on both clean and adversarial images. We leverage cutting-edge continual learning, few-shot learning, and ensemble learning techniques to qualify the principles. Experiments conducted on CIFAR-10 and ImageNet-100 validate the effectiveness of our approach against multiple stages of 10 modern adversarial attacks and significant improvements over 10 baseline methods. In particular, CAD is capable of quickly adapting with minimal feedback and a low cost of defense failure, while maintaining good performance against old attacks. Our research sheds light on a brand-new paradigm for continual defense adaptation against dynamic and evolving attacks.  ( 2 min )
    Combinatorial Complexes: Bridging the Gap Between Cell Complexes and Hypergraphs. (arXiv:2312.09504v1 [cs.LG])
    Graph-based signal processing techniques have become essential for handling data in non-Euclidean spaces. However, there is a growing awareness that these graph models might need to be expanded into `higher-order' domains to effectively represent the complex relations found in high-dimensional data. Such higher-order domains are typically modeled either as hypergraphs, or as simplicial, cubical or other cell complexes. In this context, cell complexes are often seen as a subclass of hypergraphs with additional algebraic structure that can be exploited, e.g., to develop a spectral theory. In this article, we promote an alternative perspective. We argue that hypergraphs and cell complexes emphasize \emph{different} types of relations, which may have different utility depending on the application context. Whereas hypergraphs are effective in modeling set-type, multi-body relations between entities, cell complexes provide an effective means to model hierarchical, interior-to-boundary type relations. We discuss the relative advantages of these two choices and elaborate on the previously introduced concept of a combinatorial complex that enables co-existing set-type and hierarchical relations. Finally, we provide a brief numerical experiment to demonstrate that this modelling flexibility can be advantageous in learning tasks.  ( 2 min )
    Neural Gaussian Similarity Modeling for Differential Graph Structure Learning. (arXiv:2312.09498v1 [cs.LG])
    Graph Structure Learning (GSL) has demonstrated considerable potential in the analysis of graph-unknown non-Euclidean data across a wide range of domains. However, constructing an end-to-end graph structure learning model poses a challenge due to the impediment of gradient flow caused by the nearest neighbor sampling strategy. In this paper, we construct a differential graph structure learning model by replacing the non-differentiable nearest neighbor sampling with a differentiable sampling using the reparameterization trick. Under this framework, we argue that the act of sampling \mbox{nearest} neighbors may not invariably be essential, particularly in instances where node features exhibit a significant degree of similarity. To alleviate this issue, the bell-shaped Gaussian Similarity (GauSim) modeling is proposed to sample non-nearest neighbors. To adaptively model the similarity, we further propose Neural Gaussian Similarity (NeuralGauSim) with learnable parameters featuring flexible sampling behaviors. In addition, we develop a scalable method by transferring the large-scale graph to the transition graph to significantly reduce the complexity. Experimental results demonstrate the effectiveness of the proposed methods.  ( 2 min )
    Decoding EEG-based Workload Levels Using Spatio-temporal Features Under Flight Environment. (arXiv:2312.09423v1 [eess.SP])
    The detection of pilots' mental states is important due to the potential for their abnormal mental states to result in catastrophic accidents. This study introduces the feasibility of employing deep learning techniques to classify different workload levels, specifically normal state, low workload, and high workload. To the best of our knowledge, this study is the first attempt to classify workload levels of pilots. Our approach involves the hybrid deep neural network that consists of five convolutional blocks and one long short-term memory block to extract the significant features from electroencephalography signals. Ten pilots participated in the experiment, which was conducted within the simulated flight environment. In contrast to four conventional models, our proposed model achieved a superior grand--average accuracy of 0.8613, surpassing other conventional models by at least 0.0597 in classifying workload levels across all participants. Our model not only successfully classified workload levels but also provided valuable feedback to the participants. Hence, we anticipate that our study will make the significant contributions to the advancement of autonomous flight and driving leveraging artificial intelligence technology in the future.  ( 2 min )
    Deep Learning Models for Arrhythmia Classification Using Stacked Time-frequency Scalogram Images from ECG Signals. (arXiv:2312.09426v1 [eess.SP])
    Electrocardiograms (ECGs), a medical monitoring technology recording cardiac activity, are widely used for diagnosing cardiac arrhythmia. The diagnosis is based on the analysis of the deformation of the signal shapes due to irregular heart rates associated with heart diseases. Due to the infeasibility of manual examination of large volumes of ECG data, this paper aims to propose an automated AI based system for ECG-based arrhythmia classification. To this front, a deep learning based solution has been proposed for ECG-based arrhythmia classification. Twelve lead electrocardiograms (ECG) of length 10 sec from 45, 152 individuals from Shaoxing People's Hospital (SPH) dataset from PhysioNet with four different types of arrhythmias were used. The sampling frequency utilized was 500 Hz. Median filtering was used to preprocess the ECG signals. For every 1 sec of ECG signal, the time-frequency (TF) scalogram was estimated and stacked row wise to obtain a single image from 12 channels, resulting in 10 stacked TF scalograms for each ECG signal. These stacked TF scalograms are fed to the pretrained convolutional neural network (CNN), 1D CNN, and 1D CNN-LSTM (Long short-term memory) models, for arrhythmia classification. The fine-tuned CNN models obtained the best test accuracy of about 98% followed by 95% test accuracy by basic CNN-LSTM in arrhythmia classification.  ( 3 min )
    OTOv3: Automatic Architecture-Agnostic Neural Network Training and Compression from Structured Pruning to Erasing Operators. (arXiv:2312.09411v1 [cs.LG])
    Compressing a predefined deep neural network (DNN) into a compact sub-network with competitive performance is crucial in the efficient machine learning realm. This topic spans various techniques, from structured pruning to neural architecture search, encompassing both pruning and erasing operators perspectives. Despite advancements, existing methods suffers from complex, multi-stage processes that demand substantial engineering and domain knowledge, limiting their broader applications. We introduce the third-generation Only-Train-Once (OTOv3), which first automatically trains and compresses a general DNN through pruning and erasing operations, creating a compact and competitive sub-network without the need of fine-tuning. OTOv3 simplifies and automates the training and compression process, minimizes the engineering efforts required from users. It offers key technological advancements: (i) automatic search space construction for general DNNs based on dependency graph analysis; (ii) Dual Half-Space Projected Gradient (DHSPG) and its enhanced version with hierarchical search (H2SPG) to reliably solve (hierarchical) structured sparsity problems and ensure sub-network validity; and (iii) automated sub-network construction using solutions from DHSPG/H2SPG and dependency graphs. Our empirical results demonstrate the efficacy of OTOv3 across various benchmarks in structured pruning and neural architecture search. OTOv3 produces sub-networks that match or exceed the state-of-the-arts. The source code will be available at https://github.com/tianyic/only_train_once.  ( 3 min )
    Physics-Informed Deep Learning of Rate-and-State Fault Friction. (arXiv:2312.09403v1 [math-ph])
    Direct observations of earthquake nucleation and propagation are few and yet the next decade will likely see an unprecedented increase in indirect, surface observations that must be integrated into modeling efforts. Machine learning (ML) excels in the presence of large data and is an actively growing field in seismology. However, not all ML methods incorporate rigorous physics, and purely data-driven models can predict physically unrealistic outcomes due to observational bias or extrapolation. Our work focuses on the recently emergent Physics-Informed Neural Network (PINN), which seamlessly integrates data while ensuring that model outcomes satisfy rigorous physical constraints. In this work we develop a multi-network PINN for both the forward problem as well as for direct inversion of nonlinear fault friction parameters, constrained by the physics of motion in the solid Earth, which have direct implications for assessing seismic hazard. We present the computational PINN framework for strike-slip faults in 1D and 2D subject to rate-and-state friction. Initial and boundary conditions define the data on which the PINN is trained. While the PINN is capable of approximating the solution to the governing equations to low-errors, our primary interest lies in the network's capacity to infer friction parameters during the training loop. We find that the network for the parameter inversion at the fault performs much better than the network for material displacements to which it is coupled. Additional training iterations and model tuning resolves this discrepancy, enabling a robust surrogate model for solving both forward and inverse problems relevant to seismic faulting.  ( 3 min )
    Temporal Transfer Learning for Traffic Optimization with Coarse-grained Advisory Autonomy. (arXiv:2312.09436v1 [cs.RO])
    The recent development of connected and automated vehicle (CAV) technologies has spurred investigations to optimize dense urban traffic. This paper considers advisory autonomy, in which real-time driving advisories are issued to drivers, thus blending the CAV and the human driver. Due to the complexity of traffic systems, recent studies of coordinating CAVs have resorted to leveraging deep reinforcement learning (RL). Advisory autonomy is formalized as zero-order holds, and we consider a range of hold duration from 0.1 to 40 seconds. However, despite the similarity of the higher frequency tasks on CAVs, a direct application of deep RL fails to be generalized to advisory autonomy tasks. We introduce Temporal Transfer Learning (TTL) algorithms to select source tasks, systematically leveraging the temporal structure to solve the full range of tasks. TTL selects the most suitable source tasks to maximize the performance of the range of tasks. We validate our algorithms on diverse mixed-traffic scenarios, demonstrating that TTL more reliably solves the tasks than baselines. This paper underscores the potential of coarse-grained advisory autonomy with TTL in traffic flow optimization.  ( 2 min )
    Deep Representation Learning for Open Vocabulary Electroencephalography-to-Text Decoding. (arXiv:2312.09430v1 [eess.SP])
    Previous research has demonstrated the potential of using pre-trained language models for decoding open vocabulary Electroencephalography (EEG) signals captured through a non-invasive Brain-Computer Interface (BCI). However, the impact of embedding EEG signals in the context of language models and the effect of subjectivity, remain unexplored, leading to uncertainty about the best approach to enhance decoding performance. Additionally, current evaluation metrics used to assess decoding effectiveness are predominantly syntactic and do not provide insights into the comprehensibility of the decoded output for human understanding. We present an end-to-end deep learning framework for non-invasive brain recordings that brings modern representational learning approaches to neuroscience. Our proposal introduces the following innovations: 1) an end-to-end deep learning architecture for open vocabulary EEG decoding, incorporating a subject-dependent representation learning module for raw EEG encoding, a BART language model, and a GPT-4 sentence refinement module; 2) a more comprehensive sentence-level evaluation metric based on the BERTScore; 3) an ablation study that analyses the contributions of each module within our proposal, providing valuable insights for future research. We evaluate our approach on two publicly available datasets, ZuCo v1.0 and v2.0, comprising EEG recordings of 30 subjects engaged in natural reading tasks. Our model achieves a BLEU-1 score of 42.75%, a ROUGE-1-F of 33.28%, and a BERTScore-F of 53.86%, outperforming the previous state-of-the-art methods by 3.38%, 8.43%, and 6.31%, respectively.  ( 2 min )
    Point-of-Care Real-Time Signal Quality for Fetal Doppler Ultrasound Using a Deep Learning Approach. (arXiv:2312.09433v1 [eess.SP])
    In this study, we present a deep learning framework designed to integrate with our previously developed system that facilitates large-scale 1D fetal Doppler data collection, aiming to enhance data quality. This system, tailored for traditional Indigenous midwives in low-resource communities, leverages a cost-effective Android phone to improve the quality of recorded signals. We have shown that the Doppler data can be used to identify fetal growth restriction, hypertension, and other concerning issues during pregnancy. However, the quality of the signal is dependent on many factors, including radio frequency interference, position of the fetus, maternal body habitus, and usage of the Doppler by the birth attendants. In order to provide instant feedback to allow correction of the data at source, a signal quality metric is required that can run in real-time on the mobile phone. In this study, 191 DUS signals with durations mainly in the range between 5 to 10 minutes were evaluated for quality and classified into five categories: Good, Poor, (Radiofrequency) Interference, Talking, and Silent, at a resolution of 3.75 seconds. A deep neural network was trained on each 3.75-second segment from these recordings and validated using five-fold cross-validation. An average micro F1 = 97.4\% and macro F1 = 94.2\% were achieved, with F1 = 99.2\% for `Good' quality data. These results indicate that the algorithm, which will now be implemented in the midwives' app, should allow a significant increase in the quality of data at the time of capture.  ( 3 min )
    Predicting Multi-Joint Kinematics of the Upper Limb from EMG Signals Across Varied Loads with a Physics-Informed Neural Network. (arXiv:2312.09418v1 [eess.SP])
    In this research, we present an innovative method known as a physics-informed neural network (PINN) model to predict multi-joint kinematics using electromyography (EMG) signals recorded from the muscles surrounding these joints across various loads. The primary aim is to simultaneously predict both the shoulder and elbow joint angles while executing elbow flexion-extension (FE) movements, especially under varying load conditions. The PINN model is constructed by combining a feed-forward Artificial Neural Network (ANN) with a joint torque computation model. During the training process, the model utilizes a custom loss function derived from an inverse dynamics joint torque musculoskeletal model, along with a mean square angle loss. The training dataset for the PINN model comprises EMG and time data collected from four different subjects. To assess the model's performance, we conducted a comparison between the predicted joint angles and experimental data using a testing data set. The results demonstrated strong correlations of 58% to 83% in joint angle prediction. The findings highlight the potential of incorporating physical principles into the model, not only increasing its versatility but also enhancing its accuracy. The findings could have significant implications for the precise estimation of multi-joint kinematics in dynamic scenarios, particularly concerning the advancement of human-machine interfaces (HMIs) for exoskeletons and prosthetic control systems.  ( 3 min )
    Prediction of rare events in the operation of household equipment using co-evolving time series. (arXiv:2312.09410v1 [cs.LG])
    In this study, we propose an approach for predicting rare events by exploiting time series in coevolution. Our approach involves a weighted autologistic regression model, where we leverage the temporal behavior of the data to enhance predictive capabilities. By addressing the issue of imbalanced datasets, we establish constraints leading to weight estimation and to improved performance. Evaluation on synthetic and real-world datasets confirms that our approach outperform state-of-the-art of predicting home equipment failure methods.  ( 2 min )
    DTP-Net: Learning to Reconstruct EEG signals in Time-Frequency Domain by Multi-scale Feature Reuse. (arXiv:2312.09417v1 [eess.SP])
    Electroencephalography (EEG) signals are easily corrupted by various artifacts, making artifact removal crucial for improving signal quality in scenarios such as disease diagnosis and brain-computer interface (BCI). In this paper, we present a fully convolutional neural architecture, called DTP-Net, which consists of a Densely Connected Temporal Pyramid (DTP) sandwiched between a pair of learnable time-frequency transformations for end-to-end electroencephalogram (EEG) denoising. The proposed method first transforms a single-channel EEG signal of arbitrary length into the time-frequency domain via an Encoder layer. Then, noises, such as ocular and muscle artifacts, are extracted by DTP in a multi-scale fashion and reduced. Finally, a Decoder layer is employed to reconstruct the artifact-reduced EEG signal. Additionally, we conduct an in-depth analysis of the representation learning behavior of each module in DTP-Net to substantiate its robustness and reliability. Extensive experiments conducted on two public semi-simulated datasets demonstrate the effective artifact removal performance of DTP-Net, which outperforms state-of-art approaches. Experimental results demonstrate cleaner waveforms and significant improvement in Signal-to-Noise Ratio (SNR) and Relative Root Mean Square Error (RRMSE) after denoised by the proposed model. Moreover, the proposed DTP-Net is applied in a specific BCI downstream task, improving the classification accuracy by up to 5.55% compared to that of the raw signals, validating its potential applications in the fields of EEG-based neuroscience and neuro-engineering.  ( 3 min )
    Deep Learning-Enabled Swallowing Monitoring and Postoperative Recovery Biosensing System. (arXiv:2312.09429v1 [eess.SP])
    This study introduces an innovative 3D printed dry electrode tailored for biosensing in postoperative recovery scenarios. Fabricated through a drop coating process, the electrode incorporates a novel 2D material.  ( 2 min )
    Joint Alignment of Multivariate Quasi-Periodic Functional Data Using Deep Learning. (arXiv:2312.09422v1 [eess.SP])
    The joint alignment of multivariate functional data plays an important role in various fields such as signal processing, neuroscience and medicine, including the statistical analysis of data from wearable devices. Traditional methods often ignore the phase variability and instead focus on the variability in the observed amplitude. We present a novel method for joint alignment of multivariate quasi-periodic functions using deep neural networks, decomposing, but retaining all the information in the data by preserving both phase and amplitude variability. Our proposed neural network uses a special activation of the output that builds on the unit simplex transformation, and we utilize a loss function based on the Fisher-Rao metric to train our model. Furthermore, our method is unsupervised and can provide an optimal common template function as well as subject-specific templates. We demonstrate our method on two simulated datasets and one real example, comprising data from 12-lead 10s electrocardiogram recordings.  ( 3 min )
    Unbiasing Enhanced Sampling on a High-dimensional Free Energy Surface with Deep Generative Model. (arXiv:2312.09404v1 [cs.LG])
    Biased enhanced sampling methods utilizing collective variables (CVs) are powerful tools for sampling conformational ensembles. Due to high intrinsic dimensions, efficiently generating conformational ensembles for complex systems requires enhanced sampling on high-dimensional free energy surfaces. While methods like temperature-accelerated molecular dynamics (TAMD) can adopt many CVs in a simulation, unbiasing the simulation requires accurate modeling of a high-dimensional CV probability distribution, which is challenging for traditional density estimation techniques. Here we propose an unbiasing method based on the score-based diffusion model, a deep generative learning method that excels in density estimation across complex data landscapes. We test the score-based diffusion unbiasing method on TAMD simulations. The results demonstrate that this unbiasing approach significantly outperforms traditional unbiasing methods, and can generate accurate unbiased conformational ensembles for simulations with a number of CVs higher than usual ranges.  ( 2 min )
    Exploiting Symmetric Temporally Sparse BPTT for Efficient RNN Training. (arXiv:2312.09391v1 [cs.LG])
    Recurrent Neural Networks (RNNs) are useful in temporal sequence tasks. However, training RNNs involves dense matrix multiplications which require hardware that can support a large number of arithmetic operations and memory accesses. Implementing online training of RNNs on the edge calls for optimized algorithms for an efficient deployment on hardware. Inspired by the spiking neuron model, the Delta RNN exploits temporal sparsity during inference by skipping over the update of hidden states from those inactivated neurons whose change of activation across two timesteps is below a defined threshold. This work describes a training algorithm for Delta RNNs that exploits temporal sparsity in the backward propagation phase to reduce computational requirements for training on the edge. Due to the symmetric computation graphs of forward and backward propagation during training, the gradient computation of inactivated neurons can be skipped. Results show a reduction of $\sim$80% in matrix operations for training a 56k parameter Delta LSTM on the Fluent Speech Commands dataset with negligible accuracy loss. Logic simulations of a hardware accelerator designed for the training algorithm show 2-10X speedup in matrix computations for an activation sparsity range of 50%-90%. Additionally, we show that the proposed Delta RNN training will be useful for online incremental learning on edge devices with limited computing resources.  ( 3 min )
    iOn-Profiler: intelligent Online multi-objective VNF Profiling with Reinforcement Learning. (arXiv:2312.09355v1 [cs.NI])
    Leveraging the potential of Virtualised Network Functions (VNFs) requires a clear understanding of the link between resource consumption and performance. The current state of the art tries to do that by utilising Machine Learning (ML) and specifically Supervised Learning (SL) models for given network environments and VNF types assuming single-objective optimisation targets. Taking a different approach poses a novel VNF profiler optimising multi-resource type allocation and performance objectives using adapted Reinforcement Learning (RL). Our approach can meet Key Performance Indicator (KPI) targets while minimising multi-resource type consumption and optimising the VNF output rate compared to existing single-objective solutions. Our experimental evaluation with three real-world VNF types over a total of 39 study scenarios (13 per VNF), for three resource types (virtual CPU, memory, and network link capacity), verifies the accuracy of resource allocation predictions and corresponding successful profiling decisions via a benchmark comparison between our RL model and SL models. We also conduct a complementary exhaustive search-space study revealing that different resources impact performance in varying ways per VNF type, implying the necessity of multi-objective optimisation, individualised examination per VNF type, and adaptable online profile learning, such as with the autonomous online learning approach of iOn-Profiler.  ( 3 min )
    RTRA: Rapid Training of Regularization-based Approaches in Continual Learning. (arXiv:2312.09361v1 [cs.LG])
    Catastrophic forgetting(CF) is a significant challenge in continual learning (CL). In regularization-based approaches to mitigate CF, modifications to important training parameters are penalized in subsequent tasks using an appropriate loss function. We propose the RTRA, a modification to the widely used Elastic Weight Consolidation (EWC) regularization scheme, using the Natural Gradient for loss function optimization. Our approach improves the training of regularization-based methods without sacrificing test-data performance. We compare the proposed RTRA approach against EWC using the iFood251 dataset. We show that RTRA has a clear edge over the state-of-the-art approaches.  ( 2 min )
    DSS: A Diverse Sample Selection Method to Preserve Knowledge in Class-Incremental Learning. (arXiv:2312.09357v1 [cs.LG])
    Rehearsal-based techniques are commonly used to mitigate catastrophic forgetting (CF) in Incremental learning (IL). The quality of the exemplars selected is important for this purpose and most methods do not ensure the appropriate diversity of the selected exemplars. We propose a new technique "DSS" -- Diverse Selection of Samples from the input data stream in the Class-incremental learning (CIL) setup under both disjoint and fuzzy task boundary scenarios. Our method outperforms state-of-the-art methods and is much simpler to understand and implement.  ( 2 min )
    PBES: PCA Based Exemplar Sampling Algorithm for Continual Learning. (arXiv:2312.09352v1 [cs.LG])
    We propose a novel exemplar selection approach based on Principal Component Analysis (PCA) and median sampling, and a neural network training regime in the setting of class-incremental learning. This approach avoids the pitfalls due to outliers in the data and is both simple to implement and use across various incremental machine learning models. It also has independent usage as a sampling algorithm. We achieve better performance compared to state-of-the-art methods.  ( 2 min )
    Random resistive memory-based deep extreme point learning machine for unified visual processing. (arXiv:2312.09262v1 [cs.LG])
    Visual sensors, including 3D LiDAR, neuromorphic DVS sensors, and conventional frame cameras, are increasingly integrated into edge-side intelligent machines. Realizing intensive multi-sensory data analysis directly on edge intelligent machines is crucial for numerous emerging edge applications, such as augmented and virtual reality and unmanned aerial vehicles, which necessitates unified data representation, unprecedented hardware energy efficiency and rapid model training. However, multi-sensory data are intrinsically heterogeneous, causing significant complexity in the system development for edge-side intelligent machines. In addition, the performance of conventional digital hardware is limited by the physically separated processing and memory units, known as the von Neumann bottleneck, and the physical limit of transistor scaling, which contributes to the slowdown of Moore's law. These limitations are further intensified by the tedious training of models with ever-increasing sizes. We propose a novel hardware-software co-design, random resistive memory-based deep extreme point learning machine (DEPLM), that offers efficient unified point set analysis. We show the system's versatility across various data modalities and two different learning tasks. Compared to a conventional digital hardware-based system, our co-design system achieves huge energy efficiency improvements and training cost reduction when compared to conventional systems. Our random resistive memory-based deep extreme point learning machine may pave the way for energy-efficient and training-friendly edge AI across various data modalities and tasks.  ( 3 min )
    Acoustic models of Brazilian Portuguese Speech based on Neural Transformers. (arXiv:2312.09265v1 [cs.SD])
    An acoustic model, trained on a significant amount of unlabeled data, consists of a self-supervised learned speech representation useful for solving downstream tasks, perhaps after a fine-tuning of the model in the respective downstream task. In this work, we build an acoustic model of Brazilian Portuguese Speech through a Transformer neural network. This model was pretrained on more than $800$ hours of Brazilian Portuguese Speech, using a combination of pretraining techniques. Using a labeled dataset collected for the detection of respiratory insufficiency in Brazilian Portuguese speakers, we fine-tune the pretrained Transformer neural network on the following tasks: respiratory insufficiency detection, gender recognition and age group classification. We compare the performance of pretrained Transformers on these tasks with that of Transformers without previous pretraining, noting a significant improvement. In particular, the performance of respiratory insufficiency detection obtains the best reported results so far, indicating this kind of acoustic model as a promising tool for speech-as-biomarker approach. Moreover, the performance of gender recognition is comparable to the state of the art models in English.  ( 2 min )
    Self-Evaluation Improves Selective Generation in Large Language Models. (arXiv:2312.09300v1 [cs.CL])
    Safe deployment of large language models (LLMs) may benefit from a reliable method for assessing their generated content to determine when to abstain or to selectively generate. While likelihood-based metrics such as perplexity are widely employed, recent research has demonstrated the limitations of using sequence-level probability estimates given by LLMs as reliable indicators of generation quality. Conversely, LLMs have demonstrated strong calibration at the token level, particularly when it comes to choosing correct answers in multiple-choice questions or evaluating true/false statements. In this work, we reformulate open-ended generation tasks into token-level prediction tasks, and leverage LLMs' superior calibration at the token level. We instruct an LLM to self-evaluate its answers, employing either a multi-way comparison or a point-wise evaluation approach, with the option to include a ``None of the above'' option to express the model's uncertainty explicitly. We benchmark a range of scoring methods based on self-evaluation and evaluate their performance in selective generation using TruthfulQA and TL;DR. Through experiments with PaLM-2 and GPT-3, we demonstrate that self-evaluation based scores not only improve accuracy, but also correlate better with the overall quality of generated content.  ( 2 min )
    Perspectives on the State and Future of Deep Learning -- 2023. (arXiv:2312.09323v1 [cs.AI])
    The goal of this series is to chronicle opinions and issues in the field of machine learning as they stand today and as they change over time. The plan is to host this survey periodically until the AI singularity paperclip-frenzy-driven doomsday, keeping an updated list of topical questions and interviewing new community members for each edition. In this issue, we probed people's opinions on interpretable AI, the value of benchmarking in modern NLP, the state of progress towards understanding deep learning, and the future of academia.  ( 2 min )
    Livestock feeding behavior: A tutorial review on automated techniques for ruminant monitoring. (arXiv:2312.09259v1 [eess.SP])
    Livestock feeding behavior is an influential research area for those involved in animal husbandry and agriculture. In recent years, there has been a growing interest in automated systems for monitoring the behavior of ruminants. Despite the developments accomplished in the last decade, there is still much to do and learn about the methods for measuring and analyzing livestock feeding behavior. Automated monitoring systems mainly use motion, acoustic, and image sensors to collect animal behavioral data. The performance evaluation of existing methods is a complex task and direct comparisons between studies are difficult. Several factors prevent a direct comparison, starting from the diversity of data and performance metrics used in the experiments. To the best of our knowledge, this work represents the first tutorial-style review on the analysis of the feeding behavior of ruminants, emphasizing the relationship between sensing methodologies, signal processing and computational intelligence methods. It assesses the main sensing methodologies (i.e. based on movement, sound, images/videos and pressure) and the main techniques to measure and analyze the signals associated with feeding behavior, evaluating their use in different settings and situations. It also highlights the potentiality of automated monitoring systems to provide valuable information that improves our understanding of livestock feeding behavior. The relevance of these systems is increasingly important due to their impact on production systems and research. Finally, the paper closes by discussing future challenges and opportunities in livestock feeding behavior monitoring.  ( 3 min )
    Brain-Inspired Machine Intelligence: A Survey of Neurobiologically-Plausible Credit Assignment. (arXiv:2312.09257v1 [cs.NE])
    In this survey, we examine algorithms for conducting credit assignment in artificial neural networks that are inspired or motivated by neurobiology, unifying these various processes under one possible taxonomy. Our proposed taxonomy is constructed based on how a learning algorithm answers a central question underpinning the mechanisms of synaptic plasticity in complex adaptive neuronal systems: where do the signals that drive the learning in individual elements of a network come from and how are they produced? In this unified treatment, we organize the ever-growing set of brain-inspired learning processes into six general families and consider these in the context of backpropagation of errors and its known criticisms. The results of this review are meant to encourage future developments in neuro-mimetic systems and their constituent learning processes, wherein lies the opportunity to build a strong bridge between machine learning, computational neuroscience, and cognitive science.  ( 2 min )
    Efficient speech detection in environmental audio using acoustic recognition and knowledge distillation. (arXiv:2312.09269v1 [cs.SD])
    The ongoing biodiversity crisis, driven by factors such as land-use change and global warming, emphasizes the need for effective ecological monitoring methods. Acoustic monitoring of biodiversity has emerged as an important monitoring tool. Detecting human voices in soundscape monitoring projects is useful both for analysing human disturbance and for privacy filtering. Despite significant strides in deep learning in recent years, the deployment of large neural networks on compact devices poses challenges due to memory and latency constraints. Our approach focuses on leveraging knowledge distillation techniques to design efficient, lightweight student models for speech detection in bioacoustics. In particular, we employed the MobileNetV3-Small-Pi model to create compact yet effective student architectures to compare against the larger EcoVAD teacher model, a well-regarded voice detection architecture in eco-acoustic monitoring. The comparative analysis included examining various configurations of the MobileNetV3-Small-Pi derived student models to identify optimal performance. Additionally, a thorough evaluation of different distillation techniques was conducted to ascertain the most effective method for model selection. Our findings revealed that the distilled models exhibited comparable performance to the EcoVAD teacher model, indicating a promising approach to overcoming computational barriers for real-time ecological monitoring.  ( 2 min )
    Weight subcloning: direct initialization of transformers using larger pretrained ones. (arXiv:2312.09299v1 [cs.LG])
    Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained model of the same size and specification to increase the convergence and training speed. However, what if no pretrained model of the required size is available? In this paper, we introduce a simple yet effective technique to transfer the knowledge of a pretrained model to smaller variants. Our approach called weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models. Weight subcloning involves an operation on the pretrained model to obtain the equivalent initialized scaled-down model. It consists of two key steps: first, we introduce neuron importance ranking to decrease the embedding dimension per layer in the pretrained model. Then, we remove blocks from the transformer model to match the number of layers in the scaled-down network. The result is a network ready to undergo training, which gains significant improvements in training speed compared to random initialization. For instance, we achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.  ( 2 min )
    A Hierarchical Nearest Neighbour Approach to Contextual Bandits. (arXiv:2312.09332v1 [cs.LG])
    In this paper we consider the adversarial contextual bandit problem in metric spaces. The paper "Nearest neighbour with bandit feedback" tackled this problem but when there are many contexts near the decision boundary of the comparator policy it suffers from a high regret. In this paper we eradicate this problem, designing an algorithm in which we can hold out any set of contexts when computing our regret term. Our algorithm builds on that of "Nearest neighbour with bandit feedback" and hence inherits its extreme computational efficiency.  ( 2 min )
    Well-calibrated Confidence Measures for Multi-label Text Classification with a Large Number of Labels. (arXiv:2312.09304v1 [cs.LG])
    We extend our previous work on Inductive Conformal Prediction (ICP) for multi-label text classification and present a novel approach for addressing the computational inefficiency of the Label Powerset (LP) ICP, arrising when dealing with a high number of unique labels. We present experimental results using the original and the proposed efficient LP-ICP on two English and one Czech language data-sets. Specifically, we apply the LP-ICP on three deep Artificial Neural Network (ANN) classifiers of two types: one based on contextualised (bert) and two on non-contextualised (word2vec) word-embeddings. In the LP-ICP setting we assign nonconformity scores to label-sets from which the corresponding p-values and prediction-sets are determined. Our approach deals with the increased computational burden of LP by eliminating from consideration a significant number of label-sets that will surely have p-values below the specified significance level. This reduces dramatically the computational complexity of the approach while fully respecting the standard CP guarantees. Our experimental results show that the contextualised-based classifier surpasses the non-contextualised-based ones and obtains state-of-the-art performance for all data-sets examined. The good performance of the underlying classifiers is carried on to their ICP counterparts without any significant accuracy loss, but with the added benefits of ICP, i.e. the confidence information encapsulated in the prediction sets. We experimentally demonstrate that the resulting prediction sets can be tight enough to be practically useful even though the set of all possible label-sets contains more than $1e+16$ combinations. Additionally, the empirical error rates of the obtained prediction-sets confirm that our outputs are well-calibrated.  ( 3 min )
  • Open

    Stochastic interpolants with data-dependent couplings. (arXiv:2310.03725v2 [cs.LG] UPDATED)
    Generative models inspired by dynamical transport of measure -- such as flows and diffusions -- construct a continuous-time map between two probability densities. Conventionally, one of these is the target density, only accessible through samples, while the other is taken as a simple base density that is data-agnostic. In this work, using the framework of stochastic interpolants, we formalize how to \textit{couple} the base and the target densities, whereby samples from the base are computed conditionally given samples from the target in a way that is different from (but does preclude) incorporating information about class labels or continuous embeddings. This enables us to construct dynamical transport maps that serve as conditional generative models. We show that these transport maps can be learned by solving a simple square loss regression problem analogous to the standard independent setting. We demonstrate the usefulness of constructing dependent couplings in practice through experiments in super-resolution and in-painting.  ( 2 min )
    Variational excess risk bound for general state space models. (arXiv:2312.09607v1 [stat.ME])
    In this paper, we consider variational autoencoders (VAE) for general state space models. We consider a backward factorization of the variational distributions to analyze the excess risk associated with VAE. Such backward factorizations were recently proposed to perform online variational learning and to obtain upper bounds on the variational estimation error. When independent trajectories of sequences are observed and under strong mixing assumptions on the state space model and on the variational distribution, we provide an oracle inequality explicit in the number of samples and in the length of the observation sequences. We then derive consequences of this theoretical result. In particular, when the data distribution is given by a state space model, we provide an upper bound for the Kullback-Leibler divergence between the data distribution and its estimator and between the variational posterior and the estimated state space posterior distributions.Under classical assumptions, we prove that our results can be applied to Gaussian backward kernels built with dense and recurrent neural networks.  ( 2 min )
    Unsupervised and Supervised learning by Dense Associative Memory under replica symmetry breaking. (arXiv:2312.09638v1 [cond-mat.dis-nn])
    Statistical mechanics of spin glasses is one of the main strands toward a comprehension of information processing by neural networks and learning machines. Tackling this approach, at the fairly standard replica symmetric level of description, recently Hebbian attractor networks with multi-node interactions (often called Dense Associative Memories) have been shown to outperform their classical pairwise counterparts in a number of tasks, from their robustness against adversarial attacks and their capability to work with prohibitively weak signals to their supra-linear storage capacities. Focusing on mathematical techniques more than computational aspects, in this paper we relax the replica symmetric assumption and we derive the one-step broken-replica-symmetry picture of supervised and unsupervised learning protocols for these Dense Associative Memories: a phase diagram in the space of the control parameters is achieved, independently, both via the Parisi's hierarchy within then replica trick as well as via the Guerra's telescope within the broken-replica interpolation. Further, an explicit analytical investigation is provided to deepen both the big-data and ground state limits of these networks as well as a proof that replica symmetry breaking does not alter the thresholds for learning and slightly increases the maximal storage capacity. Finally the De Almeida and Thouless line, depicting the onset of instability of a replica symmetric description, is also analytically derived highlighting how, crossed this boundary, the broken replica description should be preferred.  ( 3 min )
    The Optimal Approximation Factors in Misspecified Off-Policy Value Function Estimation. (arXiv:2307.13332v2 [cs.LG] UPDATED)
    Theoretical guarantees in reinforcement learning (RL) are known to suffer multiplicative blow-up factors with respect to the misspecification error of function approximation. Yet, the nature of such \emph{approximation factors} -- especially their optimal form in a given learning problem -- is poorly understood. In this paper we study this question in linear off-policy value function estimation, where many open questions remain. We study the approximation factor in a broad spectrum of settings, such as with the weighted $L_2$-norm (where the weighting is the offline state distribution), the $L_\infty$ norm, the presence vs. absence of state aliasing, and full vs. partial coverage of the state space. We establish the optimal asymptotic approximation factors (up to constants) for all of these settings. In particular, our bounds identify two instance-dependent factors for the $L_2(\mu)$ norm and only one for the $L_\infty$ norm, which are shown to dictate the hardness of off-policy evaluation under misspecification.  ( 2 min )
    Nonlinear Meta-Learning Can Guarantee Faster Rates. (arXiv:2307.10870v2 [stat.ML] UPDATED)
    Many recent theoretical works on \emph{meta-learning} aim to achieve guarantees in leveraging similar representational structures from related tasks towards simplifying a target task. Importantly, the main aim in theory works on the subject is to understand the extent to which convergence rates -- in learning a common representation -- \emph{may scale with the number $N$ of tasks} (as well as the number of samples per task). First steps in this setting demonstrate this property when both the shared representation amongst tasks, and task-specific regression functions, are linear. This linear setting readily reveals the benefits of aggregating tasks, e.g., via averaging arguments. In practice, however, the representation is often highly nonlinear, introducing nontrivial biases in each task that cannot easily be averaged out as in the linear case. In the present work, we derive theoretical guarantees for meta-learning with nonlinear representations. In particular, assuming the shared nonlinearity maps to an infinite-dimensional RKHS, we show that additional biases can be mitigated with careful regularization that leverages the smoothness of task-specific regression functions,  ( 2 min )
    Distributed Semi-Supervised Sparse Statistical Inference. (arXiv:2306.10395v2 [stat.ML] UPDATED)
    The debiased estimator is a crucial tool in statistical inference for high-dimensional model parameters. However, constructing such an estimator involves estimating the high-dimensional inverse Hessian matrix, incurring significant computational costs. This challenge becomes particularly acute in distributed setups, where traditional methods necessitate computing a debiased estimator on every machine. This becomes unwieldy, especially with a large number of machines. In this paper, we delve into semi-supervised sparse statistical inference in a distributed setup. An efficient multi-round distributed debiased estimator, which integrates both labeled and unlabelled data, is developed. We will show that the additional unlabeled data helps to improve the statistical rate of each round of iteration. Our approach offers tailored debiasing methods for $M$-estimation and generalized linear models according to the specific form of the loss function. Our method also applies to a non-smooth loss like absolute deviation loss. Furthermore, our algorithm is computationally efficient since it requires only one estimation of a high-dimensional inverse covariance matrix. We demonstrate the effectiveness of our method by presenting simulation studies and real data applications that highlight the benefits of incorporating unlabeled data.  ( 2 min )
    Decomposed Diffusion Sampler for Accelerating Large-Scale Inverse Problems. (arXiv:2303.05754v2 [cs.LG] UPDATED)
    Krylov subspace, which is generated by multiplying a given vector by the matrix of a linear transformation and its successive powers, has been extensively studied in classical optimization literature to design algorithms that converge quickly for large linear inverse problems. For example, the conjugate gradient method (CG), one of the most popular Krylov subspace methods, is based on the idea of minimizing the residual error in the Krylov subspace. However, with the recent advancement of high-performance diffusion solvers for inverse problems, it is not clear how classical wisdom can be synergistically combined with modern diffusion models. In this study, we propose a novel and efficient diffusion sampling strategy that synergistically combine the diffusion sampling and Krylov subspace methods. Specifically, we prove that if the tangent space at a denoised sample by Tweedie's formula forms a Krylov subspace, then the CG initialized with the denoised data ensures the data consistency update to remain in the tangent space. This negates the need to compute the manifold-constrained gradient (MCG), leading to a more efficient diffusion sampling method. Our method is applicable regardless of the parametrization and setting (i.e., VE, VP). Notably, we achieve state-of-the-art reconstruction quality on challenging real-world medical inverse imaging problems, including multi-coil MRI reconstruction and 3D CT reconstruction. Moreover, our proposed method achieves more than 80 times faster inference time than the previous state-of-the-art method.  ( 3 min )
    Machine-Learned Exclusion Limits without Binning. (arXiv:2211.04806v2 [hep-ph] UPDATED)
    Machine-Learned Likelihoods (MLL) combines machine-learning classification techniques with likelihood-based inference tests to estimate the experimental sensitivity of high-dimensional data sets. We extend the MLL method by including Kernel Density Estimators (KDE) to avoid binning the classifier output to extract the resulting one-dimensional signal and background probability density functions. We first test our method on toy models generated with multivariate Gaussian distributions, where the true probability distribution functions are known. Later, we apply the method to two cases of interest at the LHC: a search for exotic Higgs bosons, and a $Z'$ boson decaying into lepton pairs. In contrast to physical-based quantities, the typical fluctuations of the ML outputs give non-smooth probability distributions for pure-signal and pure-background samples. The non-smoothness is propagated into the density estimation due to the good performance and flexibility of the KDE method. We study its impact on the final significance computation, and we compare the results using the average of several independent ML output realizations, which allows us to obtain smoother distributions. We conclude that the significance estimation turns out to be not sensible to this issue.  ( 3 min )
    Optimal Estimation of Generic Dynamics by Path-Dependent Neural Jump ODEs. (arXiv:2206.14284v5 [stat.ML] UPDATED)
    This paper studies the problem of forecasting general stochastic processes using a path-dependent extension of the Neural Jump ODE (NJ-ODE) framework \citep{herrera2021neural}. While NJ-ODE was the first framework to establish convergence guarantees for the prediction of irregularly observed time series, these results were limited to data stemming from It\^o-diffusions with complete observations, in particular Markov processes, where all coordinates are observed simultaneously. In this work, we generalise these results to generic, possibly non-Markovian or discontinuous, stochastic processes with incomplete observations, by utilising the reconstruction properties of the signature transform. These theoretical results are supported by empirical studies, where it is shown that the path-dependent NJ-ODE outperforms the original NJ-ODE framework in the case of non-Markovian data. Moreover, we show that PD-NJ-ODE can be applied successfully to classical stochastic filtering problems and to limit order book (LOB) data.  ( 2 min )
    Generic Unsupervised Optimization for a Latent Variable Model With Exponential Family Observables. (arXiv:2003.02214v3 [cs.LG] UPDATED)
    Latent variable models (LVMs) represent observed variables by parameterized functions of latent variables. Prominent examples of LVMs for unsupervised learning are probabilistic PCA or probabilistic SC which both assume a weighted linear summation of the latents to determine the mean of a Gaussian distribution for the observables. In many cases, however, observables do not follow a Gaussian distribution. For unsupervised learning, LVMs which assume specific non-Gaussian observables have therefore been considered. Already for specific choices of distributions, parameter optimization is challenging and only a few previous contributions considered LVMs with more generally defined observable distributions. Here, we consider LVMs that are defined for a range of different distributions, i.e., observables can follow any (regular) distribution of the exponential family. The novel class of LVMs presented is defined for binary latents, and it uses maximization in place of summation to link the latents to observables. To derive an optimization procedure, we follow an EM approach for maximum likelihood parameter estimation. We show that a set of very concise parameter update equations can be derived which feature the same functional form for all exponential family distributions. The derived generic optimization can consequently be applied to different types of metric data as well as to different types of discrete data. Also, the derived optimization equations can be combined with a recently suggested variational acceleration which is likewise generically applicable to the LVMs considered here. So, the combination maintains generic and direct applicability of the derived optimization procedure, but, crucially, enables efficient scalability. We numerically verify our analytical results and discuss some potential applications such as learning of variance structure, noise type estimation and denoising.  ( 3 min )
    Modeling Unknown Stochastic Dynamical System via Autoencoder. (arXiv:2312.10001v1 [cs.LG])
    We present a numerical method to learn an accurate predictive model for an unknown stochastic dynamical system from its trajectory data. The method seeks to approximate the unknown flow map of the underlying system. It employs the idea of autoencoder to identify the unobserved latent random variables. In our approach, we design an encoding function to discover the latent variables, which are modeled as unit Gaussian, and a decoding function to reconstruct the future states of the system. Both the encoder and decoder are expressed as deep neural networks (DNNs). Once the DNNs are trained by the trajectory data, the decoder serves as a predictive model for the unknown stochastic system. Through an extensive set of numerical examples, we demonstrate that the method is able to produce long-term system predictions by using short bursts of trajectory data. It is also applicable to systems driven by non-Gaussian noises.  ( 2 min )
    Scalable and hyper-parameter-free non-parametric covariate shift adaptation with conditional sampling. (arXiv:2312.09969v1 [stat.ML])
    Many existing covariate shift adaptation methods estimate sample weights to be used in the risk estimation in order to mitigate the gap between the source and the target distribution. However, non-parametrically estimating the optimal weights typically involves computationally expensive hyper-parameter tuning that is crucial to the final performance. In this paper, we propose a new non-parametric approach to covariate shift adaptation which avoids estimating weights and has no hyper-parameter to be tuned. Our basic idea is to label unlabeled target data according to the $k$-nearest neighbors in the source dataset. Our analysis indicates that setting $k = 1$ is an optimal choice. Thanks to this property, there is no need to tune any hyper-parameters, unlike other non-parametric methods. Moreover, our method achieves a running time quasi-linear in the sample size with a theoretical guarantee, for the first time in the literature to the best of our knowledge. Our results include sharp rates of convergence for estimating the joint probability distribution of the target data. In particular, the variance of our estimators has the same rate of convergence as for standard parametric estimation despite their non-parametric nature. Our numerical experiments show that proposed method brings drastic reduction in the running time with accuracy comparable to that of the state-of-the-art methods.  ( 2 min )
    Toward Computationally Efficient Inverse Reinforcement Learning via Reward Shaping. (arXiv:2312.09983v1 [cs.LG])
    Inverse reinforcement learning (IRL) is computationally challenging, with common approaches requiring the solution of multiple reinforcement learning (RL) sub-problems. This work motivates the use of potential-based reward shaping to reduce the computational burden of each RL sub-problem. This work serves as a proof-of-concept and we hope will inspire future developments towards computationally efficient IRL.  ( 2 min )
    Risk-Aware Continuous Control with Neural Contextual Bandits. (arXiv:2312.09961v1 [cs.LG])
    Recent advances in learning techniques have garnered attention for their applicability to a diverse range of real-world sequential decision-making problems. Yet, many practical applications have critical constraints for operation in real environments. Most learning solutions often neglect the risk of failing to meet these constraints, hindering their implementation in real-world contexts. In this paper, we propose a risk-aware decision-making framework for contextual bandit problems, accommodating constraints and continuous action spaces. Our approach employs an actor multi-critic architecture, with each critic characterizing the distribution of performance and constraint metrics. Our framework is designed to cater to various risk levels, effectively balancing constraint satisfaction against performance. To demonstrate the effectiveness of our approach, we first compare it against state-of-the-art baseline methods in a synthetic environment, highlighting the impact of intrinsic environmental noise across different risk configurations. Finally, we evaluate our framework in a real-world use case involving a 5G mobile network where only our approach consistently satisfies the system constraint (a signal processing reliability target) with a small performance toll (8.5% increase in power consumption).  ( 2 min )
    Probabilistic learning of the Purkinje network from the electrocardiogram. (arXiv:2312.09887v1 [stat.ML])
    The identification of the Purkinje conduction system in the heart is a challenging task, yet essential for a correct definition of cardiac digital twins for precision cardiology. Here, we propose a probabilistic approach for identifying the Purkinje network from non-invasive clinical data such as the standard electrocardiogram (ECG). We use cardiac imaging to build an anatomically accurate model of the ventricles; we algorithmically generate a rule-based Purkinje network tailored to the anatomy; we simulate physiological electrocardiograms with a fast model; we identify the geometrical and electrical parameters of the Purkinje-ECG model with Bayesian optimization and approximate Bayesian computation. The proposed approach is inherently probabilistic and generates a population of plausible Purkinje networks, all fitting the ECG within a given tolerance. In this way, we can estimate the uncertainty of the parameters, thus providing reliable predictions. We test our methodology in physiological and pathological scenarios, showing that we are able to accurately recover the ECG with our model. We propagate the uncertainty in the Purkinje network parameters in a simulation of conduction system pacing therapy. Our methodology is a step forward in creation of digital twins from non-invasive data in precision medicine. An open source implementation can be found at this http URL  ( 2 min )
    Sketch and shift: a robust decoder for compressive clustering. (arXiv:2312.09940v1 [cs.LG])
    Compressive learning is an emerging approach to drastically reduce the memory footprint of large-scale learning, by first summarizing a large dataset into a low-dimensional sketch vector, and then decoding from this sketch the latent information needed for learning. In light of recent progress on information preservation guarantees for sketches based on random features, a major objective is to design easy-to-tune algorithms (called decoders) to robustly and efficiently extract this information. To address the underlying non-convex optimization problems, various heuristics have been proposed. In the case of compressive clustering, the standard heuristic is CL-OMPR, a variant of sliding Frank-Wolfe. Yet, CL-OMPR is hard to tune, and the examination of its robustness was overlooked. In this work, we undertake a scrutinized examination of CL-OMPR to circumvent its limitations. In particular, we show how this algorithm can fail to recover the clusters even in advantageous scenarios. To gain insight, we show how the deficiencies of this algorithm can be attributed to optimization difficulties related to the structure of a correlation function appearing at core steps of the algorithm. To address these limitations, we propose an alternative decoder offering substantial improvements over CL-OMPR. Its design is notably inspired from the mean shift algorithm, a classic approach to detect the local maxima of kernel density estimators. The proposed algorithm can extract clustering information from a sketch of the MNIST dataset that is 10 times smaller than previously.  ( 3 min )
    Distributed Learning of Mixtures of Experts. (arXiv:2312.09877v1 [cs.LG])
    In modern machine learning problems we deal with datasets that are either distributed by nature or potentially large for which distributing the computations is usually a standard way to proceed, since centralized algorithms are in general ineffective. We propose a distributed learning approach for mixtures of experts (MoE) models with an aggregation strategy to construct a reduction estimator from local estimators fitted parallelly to distributed subsets of the data. The aggregation is based on an optimal minimization of an expected transportation divergence between the large MoE composed of local estimators and the unknown desired MoE model. We show that the provided reduction estimator is consistent as soon as the local estimators to be aggregated are consistent, and its construction is performed by a proposed majorization-minimization (MM) algorithm that is computationally effective. We study the statistical and numerical properties for the proposed reduction estimator on experiments that demonstrate its performance compared to namely the global estimator constructed in a centralized way from the full dataset. For some situations, the computation time is more than ten times faster, for a comparable performance. Our source codes are publicly available on Github.  ( 2 min )
    Deep Unsupervised Domain Adaptation for Time Series Classification: a Benchmark. (arXiv:2312.09857v1 [cs.LG])
    Unsupervised Domain Adaptation (UDA) aims to harness labeled source data to train models for unlabeled target data. Despite extensive research in domains like computer vision and natural language processing, UDA remains underexplored for time series data, which has widespread real-world applications ranging from medicine and manufacturing to earth observation and human activity recognition. Our paper addresses this gap by introducing a comprehensive benchmark for evaluating UDA techniques for time series classification, with a focus on deep learning methods. We provide seven new benchmark datasets covering various domain shifts and temporal dynamics, facilitating fair and standardized UDA method assessments with state of the art neural network backbones (e.g. Inception) for time series data. This benchmark offers insights into the strengths and limitations of the evaluated approaches while preserving the unsupervised nature of domain adaptation, making it directly applicable to practical problems. Our paper serves as a vital resource for researchers and practitioners, advancing domain adaptation solutions for time series data and fostering innovation in this critical field. The implementation code of this benchmark is available at https://github.com/EricssonResearch/UDA-4-TSC.  ( 2 min )
    Learning Distributions on Manifolds with Free-form Flows. (arXiv:2312.09852v1 [cs.LG])
    Many real world data, particularly in the natural sciences and computer vision, lie on known Riemannian manifolds such as spheres, tori or the group of rotation matrices. The predominant approaches to learning a distribution on such a manifold require solving a differential equation in order to sample from the model and evaluate densities. The resulting sampling times are slowed down by a high number of function evaluations. In this work, we propose an alternative approach which only requires a single function evaluation followed by a projection to the manifold. Training is achieved by an adaptation of the recently proposed free-form flow framework to Riemannian manifolds. The central idea is to estimate the gradient of the negative log-likelihood via a trace evaluated in the tangent space. We evaluate our method on various manifolds, and find significantly faster inference at competitive performance compared to previous work. We make our code public at https://github.com/vislearn/FFF.  ( 2 min )
    PAC-Bayes Generalisation Bounds for Dynamical Systems Including Stable RNNs. (arXiv:2312.09793v1 [cs.LG])
    In this paper, we derive a PAC-Bayes bound on the generalisation gap, in a supervised time-series setting for a special class of discrete-time non-linear dynamical systems. This class includes stable recurrent neural networks (RNN), and the motivation for this work was its application to RNNs. In order to achieve the results, we impose some stability constraints, on the allowed models. Here, stability is understood in the sense of dynamical systems. For RNNs, these stability conditions can be expressed in terms of conditions on the weights. We assume the processes involved are essentially bounded and the loss functions are Lipschitz. The proposed bound on the generalisation gap depends on the mixing coefficient of the data distribution, and the essential supremum of the data. Furthermore, the bound converges to zero as the dataset size increases. In this paper, we 1) formalize the learning problem, 2) derive a PAC-Bayesian error bound for such systems, 3) discuss various consequences of this error bound, and 4) show an illustrative example, with discussions on computing the proposed bound. Unlike other available bounds the derived bound holds for non i.i.d. data (time-series) and it does not grow with the number of steps of the RNN.  ( 2 min )
    Calibrated One Round Federated Learning with Bayesian Inference in the Predictive Space. (arXiv:2312.09817v1 [cs.LG])
    Federated Learning (FL) involves training a model over a dataset distributed among clients, with the constraint that each client's dataset is localized and possibly heterogeneous. In FL, small and noisy datasets are common, highlighting the need for well-calibrated models that represent the uncertainty of predictions. The closest FL techniques to achieving such goals are the Bayesian FL methods which collect parameter samples from local posteriors, and aggregate them to approximate the global posterior. To improve scalability for larger models, one common Bayesian approach is to approximate the global predictive posterior by multiplying local predictive posteriors. In this work, we demonstrate that this method gives systematically overconfident predictions, and we remedy this by proposing $\beta$-Predictive Bayes, a Bayesian FL algorithm that interpolates between a mixture and product of the predictive posteriors, using a tunable parameter $\beta$. This parameter is tuned to improve the global ensemble's calibration, before it is distilled to a single model. Our method is evaluated on a variety of regression and classification datasets to demonstrate its superiority in calibration to other baselines, even as data heterogeneity increases. Code available at https://github.com/hasanmohsin/betaPredBayes_FL  ( 2 min )
    Vectorizing string entries for data processing on tables: when are larger language models better?. (arXiv:2312.09634v1 [stat.ML])
    There are increasingly efficient data processing pipelines that work on vectors of numbers, for instance most machine learning models, or vector databases for fast similarity search. These require converting the data to numbers. While this conversion is easy for simple numerical and categorical entries, databases are strife with text entries, such as names or descriptions. In the age of large language models, what's the best strategies to vectorize tables entries, baring in mind that larger models entail more operational complexity? We study the benefits of language models in 14 analytical tasks on tables while varying the training size, as well as for a fuzzy join benchmark. We introduce a simple characterization of a column that reveals two settings: 1) a dirty categories setting, where strings share much similarities across entries, and conversely 2) a diverse entries setting. For dirty categories, pretrained language models bring little-to-no benefit compared to simpler string models. For diverse entries, we show that larger language models improve data processing. For these we investigate the complexity-performance tradeoffs and show that they reflect those of classic text embedding: larger models tend to perform better, but it is useful to fine tune them for embedding purposes.  ( 2 min )
    Optimal Regret Bounds for Collaborative Learning in Bandits. (arXiv:2312.09674v1 [cs.LG])
    We consider regret minimization in a general collaborative multi-agent multi-armed bandit model, in which each agent faces a finite set of arms and may communicate with other agents through a central controller. The optimal arm for each agent in this model is the arm with the largest expected mixed reward, where the mixed reward of each arm is a weighted average of its rewards across all agents, making communication among agents crucial. While near-optimal sample complexities for best arm identification are known under this collaborative model, the question of optimal regret remains open. In this work, we address this problem and propose the first algorithm with order optimal regret bounds under this collaborative bandit model. Furthermore, we show that only a small constant number of expected communication rounds is needed.  ( 2 min )
    Rethinking Causal Relationships Learning in Graph Neural Networks. (arXiv:2312.09613v1 [cs.LG])
    Graph Neural Networks (GNNs) demonstrate their significance by effectively modeling complex interrelationships within graph-structured data. To enhance the credibility and robustness of GNNs, it becomes exceptionally crucial to bolster their ability to capture causal relationships. However, despite recent advancements that have indeed strengthened GNNs from a causal learning perspective, conducting an in-depth analysis specifically targeting the causal modeling prowess of GNNs remains an unresolved issue. In order to comprehensively analyze various GNN models from a causal learning perspective, we constructed an artificially synthesized dataset with known and controllable causal relationships between data and labels. The rationality of the generated data is further ensured through theoretical foundations. Drawing insights from analyses conducted using our dataset, we introduce a lightweight and highly adaptable GNN module designed to strengthen GNNs' causal learning capabilities across a diverse range of tasks. Through a series of experiments conducted on both synthetic datasets and other real-world datasets, we empirically validate the effectiveness of the proposed module.  ( 2 min )
    Reliable Prediction Intervals with Regression Neural Networks. (arXiv:2312.09606v1 [cs.LG])
    This paper proposes an extension to conventional regression Neural Networks (NNs) for replacing the point predictions they produce with prediction intervals that satisfy a required level of confidence. Our approach follows a novel machine learning framework, called Conformal Prediction (CP), for assigning reliable confidence measures to predictions without assuming anything more than that the data are independent and identically distributed (i.i.d.). We evaluate the proposed method on four benchmark datasets and on the problem of predicting Total Electron Content (TEC), which is an important parameter in trans-ionospheric links; for the latter we use a dataset of more than 60000 TEC measurements collected over a period of 11 years. Our experimental results show that the prediction intervals produced by our method are both well-calibrated and tight enough to be useful in practice.  ( 2 min )
    Well-calibrated Confidence Measures for Multi-label Text Classification with a Large Number of Labels. (arXiv:2312.09304v1 [cs.LG])
    We extend our previous work on Inductive Conformal Prediction (ICP) for multi-label text classification and present a novel approach for addressing the computational inefficiency of the Label Powerset (LP) ICP, arrising when dealing with a high number of unique labels. We present experimental results using the original and the proposed efficient LP-ICP on two English and one Czech language data-sets. Specifically, we apply the LP-ICP on three deep Artificial Neural Network (ANN) classifiers of two types: one based on contextualised (bert) and two on non-contextualised (word2vec) word-embeddings. In the LP-ICP setting we assign nonconformity scores to label-sets from which the corresponding p-values and prediction-sets are determined. Our approach deals with the increased computational burden of LP by eliminating from consideration a significant number of label-sets that will surely have p-values below the specified significance level. This reduces dramatically the computational complexity of the approach while fully respecting the standard CP guarantees. Our experimental results show that the contextualised-based classifier surpasses the non-contextualised-based ones and obtains state-of-the-art performance for all data-sets examined. The good performance of the underlying classifiers is carried on to their ICP counterparts without any significant accuracy loss, but with the added benefits of ICP, i.e. the confidence information encapsulated in the prediction sets. We experimentally demonstrate that the resulting prediction sets can be tight enough to be practically useful even though the set of all possible label-sets contains more than $1e+16$ combinations. Additionally, the empirical error rates of the obtained prediction-sets confirm that our outputs are well-calibrated.  ( 3 min )
    Modeling and Predicting Epidemic Spread: A Gaussian Process Regression Approach. (arXiv:2312.09384v1 [stat.ML])
    Modeling and prediction of epidemic spread are critical to assist in policy-making for mitigation. Therefore, we present a new method based on Gaussian Process Regression to model and predict epidemics, and it quantifies prediction confidence through variance and high probability error bounds. Gaussian Process Regression excels in using small datasets and providing uncertainty bounds, and both of these properties are critical in modeling and predicting epidemic spreading processes with limited data. However, the derivation of formal uncertainty bounds remains lacking when using Gaussian Process Regression in the setting of epidemics, which limits its usefulness in guiding mitigation efforts. Therefore, in this work, we develop a novel bound on the variance of the prediction that quantifies the impact of the epidemic data on the predictions we make. Further, we develop a high probability error bound on the prediction, and we quantify how the epidemic spread, the infection data, and the length of the prediction horizon all affect this error bound. We also show that the error stays below a certain threshold based on the length of the prediction horizon. To illustrate this framework, we leverage Gaussian Process Regression to model and predict COVID-19 using real-world infection data from the United Kingdom.  ( 3 min )
    Combinatorial Complexes: Bridging the Gap Between Cell Complexes and Hypergraphs. (arXiv:2312.09504v1 [cs.LG])
    Graph-based signal processing techniques have become essential for handling data in non-Euclidean spaces. However, there is a growing awareness that these graph models might need to be expanded into `higher-order' domains to effectively represent the complex relations found in high-dimensional data. Such higher-order domains are typically modeled either as hypergraphs, or as simplicial, cubical or other cell complexes. In this context, cell complexes are often seen as a subclass of hypergraphs with additional algebraic structure that can be exploited, e.g., to develop a spectral theory. In this article, we promote an alternative perspective. We argue that hypergraphs and cell complexes emphasize \emph{different} types of relations, which may have different utility depending on the application context. Whereas hypergraphs are effective in modeling set-type, multi-body relations between entities, cell complexes provide an effective means to model hierarchical, interior-to-boundary type relations. We discuss the relative advantages of these two choices and elaborate on the previously introduced concept of a combinatorial complex that enables co-existing set-type and hierarchical relations. Finally, we provide a brief numerical experiment to demonstrate that this modelling flexibility can be advantageous in learning tasks.  ( 2 min )
    A Hierarchical Nearest Neighbour Approach to Contextual Bandits. (arXiv:2312.09332v1 [cs.LG])
    In this paper we consider the adversarial contextual bandit problem in metric spaces. The paper "Nearest neighbour with bandit feedback" tackled this problem but when there are many contexts near the decision boundary of the comparator policy it suffers from a high regret. In this paper we eradicate this problem, designing an algorithm in which we can hold out any set of contexts when computing our regret term. Our algorithm builds on that of "Nearest neighbour with bandit feedback" and hence inherits its extreme computational efficiency.  ( 2 min )

  • Open

    How can data science and AI help HR in workforce development, evaluation, and retention?
    There have been claims that artificial intelligence is bringing about increased productivity, accuracy, and a smarter workplace. In all of this excitement, it is difficult to differentiate between fact and fantasy. When it comes to the management of workforces, what is the truth there? Within the context of real-world applications, how much hype is there?… Read More »How can data science and AI help HR in workforce development, evaluation, and retention? The post How can data science and AI help HR in workforce development, evaluation, and retention? appeared first on Data Science Central.  ( 29 min )
    Data management implications of the AI Act
    Members of the European Parliament and the Council reached provisional agreement on the Artificial Intelligence Act on December 9th, 2023 after years of debate and discussion. The AI Act is broad in scope and is intended to protect public welfare, digital rights, democracy, and the rule of law from the dangers of AI. The Act… Read More »Data management implications of the AI Act The post Data management implications of the AI Act appeared first on Data Science Central.  ( 22 min )
  • Open

    ISO 42001: A new foundational global standard to advance responsible AI
    Artificial intelligence (AI) is one of the most transformational technologies of our generation and provides opportunities to be a force for good and drive economic growth. The growth of large language models (LLMs), with hundreds of billions of parameters, has unlocked new generative AI use cases to improve customer experiences, boost employee productivity, and so […]  ( 4 min )
    Accelerating time-to-insight with MongoDB time series collections and Amazon SageMaker Canvas
    This is a guest post co-written with Babu Srinivasan from MongoDB. As industries evolve in today’s fast-paced business landscape, the inability to have real-time forecasts poses significant challenges for industries heavily reliant on accurate and timely insights. The absence of real-time forecasts in various industries presents pressing business challenges that can significantly impact decision-making and […]  ( 8 min )
  • Open

    AI Frontiers: A deep dive into deep learning with Ashley Llorens and Chris Bishop
    In this episode of “AI Frontiers,” AI4Science Director Chris Bishop talks about the state of deep learning; his new textbook, “Deep Learning: Foundations and Concepts,” and the impact the field is having on the natural sciences. The post AI Frontiers: A deep dive into deep learning with Ashley Llorens and Chris Bishop appeared first on Microsoft Research.  ( 24 min )
  • Open

    Integrals involving secants and tangents
    As a student, I often made the mistake of thinking that if I knew a more powerful theorem, I didn’t need to learn a less powerful theorem. The reason this is a mistake is that the more powerful theorem may be better by one obvious criterion but not be better by other less-obvious criteria. The […] Integrals involving secants and tangents first appeared on John D. Cook.  ( 5 min )
  • Open

    A Single-Loop Algorithm for Decentralized Bilevel Optimization. (arXiv:2311.08945v2 [math.OC] UPDATED)
    Bilevel optimization has received more and more attention recently due to its wide applications in machine learning. In this paper, we consider bilevel optimization in decentralized networks. In particular, we propose a novel single-loop algorithm for solving decentralized bilevel optimization with strongly convex lower level problem. Our algorithm is fully single-loop and does not require heavy matrix-vector multiplications when approximating the hypergradient. Moreover, unlike existing methods for decentralized bilevel optimization and federated bilevel optimization, our algorithm does not require any gradient heterogeneity assumption. Our analysis shows that the proposed algorithm achieves a sublinear convergence rate. Experimental results on hyperparameter optimization problem with both synthetic and MNIST data sets demonstrate the efficiency of the proposed algorithm.  ( 2 min )

  • Open

    AI and Justice in a Brave New World: Part 3 – AI Governance
    In part 1 of the series “A Different AI Scenario: AI and Justice in a Brave New World,” I outlined some requirements for the role that AI would play in enforcing our laws and regulations in a more just and fair manner and what our human legislators must do to ensure that outcome.  In part… Read More »AI and Justice in a Brave New World: Part 3 – AI Governance The post AI and Justice in a Brave New World: Part 3 – AI Governance appeared first on Data Science Central.  ( 23 min )
  • Open

    GMTR: Graph Matching Transformers. (arXiv:2311.08141v2 [cs.CV] UPDATED)
    Vision transformers (ViTs) have recently been used for visual matching beyond object detection and segmentation. However, the original grid dividing strategy of ViTs neglects the spatial information of the keypoints, limiting the sensitivity to local information. Therefore, we propose QueryTrans (Query Transformer), which adopts a cross-attention module and keypoints-based center crop strategy for better spatial information extraction. We further integrate the graph attention module and devise a transformer-based graph matching approach GMTR (Graph Matching TRansformers) whereby the combinatorial nature of GM is addressed by a graph transformer neural GM solver. On standard GM benchmarks, GMTR shows competitive performance against the SOTA frameworks. Specifically, on Pascal VOC, GMTR achieves $\mathbf{83.6\%}$ accuracy, $\mathbf{0.9\%}$ higher than the SOTA framework. On Spair-71k, GMTR shows great potential and outperforms most of the previous works. Meanwhile, on Pascal VOC, QueryTrans improves the accuracy of NGMv2 from $80.1\%$ to $\mathbf{83.3\%}$, and BBGM from $79.0\%$ to $\mathbf{84.5\%}$. On Spair-71k, QueryTrans improves NGMv2 from $80.6\%$ to $\mathbf{82.5\%}$, and BBGM from $82.1\%$ to $\mathbf{83.9\%}$. Source code will be made publicly available.  ( 2 min )

  • Open

    Structure-Preserving Transformers for Sequences of SPD Matrices. (arXiv:2309.07579v4 [cs.LG] UPDATED)
    In recent years, Transformer-based auto-attention mechanisms have been successfully applied to the analysis of a variety of context-reliant data types, from texts to images and beyond, including data from non-Euclidean geometries. In this paper, we present such a mechanism, designed to classify sequences of Symmetric Positive Definite matrices while preserving their Riemannian geometry throughout the analysis. We apply our method to automatic sleep staging on timeseries of EEG-derived covariance matrices from a standard dataset, obtaining high levels of stage-wise performance.  ( 2 min )
    Recovering from Privacy-Preserving Masking with Large Language Models. (arXiv:2309.08628v3 [cs.CL] UPDATED)
    Model adaptation is crucial to handle the discrepancy between proxy training data and actual users data received. To effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream natural language processing (NLP) models can be directly trained using such in-domain data. However, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. Replacing identifying information in textual data with a generic marker has been recently explored. In this work, we leverage large language models (LLMs) to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream language modeling tasks. Specifically, we propose multiple pre-trained and fine-tuned LLM-based approaches and perform empirical studies on various datasets for the comparison of these methods. Experimental results show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data without privacy-preserving token masking.  ( 2 min )
    Physics-informed neural networks for pathloss prediction. (arXiv:2211.12986v2 [stat.ML] UPDATED)
    This paper introduces a physics-informed machine learning approach for pathloss prediction. This is achieved by including in the training phase simultaneously (i) physical dependencies between spatial loss field and (ii) measured pathloss values in the field. It is shown that the solution to a proposed learning problem improves generalization and prediction quality with a small number of neural network layers and parameters. The latter leads to fast inference times which are favorable for downstream tasks such as localization. Moreover, the physics-informed formulation allows training and prediction with a small amount of training data which makes it appealing for a wide range of practical pathloss prediction scenarios.  ( 2 min )
    Unsupervised Detection of Behavioural Drifts with Dynamic Clustering and Trajectory Analysis. (arXiv:2302.06228v2 [cs.LG] UPDATED)
    Real-time monitoring of human behaviours, especially in e-Health applications, has been an active area of research in the past decades. On top of IoT-based sensing environments, anomaly detection algorithms have been proposed for the early detection of abnormalities. Gradual change procedures, commonly referred to as drift anomalies, have received much less attention in the literature because they represent a much more challenging scenario than sudden temporary changes (point anomalies). In this paper, we propose, for the first time, a fully unsupervised real-time drift detection algorithm named DynAmo, which can identify drift periods as they are happening. DynAmo comprises a dynamic clustering component to capture the overall trends of monitored behaviours and a trajectory generation component, which extracts features from the densest cluster centroids. Finally, we apply an ensemble of divergence tests on sliding reference and detection windows to detect drift periods in the behavioural sequence.  ( 2 min )
    Symmetry Breaking and Equivariant Neural Networks. (arXiv:2312.09016v1 [cs.LG])
    Using symmetry as an inductive bias in deep learning has been proven to be a principled approach for sample-efficient model design. However, the relationship between symmetry and the imperative for equivariance in neural networks is not always obvious. Here, we analyze a key limitation that arises in equivariant functions: their incapacity to break symmetry at the level of individual data samples. In response, we introduce a novel notion of 'relaxed equivariance' that circumvents this limitation. We further demonstrate how to incorporate this relaxation into equivariant multilayer perceptrons (E-MLPs), offering an alternative to the noise-injection method. The relevance of symmetry breaking is then discussed in various application domains: physics, graph representation learning, combinatorial optimization and equivariant decoding.  ( 2 min )
    Distributed Stochastic Optimization under a General Variance Condition. (arXiv:2301.12677v3 [math.OC] UPDATED)
    Distributed stochastic optimization has drawn great attention recently due to its effectiveness in solving large-scale machine learning problems. Though numerous algorithms have been proposed and successfully applied to general practical problems, their theoretical guarantees mainly rely on certain boundedness conditions on the stochastic gradients, varying from uniform boundedness to the relaxed growth condition. In addition, how to characterize the data heterogeneity among the agents and its impacts on the algorithmic performance remains challenging. In light of such motivations, we revisit the classical Federated Averaging (FedAvg) algorithm (McMahan et al., 2017) as well as the more recent SCAFFOLD method (Karimireddy et al., 2020) for solving the distributed stochastic optimization problem and establish the convergence results under only a mild variance condition on the stochastic gradients for smooth nonconvex objective functions. Almost sure convergence to a stationary point is also established under the condition. Moreover, we discuss a more informative measurement for data heterogeneity as well as its implications.  ( 2 min )
    A Unified Experiment Design Approach for Cyclic and Acyclic Causal Models. (arXiv:2205.10083v3 [cs.LG] UPDATED)
    We study experiment design for unique identification of the causal graph of a simple SCM, where the graph may contain cycles. The presence of cycles in the structure introduces major challenges for experiment design as, unlike acyclic graphs, learning the skeleton of causal graphs with cycles may not be possible from merely the observational distribution. Furthermore, intervening on a variable in such graphs does not necessarily lead to orienting all the edges incident to it. In this paper, we propose an experiment design approach that can learn both cyclic and acyclic graphs and hence, unifies the task of experiment design for both types of graphs. We provide a lower bound on the number of experiments required to guarantee the unique identification of the causal graph in the worst case, showing that the proposed approach is order-optimal in terms of the number of experiments up to an additive logarithmic term. Moreover, we extend our result to the setting where the size of each experiment is bounded by a constant. For this case, we show that our approach is optimal in terms of the size of the largest experiment required for uniquely identifying the causal graph in the worst case.  ( 3 min )
    Learning from Polar Representation: An Extreme-Adaptive Model for Long-Term Time Series Forecasting. (arXiv:2312.08763v1 [cs.LG])
    In the hydrology field, time series forecasting is crucial for efficient water resource management, improving flood and drought control and increasing the safety and quality of life for the general population. However, predicting long-term streamflow is a complex task due to the presence of extreme events. It requires the capture of long-range dependencies and the modeling of rare but important extreme values. Existing approaches often struggle to tackle these dual challenges simultaneously. In this paper, we specifically delve into these issues and propose Distance-weighted Auto-regularized Neural network (DAN), a novel extreme-adaptive model for long-range forecasting of stremflow enhanced by polar representation learning. DAN utilizes a distance-weighted multi-loss mechanism and stackable blocks to dynamically refine indicator sequences from exogenous data, while also being able to handle uni-variate time-series by employing Gaussian Mixture probability modeling to improve robustness to severe events. We also introduce Kruskal-Wallis sampling and gate control vectors to handle imbalanced extreme data. On four real-life hydrologic streamflow datasets, we demonstrate that DAN significantly outperforms both state-of-the-art hydrologic time series prediction methods and general methods designed for long-term time series prediction.  ( 2 min )
    Learning to Optimize Permutation Flow Shop Scheduling via Graph-based Imitation Learning. (arXiv:2210.17178v2 [cs.LG] UPDATED)
    The permutation flow shop scheduling (PFSS), aiming at finding the optimal permutation of jobs, is widely used in manufacturing systems. When solving large-scale PFSS problems, traditional optimization algorithms such as heuristics could hardly meet the demands of both solution accuracy and computational efficiency, thus learning-based methods have recently garnered more attention. Some work attempts to solve the problems by reinforcement learning methods, which suffer from slow convergence issues during training and are still not accurate enough regarding the solutions. To that end, we propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately. Moreover, in order to extract better feature representations of input jobs, we incorporate the graph structure as the encoder. The extensive experiments reveal that our proposed model obtains significant promotion and presents excellent generalizability in large-scale problems with up to 1000 jobs. Compared to the state-of-the-art reinforcement learning method, our model's network parameters are reduced to only 37\% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8\% to 1.3\% on average. The code is available at: \url{https://github.com/longkangli/PFSS-IL}.  ( 2 min )
    QCM-SGM+: Improved Quantized Compressed Sensing With Score-Based Generative Models. (arXiv:2302.00919v3 [eess.SP] UPDATED)
    In practical compressed sensing (CS), the obtained measurements typically necessitate quantization to a limited number of bits prior to transmission or storage. This nonlinear quantization process poses significant recovery challenges, particularly with extreme coarse quantization such as 1-bit. Recently, an efficient algorithm called QCS-SGM was proposed for quantized CS (QCS) which utilizes score-based generative models (SGM) as an implicit prior. Due to the adeptness of SGM in capturing the intricate structures of natural signals, QCS-SGM substantially outperforms previous QCS methods. However, QCS-SGM is constrained to (approximately) row-orthogonal sensing matrices as the computation of the likelihood score becomes intractable otherwise. To address this limitation, we introduce an advanced variant of QCS-SGM, termed QCS-SGM+, capable of handling general matrices effectively. The key idea is a Bayesian inference perspective on the likelihood score computation, wherein expectation propagation is employed for its approximate computation. Extensive experiments are conducted, demonstrating the substantial superiority of QCS-SGM+ over QCS-SGM for general sensing matrices beyond mere row-orthogonality.  ( 2 min )
    Impact of Redundancy on Resilience in Distributed Optimization and Learning. (arXiv:2211.08622v2 [cs.DC] UPDATED)
    This report considers the problem of resilient distributed optimization and stochastic learning in a server-based architecture. The system comprises a server and multiple agents, where each agent has its own local cost function. The agents collaborate with the server to find a minimum of the aggregate of the local cost functions. In the context of stochastic learning, the local cost of an agent is the loss function computed over the data at that agent. In this report, we consider this problem in a system wherein some of the agents may be Byzantine faulty and some of the agents may be slow (also called stragglers). In this setting, we investigate the conditions under which it is possible to obtain an "approximate" solution to the above problem. In particular, we introduce the notion of $(f, r; \epsilon)$-resilience to characterize how well the true solution is approximated in the presence of up to $f$ Byzantine faulty agents, and up to $r$ slow agents (or stragglers) -- smaller $\epsilon$ represents a better approximation. We also introduce a measure named $(f, r; \epsilon)$-redundancy to characterize the redundancy in the cost functions of the agents. Greater redundancy allows for a better approximation when solving the problem of aggregate cost minimization. In this report, we constructively show (both theoretically and empirically) that $(f, r; \mathcal{O}(\epsilon))$-resilience can indeed be achieved in practice, given that the local cost functions are sufficiently redundant.  ( 3 min )
    Fast sampling from constrained spaces using the Metropolis-adjusted Mirror Langevin Algorithm. (arXiv:2312.08823v1 [stat.CO])
    We propose a new method called the Metropolis-adjusted Mirror Langevin algorithm for approximate sampling from distributions whose support is a compact and convex set. This algorithm adds an accept-reject filter to the Markov chain induced by a single step of the mirror Langevin algorithm (Zhang et al., 2020), which is a basic discretisation of the mirror Langevin dynamics. Due to the inclusion of this filter, our method is unbiased relative to the target, while known discretisations of the mirror Langevin dynamics including the mirror Langevin algorithm have an asymptotic bias. We give upper bounds for the mixing time of the proposed algorithm when the potential is relatively smooth, convex, and Lipschitz with respect to a self-concordant mirror function. As a consequence of the reversibility of the Markov chain induced by the algorithm, we obtain an exponentially better dependence on the error tolerance for approximate sampling. We also present numerical experiments that corroborate our theoretical findings.  ( 2 min )
    A Framework for Exploring Federated Community Detection. (arXiv:2312.09023v1 [cs.LG])
    Federated Learning is machine learning in the context of a network of clients whilst maintaining data residency and/or privacy constraints. Community detection is the unsupervised discovery of clusters of nodes within graph-structured data. The intersection of these two fields uncovers much opportunity, but also challenge. For example, it adds complexity due to missing connectivity information between privately held graphs. In this work, we explore the potential of federated community detection by conducting initial experiments across a range of existing datasets that showcase the gap in performance introduced by the distributed data. We demonstrate that isolated models would benefit from collaboration establishing a framework for investigating challenges within this domain. The intricacies of these research frontiers are discussed alongside proposed solutions to these issues.  ( 2 min )
    CAT: A Causally Graph Attention Network for Trimming Heterophilic Graph. (arXiv:2312.08672v1 [cs.LG])
    Local Attention-guided Message Passing Mechanism (LAMP) adopted in Graph Attention Networks (GATs) is designed to adaptively learn the importance of neighboring nodes for better local aggregation on the graph, which can bring the representations of similar neighbors closer effectively, thus showing stronger discrimination ability. However, existing GATs suffer from a significant discrimination ability decline in heterophilic graphs because the high proportion of dissimilar neighbors can weaken the self-attention of the central node, jointly resulting in the deviation of the central node from similar nodes in the representation space. This kind of effect generated by neighboring nodes is called the Distraction Effect (DE) in this paper. To estimate and weaken the DE of neighboring nodes, we propose a Causally graph Attention network for Trimming heterophilic graph (CAT). To estimate the DE, since the DE are generated through two paths (grab the attention assigned to neighbors and reduce the self-attention of the central node), we use Total Effect to model DE, which is a kind of causal estimand and can be estimated from intervened data; To weaken the DE, we identify the neighbors with the highest DE (we call them Distraction Neighbors) and remove them. We adopt three representative GATs as the base model within the proposed CAT framework and conduct experiments on seven heterophilic datasets in three different sizes. Comparative experiments show that CAT can improve the node classification accuracy of all base GAT models. Ablation experiments and visualization further validate the enhancement of discrimination ability brought by CAT. The source code is available at https://github.com/GeoX-Lab/CAT.  ( 3 min )
    SpeedUpNet: A Plug-and-Play Hyper-Network for Accelerating Text-to-Image Diffusion Models. (arXiv:2312.08887v1 [cs.CV])
    Text-to-image diffusion models (SD) exhibit significant advancements while requiring extensive computational resources. Though many acceleration methods have been proposed, they suffer from generation quality degradation or extra training cost generalizing to new fine-tuned models. To address these limitations, we propose a novel and universal Stable-Diffusion (SD) acceleration module called SpeedUpNet(SUN). SUN can be directly plugged into various fine-tuned SD models without extra training. This technique utilizes cross-attention layers to learn the relative offsets in the generated image results between negative and positive prompts achieving classifier-free guidance distillation with negative prompts controllable, and introduces a Multi-Step Consistency (MSC) loss to ensure a harmonious balance between reducing inference steps and maintaining consistency in the generated output. Consequently, SUN significantly reduces the number of inference steps to just 4 steps and eliminates the need for classifier-free guidance. It leads to an overall speedup of more than 10 times for SD models compared to the state-of-the-art 25-step DPM-solver++, and offers two extra advantages: (1) classifier-free guidance distillation with controllable negative prompts and (2) seamless integration into various fine-tuned Stable-Diffusion models without training. The effectiveness of the SUN has been verified through extensive experimentation. Project Page: https://williechai.github.io/speedup-plugin-for-stable-diffusions.github.io  ( 2 min )
    Uncertainty in GNN Learning Evaluations: A Comparison Between Measures for Quantifying Randomness in GNN Community Detection. (arXiv:2312.09015v1 [cs.LG])
    (1) The enhanced capability of Graph Neural Networks (GNNs) in unsupervised community detection of clustered nodes is attributed to their capacity to encode both the connectivity and feature information spaces of graphs. The identification of latent communities holds practical significance in various domains, from social networks to genomics. Current real-world performance benchmarks are perplexing due to the multitude of decisions influencing GNN evaluations for this task. (2) Three metrics are compared to assess the consistency of algorithm rankings in the presence of randomness. The consistency and quality of performance between the results under a hyperparameter optimisation with the default hyperparameters is evaluated. (3) The results compare hyperparameter optimisation with default hyperparameters, revealing a significant performance loss when neglecting hyperparameter investigation. A comparison of metrics indicates that ties in ranks can substantially alter the quantification of randomness. (4) Ensuring adherence to the same evaluation criteria may result in notable differences in the reported performance of methods for this task. The $W$ Randomness coefficient, based on the Wasserstein distance, is identified as providing the most robust assessment of randomness.  ( 3 min )
    Leveraging Diffusion-Based Image Variations for Robust Training on Poisoned Data. (arXiv:2310.06372v2 [cs.CR] UPDATED)
    Backdoor attacks pose a serious security threat for training neural networks as they surreptitiously introduce hidden functionalities into a model. Such backdoors remain silent during inference on clean inputs, evading detection due to inconspicuous behavior. However, once a specific trigger pattern appears in the input data, the backdoor activates, causing the model to execute its concealed function. Detecting such poisoned samples within vast datasets is virtually impossible through manual inspection. To address this challenge, we propose a novel approach that enables model training on potentially poisoned datasets by utilizing the power of recent diffusion models. Specifically, we create synthetic variations of all training samples, leveraging the inherent resilience of diffusion models to potential trigger patterns in the data. By combining this generative approach with knowledge distillation, we produce student models that maintain their general performance on the task while exhibiting robust resistance to backdoor triggers.
    Defenses in Adversarial Machine Learning: A Survey. (arXiv:2312.08890v1 [cs.CV])
    Adversarial phenomenon has been widely observed in machine learning (ML) systems, especially in those using deep neural networks, describing that ML systems may produce inconsistent and incomprehensible predictions with humans at some particular cases. This phenomenon poses a serious security threat to the practical application of ML systems, and several advanced attack paradigms have been developed to explore it, mainly including backdoor attacks, weight attacks, and adversarial examples. For each individual attack paradigm, various defense paradigms have been developed to improve the model robustness against the corresponding attack paradigm. However, due to the independence and diversity of these defense paradigms, it is difficult to examine the overall robustness of an ML system against different kinds of attacks.This survey aims to build a systematic review of all existing defense paradigms from a unified perspective. Specifically, from the life-cycle perspective, we factorize a complete machine learning system into five stages, including pre-training, training, post-training, deployment, and inference stages, respectively. Then, we present a clear taxonomy to categorize and review representative defense methods at each individual stage. The unified perspective and presented taxonomies not only facilitate the analysis of the mechanism of each defense paradigm but also help us to understand connections and differences among different defense paradigms, which may inspire future research to develop more advanced, comprehensive defenses.  ( 2 min )
    Language Models Represent Space and Time. (arXiv:2310.02207v2 [cs.LG] UPDATED)
    The capabilities of large language models (LLMs) have sparked debate over whether such systems just learn an enormous collection of superficial statistics or a coherent model of the data generation process -- a world model. We find preliminary evidence for the latter by analyzing the learned representations of three spatial datasets (world, US, NYC places) and three temporal datasets (historical figures, artworks, news headlines) in the Llama-2 family of models. We discover that LLMs learn linear representations of space and time across multiple scales. These representations are robust to prompting variations and unified across different entity types (e.g. cities and landmarks). In addition, we identify individual ``space neurons'' and ``time neurons'' that reliably encode spatial and temporal coordinates. While further investigation is needed, our results suggest modern LLMs learn rich spatiotemporal representations of the real world and possess basic ingredients of a world model.
    Double Equivariance for Inductive Link Prediction for Both New Nodes and New Relation Types. (arXiv:2302.01313v7 [cs.LG] UPDATED)
    The task of inductive link prediction in knowledge graphs (KGs) generally focuses on test predictions with solely new nodes but not both new nodes and new relation types. In this work, we formally define the concept of double permutation-equivariant representations that are equivariant to permutations of both node identities and edge relation types. We then show how double-equivariant architectures are able to self-supervise pre-train on distinct KG domains and zero-shot predict links on a new KG domain (with completely new entities and new relation types). We also introduce the concept of distributionally double equivariant positional embeddings designed to perform the same task. Finally, we empirically demonstrate the capability of the proposed models against baselines on a set of novel real-world benchmarks. More interestingly, we show that self-supervised pre-training on more KG domains increases the zero-shot ability of our model to predict on new relation types over new entities on unseen KG domains.
    DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations. (arXiv:2308.01890v2 [cs.CV] UPDATED)
    Multi-label image recognition in the low-label regime is a task of great challenge and practical significance. Previous works have focused on learning the alignment between textual and visual spaces to compensate for limited image labels, yet may suffer from reduced accuracy due to the scarcity of high-quality multi-label annotations. In this research, we leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs. We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++), which serves as a unified approach for addressing partial-label and zero-shot multi-label recognition. In DualCoOp++ we separately encode evidential, positive, and negative contexts for target classes as parametric components of the linguistic input (i.e., prompts). The evidential context aims to discover all the related visual content for the target class, and serves as guidance to aggregate positive and negative contexts from the spatial domain of the image, enabling better distinguishment between similar categories. Additionally, we introduce a Winner-Take-All module that promotes inter-class interaction during training, while avoiding the need for extra parameters and costs. As DualCoOp++ imposes minimal additional learnable overhead on the pretrained vision-language framework, it enables rapid adaptation to multi-label recognition tasks with limited annotations and even unseen classes. Experiments on standard multi-label recognition benchmarks across two challenging low-label settings demonstrate the superior performance of our approach compared to state-of-the-art methods.
    String Diagrams with Factorized Densities. (arXiv:2305.02506v5 [cs.PL] CROSS LISTED)
    A growing body of research on probabilistic programs and causal models has highlighted the need to reason compositionally about model classes that extend directed graphical models. Both probabilistic programs and causal models define a joint probability density over a set of random variables, and exhibit sparse structure that can be used to reason about causation and conditional independence. This work builds on recent work on Markov categories of probabilistic mappings to define a category whose morphisms combine a joint density, factorized over each sample space, with a deterministic mapping from samples to return values. This is a step towards closing the gap between recent category-theoretic descriptions of probability measures, and the operational definitions of factorized densities that are commonly employed in probabilistic programming and causal inference.
    Neural Network Field Theories: Non-Gaussianity, Actions, and Locality. (arXiv:2307.03223v2 [hep-th] UPDATED)
    Both the path integral measure in field theory and ensembles of neural networks describe distributions over functions. When the central limit theorem can be applied in the infinite-width (infinite-$N$) limit, the ensemble of networks corresponds to a free field theory. Although an expansion in $1/N$ corresponds to interactions in the field theory, others, such as in a small breaking of the statistical independence of network parameters, can also lead to interacting theories. These other expansions can be advantageous over the $1/N$-expansion, for example by improved behavior with respect to the universal approximation theorem. Given the connected correlators of a field theory, one can systematically reconstruct the action order-by-order in the expansion parameter, using a new Feynman diagram prescription whose vertices are the connected correlators. This method is motivated by the Edgeworth expansion and allows one to derive actions for neural network field theories. Conversely, the correspondence allows one to engineer architectures realizing a given field theory by representing action deformations as deformations of neural network parameter densities. As an example, $\phi^4$ theory is realized as an infinite-$N$ neural network field theory.
    Beyond U: Making Diffusion Models Faster & Lighter. (arXiv:2310.20092v2 [cs.LG] UPDATED)
    Diffusion models are a family of generative models that yield record-breaking performance in tasks such as image synthesis, video generation, and molecule design. Despite their capabilities, their efficiency, especially in the reverse denoising process, remains a challenge due to slow convergence rates and high computational costs. In this work, we introduce an approach that leverages continuous dynamical systems to design a novel denoising network for diffusion models that is more parameter-efficient, exhibits faster convergence, and demonstrates increased noise robustness. Experimenting with denoising probabilistic diffusion models, our framework operates with approximately a quarter of the parameters and $\sim 30\%$ of the Floating Point Operations (FLOPs) compared to standard U-Nets in Denoising Diffusion Probabilistic Models (DDPMs). Furthermore, our model is faster in inference than the baseline models when measured in equal conditions while converging to better quality solutions.
    PANDA: Architecture-Level Power Evaluation by Unifying Analytical and Machine Learning Solutions. (arXiv:2312.08994v1 [cs.LG])
    Power efficiency is a critical design objective in modern microprocessor design. To evaluate the impact of architectural-level design decisions, an accurate yet efficient architecture-level power model is desired. However, widely adopted data-independent analytical power models like McPAT and Wattch have been criticized for their unreliable accuracy. While some machine learning (ML) methods have been proposed for architecture-level power modeling, they rely on sufficient known designs for training and perform poorly when the number of available designs is limited, which is typically the case in realistic scenarios. In this work, we derive a general formulation that unifies existing architecture-level power models. Based on the formulation, we propose PANDA, an innovative architecture-level solution that combines the advantages of analytical and ML power models. It achieves unprecedented high accuracy on unknown new designs even when there are very limited designs for training, which is a common challenge in practice. Besides being an excellent power model, it can predict area, performance, and energy accurately. PANDA further supports power prediction for unknown new technology nodes. In our experiments, besides validating the superior performance and the wide range of functionalities of PANDA, we also propose an application scenario, where PANDA proves to identify high-performance design configurations given a power constraint.  ( 2 min )
    Multi-Modal Learning-based Reconstruction of High-Resolution Spatial Wind Speed Fields. (arXiv:2312.08933v1 [cs.LG])
    Wind speed at sea surface is a key quantity for a variety of scientific applications and human activities. Due to the non-linearity of the phenomenon, a complete description of such variable is made infeasible on both the small scale and large spatial extents. Methods relying on Data Assimilation techniques, despite being the state-of-the-art for Numerical Weather Prediction, can not provide the reconstructions with a spatial resolution that can compete with satellite imagery. In this work we propose a framework based on Variational Data Assimilation and Deep Learning concepts. This framework is applied to recover rich-in-time, high-resolution information on sea surface wind speed. We design our experiments using synthetic wind data and different sampling schemes for high-resolution and low-resolution versions of original data to emulate the real-world scenario of spatio-temporally heterogeneous observations. Extensive numerical experiments are performed to assess systematically the impact of low and high-resolution wind fields and in-situ observations on the model reconstruction performance. We show that in-situ observations with richer temporal resolution represent an added value in terms of the model reconstruction performance. We show how a multi-modal approach, that explicitly informs the model about the heterogeneity of the available observations, can improve the reconstruction task by exploiting the complementary information in spatial and local point-wise data. To conclude, we propose an analysis to test the robustness of the chosen framework against phase delay and amplitude biases in low-resolution data and against interruptions of in-situ observations supply at evaluation time
    Automated Sizing and Training of Efficient Deep Autoencoders using Second Order Algorithms. (arXiv:2308.06221v2 [cs.LG] UPDATED)
    We propose a multi-step training method for designing generalized linear classifiers. First, an initial multi-class linear classifier is found through regression. Then validation error is minimized by pruning of unnecessary inputs. Simultaneously, desired outputs are improved via a method similar to the Ho-Kashyap rule. Next, the output discriminants are scaled to be net functions of sigmoidal output units in a generalized linear classifier. We then develop a family of batch training algorithm for the multi layer perceptron that optimizes its hidden layer size and number of training epochs. Next, we combine pruning with a growing approach. Later, the input units are scaled to be the net function of the sigmoidal output units that are then feed into as input to the MLP. We then propose resulting improvements in each of the deep learning blocks thereby improving the overall performance of the deep architecture. We discuss the principles and formulation regarding learning algorithms for deep autoencoders. We investigate several problems in deep autoencoders networks including training issues, the theoretical, mathematical and experimental justification that the networks are linear, optimizing the number of hidden units in each layer and determining the depth of the deep learning model. A direct implication of the current work is the ability to construct fast deep learning models using desktop level computational resources. This, in our opinion, promotes our design philosophy of building small but powerful algorithms. Performance gains are demonstrated at each step. Using widely available datasets, the final network's ten fold testing error is shown to be less than that of several other linear, generalized linear classifiers, multi layer perceptron and deep learners reported in the literature.
    Discovering Symmetry Breaking in Physical Systems with Relaxed Group Convolution. (arXiv:2310.02299v4 [cs.LG] UPDATED)
    Finding symmetry breaking is essential for understanding the fundamental changes in the behaviors and properties of physical systems, from microscopic particle interactions to macroscopic phenomena like fluid dynamics and cosmic structures. Relaxed group convolution emerges as a solution for instances when physical systems without perfect symmetries and perfectly equivariant models are restrictive. In this paper, we provide both theoretical and empirical evidence that this flexible convolution technique allows the model to maintain the highest level of equivariance that is consistent with data and discover the subtle symmetry-breaking factors in various physical systems. We employ various relaxed group convolution architectures to uncover various symmetry-breaking factors in different physical systems, including the phase transition of crystal structure, the isotropy and homogeneity breaking in turbulence, and the time-reversal symmetry breaking in pendulum systems.
    Concealing Sensitive Samples against Gradient Leakage in Federated Learning. (arXiv:2209.05724v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a distributed learning paradigm that enhances users privacy by eliminating the need for clients to share raw, private data with the server. Despite the success, recent studies expose the vulnerability of FL to model inversion attacks, where adversaries reconstruct users private data via eavesdropping on the shared gradient information. We hypothesize that a key factor in the success of such attacks is the low entanglement among gradients per data within the batch during stochastic optimization. This creates a vulnerability that an adversary can exploit to reconstruct the sensitive data. Building upon this insight, we present a simple, yet effective defense strategy that obfuscates the gradients of the sensitive data with concealed samples. To achieve this, we propose synthesizing concealed samples to mimic the sensitive data at the gradient level while ensuring their visual dissimilarity from the actual sensitive data. Compared to the previous art, our empirical evaluations suggest that the proposed technique provides the strongest protection while simultaneously maintaining the FL performance.
    Amide Proton Transfer (APT) imaging in tumor with a machine learning approach using partially synthetic data. (arXiv:2311.01683v2 [physics.med-ph] UPDATED)
    Machine learning (ML) has been increasingly used to quantify chemical exchange saturation transfer (CEST) effect. ML models are typically trained using either measured data or fully simulated data. However, training with measured data often lacks sufficient training data, while training with fully simulated data may introduce bias due to limited simulations pools. This study introduces a new platform that combines simulated and measured components to generate partially synthetic CEST data, and to evaluate its feasibility for training ML models to predict amide proton transfer (APT) effect. Partially synthetic CEST signals were created using an inverse summation of APT effects from simulations and the other components from measurements. Training data were generated by varying APT simulation parameters and applying scaling factors to adjust the measured components, achieving a balance between simulation flexibility and fidelity. First, tissue-mimicking CEST signals along with ground truth information were created using multiple-pool model simulations to validate this method. Second, an ML model was trained individually on partially synthetic data, in vivo data, and fully simulated data, to predict APT effect in rat brains bearing 9L tumors. Experiments on tissue-mimicking data suggest that the ML method using the partially synthetic data is accurate in predicting APT. In vivo experiments suggest that our method provides more accurate and robust prediction than the training using in vivo data and fully synthetic data. Partially synthetic CEST data can address the challenges in conventional ML methods.
    The impact of memory on learning sequence-to-sequence tasks. (arXiv:2205.14683v2 [cs.LG] UPDATED)
    The recent success of neural networks in natural language processing has drawn renewed attention to learning sequence-to-sequence (seq2seq) tasks. While there exists a rich literature that studies classification and regression tasks using solvable models of neural networks, seq2seq tasks have not yet been studied from this perspective. Here, we propose a simple model for a seq2seq task that has the advantage of providing explicit control over the degree of memory, or non-Markovianity, in the sequences -- the stochastic switching-Ornstein-Uhlenbeck (SSOU) model. We introduce a measure of non-Markovianity to quantify the amount of memory in the sequences. For a minimal auto-regressive (AR) learning model trained on this task, we identify two learning regimes corresponding to distinct phases in the stationary state of the SSOU process. These phases emerge from the interplay between two different time scales that govern the sequence statistics. Moreover, we observe that while increasing the integration window of the AR model always improves performance, albeit with diminishing returns, increasing the non-Markovianity of the input sequences can improve or degrade its performance. Finally, we perform experiments with recurrent and convolutional neural networks that show that our observations carry over to more complicated neural network architectures.
    Prediction of the evolution of the nuclear reactor core parameters using artificial neural network. (arXiv:2304.10337v2 [cs.LG] UPDATED)
    A nuclear reactor based on MIT BEAVRS benchmark was used as a typical power generating Pressurized Water Reactor (PWR). The PARCS v3.2 nodal-diffusion core simulator was used as a full-core reactor physics solver to emulate the operation of a reactor and to generate training, and validation data for the ANN. The ANN was implemented with dedicated Python 3.8 code with Google's TensorFlow 2.0 library. The effort was based to a large extent on the process of appropriate automatic transformation of data generated by PARCS simulator, which was later used in the process of the ANN development. Various methods that allow obtaining better accuracy of the ANN predicted results were studied, such as trying different ANN architectures to find the optimal number of neurons in the hidden layers of the network. Results were later compared with the architectures proposed in the literature. For the selected best architecture predictions were made for different core parameters and their dependence on core loading patterns. In this study, a special focus was put on the prediction of the fuel cycle length for a given core loading pattern, as it can be considered one of the targets for plant economic operation. For instance, the length of a single fuel cycle depending on the initial core loading pattern was predicted with very good accuracy (>99%). This work contributes to the exploration of the usefulness of neural networks in solving nuclear reactor design problems. Thanks to the application of ANN, designers can avoid using an excessive amount of core simulator runs and more rapidly explore the space of possible solutions before performing more detailed design considerations.
    Object Recognition from Scientific Document based on Compartment Refinement Framework. (arXiv:2312.09038v1 [cs.CV])
    With the rapid development of the internet in the past decade, it has become increasingly important to extract valuable information from vast resources efficiently, which is crucial for establishing a comprehensive digital ecosystem, particularly in the context of research surveys and comprehension. The foundation of these tasks focuses on accurate extraction and deep mining of data from scientific documents, which are essential for building a robust data infrastructure. However, parsing raw data or extracting data from complex scientific documents have been ongoing challenges. Current data extraction methods for scientific documents typically use rule-based (RB) or machine learning (ML) approaches. However, using rule-based methods can incur high coding costs for articles with intricate typesetting. Conversely, relying solely on machine learning methods necessitates annotation work for complex content types within the scientific document, which can be costly. Additionally, few studies have thoroughly defined and explored the hierarchical layout within scientific documents. The lack of a comprehensive definition of the internal structure and elements of the documents indirectly impacts the accuracy of text classification and object recognition tasks. From the perspective of analyzing the standard layout and typesetting used in the specified publication, we propose a new document layout analysis framework called CTBR(Compartment & Text Blocks Refinement). Firstly, we define scientific documents into hierarchical divisions: base domain, compartment, and text blocks. Next, we conduct an in-depth exploration and classification of the meanings of text blocks. Finally, we utilize the results of text block classification to implement object recognition within scientific documents based on rule-based compartment segmentation.  ( 3 min )
    Math-Shepherd: A Label-Free Step-by-Step Verifier for LLMs in Mathematical Reasoning. (arXiv:2312.08935v1 [cs.AI])
    Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, even the most advanced open-source LLMs, such as the LLaMA family models, still face challenges when it comes to accurately solving complex multi-step mathematical problems. In this paper, we present an innovative process-oriented math verifier called \textbf{Math-Shepherd}, which assigns a reward score to each step of the LLM's outputs on math problems. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. With the guidance of Math-Shepherd, a series of open-source LLMs demonstrate exceptional performance. Among them, DeepSeek 67B \citep{DeepSeek-llm} stands out by achieving accuracy rates of 93.3\% on the GSM8K dataset and 48.1\% on the MATH dataset, without external enhancement such as tool usage. Our Math-Shepherd also outperforms the self-consistency method and other existing verification models. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.  ( 2 min )
    Stochastic Optimal Control Matching. (arXiv:2312.02027v2 [math.OC] UPDATED)
    Stochastic optimal control, which has the goal of driving the behavior of noisy systems, is broadly applicable in science, engineering and artificial intelligence. Our work introduces Stochastic Optimal Control Matching (SOCM), a novel Iterative Diffusion Optimization (IDO) technique for stochastic optimal control that stems from the same philosophy as the conditional score matching loss for diffusion models. That is, the control is learned via a least squares problem by trying to fit a matching vector field. The training loss, which is closely connected to the cross-entropy loss, is optimized with respect to both the control function and a family of reparameterization matrices which appear in the matching vector field. The optimization with respect to the reparameterization matrices aims at minimizing the variance of the matching vector field. Experimentally, our algorithm achieves lower error than all the existing IDO techniques for stochastic optimal control for three out of four control problems, in some cases by an order of magnitude. The key idea underlying SOCM is the path-wise reparameterization trick, a novel technique that is of independent interest, e.g., for generative modeling. Code at https://github.com/facebookresearch/SOC-matching
    Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off. (arXiv:2212.08949v2 [cs.LG] UPDATED)
    A default assumption in reinforcement learning (RL) and optimal control is that observations arrive at discrete time points on a fixed clock cycle. Yet, many applications involve continuous-time systems where the time discretization, in principle, can be managed. The impact of time discretization on RL methods has not been fully characterized in existing theory, but a more detailed analysis of its effect could reveal opportunities for improving data-efficiency. We address this gap by analyzing Monte-Carlo policy evaluation for LQR systems and uncover a fundamental trade-off between approximation and statistical error in value estimation. Importantly, these two errors behave differently to time discretization, leading to an optimal choice of temporal resolution for a given data budget. These findings show that managing the temporal resolution can provably improve policy evaluation efficiency in LQR systems with finite data. Empirically, we demonstrate the trade-off in numerical simulations of LQR instances and standard RL benchmarks for non-linear continuous control.
    DCLP: Neural Architecture Predictor with Curriculum Contrastive Learning. (arXiv:2302.13020v2 [cs.LG] UPDATED)
    Neural predictors have shown great potential in the evaluation process of neural architecture search (NAS). However, current predictor-based approaches overlook the fact that training a predictor necessitates a considerable number of trained neural networks as the labeled training set, which is costly to obtain. Therefore, the critical issue in utilizing predictors for NAS is to train a high-performance predictor using as few trained neural networks as possible. Although some methods attempt to address this problem through unsupervised learning, they often result in inaccurate predictions. We argue that the unsupervised tasks intended for the common graph data are too challenging for neural networks, causing unsupervised training to be susceptible to performance crashes in NAS. To address this issue, we propose a Curricumum-guided Contrastive Learning framework for neural Predictor (DCLP). Our method simplifies the contrastive task by designing a novel curriculum to enhance the stability of unlabeled training data distribution during contrastive training. Specifically, we propose a scheduler that ranks the training data according to the contrastive difficulty of each data and then inputs them to the contrastive learner in order. This approach concentrates the training data distribution and makes contrastive training more efficient. By using our method, the contrastive learner incrementally learns feature representations via unsupervised data on a smooth learning curve, avoiding performance crashes that may occur with excessively variable training data distributions. We experimentally demonstrate that DCLP has high accuracy and efficiency compared with existing predictors, and shows promising potential to discover superior architectures in various search spaces when combined with search strategies. Our code is available at: https://github.com/Zhengsh123/DCLP.  ( 3 min )
    High-Dimensional Bayesian Optimisation with Large-Scale Constraints -- An Application to Aeroelastic Tailoring. (arXiv:2312.08891v1 [cs.CE])
    Design optimisation potentially leads to lightweight aircraft structures with lower environmental impact. Due to the high number of design variables and constraints, these problems are ordinarily solved using gradient-based optimisation methods, leading to a local solution in the design space while the global space is neglected. Bayesian Optimisation is a promising path towards sample-efficient, global optimisation based on probabilistic surrogate models. While Bayesian optimisation methods have demonstrated their strength for problems with a low number of design variables, the scalability to high-dimensional problems while incorporating large-scale constraints is still lacking. Especially in aeroelastic tailoring where directional stiffness properties are embodied into the structural design of aircraft, to control aeroelastic deformations and to increase the aerodynamic and structural performance, the safe operation of the system needs to be ensured by involving constraints resulting from different analysis disciplines. Hence, a global design space search becomes even more challenging. The present study attempts to tackle the problem by using high-dimensional Bayesian Optimisation in combination with a dimensionality reduction approach to solve the optimisation problem occurring in aeroelastic tailoring, presenting a novel approach for high-dimensional problems with large-scale constraints. Experiments on well-known benchmark cases with black-box constraints show that the proposed approach can incorporate large-scale constraints.  ( 2 min )
    Learning Differentiable Particle Filter on the Fly. (arXiv:2312.05955v2 [cs.LG] UPDATED)
    Differentiable particle filters are an emerging class of sequential Bayesian inference techniques that use neural networks to construct components in state space models. Existing approaches are mostly based on offline supervised training strategies. This leads to the delay of the model deployment and the obtained filters are susceptible to distribution shift of test-time data. In this paper, we propose an online learning framework for differentiable particle filters so that model parameters can be updated as data arrive. The technical constraint is that there is no known ground truth state information in the online inference setting. We address this by adopting an unsupervised loss to construct the online model updating procedure, which involves a sequence of filtering operations for online maximum likelihood-based parameter estimation. We empirically evaluate the effectiveness of the proposed method, and compare it with supervised learning methods in simulation settings including a multivariate linear Gaussian state-space model and a simulated object tracking experiment.
    CSGNN: Conquering Noisy Node labels via Dynamic Class-wise Selection. (arXiv:2311.11473v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have emerged as a powerful tool for representation learning on graphs, but they often suffer from overfitting and label noise issues, especially when the data is scarce or imbalanced. Different from the paradigm of previous methods that rely on single-node confidence, in this paper, we introduce a novel Class-wise Selection for Graph Neural Networks, dubbed CSGNN, which employs a neighbor-aggregated latent space to adaptively select reliable nodes across different classes. Specifically, 1) to tackle the class imbalance issue, we introduce a dynamic class-wise selection mechanism, leveraging the clustering technique to identify clean nodes based on the neighbor-aggregated confidences. In this way, our approach can avoid the pitfalls of biased sampling which is common with global threshold techniques. 2) To alleviate the problem of noisy labels, built on the concept of the memorization effect, CSGNN prioritizes learning from clean nodes before noisy ones, thereby iteratively enhancing model performance while mitigating label noise. Through extensive experiments, we demonstrate that CSGNN outperforms state-of-the-art methods in terms of both effectiveness and robustness.
    Subspace Identification for Multi-Source Domain Adaptation. (arXiv:2310.04723v2 [cs.LG] UPDATED)
    Multi-source domain adaptation (MSDA) methods aim to transfer knowledge from multiple labeled source domains to an unlabeled target domain. Although current methods achieve target joint distribution identifiability by enforcing minimal changes across domains, they often necessitate stringent conditions, such as an adequate number of domains, monotonic transformation of latent variables, and invariant label distributions. These requirements are challenging to satisfy in real-world applications. To mitigate the need for these strict assumptions, we propose a subspace identification theory that guarantees the disentanglement of domain-invariant and domain-specific variables under less restrictive constraints regarding domain numbers and transformation properties, thereby facilitating domain adaptation by minimizing the impact of domain shifts on invariant variables. Based on this theory, we develop a Subspace Identification Guarantee (SIG) model that leverages variational inference. Furthermore, the SIG model incorporates class-aware conditional alignment to accommodate target shifts where label distributions change with the domains. Experimental results demonstrate that our SIG model outperforms existing MSDA techniques on various benchmark datasets, highlighting its effectiveness in real-world applications.
    LSTM Network Analysis of Vehicle-Type Fatalities on Great Britain's Roads. (arXiv:2312.08948v1 [cs.LG])
    This study harnesses the predictive capabilities of Long Short-Term Memory (LSTM) networks to analyse and predict road traffic accidents in Great Britain. It addresses the challenge of traffic accident forecasting, which is paramount for devising effective preventive measures. We utilised an extensive dataset encompassing reported collisions, casualties, and vehicles involvements from 1926 to 2022, provided by the Department for Transport (DfT). The data underwent stringent processing to rectify missing values and normalise features, ensuring robust LSTM network input.
    AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models. (arXiv:2310.15140v2 [cs.CR] UPDATED)
    Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters; manual jailbreak attacks craft readable prompts, but their limited number due to the necessity of human creativity allows for easy blocking. In this paper, we show that these solutions may be too optimistic. We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types. Guided by the dual goals of jailbreak and readability, AutoDAN optimizes and generates tokens one by one from left to right, resulting in readable prompts that bypass perplexity filters while maintaining high attack success rates. Notably, these prompts, generated from scratch using gradients, are interpretable and diverse, with emerging strategies commonly seen in manual jailbreak attacks. They also generalize to unforeseen harmful behaviors and transfer to black-box LLMs better than their unreadable counterparts when using limited training data or a single proxy model. Furthermore, we show the versatility of AutoDAN by automatically leaking system prompts using a customized objective. Our work offers a new way to red-team LLMs and understand jailbreak mechanisms via interpretability.
    A Survey on Knowledge Editing of Neural Networks. (arXiv:2310.19704v2 [cs.LG] UPDATED)
    Deep neural networks are becoming increasingly pervasive in academia and industry, matching and surpassing human performance on a wide variety of fields and related tasks. However, just as humans, even the largest artificial neural networks make mistakes, and once-correct predictions can become invalid as the world progresses in time. Augmenting datasets with samples that account for mistakes or up-to-date information has become a common workaround in practical applications. However, the well-known phenomenon of catastrophic forgetting poses a challenge in achieving precise changes in the implicitly memorized knowledge of neural network parameters, often requiring a full model re-training to achieve desired behaviors. That is expensive, unreliable, and incompatible with the current trend of large self-supervised pre-training, making it necessary to find more efficient and effective methods for adapting neural network models to changing data. To address this need, knowledge editing is emerging as a novel area of research that aims to enable reliable, data-efficient, and fast changes to a pre-trained target model, without affecting model behaviors on previously learned tasks. In this survey, we provide a brief review of this recent artificial intelligence field of research. We first introduce the problem of editing neural networks, formalize it in a common framework and differentiate it from more notorious branches of research such as continuous learning. Next, we provide a review of the most relevant knowledge editing approaches and datasets proposed so far, grouping works under four different families: regularization techniques, meta-learning, direct model editing, and architectural strategies. Finally, we outline some intersections with other fields of research and potential directions for future works.
    ForceGen: End-to-end de novo protein generation based on nonlinear mechanical unfolding responses using a language diffusion model. (arXiv:2310.10605v2 [cond-mat.mtrl-sci] UPDATED)
    Through evolution, nature has presented a set of remarkable protein materials, including elastins, silks, keratins and collagens with superior mechanical performances that play crucial roles in mechanobiology. However, going beyond natural designs to discover proteins that meet specified mechanical properties remains challenging. Here we report a generative model that predicts protein designs to meet complex nonlinear mechanical property-design objectives. Our model leverages deep knowledge on protein sequences from a pre-trained protein language model and maps mechanical unfolding responses to create novel proteins. Via full-atom molecular simulations for direct validation, we demonstrate that the designed proteins are novel, and fulfill the targeted mechanical properties, including unfolding energy and mechanical strength, as well as the detailed unfolding force-separation curves. Our model offers rapid pathways to explore the enormous mechanobiological protein sequence space unconstrained by biological synthesis, using mechanical features as target to enable the discovery of protein materials with superior mechanical properties.
    Lagrangian Flow Networks for Conservation Laws. (arXiv:2305.16846v2 [cs.LG] UPDATED)
    We introduce Lagrangian Flow Networks (LFlows) for modeling fluid densities and velocities continuously in space and time. By construction, the proposed LFlows satisfy the continuity equation, a PDE describing mass conservation in its differentiable form. Our model is based on the insight that solutions to the continuity equation can be expressed as time-dependent density transformations via differentiable and invertible maps. This follows from classical theory of the existence and uniqueness of Lagrangian flows for smooth vector fields. Hence, we model fluid densities by transforming a base density with parameterized diffeomorphisms conditioned on time. The key benefit compared to methods relying on numerical ODE solvers or PINNs is that the analytic expression of the velocity is always consistent with changes in density. Furthermore, we require neither expensive numerical solvers, nor additional penalties to enforce the PDE. LFlows show higher predictive accuracy in density modeling tasks compared to competing models in 2D and 3D, while being computationally efficient. As a real-world application, we model bird migration based on sparse weather radar measurements.
    AutoST: Training-free Neural Architecture Search for Spiking Transformers. (arXiv:2307.00293v2 [cs.NE] UPDATED)
    Spiking Transformers have gained considerable attention because they achieve both the energy efficiency of Spiking Neural Networks (SNNs) and the high capacity of Transformers. However, the existing Spiking Transformer architectures, derived from Artificial Neural Networks (ANNs), exhibit a notable architectural gap, resulting in suboptimal performance compared to their ANN counterparts. Manually discovering optimal architectures is time-consuming. To address these limitations, we introduce AutoST, a training-free NAS method for Spiking Transformers, to rapidly identify high-performance Spiking Transformer architectures. Unlike existing training-free NAS methods, which struggle with the non-differentiability and high sparsity inherent in SNNs, we propose to utilize Floating-Point Operations (FLOPs) as a performance metric, which is independent of model computations and training dynamics, leading to a stronger correlation with performance. Our extensive experiments show that AutoST models outperform state-of-the-art manually or automatically designed SNN architectures on static and neuromorphic datasets. Full code, model, and data are released for reproduction.
    Unravel Anomalies: An End-to-end Seasonal-Trend Decomposition Approach for Time Series Anomaly Detection. (arXiv:2310.00268v2 [cs.LG] UPDATED)
    Traditional Time-series Anomaly Detection (TAD) methods often struggle with the composite nature of complex time-series data and a diverse array of anomalies. We introduce TADNet, an end-to-end TAD model that leverages Seasonal-Trend Decomposition to link various types of anomalies to specific decomposition components, thereby simplifying the analysis of complex time-series and enhancing detection performance. Our training methodology, which includes pre-training on a synthetic dataset followed by fine-tuning, strikes a balance between effective decomposition and precise anomaly detection. Experimental validation on real-world datasets confirms TADNet's state-of-the-art performance across a diverse range of anomalies.
    ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation. (arXiv:2308.03793v2 [cs.CV] UPDATED)
    Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits to many tasks that have no labeled data. However, while applying CLIP to a downstream target domain, the presence of visual and text domain gaps and cross-modality misalignment can greatly impact the model performance. To address such challenges, we propose ReCLIP, the first source-free domain adaptation method for vision-language models, which does not require any source data or target labeled data. ReCLIP first learns a projection space to mitigate the misaligned visual-text embeddings and learns pseudo labels, and then deploys cross-modality self-training with the pseudo labels, to update visual and text encoders, refine labels and reduce domain gaps and misalignments iteratively. With extensive experiments, we demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks. Code available at https://github.com/michiganleon/ReCLIP_WACV.
    Neural Structure Fields with Application to Crystal Structure Autoencoders. (arXiv:2212.13120v2 [cond-mat.mtrl-sci] UPDATED)
    Representing crystal structures of materials to facilitate determining them via neural networks is crucial for enabling machine-learning applications involving crystal structure estimation. Among these applications, the inverse design of materials can contribute to explore materials with desired properties without relying on luck or serendipity. We propose neural structure fields (NeSF) as an accurate and practical approach for representing crystal structures using neural networks. Inspired by the concepts of vector fields in physics and implicit neural representations in computer vision, the proposed NeSF considers a crystal structure as a continuous field rather than as a discrete set of atoms. Unlike existing grid-based discretized spatial representations, the NeSF overcomes the tradeoff between spatial resolution and computational complexity and can represent any crystal structure. We propose an autoencoder of crystal structures that can recover various crystal structures, such as those of perovskite structure materials and cuprate superconductors. Extensive quantitative results demonstrate the superior performance of the NeSF compared with the existing grid-based approach.
    Inference on Optimal Dynamic Policies via Softmax Approximation. (arXiv:2303.04416v3 [econ.EM] UPDATED)
    Estimating optimal dynamic policies from offline data is a fundamental problem in dynamic decision making. In the context of causal inference, the problem is known as estimating the optimal dynamic treatment regime. Even though there exists a plethora of methods for estimation, constructing confidence intervals for the value of the optimal regime and structural parameters associated with it is inherently harder, as it involves non-linear and non-differentiable functionals of unknown quantities that need to be estimated. Prior work resorted to sub-sample approaches that can deteriorate the quality of the estimate. We show that a simple soft-max approximation to the optimal treatment regime, for an appropriately fast growing temperature parameter, can achieve valid inference on the truly optimal regime. We illustrate our result for a two-period optimal dynamic regime, though our approach should directly extend to the finite horizon case. Our work combines techniques from semi-parametric inference and $g$-estimation, together with an appropriate triangular array central limit theorem, as well as a novel analysis of the asymptotic influence and asymptotic bias of softmax approximations.
    Dynamic control of self-assembly of quasicrystalline structures through reinforcement learning. (arXiv:2309.06869v2 [cond-mat.soft] UPDATED)
    We propose reinforcement learning to control the dynamical self-assembly of the dodecagonal quasicrystal (DDQC) from patchy particles. The patchy particles have anisotropic interactions with other particles and form DDQC. However, their structures at steady states are significantly influenced by the kinetic pathways of their structural formation. We estimate the best policy of temperature control trained by the Q-learning method and demonstrate that we can generate DDQC with few defects using the estimated policy. The temperature schedule obtained by reinforcement learning can reproduce the desired structure more efficiently than the conventional pre-fixed temperature schedule, such as annealing. To clarify the success of the learning, we also analyse a simple model describing the kinetics of structural changes through the motion in a triple-well potential. We have found that reinforcement learning autonomously discovers the critical temperature at which structural fluctuations enhance the chance of forming a globally stable state. The estimated policy guides the system toward the critical temperature to assist the formation of DDQC.
    Robust-MBDL: A Robust Multi-branch Deep Learning Based Model for Remaining Useful Life Prediction and Operational Condition Identification of Rotating Machines. (arXiv:2309.06157v2 [cs.LG] UPDATED)
    In this paper, a Robust Multi-branch Deep learning-based system for remaining useful life (RUL) prediction and condition operations (CO) identification of rotating machines is proposed. In particular, the proposed system comprises main components: (1) an LSTM-Autoencoder to denoise the vibration data; (2) a feature extraction to generate time-domain, frequency-domain, and time-frequency based features from the denoised data; (3) a novel and robust multi-branch deep learning network architecture to exploit the multiple features. The performance of our proposed system was evaluated and compared to the state-of-the-art systems on two benchmark datasets of XJTU-SY and PRONOSTIA. The experimental results prove that our proposed system outperforms the state-of-the-art systems and presents potential for real-life applications on bearing machines.
    Robot Learning with Sensorimotor Pre-training. (arXiv:2306.10007v2 [cs.RO] UPDATED)
    We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. Given a sequence of camera images, proprioceptive robot states, and actions, we encode the sequence into tokens, mask out a subset, and train a model to predict the missing content from the rest. We hypothesize that if a robot can predict the masked-out content it will have acquired a good model of the physical world that can enable it to act. RPT is designed to operate on latent visual representations which makes prediction tractable, enables scaling to larger models, and allows fast inference on a real robot. To evaluate our approach, we collected a dataset of 20,000 real-world trajectories over 9 months using a combination of motion planning and grasping algorithms. We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
    GNNX-BENCH: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth Benchmarking. (arXiv:2310.01794v2 [cs.LG] UPDATED)
    Numerous explainability methods have been proposed to shed light on the inner workings of GNNs. Despite the inclusion of empirical evaluations in all the proposed algorithms, the interrogative aspects of these evaluations lack diversity. As a result, various facets of explainability pertaining to GNNs, such as a comparative analysis of counterfactual reasoners, their stability to variational factors such as different GNN architectures, noise, stochasticity in non-convex loss surfaces, feasibility amidst domain constraints, and so forth, have yet to be formally investigated. Motivated by this need, we present a benchmarking study on perturbation-based explainability methods for GNNs, aiming to systematically evaluate and compare a wide range of explainability techniques. Among the key findings of our study, we identify the Pareto-optimal methods that exhibit superior efficacy and stability in the presence of noise. Nonetheless, our study reveals that all algorithms are affected by stability issues when faced with noisy data. Furthermore, we have established that the current generation of counterfactual explainers often fails to provide feasible recourses due to violations of topological constraints encoded by domain-specific considerations. Overall, this benchmarking study empowers stakeholders in the field of GNNs with a comprehensive understanding of the state-of-the-art explainability methods, potential research problems for further enhancement, and the implications of their application in real-world scenarios.
    An Optimal Algorithm for the Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit. (arXiv:2306.09202v2 [cs.LG] UPDATED)
    We study the real-valued combinatorial pure exploration problem in the stochastic multi-armed bandit (R-CPE-MAB). We study the case where the size of the action set is polynomial with respect to the number of arms. In such a case, the R-CPE-MAB can be seen as a special case of the so-called transductive linear bandits. Existing methods in the R-CPE-MAB and transductive linear bandits have a gap of problem-dependent constant terms and logarithmic terms between the upper and lower bounds of the sample complexity, respectively. We close these gaps by proposing an algorithm named the combinatorial gap-based exploration (CombGapE) algorithm, whose sample complexity upper bound matches the lower bound. Finally, we numerically show that the CombGapE algorithm outperforms existing methods significantly.
    Data Portraits: Recording Foundation Model Training Data. (arXiv:2303.03919v2 [cs.LG] UPDATED)
    Foundation models are trained on increasingly immense and opaque datasets. Even while these models are now key in AI system building, it can be difficult to answer the straightforward question: has the model already encountered a given example during training? We therefore propose a widespread adoption of Data Portraits: artifacts that record training data and allow for downstream inspection. First we outline the properties of such an artifact and discuss how existing solutions can be used to increase transparency. We then propose and implement a solution based on data sketching, stressing fast and space efficient querying. Using our tools, we document a popular language modeling corpus (The Pile) and a recently released code modeling dataset (The Stack). We show that our solution enables answering questions about test set leakage and model plagiarism. Our tool is lightweight and fast, costing only 3% of the dataset size in overhead. We release a live interface of our tools at https://dataportraits.org/ and call on dataset and model creators to release Data Portraits as a complement to current documentation practices.
    Doubly Robust Estimator for Off-Policy Evaluation with Large Action Spaces. (arXiv:2308.03443v3 [stat.ML] UPDATED)
    We study Off-Policy Evaluation (OPE) in contextual bandit settings with large action spaces. The benchmark estimators suffer from severe bias and variance tradeoffs. Parametric approaches suffer from bias due to difficulty specifying the correct model, whereas ones with importance weight suffer from variance. To overcome these limitations, Marginalized Inverse Propensity Scoring (MIPS) was proposed to mitigate the estimator's variance via embeddings of an action. Nevertheless, MIPS is unbiased under the no direct effect, which assumes that the action embedding completely mediates the effect of an action on a reward. To overcome the dependency on these unrealistic assumptions, we propose a Marginalized Doubly Robust (MDR) estimator. Theoretical analysis shows that the proposed estimator is unbiased under weaker assumptions than MIPS while reducing the variance against MIPS. The empirical experiment verifies the supremacy of MDR against existing estimators with large action spaces.
    Simplifying Subgraph Representation Learning for Scalable Link Prediction. (arXiv:2301.12562v3 [cs.LG] UPDATED)
    Link prediction on graphs is a fundamental problem. Subgraph representation learning approaches (SGRLs), by transforming link prediction to graph classification on the subgraphs around the links, have achieved state-of-the-art performance in link prediction. However, SGRLs are computationally expensive, and not scalable to large-scale graphs due to expensive subgraph-level operations. To unlock the scalability of SGRLs, we propose a new class of SGRLs, that we call Scalable Simplified SGRL (S3GRL). Aimed at faster training and inference, S3GRL simplifies the message passing and aggregation operations in each link's subgraph. S3GRL, as a scalability framework, accommodates various subgraph sampling strategies and diffusion operators to emulate computationally-expensive SGRLs. We propose multiple instances of S3GRL and empirically study them on small to large-scale graphs. Our extensive experiments demonstrate that the proposed S3GRL models scale up SGRLs without significant performance compromise (even with considerable gains in some cases), while offering substantially lower computational footprints (e.g., multi-fold inference and training speedup).
    An Isolation-Aware Online Virtual Network Embedding via Deep Reinforcement Learning. (arXiv:2211.14158v3 [cs.NI] UPDATED)
    Virtualization technologies are the foundation of modern ICT infrastructure, enabling service providers to create dedicated virtual networks (VNs) that can support a wide range of smart city applications. These VNs continuously generate massive amounts of data, necessitating stringent reliability and security requirements. In virtualized network environments, however, multiple VNs may coexist on the same physical infrastructure and, if not properly isolated, may interfere with or provide unauthorized access to one another. The former causes performance degradation, while the latter compromises the security of VNs. Service assurance for infrastructure providers becomes significantly more complicated when a specific VN violates the isolation requirement. In an effort to address the isolation issue, this paper proposes isolation during virtual network embedding (VNE), the procedure of allocating VNs onto physical infrastructure. We define a simple abstracted concept of isolation levels to capture the variations in isolation requirements and then formulate isolation-aware VNE as an optimization problem with resource and isolation constraints. A deep reinforcement learning (DRL)-based VNE algorithm ISO-DRL_VNE, is proposed that considers resource and isolation constraints and is compared to the existing three state-of-the-art algorithms: NodeRank, Global Resource Capacity (GRC), and Mote-Carlo Tree Search (MCTS). Evaluation results show that the ISO-DRL_VNE algorithm outperforms others in acceptance ratio, long-term average revenue, and long-term average revenue-to-cost ratio by 6%, 13%, and 15%.
    GSHOT: Few-shot Generative Modeling of Labeled Graphs. (arXiv:2306.03480v2 [cs.LG] UPDATED)
    Deep graph generative modeling has gained enormous attraction in recent years due to its impressive ability to directly learn the underlying hidden graph distribution. Despite their initial success, these techniques, like much of the existing deep generative methods, require a large number of training samples to learn a good model. Unfortunately, large number of training samples may not always be available in scenarios such as drug discovery for rare diseases. At the same time, recent advances in few-shot learning have opened door to applications where available training data is limited. In this work, we introduce the hitherto unexplored paradigm of few-shot graph generative modeling. Towards this, we develop GSHOT, a meta-learning based framework for few-shot labeled graph generative modeling. GSHOT learns to transfer meta-knowledge from similar auxiliary graph datasets. Utilizing these prior experiences, GSHOT quickly adapts to an unseen graph dataset through self-paced fine-tuning. Through extensive experiments on datasets from diverse domains having limited training samples, we establish that GSHOT generates graphs of superior fidelity compared to existing baselines.
    Amicable Aid: Perturbing Images to Improve Classification Performance. (arXiv:2112.04720v4 [cs.CV] UPDATED)
    While adversarial perturbation of images to attack deep image classification models pose serious security concerns in practice, this paper suggests a novel paradigm where the concept of image perturbation can benefit classification performance, which we call amicable aid. We show that by taking the opposite search direction of perturbation, an image can be modified to yield higher classification confidence and even a misclassified image can be made correctly classified. This can be also achieved with a large amount of perturbation by which the image is made unrecognizable by human eyes. The mechanism of the amicable aid is explained in the viewpoint of the underlying natural image manifold. Furthermore, we investigate the universal amicable aid, i.e., a fixed perturbation can be applied to multiple images to improve their classification results. While it is challenging to find such perturbations, we show that making the decision boundary as perpendicular to the image manifold as possible via training with modified data is effective to obtain a model for which universal amicable perturbations are more easily found.
    Real-World Humanoid Locomotion with Reinforcement Learning. (arXiv:2303.03381v2 [cs.RO] UPDATED)
    Humanoid robots that can autonomously operate in diverse environments have the potential to help address labour shortages in factories, assist elderly at homes, and colonize new planets. While classical controllers for humanoid robots have shown impressive results in a number of settings, they are challenging to generalize and adapt to new environments. Here, we present a fully learning-based approach for real-world humanoid locomotion. Our controller is a causal transformer that takes the history of proprioceptive observations and actions as input and predicts the next action. We hypothesize that the observation-action history contains useful information about the world that a powerful transformer model can use to adapt its behavior in-context, without updating its weights. We train our model with large-scale model-free reinforcement learning on an ensemble of randomized environments in simulation and deploy it to the real world zero-shot. Our controller can walk over various outdoor terrains, is robust to external disturbances, and can adapt in context.
    Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering. (arXiv:2303.01903v3 [cs.CV] UPDATED)
    Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, we argue that they have not fully activated the capacity of the blind LLM as the provided textual input is insufficient to depict the required visual information to answer the question. In this paper, we present Prophet -- a conceptually simple, flexible, and general framework designed to prompt LLM with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary answer heuristics from the VQA model: answer candidates and answer-aware examples. Finally, the two types of answer heuristics are jointly encoded into a formatted prompt to facilitate the LLM's understanding of both the image and question, thus generating a more accurate answer. By incorporating the state-of-the-art LLM GPT-3, Prophet significantly outperforms existing state-of-the-art methods on four challenging knowledge-based VQA datasets. To demonstrate the generality of our approach, we instantiate Prophet with the combinations of different VQA models (i.e., both discriminative and generative ones) and different LLMs (i.e., both commercial and open-source ones).
    AVSegFormer: Audio-Visual Segmentation with Transformer. (arXiv:2307.01146v3 [cs.CV] UPDATED)
    The combination of audio and vision has long been a topic of interest in the multi-modal community. Recently, a new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video. This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges. In this paper, we propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture. Specifically, we introduce audio queries and learnable queries into the transformer decoder, enabling the network to selectively attend to interested visual features. Besides, we present an audio-visual mixer, which can dynamically adjust visual features by amplifying relevant and suppressing irrelevant spatial channels. Additionally, we devise an intermediate mask loss to enhance the supervision of the decoder, encouraging the network to produce more accurate intermediate predictions. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer.
    An overview of differentiable particle filters for data-adaptive sequential Bayesian inference. (arXiv:2302.09639v2 [cs.LG] UPDATED)
    By approximating posterior distributions with weighted samples, particle filters (PFs) provide an efficient mechanism for solving non-linear sequential state estimation problems. While the effectiveness of particle filters has been recognised in various applications, their performance relies on the knowledge of dynamic models and measurement models, as well as the construction of effective proposal distributions. An emerging trend involves constructing components of particle filters using neural networks and optimising them by gradient descent, and such data-adaptive particle filtering approaches are often called differentiable particle filters. Due to the expressiveness of neural networks, differentiable particle filters are a promising computational tool for performing inference on sequential data in complex, high-dimensional tasks, such as vision-based robot localisation. In this paper, we review recent advances in differentiable particle filters and their applications. We place special emphasis on different design choices for key components of differentiable particle filters, including dynamic models, measurement models, proposal distributions, optimisation objectives, and differentiable resampling techniques.
    Conformalised data synthesis with statistical quality guarantees. (arXiv:2312.08999v1 [cs.LG])
    With the proliferation of ever more complicated Deep Learning architectures, data synthesis is a highly promising technique to address the demand of data-hungry models. However, reliably assessing the quality of a 'synthesiser' model's output is an open research question with significant associated risks for high-stake domains. To address this challenge, we have designed a unique confident data synthesis algorithm that introduces statistical confidence guarantees through a novel extension of the Conformal Prediction framework. We support our proposed algorithm with theoretical proofs and an extensive empirical evaluation of five benchmark datasets. To show our approach's versatility on ubiquitous real-world challenges, the datasets were carefully selected for their variety of difficult characteristics: low sample count, class imbalance and non-separability, and privacy-sensitive data. In all trials, training sets extended with our confident synthesised data performed at least as well as the original, and frequently significantly improved Deep Learning performance by up to +65% F1-score.  ( 2 min )
    Predicting the Initial Conditions of the Universe using a Deterministic Neural Network. (arXiv:2303.13056v2 [astro-ph.CO] UPDATED)
    Finding the initial conditions that led to the current state of the universe is challenging because it involves searching over an intractable input space of initial conditions, along with modeling their evolution via tools such as N-body simulations which are computationally expensive. Recently, deep learning has emerged as a surrogate for N-body simulations by directly learning the mapping between the linear input of an N-body simulation and the final nonlinear output from the simulation, significantly accelerating the forward modeling. However, this still does not reduce the search space for initial conditions. In this work, we pioneer the use of a deterministic convolutional neural network for learning the reverse mapping and show that it accurately recovers the initial linear displacement field over a wide range of scales ($<1$-$2\%$ error up to nearly $k\simeq0.8$-$0.9 \text{ Mpc}^{-1}h$), despite the one-to-many mapping of the inverse problem (due to the divergent backward trajectories at smaller scales). Specifically, we train a V-Net architecture, which outputs the linear displacement of an N-body simulation, given the nonlinear displacement at redshift $z=0$ and the cosmological parameters. The results of our method suggest that a simple deterministic neural network is sufficient for accurately approximating the initial linear states, potentially obviating the need for the more complex and computationally demanding backward modeling methods that were recently proposed.  ( 3 min )
    Entity-Augmented Code Generation. (arXiv:2312.08976v1 [cs.SE])
    The current state-of-the-art large language models (LLMs) are effective in generating high-quality text and encapsulating a broad spectrum of world knowledge. However, these models often hallucinate during generation and are not designed to utilize external information sources. To enable requests to the external knowledge bases, also called knowledge grounding, retrieval-augmented LLMs were introduced. For now, their applications have largely involved Open Domain Question Answering, Abstractive Question Answering, and such. In this paper, we broaden the scope of retrieval-augmented LLMs by venturing into a new task - code generation using external entities. For this task, we collect and publish a new dataset for project-level code generation, where the model should reuse functions defined in the project during generation. As we show, existing retrieval-augmented LLMs fail to assign relevance scores between similar entity names, and to mitigate it, they expand entity names with description context and append it to the input. In practice, due to the limited context size they can not accommodate the indefinitely large context of the whole project. To solve this issue, we propose a novel end-to-end trainable architecture with an scalable entity retriever injected directly into the LLM decoder. We demonstrate that our model can outperform common baselines in several scenarios, including project-level code generation, as well as Bash and SQL scripting.  ( 2 min )
    Adaptive Shortcut Debiasing for Online Continual Learning. (arXiv:2312.08677v1 [cs.LG])
    We propose a novel framework DropTop that suppresses the shortcut bias in online continual learning (OCL) while being adaptive to the varying degree of the shortcut bias incurred by continuously changing environment. By the observed high-attention property of the shortcut bias, highly-activated features are considered candidates for debiasing. More importantly, resolving the limitation of the online environment where prior knowledge and auxiliary data are not ready, two novel techniques -- feature map fusion and adaptive intensity shifting -- enable us to automatically determine the appropriate level and proportion of the candidate shortcut features to be dropped. Extensive experiments on five benchmark datasets demonstrate that, when combined with various OCL algorithms, DropTop increases the average accuracy by up to 10.4% and decreases the forgetting by up to 63.2%.  ( 2 min )
    Global Rewards in Multi-Agent Deep Reinforcement Learning for Autonomous Mobility on Demand Systems. (arXiv:2312.08884v1 [cs.LG])
    We study vehicle dispatching in autonomous mobility on demand (AMoD) systems, where a central operator assigns vehicles to customer requests or rejects these with the aim of maximizing its total profit. Recent approaches use multi-agent deep reinforcement learning (MADRL) to realize scalable yet performant algorithms, but train agents based on local rewards, which distorts the reward signal with respect to the system-wide profit, leading to lower performance. We therefore propose a novel global-rewards-based MADRL algorithm for vehicle dispatching in AMoD systems, which resolves so far existing goal conflicts between the trained agents and the operator by assigning rewards to agents leveraging a counterfactual baseline. Our algorithm shows statistically significant improvements across various settings on real-world data compared to state-of-the-art MADRL algorithms with local rewards. We further provide a structural analysis which shows that the utilization of global rewards can improve implicit vehicle balancing and demand forecasting abilities. Our code is available at https://github.com/tumBAIS/GR-MADRL-AMoD.  ( 2 min )
    Personalized Path Recourse. (arXiv:2312.08724v1 [cs.LG])
    This paper introduces Personalized Path Recourse, a novel method that generates recourse paths for an agent. The objective is to achieve desired goals (e.g., better outcomes compared to the agent's original paths of action), while ensuring a high similarity to the agent's original paths and being personalized to the agent. Personalization refers to the extent to which the new path is tailored to the agent's observed behavior patterns from their policy function. We train a personalized recourse agent to generate such personalized paths, which are obtained using reward functions that consider the goal, similarity, and personalization. The proposed method is applicable to both reinforcement learning and supervised learning settings for correcting or improving sequences of actions or sequences of data to achieve a pre-determined goal. The method is evaluated in various settings and demonstrates promising results.  ( 2 min )
    BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials. (arXiv:2312.08937v1 [cs.LG])
    Pretrained foundation models offer substantial benefits for a wide range of downstream tasks, which can be one of the most potential techniques to access artificial general intelligence. However, scaling up foundation transformers for maximal task-agnostic knowledge has brought about computational challenges, especially on resource-limited devices such as mobiles. This work proposes the first Binary Pretrained Foundation Transformer (BiPFT) for natural language understanding (NLU) tasks, which remarkably saves 56 times operations and 28 times memory. In contrast to previous task-specific binary transformers, BiPFT exhibits a substantial enhancement in the learning capabilities of binary neural networks (BNNs), promoting BNNs into the era of pre-training. Benefiting from extensive pretraining data, we further propose a data-driven binarization method. Specifically, we first analyze the binarization error in self-attention operations and derive the polynomials of binarization error. To simulate full-precision self-attention, we define binarization error as binarization residual polynomials, and then introduce low-rank estimators to model these polynomials. Extensive experiments validate the effectiveness of BiPFTs, surpassing task-specific baseline by 15.4% average performance on the GLUE benchmark. BiPFT also demonstrates improved robustness to hyperparameter changes, improved optimization efficiency, and reduced reliance on downstream distillation, which consequently generalize on various NLU tasks and simplify the downstream pipeline of BNNs. Our code and pretrained models are publicly available at https://github.com/Xingrun-Xing/BiPFT.  ( 2 min )
    Reconstruction of Sound Field through Diffusion Models. (arXiv:2312.08821v1 [eess.AS])
    Reconstructing the sound field in a room is an important task for several applications, such as sound control and augmented (AR) or virtual reality (VR). In this paper, we propose a data-driven generative model for reconstructing the magnitude of acoustic fields in rooms with a focus on the modal frequency range. We introduce, for the first time, the use of a conditional Denoising Diffusion Probabilistic Model (DDPM) trained in order to reconstruct the sound field (SF-Diff) over an extended domain. The architecture is devised in order to be conditioned on a set of limited available measurements at different frequencies and generate the sound field in target, unknown, locations. The results show that SF-Diff is able to provide accurate reconstructions, outperforming a state-of-the-art baseline based on kernel interpolation.  ( 2 min )
    LiFT: Unsupervised Reinforcement Learning with Foundation Models as Teachers. (arXiv:2312.08958v1 [cs.LG])
    We propose a framework that leverages foundation models as teachers, guiding a reinforcement learning agent to acquire semantically meaningful behavior without human feedback. In our framework, the agent receives task instructions grounded in a training environment from large language models. Then, a vision-language model guides the agent in learning the multi-task language-conditioned policy by providing reward feedback. We demonstrate that our method can learn semantically meaningful skills in a challenging open-ended MineDojo environment while prior unsupervised skill discovery methods struggle. Additionally, we discuss observed challenges of using off-the-shelf foundation models as teachers and our efforts to address them.  ( 2 min )
    Weighted Ensemble Models Are Strong Continual Learners. (arXiv:2312.08977v1 [cs.LG])
    In this work, we study the problem of continual learning (CL) where the goal is to learn a model on a sequence of tasks, such that the data from the previous tasks becomes unavailable while learning on the current task data. CL is essentially a balancing act between being able to learn on the new task (i.e., plasticity) and maintaining the performance on the previously learned concepts (i.e., stability). With an aim to address the stability-plasticity trade-off, we propose to perform weight-ensembling of the model parameters of the previous and current task. This weight-ensembled model, which we call Continual Model Averaging (or CoMA), attains high accuracy on the current task by leveraging plasticity, while not deviating too far from the previous weight configuration, ensuring stability. We also propose an improved variant of CoMA, named Continual Fisher-weighted Model Averaging (or CoFiMA), that selectively weighs each parameter in the weight ensemble by leveraging the Fisher information of the weights of the model. Both the variants are conceptually simple, easy to implement, and effective in attaining state-of-the-art performance on several standard CL benchmarks.  ( 2 min )
    Forbidden Facts: An Investigation of Competing Objectives in Llama-2. (arXiv:2312.08793v1 [cs.LG])
    LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-chat models on the forbidden fact task. Specifically, we instruct Llama-2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. This often makes the model give incorrect answers. We decompose Llama-2 into 1000+ components, and rank each one with respect to how useful it is for forbidding the correct answer. We find that in aggregate, around 35 components are enough to reliably implement the full suppression behavior. However, these components are fairly heterogeneous and many operate using faulty heuristics. We discover that one of these heuristics can be exploited via a manually designed adversarial attack which we call The California Attack. Our results highlight some roadblocks standing in the way of being able to successfully interpret advanced ML systems. Project website available at https://forbiddenfacts.github.io .  ( 2 min )
    What's Next? Predicting Hamiltonian Dynamics from Discrete Observations of a Vector Field. (arXiv:2312.08944v1 [cs.LG])
    We present several methods for predicting the dynamics of Hamiltonian systems from discrete observations of their vector field. Each method is either informed or uninformed of the Hamiltonian property. We empirically and comparatively evaluate the methods and observe that information that the system is Hamiltonian can be effectively informed, and that different methods strike different trade-offs between efficiency and effectiveness for different dynamical systems.  ( 2 min )
    ERASE: Error-Resilient Representation Learning on Graphs for Label Noise Tolerance. (arXiv:2312.08852v1 [cs.LG])
    Deep learning has achieved remarkable success in graph-related tasks, yet this accomplishment heavily relies on large-scale high-quality annotated datasets. However, acquiring such datasets can be cost-prohibitive, leading to the practical use of labels obtained from economically efficient sources such as web searches and user tags. Unfortunately, these labels often come with noise, compromising the generalization performance of deep networks. To tackle this challenge and enhance the robustness of deep learning models against label noise in graph-based tasks, we propose a method called ERASE (Error-Resilient representation learning on graphs for lAbel noiSe tolerancE). The core idea of ERASE is to learn representations with error tolerance by maximizing coding rate reduction. Particularly, we introduce a decoupled label propagation method for learning representations. Before training, noisy labels are pre-corrected through structural denoising. During training, ERASE combines prototype pseudo-labels with propagated denoised labels and updates representations with error resilience, which significantly improves the generalization performance in node classification. The proposed method allows us to more effectively withstand errors caused by mislabeled nodes, thereby strengthening the robustness of deep networks in handling noisy graph data. Extensive experimental results show that our method can outperform multiple baselines with clear margins in broad noise levels and enjoy great scalability. Codes are released at https://github.com/eraseai/erase.  ( 2 min )
    Context-PEFT: Efficient Multi-Modal, Multi-Task Fine-Tuning. (arXiv:2312.08900v1 [cs.LG])
    This paper introduces a novel Parameter-Efficient Fine-Tuning (PEFT) framework for multi-modal, multi-task transfer learning with pre-trained language models. PEFT techniques such as LoRA, BitFit and IA3 have demonstrated comparable performance to full fine-tuning of pre-trained models for specific downstream tasks, all while demanding significantly fewer trainable parameters and reduced GPU memory consumption. However, in the context of multi-modal fine-tuning, the need for architectural modifications or full fine-tuning often becomes apparent. To address this we propose Context-PEFT, which learns different groups of adaptor parameters based on the token's domain. This approach enables LoRA-like weight injection without requiring additional architectural changes. Our method is evaluated on the COCO captioning task, where it outperforms full fine-tuning under similar data constraints while simultaneously offering a substantially more parameter-efficient and computationally economical solution.  ( 2 min )
    Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis. (arXiv:2312.08782v1 [cs.RO])
    Building general-purpose robots that can operate seamlessly, in any environment, with any object, and utilizing various skills to complete diverse tasks has been a long-standing goal in Artificial Intelligence. Unfortunately, however, most existing robotic systems have been constrained - having been designed for specific tasks, trained on specific datasets, and deployed within specific environments. These systems usually require extensively-labeled data, rely on task-specific models, have numerous generalization issues when deployed in real-world scenarios, and struggle to remain robust to distribution shifts. Motivated by the impressive open-set performance and content generation capabilities of web-scale, large-capacity pre-trained models (i.e., foundation models) in research fields such as Natural Language Processing (NLP) and Computer Vision (CV), we devote this survey to exploring (i) how these existing foundation models from NLP and CV can be applied to the field of robotics, and also exploring (ii) what a robotics-specific foundation model would look like. We begin by providing an overview of what constitutes a conventional robotic system and the fundamental barriers to making it universally applicable. Next, we establish a taxonomy to discuss current work exploring ways to leverage existing foundation models for robotics and develop ones catered to robotics. Finally, we discuss key challenges and promising future directions in using foundation models for enabling general-purpose robotic systems. We encourage readers to view our ``living`` GitHub repository of resources, including papers reviewed in this survey as well as related projects and repositories for developing foundation models for robotics.  ( 3 min )
    Performance evaluation of matrix factorization for fMRI data. (arXiv:2312.08809v1 [q-bio.NC])
    In the study of the brain, there is a hypothesis that sparse coding is realized in information representation of external stimuli, which is experimentally confirmed for visual stimulus recently. However, unlike the specific functional region in the brain, sparse coding in information processing in the whole brain has not been clarified sufficiently. In this study, we investigate the validity of sparse coding in the whole human brain by applying various matrix factorization methods to functional magnetic resonance imaging data of neural activities in the whole human brain. The result suggests sparse coding hypothesis in information representation in the whole human brain, because extracted features from sparse MF method, SparsePCA or MOD under high sparsity setting, or approximate sparse MF method, FastICA, can classify external visual stimuli more accurately than non-sparse MF method or sparse MF method under low sparsity setting.  ( 2 min )
    Incomplete Contrastive Multi-View Clustering with High-Confidence Guiding. (arXiv:2312.08697v1 [cs.CV])
    Incomplete multi-view clustering becomes an important research problem, since multi-view data with missing values are ubiquitous in real-world applications. Although great efforts have been made for incomplete multi-view clustering, there are still some challenges: 1) most existing methods didn't make full use of multi-view information to deal with missing values; 2) most methods just employ the consistent information within multi-view data but ignore the complementary information; 3) For the existing incomplete multi-view clustering methods, incomplete multi-view representation learning and clustering are treated as independent processes, which leads to performance gap. In this work, we proposed a novel Incomplete Contrastive Multi-View Clustering method with high-confidence guiding (ICMVC). Firstly, we proposed a multi-view consistency relation transfer plus graph convolutional network to tackle missing values problem. Secondly, instance-level attention fusion and high-confidence guiding are proposed to exploit the complementary information while instance-level contrastive learning for latent representation is designed to employ the consistent information. Thirdly, an end-to-end framework is proposed to integrate multi-view missing values handling, multi-view representation learning and clustering assignment for joint optimization. Experiments compared with state-of-the-art approaches demonstrated the effectiveness and superiority of our method. Our code is publicly available at https://github.com/liunian-Jay/ICMVC.  ( 2 min )
    Managing the unknown: a survey on Open Set Recognition and tangential areas. (arXiv:2312.08785v1 [cs.LG])
    In real-world scenarios classification models are often required to perform robustly when predicting samples belonging to classes that have not appeared during its training stage. Open Set Recognition addresses this issue by devising models capable of detecting unknown classes from samples arriving during the testing phase, while maintaining a good level of performance in the classification of samples belonging to known classes. This review comprehensively overviews the recent literature related to Open Set Recognition, identifying common practices, limitations, and connections of this field with other machine learning research areas, such as continual learning, out-of-distribution detection, novelty detection, and uncertainty estimation. Our work also uncovers open problems and suggests several research directions that may motivate and articulate future efforts towards more safe Artificial Intelligence methods.  ( 2 min )
    EAT: Towards Long-Tailed Out-of-Distribution Detection. (arXiv:2312.08939v1 [cs.LG])
    Despite recent advancements in out-of-distribution (OOD) detection, most current studies assume a class-balanced in-distribution training dataset, which is rarely the case in real-world scenarios. This paper addresses the challenging task of long-tailed OOD detection, where the in-distribution data follows a long-tailed class distribution. The main difficulty lies in distinguishing OOD data from samples belonging to the tail classes, as the ability of a classifier to detect OOD instances is not strongly correlated with its accuracy on the in-distribution classes. To overcome this issue, we propose two simple ideas: (1) Expanding the in-distribution class space by introducing multiple abstention classes. This approach allows us to build a detector with clear decision boundaries by training on OOD data using virtual labels. (2) Augmenting the context-limited tail classes by overlaying images onto the context-rich OOD data. This technique encourages the model to pay more attention to the discriminative features of the tail classes. We provide a clue for separating in-distribution and OOD data by analyzing gradient noise. Through extensive experiments, we demonstrate that our method outperforms the current state-of-the-art on various benchmark datasets. Moreover, our method can be used as an add-on for existing long-tail learning approaches, significantly enhancing their OOD detection performance. Code is available at: https://github.com/Stomach-ache/Long-Tailed-OOD-Detection .  ( 2 min )
    Unraveling Key Factors of Knowledge Distillation. (arXiv:2312.08585v1 [cs.CL])
    Knowledge distillation, a technique for model compression and performance enhancement, has gained significant traction in Neural Machine Translation (NMT). However, existing research primarily focuses on empirical applications, and there is a lack of comprehensive understanding of how student model capacity, data complexity, and decoding strategies collectively influence distillation effectiveness. Addressing this gap, our study conducts an in-depth investigation into these factors, particularly focusing on their interplay in word-level and sequence-level distillation within NMT. Through extensive experimentation across datasets like IWSLT13 En$\rightarrow$Fr, IWSLT14 En$\rightarrow$De, and others, we empirically validate hypotheses related to the impact of these factors on knowledge distillation. Our research not only elucidates the significant influence of model capacity, data complexity, and decoding strategies on distillation effectiveness but also introduces a novel, optimized distillation approach. This approach, when applied to the IWSLT14 de$\rightarrow$en translation task, achieves state-of-the-art performance, demonstrating its practical efficacy in advancing the field of NMT.  ( 2 min )
    Explainable Equivariant Neural Networks for Particle Physics: PELICAN. (arXiv:2307.16506v3 [hep-ph] UPDATED)
    PELICAN is a novel permutation equivariant and Lorentz invariant or covariant aggregator network designed to overcome common limitations found in architectures applied to particle physics problems. Compared to many approaches that use non-specialized architectures that neglect underlying physics principles and require very large numbers of parameters, PELICAN employs a fundamentally symmetry group-based architecture that demonstrates benefits in terms of reduced complexity, increased interpretability, and raw performance. We present a comprehensive study of the PELICAN algorithm architecture in the context of both tagging (classification) and reconstructing (regression) Lorentz-boosted top quarks, including the difficult task of specifically identifying and measuring the $W$-boson inside the dense environment of the Lorentz-boosted top-quark hadronic final state. We also extend the application of PELICAN to the tasks of identifying quark-initiated vs.~gluon-initiated jets, and a multi-class identification across five separate target categories of jets. When tested on the standard task of Lorentz-boosted top-quark tagging, PELICAN outperforms existing competitors with much lower model complexity and high sample efficiency. On the less common and more complex task of 4-momentum regression, PELICAN also outperforms hand-crafted, non-machine learning algorithms. We discuss the implications of symmetry-restricted architectures for the wider field of machine learning for physics.
    Guided Distillation for Semi-Supervised Instance Segmentation. (arXiv:2308.02668v2 [cs.CV] UPDATED)
    Although instance segmentation methods have improved considerably, the dominant paradigm is to rely on fully-annotated training images, which are tedious to obtain. To alleviate this reliance, and boost results, semi-supervised approaches leverage unlabeled data as an additional training signal that limits overfitting to the labeled samples. In this context, we present novel design choices to significantly improve teacher-student distillation models. In particular, we (i) improve the distillation approach by introducing a novel "guided burn-in" stage, and (ii) evaluate different instance segmentation architectures, as well as backbone networks and pre-training strategies. Contrary to previous work which uses only supervised data for the burn-in period of the student model, we also use guidance of the teacher model to exploit unlabeled data in the burn-in period. Our improved distillation approach leads to substantial improvements over previous state-of-the-art results. For example, on the Cityscapes dataset we improve mask-AP from 23.7 to 33.9 when using labels for 10\% of images, and on the COCO dataset we improve mask-AP from 18.3 to 34.1 when using labels for only 1\% of the training data.
    Diffusion Model in Causal Inference with Unmeasured Confounders. (arXiv:2308.03669v4 [cs.LG] UPDATED)
    We study how to extend the use of the diffusion model to answer the causal question from the observational data under the existence of unmeasured confounders. In Pearl's framework of using a Directed Acyclic Graph (DAG) to capture the causal intervention, a Diffusion-based Causal Model (DCM) was proposed incorporating the diffusion model to answer the causal questions more accurately, assuming that all of the confounders are observed. However, unmeasured confounders in practice exist, which hinders DCM from being applicable. To alleviate this limitation of DCM, we propose an extended model called Backdoor Criterion based DCM (BDCM), whose idea is rooted in the Backdoor criterion to find the variables in DAG to be included in the decoding process of the diffusion model so that we can extend DCM to the case with unmeasured confounders. Synthetic data experiment demonstrates that our proposed model captures the counterfactual distribution more precisely than DCM under the unmeasured confounders.
    DeepSaDe: Learning Neural Networks that Guarantee Domain Constraint Satisfaction. (arXiv:2303.01141v3 [cs.LG] UPDATED)
    As machine learning models, specifically neural networks, are becoming increasingly popular, there are concerns regarding their trustworthiness, specially in safety-critical applications, e.g. actions of an autonomous vehicle must be safe. There are approaches that can train neural networks where such domain requirements are enforced as constraints, but they either cannot guarantee that the constraint will be satisfied by all possible predictions (even on unseen data) or they are limited in the type of constraints that can be enforced. In this paper, we present an approach to train neural networks which can enforce a wide variety of constraints and guarantee that the constraint is satisfied by all possible predictions. The approach builds on earlier work where learning linear models is formulated as a constraint satisfaction problem (CSP). To make this idea applicable to neural networks, two crucial new elements are added: constraint propagation over the network layers, and weight updates based on a mix of gradient descent and CSP solving. Evaluation on various machine learning tasks demonstrates that our approach is flexible enough to enforce a wide variety of domain constraints and is able to guarantee them in neural networks.
    Diverse super-resolution with pretrained deep hiererarchical VAEs. (arXiv:2205.10347v3 [cs.CV] UPDATED)
    We investigate the problem of producing diverse solutions to an image super-resolution problem. From a probabilistic perspective, this can be done by sampling from the posterior distribution of an inverse problem, which requires the definition of a prior distribution on the high-resolution images. In this work, we propose to use a pretrained hierarchical variational autoencoder (HVAE) as a prior. We train a lightweight stochastic encoder to encode low-resolution images in the latent space of a pretrained HVAE. At inference, we combine the low-resolution encoder and the pretrained generative model to super-resolve an image. We demonstrate on the task of face super-resolution that our method provides an advantageous trade-off between the computational efficiency of conditional normalizing flows techniques and the sample quality of diffusion based methods.
    Holistic chemical evaluation reveals pitfalls in reaction prediction models. (arXiv:2312.09004v1 [physics.chem-ph])
    The prediction of chemical reactions has gained significant interest within the machine learning community in recent years, owing to its complexity and crucial applications in chemistry. However, model evaluation for this task has been mostly limited to simple metrics like top-k accuracy, which obfuscates fine details of a model's limitations. Inspired by progress in other fields, we propose a new assessment scheme that builds on top of current approaches, steering towards a more holistic evaluation. We introduce the following key components for this goal: CHORISO, a curated dataset along with multiple tailored splits to recreate chemically relevant scenarios, and a collection of metrics that provide a holistic view of a model's advantages and limitations. Application of this method to state-of-the-art models reveals important differences on sensitive fronts, especially stereoselectivity and chemical out-of-distribution generalization. Our work paves the way towards robust prediction models that can ultimately accelerate chemical discovery.
    FedSSA: Semantic Similarity-based Aggregation for Efficient Model-Heterogeneous Personalized Federated Learning. (arXiv:2312.09006v1 [cs.LG])
    Federated learning (FL) is a privacy-preserving collaboratively machine learning paradigm. Traditional FL requires all data owners (a.k.a. FL clients) to train the same local model. This design is not well-suited for scenarios involving data and/or system heterogeneity. Model-Heterogeneous Personalized FL (MHPFL) has emerged to address this challenge. Existing MHPFL approaches often rely on having a public dataset with the same nature of the learning task, or incur high computation and communication costs. To address these limitations, we propose the Federated Semantic Similarity Aggregation (FedSSA) approach, which splits each client's model into a heterogeneous (structure-different) feature extractor and a homogeneous (structure-same) classification header. It performs local-to-global knowledge transfer via semantic similarity-based header parameter aggregation. In addition, global-to-local knowledge transfer is achieved via an adaptive parameter stabilization strategy which fuses the seen-class parameters of historical local headers with that of the latest global header for each client. In this way, FedSSA does not rely on public datasets, while only requiring partial header parameter transmission (thereby saving costs). Theoretical analysis proves the convergence of FedSSA. Extensive experiments demonstrate that FedSSA achieves up to $3.62 \times\%$ higher accuracy, $15.54$ times higher communication efficiency, and $15.52 \times$ higher computational efficiency compared to 7 state-of-the-art MHPFL baselines.
    Knowledge-Driven Modulation of Neural Networks with Attention Mechanism for Next Activity Prediction. (arXiv:2312.08847v1 [cs.AI])
    Predictive Process Monitoring (PPM) aims at leveraging historic process execution data to predict how ongoing executions will continue up to their completion. In recent years, PPM techniques for the prediction of the next activities have matured significantly, mainly thanks to the use of Neural Networks (NNs) as a predictor. While their performance is difficult to beat in the general case, there are specific situations where background process knowledge can be helpful. Such knowledge can be leveraged for improving the quality of predictions for exceptional process executions or when the process changes due to a concept drift. In this paper, we present a Symbolic[Neuro] system that leverages background knowledge expressed in terms of a procedural process model to offset the under-sampling in the training data. More specifically, we make predictions using NNs with attention mechanism, an emerging technology in the NN field. The system has been tested on several real-life logs showing an improvement in the performance of the prediction task.
    Evaluating Large Language Models for Health-related Queries with Presuppositions. (arXiv:2312.08800v1 [cs.CL])
    As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express. In this work, we introduce UPHILL, a dataset consisting of health-related queries with varying degrees of presuppositions. Using UPHILL, we evaluate the factual accuracy and consistency of InstructGPT, ChatGPT, and BingChat models. We find that while model responses rarely disagree with true health claims (posed as questions), they often fail to challenge false claims: responses from InstructGPT agree with 32% of the false claims, ChatGPT 26% and BingChat 23%. As we increase the extent of presupposition in input queries, the responses from InstructGPT and ChatGPT agree with the claim considerably more often, regardless of its veracity. Responses from BingChat, which rely on retrieved webpages, are not as susceptible. Given the moderate factual accuracy, and the inability of models to consistently correct false assumptions, our work calls for a careful assessment of current LLMs for use in high-stakes scenarios.
    MaxK-GNN: Towards Theoretical Speed Limits for Accelerating Graph Neural Networks Training. (arXiv:2312.08656v1 [cs.LG])
    In the acceleration of deep neural network training, the GPU has become the mainstream platform. GPUs face substantial challenges on GNNs, such as workload imbalance and memory access irregularities, leading to underutilized hardware. Existing solutions such as PyG, DGL with cuSPARSE, and GNNAdvisor frameworks partially address these challenges but memory traffic is still significant. We argue that drastic performance improvements can only be achieved by the vertical optimization of algorithm and system innovations, rather than treating the speedup optimization as an "after-thought" (i.e., (i) given a GNN algorithm, designing an accelerator, or (ii) given hardware, mainly optimizing the GNN algorithm). In this paper, we present MaxK-GNN, an advanced high-performance GPU training system integrating algorithm and system innovation. (i) We introduce the MaxK nonlinearity and provide a theoretical analysis of MaxK nonlinearity as a universal approximator, and present the Compressed Balanced Sparse Row (CBSR) format, designed to store the data and index of the feature matrix after nonlinearity; (ii) We design a coalescing enhanced forward computation with row-wise product-based SpGEMM Kernel using CBSR for input feature matrix fetching and strategic placement of a sparse output accumulation buffer in shared memory; (iii) We develop an optimized backward computation with outer product-based and SSpMM Kernel. We conduct extensive evaluations of MaxK-GNN and report the end-to-end system run-time. Experiments show that MaxK-GNN system could approach the theoretical speedup limit according to Amdahl's law. We achieve comparable accuracy to SOTA GNNs, but at a significantly increased speed: 3.22/4.24 times speedup (vs. theoretical limits, 5.52/7.27 times) on Reddit compared to DGL and GNNAdvisor implementations.  ( 3 min )
    Offshore Wind Plant Instance Segmentation Using Sentinel-1 Time Series, GIS, and Semantic Segmentation Models. (arXiv:2312.08773v1 [cs.CV])
    Offshore wind farms represent a renewable energy source with a significant global growth trend, and their monitoring is strategic for territorial and environmental planning. This study's primary objective is to detect offshore wind plants at an instance level using semantic segmentation models and Sentinel-1 time series. The secondary objectives are: (a) to develop a database consisting of labeled data and S-1 time series; (b) to compare the performance of five deep semantic segmentation architectures (U-Net, U-Net++, Feature Pyramid Network - FPN, DeepLabv3+, and LinkNet); (c) develop a novel augmentation strategy that shuffles the positions of the images within the time series; (d) investigate different dimensions of time series intervals (1, 5, 10, and 15 images); and (e) evaluate the semantic-to-instance conversion procedure. LinkNet was the top-performing model, followed by U-Net++ and U-Net, while FPN and DeepLabv3+ presented the worst results. The evaluation of semantic segmentation models reveals enhanced Intersection over Union (IoU) (25%) and F-score metrics (18%) with the augmentation of time series images. The study showcases the augmentation strategy's capability to mitigate biases and precisely detect invariant targets. Furthermore, the conversion from semantic to instance segmentation demonstrates its efficacy in accurately isolating individual instances within classified regions - simplifying training data and reducing annotation effort and complexity.  ( 3 min )
    How to Raise a Robot -- A Case for Neuro-Symbolic AI in Constrained Task Planning for Humanoid Assistive Robots. (arXiv:2312.08820v1 [cs.RO])
    Humanoid robots will be able to assist humans in their daily life, in particular due to their versatile action capabilities. However, while these robots need a certain degree of autonomy to learn and explore, they also should respect various constraints, for access control and beyond. We explore the novel field of incorporating privacy, security, and access control constraints with robot task planning approaches. We report preliminary results on the classical symbolic approach, deep-learned neural networks, and modern ideas using large language models as knowledge base. From analyzing their trade-offs, we conclude that a hybrid approach is necessary, and thereby present a new use case for the emerging field of neuro-symbolic artificial intelligence.  ( 2 min )
    LGD-GCN: Local and Global Disentangled Graph Convolutional Networks. (arXiv:2104.11893v3 [cs.LG] UPDATED)
    Disentangled Graph Convolutional Network (DisenGCN) is an encouraging framework to disentangle the latent factors arising in a real-world graph. However, it relies on disentangling information heavily from a local range (i.e., a node and its 1-hop neighbors), while the local information in many cases can be uneven and incomplete, hindering the interpretabiliy power and model performance of DisenGCN. In this paper\footnote{This paper is a lighter version of \href{https://jingweio.github.io/assets/pdf/tnnls22.pdf}{"Learning Disentangled Graph Convolutional Networks Locally and Globally"} where the results and analysis have been reworked substantially. Digital Object Identifier \url{https://doi.org/10.1109/TNNLS.2022.3195336}.}, we introduce a novel Local and Global Disentangled Graph Convolutional Network (LGD-GCN) to capture both local and global information for graph disentanglement. LGD-GCN performs a statistical mixture modeling to derive a factor-aware latent continuous space, and then constructs different structures w.r.t. different factors from the revealed space. In this way, the global factor-specific information can be efficiently and selectively encoded via a message passing along these built structures, strengthening the intra-factor consistency. We also propose a novel diversity promoting regularizer employed with the latent space modeling, to encourage inter-factor diversity. Evaluations of the proposed LGD-GCN on the synthetic and real-world datasets show a better interpretability and improved performance in node classification over the existing competitive models. Code is available at \url{https://github.com/jingweio/LGD-GCN}.  ( 3 min )
    Automated detection of Zika and dengue in Aedes aegypti using neural spiking analysis. (arXiv:2312.08654v1 [cs.LG])
    Mosquito-borne diseases present considerable risks to the health of both animals and humans. Aedes aegypti mosquitoes are the primary vectors for numerous medically important viruses such as dengue, Zika, yellow fever, and chikungunya. To characterize this mosquito neural activity, it is essential to classify the generated electrical spikes. However, no open-source neural spike classification method is currently available for mosquitoes. Our work presented in this paper provides an innovative artificial intelligence-based method to classify the neural spikes in uninfected, dengue-infected, and Zika-infected mosquitoes. Aiming for outstanding performance, the method employs a fusion of normalization, feature importance, and dimension reduction for the preprocessing and combines convolutional neural network and extra gradient boosting (XGBoost) for classification. The method uses the electrical spiking activity data of mosquito neurons recorded by microelectrode array technology. We used data from 0, 1, 2, 3, and 7 days post-infection, containing over 15 million samples, to analyze the method's performance. The performance of the proposed method was evaluated using accuracy, precision, recall, and the F1 scores. The results obtained from the method highlight its remarkable performance in differentiating infected vs uninfected mosquito samples, achieving an average of 98.1%. The performance was also compared with 6 other machine learning algorithms to further assess the method's capability. The method outperformed all other machine learning algorithms' performance. Overall, this research serves as an efficient method to classify the neural spikes of Aedes aegypti mosquitoes and can assist in unraveling the complex interactions between pathogens and mosquitoes.  ( 3 min )
    RdimKD: Generic Distillation Paradigm by Dimensionality Reduction. (arXiv:2312.08700v1 [cs.LG])
    Knowledge Distillation (KD) emerges as one of the most promising compression technologies to run advanced deep neural networks on resource-limited devices. In order to train a small network (student) under the guidance of a large network (teacher), the intuitive method is regularizing the feature maps or logits of the student using the teacher's information. However, existing methods either over-restrict the student to learn all information from the teacher, which lead to some bad local minimum, or use various fancy and elaborate modules to process and align features, which are complex and lack generality. In this work, we proposed an abstract and general paradigm for the KD task, referred to as DIMensionality Reduction KD (RdimKD), which solely relies on dimensionality reduction, with a very minor modification to naive L2 loss. RdimKD straightforwardly utilizes a projection matrix to project both the teacher's and student's feature maps onto a low-dimensional subspace, which are then optimized during training. RdimKD achieves the goal in the simplest way that not only does the student get valuable information from the teacher, but it also ensures sufficient flexibility to adapt to the student's low-capacity reality. Our extensive empirical findings indicate the effectiveness of RdimKD across various learning tasks and diverse network architectures.  ( 2 min )
    Consistent and Asymptotically Unbiased Estimation of Proper Calibration Errors. (arXiv:2312.08589v1 [cs.LG])
    Proper scoring rules evaluate the quality of probabilistic predictions, playing an essential role in the pursuit of accurate and well-calibrated models. Every proper score decomposes into two fundamental components -- proper calibration error and refinement -- utilizing a Bregman divergence. While uncertainty calibration has gained significant attention, current literature lacks a general estimator for these quantities with known statistical properties. To address this gap, we propose a method that allows consistent, and asymptotically unbiased estimation of all proper calibration errors and refinement terms. In particular, we introduce Kullback--Leibler calibration error, induced by the commonly used cross-entropy loss. As part of our results, we prove the relation between refinement and f-divergences, which implies information monotonicity in neural networks, regardless of which proper scoring rule is optimized. Our experiments validate empirically the claimed properties of the proposed estimator and suggest that the selection of a post-hoc calibration method should be determined by the particular calibration error of interest.  ( 2 min )
    Localization with Reconfigurable Intelligent Surface: An Active Sensing Approach. (arXiv:2312.09002v1 [cs.IT])
    This paper addresses an uplink localization problem in which a base station (BS) aims to locate a remote user with the help of reconfigurable intelligent surfaces (RISs). We propose a strategy in which the user transmits pilots sequentially and the BS adaptively adjusts the sensing vectors, including the BS beamforming vector and multiple RIS reflection coefficients based on the observations already made, to eventually produce an estimated user position. This is a challenging active sensing problem for which finding an optimal solution involves searching through a complicated functional space whose dimension increases with the number of measurements. We show that the long short-term memory (LSTM) network can be used to exploit the latent temporal correlation between measurements to automatically construct scalable state vectors. Subsequently, the state vector is mapped to the sensing vectors for the next time frame via a deep neural network (DNN). A final DNN is used to map the state vector to the estimated user position. Numerical result illustrates the advantage of the active sensing design as compared to non-active sensing methods. The proposed solution produces interpretable results and is generalizable in the number of sensing stages. Remarkably, we show that a network with one BS and multiple RISs can outperform a comparable setting with multiple BSs.  ( 2 min )
    Domain Prompt Learning with Quaternion Networks. (arXiv:2312.08878v1 [cs.CV])
    Prompt learning has emerged as an effective and data-efficient technique in large Vision-Language Models (VLMs). However, when adapting VLMs to specialized domains such as remote sensing and medical imaging, domain prompt learning remains underexplored. While large-scale domain-specific foundation models can help tackle this challenge, their concentration on a single vision level makes it challenging to prompt both vision and language modalities. To overcome this, we propose to leverage domain-specific knowledge from domain-specific foundation models to transfer the robust recognition ability of VLMs from generalized to specialized domains, using quaternion networks. Specifically, the proposed method involves using domain-specific vision features from domain-specific foundation models to guide the transformation of generalized contextual embeddings from the language branch into a specialized space within the quaternion networks. Moreover, we present a hierarchical approach that generates vision prompt features by analyzing intermodal relationships between hierarchical language prompt features and domain-specific vision features. In this way, quaternion networks can effectively mine the intermodal relationships in the specific domain, facilitating domain-specific vision-language contrastive learning. Extensive experiments on domain-specific datasets show that our proposed method achieves new state-of-the-art results in prompt learning.  ( 2 min )
    A Cyber-Physical Architecture for Microgrids based on Deep learning and LORA Technology. (arXiv:2312.08818v1 [cs.LG])
    This paper proposes a cyber-physical architecture for the secured social operation of isolated hybrid microgrids (HMGs). On the physical side of the proposed architecture, an optimal scheduling scheme considering various renewable energy sources (RESs) and fossil fuel-based distributed generation units (DGs) is proposed. Regarding the cyber layer of MGs, a wireless architecture based on low range wide area (LORA) technology is introduced for advanced metering infrastructure (AMI) in smart electricity grids. In the proposed architecture, the LORA data frame is described in detail and designed for the application of smart meters considering DGs and ac-dc converters. Additionally, since the cyber layer of smart grids is highly vulnerable to cyber-attacks, t1his paper proposes a deep-learning-based cyber-attack detection model (CADM) based on bidirectional long short-term memory (BLSTM) and sequential hypothesis testing (SHT) to detect false data injection attacks (FDIA) on the smart meters within AMI. The performance of the proposed energy management architecture is evaluated using the IEEE 33-bus test system. In order to investigate the effect of FDIA on the isolated HMGs and highlight the interactions between the cyber layer and physical layer, an FDIA is launched against the test system. The results showed that a successful attack can highly damage the system and cause widespread load shedding. Also, the performance of the proposed CADM is examined using a real-world dataset. Results prove the effectiveness of the proposed CADM in detecting the attacks using only two samples.  ( 3 min )
    Diffusion-C: Unveiling the Generative Challenges of Diffusion Models through Corrupted Data. (arXiv:2312.08843v1 [cs.LG])
    In our contemporary academic inquiry, we present "Diffusion-C," a foundational methodology to analyze the generative restrictions of Diffusion Models, particularly those akin to GANs, DDPM, and DDIM. By employing input visual data that has been subjected to a myriad of corruption modalities and intensities, we elucidate the performance characteristics of those Diffusion Models. The noise component takes center stage in our analysis, hypothesized to be a pivotal element influencing the mechanics of deep learning systems. In our rigorous expedition utilizing Diffusion-C, we have discerned the following critical observations: (I) Within the milieu of generative models under the Diffusion taxonomy, DDPM emerges as a paragon, consistently exhibiting superior performance metrics. (II) Within the vast spectrum of corruption frameworks, the fog and fractal corruptions notably undermine the functional robustness of both DDPM and DDIM. (III) The vulnerability of Diffusion Models to these particular corruptions is significantly influenced by topological and statistical similarities, particularly concerning the alignment between mean and variance. This scholarly work highlights Diffusion-C's core understandings regarding the impacts of various corruptions, setting the stage for future research endeavors in the realm of generative models.  ( 2 min )
    Learning a Low-Rank Feature Representation: Achieving Better Trade-Off between Stability and Plasticity in Continual Learning. (arXiv:2312.08740v1 [cs.LG])
    In continual learning, networks confront a trade-off between stability and plasticity when trained on a sequence of tasks. To bolster plasticity without sacrificing stability, we propose a novel training algorithm called LRFR. This approach optimizes network parameters in the null space of the past tasks' feature representation matrix to guarantee the stability. Concurrently, we judiciously select only a subset of neurons in each layer of the network while training individual tasks to learn the past tasks' feature representation matrix in low-rank. This increases the null space dimension when designing network parameters for subsequent tasks, thereby enhancing the plasticity. Using CIFAR-100 and TinyImageNet as benchmark datasets for continual learning, the proposed approach consistently outperforms state-of-the-art methods.  ( 2 min )
    HAROOD: Human Activity Classification and Out-of-Distribution Detection with Short-Range FMCW Radar. (arXiv:2312.08894v1 [cs.CV])
    We propose HAROOD as a short-range FMCW radar-based human activity classifier and out-of-distribution (OOD) detector. It aims to classify human sitting, standing, and walking activities and to detect any other moving or stationary object as OOD. We introduce a two-stage network. The first stage is trained with a novel loss function that includes intermediate reconstruction loss, intermediate contrastive loss, and triplet loss. The second stage uses the first stage's output as its input and is trained with cross-entropy loss. It creates a simple classifier that performs the activity classification. On our dataset collected by 60 GHz short-range FMCW radar, we achieve an average classification accuracy of 96.51%. Also, we achieve an average AUROC of 95.04% as an OOD detector. Additionally, our extensive evaluations demonstrate the superiority of HAROOD over the state-of-the-art OOD detection methods in terms of standard OOD detection metrics.  ( 2 min )
    Deep Learning-Based Cyber-Attack Detection Model for Smart Grids. (arXiv:2312.08810v1 [cs.LG])
    In this paper, a novel artificial intelligence-based cyber-attack detection model for smart grids is developed to stop data integrity cyber-attacks (DIAs) on the received load data by supervisory control and data acquisition (SCADA). In the proposed model, first the load data is forecasted using a regression model and after processing stage, the processed data is clustered using the unsupervised learning method. In this work, in order to achieve the best performance, three load forecasting methods (i.e. extra tree regression (ETR), long short-term memory (LSTM) and bidirectional long short-term memory (BiLSTM)) are utilized as regression models and their performance is compared. For clustering and outlying detection, the covariance elliptic envelope (EE) is employed as an unsupervised learning method. To examine the proposed model, the hourly load data of the power company of the city of Johor in Malaysia is employed and Two common DIAs, which are DIAs targeting economic loss and DIAs targeting blackouts, are used to evaluate the accuracy of detection methods in several scenarios. The simulation results show that the proposed EE-BiLSTM method can perform more robust and accurate compared to the other two methods.  ( 2 min )
    Temporal-Spatial Entropy Balancing for Causal Continuous Treatment-Effect Estimation. (arXiv:2312.08670v1 [stat.ME])
    In the field of intracity freight transportation, changes in order volume are significantly influenced by temporal and spatial factors. When building subsidy and pricing strategies, predicting the causal effects of these strategies on order volume is crucial. In the process of calculating causal effects, confounding variables can have an impact. Traditional methods to control confounding variables handle data from a holistic perspective, which cannot ensure the precision of causal effects in specific temporal and spatial dimensions. However, temporal and spatial dimensions are extremely critical in the logistics field, and this limitation may directly affect the precision of subsidy and pricing strategies. To address these issues, this study proposes a technique based on flexible temporal-spatial grid partitioning. Furthermore, based on the flexible grid partitioning technique, we further propose a continuous entropy balancing method in the temporal-spatial domain, which named TS-EBCT (Temporal-Spatial Entropy Balancing for Causal Continue Treatments). The method proposed in this paper has been tested on two simulation datasets and two real datasets, all of which have achieved excellent performance. In fact, after applying the TS-EBCT method to the intracity freight transportation field, the prediction accuracy of the causal effect has been significantly improved. It brings good business benefits to the company's subsidy and pricing strategies.  ( 3 min )
    Read Between the Layers: Leveraging Intra-Layer Representations for Rehearsal-Free Continual Learning with Pre-Trained Models. (arXiv:2312.08888v1 [cs.LG])
    We address the Continual Learning (CL) problem, where a model has to learn a sequence of tasks from non-stationary distributions while preserving prior knowledge as it encounters new experiences. With the advancement of foundation models, CL research has shifted focus from the initial learning-from-scratch paradigm to the use of generic features from large-scale pre-training. However, existing approaches to CL with pre-trained models only focus on separating the class-specific features from the final representation layer and neglect the power of intermediate representations that capture low- and mid-level features naturally more invariant to domain shifts. In this work, we propose LayUP, a new class-prototype-based approach to continual learning that leverages second-order feature statistics from multiple intermediate layers of a pre-trained network. Our method is conceptually simple, does not require any replay buffer, and works out of the box with any foundation model. LayUP improves over the state-of-the-art on four of the seven class-incremental learning settings at a considerably reduced memory and computational footprint compared with the next best baseline. Our results demonstrate that fully exhausting the representational capacities of pre-trained models in CL goes far beyond their final embeddings.  ( 2 min )
    Learning Safety Constraints From Demonstration Using One-Class Decision Trees. (arXiv:2312.08837v1 [cs.LG])
    The alignment of autonomous agents with human values is a pivotal challenge when deploying these agents within physical environments, where safety is an important concern. However, defining the agent's objective as a reward and/or cost function is inherently complex and prone to human errors. In response to this challenge, we present a novel approach that leverages one-class decision trees to facilitate learning from expert demonstrations. These decision trees provide a foundation for representing a set of constraints pertinent to the given environment as a logical formula in disjunctive normal form. The learned constraints are subsequently employed within an oracle constrained reinforcement learning framework, enabling the acquisition of a safe policy. In contrast to other methods, our approach offers an interpretable representation of the constraints, a vital feature in safety-critical environments. To validate the effectiveness of our proposed method, we conduct experiments in synthetic benchmark domains and a realistic driving environment.  ( 2 min )
    Improve Robustness of Reinforcement Learning against Observation Perturbations via $l_\infty$ Lipschitz Policy Networks. (arXiv:2312.08751v1 [cs.LG])
    Deep Reinforcement Learning (DRL) has achieved remarkable advances in sequential decision tasks. However, recent works have revealed that DRL agents are susceptible to slight perturbations in observations. This vulnerability raises concerns regarding the effectiveness and robustness of deploying such agents in real-world applications. In this work, we propose a novel robust reinforcement learning method called SortRL, which improves the robustness of DRL policies against observation perturbations from the perspective of the network architecture. We employ a novel architecture for the policy network that incorporates global $l_\infty$ Lipschitz continuity and provide a convenient method to enhance policy robustness based on the output margin. Besides, a training framework is designed for SortRL, which solves given tasks while maintaining robustness against $l_\infty$ bounded perturbations on the observations. Several experiments are conducted to evaluate the effectiveness of our method, including classic control tasks and video games. The results demonstrate that SortRL achieves state-of-the-art robustness performance against different perturbation strength.  ( 2 min )
    StemGen: A music generation model that listens. (arXiv:2312.08723v1 [cs.SD])
    End-to-end generation of musical audio using deep learning techniques has seen an explosion of activity recently. However, most models concentrate on generating fully mixed music in response to abstract conditioning information. In this work, we present an alternative paradigm for producing music generation models that can listen and respond to musical context. We describe how such a model can be constructed using a non-autoregressive, transformer-based model architecture and present a number of novel architectural and sampling improvements. We train the described architecture on both an open-source and a proprietary dataset. We evaluate the produced models using standard quality metrics and a new approach based on music information retrieval descriptors. The resulting model reaches the audio quality of state-of-the-art text-conditioned models, as well as exhibiting strong musical coherence with its context.  ( 2 min )
    May the Noise be with you: Adversarial Training without Adversarial Examples. (arXiv:2312.08877v1 [cs.LG])
    In this paper, we investigate the following question: Can we obtain adversarially-trained models without training on adversarial examples? Our intuition is that training a model with inherent stochasticity, i.e., optimizing the parameters by minimizing a stochastic loss function, yields a robust expectation function that is non-stochastic. In contrast to related methods that introduce noise at the input level, our proposed approach incorporates inherent stochasticity by embedding Gaussian noise within the layers of the NN model at training time. We model the propagation of noise through the layers, introducing a closed-form stochastic loss function that encapsulates a noise variance parameter. Additionally, we contribute a formalized noise-aware gradient, enabling the optimization of model parameters while accounting for stochasticity. Our experimental results confirm that the expectation model of a stochastic architecture trained on benign distribution is adversarially robust. Interestingly, we find that the impact of the applied Gaussian noise's standard deviation on both robustness and baseline accuracy closely mirrors the impact of the noise magnitude employed in adversarial training. Our work contributes adversarially trained networks using a completely different approach, with empirically similar robustness to adversarial training.  ( 2 min )
    Real-time Autonomous Control of a Continuous Macroscopic Process as Demonstrated by Plastic Forming. (arXiv:2312.08658v1 [cond-mat.soft])
    To meet the demands for more adaptable and expedient approaches to augment both research and manufacturing, we report an autonomous system using real-time in-situ characterization and an autonomous, decision-making processer based on an active learning algorithm. This system was applied to a plastic film forming system to highlight its efficiency and accuracy in determining the process conditions for specified target film dimensions, importantly, without any human intervention. Application of this system towards nine distinct film dimensions demonstrated the system ability to quickly determine the appropriate and stable process conditions (average 11 characterization-adjustment iterations, 19 minutes) and the ability to avoid traps, such as repetitive over-correction. Furthermore, comparison of the achieved film dimensions to the target values showed a high accuracy (R2 = 0.87, 0.90) for film width and thickness, respectively. In addition, the use of an active learning algorithm afforded our system to proceed optimization with zero initial training data, which was unavailable due to the complex relationships between the control factors (material supply rate, applied force, material viscosity) within the plastic forming process. As our system is intrinsically general and can be applied to any most material processes, these results have significant implications in accelerating both research and industrial processes.  ( 3 min )
    Solving Dense Linear Systems Faster than via Preconditioning. (arXiv:2312.08893v1 [cs.DS])
    We give a stochastic optimization algorithm that solves a dense $n\times n$ real-valued linear system $Ax=b$, returning $\tilde x$ such that $\|A\tilde x-b\|\leq \epsilon\|b\|$ in time: $$\tilde O((n^2+nk^{\omega-1})\log1/\epsilon),$$ where $k$ is the number of singular values of $A$ larger than $O(1)$ times its smallest positive singular value, $\omega < 2.372$ is the matrix multiplication exponent, and $\tilde O$ hides a poly-logarithmic in $n$ factor. When $k=O(n^{1-\theta})$ (namely, $A$ has a flat-tailed spectrum, e.g., due to noisy data or regularization), this improves on both the cost of solving the system directly, as well as on the cost of preconditioning an iterative method such as conjugate gradient. In particular, our algorithm has an $\tilde O(n^2)$ runtime when $k=O(n^{0.729})$. We further adapt this result to sparse positive semidefinite matrices and least squares regression. Our main algorithm can be viewed as a randomized block coordinate descent method, where the key challenge is simultaneously ensuring good convergence and fast per-iteration time. In our analysis, we use theory of majorization for elementary symmetric polynomials to establish a sharp convergence guarantee when coordinate blocks are sampled using a determinantal point process. We then use a Markov chain coupling argument to show that similar convergence can be attained with a cheaper sampling scheme, and accelerate the block coordinate descent update via matrix sketching.  ( 2 min )
    Gradient Informed Proximal Policy Optimization. (arXiv:2312.08710v1 [cs.LG])
    We introduce a novel policy learning method that integrates analytical gradients from differentiable environments with the Proximal Policy Optimization (PPO) algorithm. To incorporate analytical gradients into the PPO framework, we introduce the concept of an {\alpha}-policy that stands as a locally superior policy. By adaptively modifying the {\alpha} value, we can effectively manage the influence of analytical policy gradients during learning. To this end, we suggest metrics for assessing the variance and bias of analytical gradients, reducing dependence on these gradients when high variance or bias is detected. Our proposed approach outperforms baseline algorithms in various scenarios, such as function optimization, physics simulations, and traffic control environments. Our code can be found online: https://github.com/SonSang/gippo.  ( 2 min )
    TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training. (arXiv:2312.08846v1 [cs.LG])
    Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMixfrom a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios.  ( 2 min )
    Detection and Defense of Unlearnable Examples. (arXiv:2312.08898v1 [cs.LG])
    Privacy preserving has become increasingly critical with the emergence of social media. Unlearnable examples have been proposed to avoid leaking personal information on the Internet by degrading generalization abilities of deep learning models. However, our study reveals that unlearnable examples are easily detectable. We provide theoretical results on linear separability of certain unlearnable poisoned dataset and simple network based detection methods that can identify all existing unlearnable examples, as demonstrated by extensive experiments. Detectability of unlearnable examples with simple networks motivates us to design a novel defense method. We propose using stronger data augmentations coupled with adversarial noises generated by simple networks, to degrade the detectability and thus provide effective defense against unlearnable examples with a lower cost. Adversarial training with large budgets is a widely-used defense method on unlearnable examples. We establish quantitative criteria between the poison and adversarial budgets which determine the existence of robust unlearnable examples or the failure of the adversarial defense.  ( 2 min )
    Estimating calibration error under label shift without labels. (arXiv:2312.08586v1 [cs.LG])
    In the face of dataset shift, model calibration plays a pivotal role in ensuring the reliability of machine learning systems. Calibration error (CE) is an indicator of the alignment between the predicted probabilities and the classifier accuracy. While prior works have delved into the implications of dataset shift on calibration, existing CE estimators assume access to labels from the target domain, which are often unavailable in practice, i.e., when the model is deployed and used. This work addresses such challenging scenario, and proposes a novel CE estimator under label shift, which is characterized by changes in the marginal label distribution $p(Y)$, while keeping the conditional $p(X|Y)$ constant between the source and target distributions. Our contribution is an approach, which, by leveraging importance re-weighting of the labeled source distribution, provides consistent and asymptotically unbiased CE estimation with respect to the shifted target distribution. Empirical results across diverse real-world datasets, under various conditions and label-shift intensities, demonstrate the effectiveness and reliability of the proposed estimator.  ( 2 min )
    Transferring climate change knowledge. (arXiv:2309.14780v2 [physics.ao-ph] UPDATED)
    Accurate climate projections are required for climate adaptation and mitigation. Earth system model simulations, used to project climate change, inherently make approximations in their representation of small-scale physical processes, such as the formation of clouds, that are at the root of the uncertainties in global mean temperature's response to increased greenhouse gas concentrations. Several approaches have been developed to use historical observations to constrain future projections and reduce uncertainties in climate projections and climate feedbacks. Yet those methods cannot capture the non-linear complexity inherent in the climate system. Using a Transfer Learning approach, we show that Machine Learning, in particular Deep Neural Networks, can be used to optimally leverage and merge the knowledge gained from Earth system model simulations and historical observations to more accurately project global surface temperature fields in the 21st century. We reach a reduction in the 5-95% uncertainty range of global surface air temperature in 2081-2098 of up to 56% and 52% - across the Shared Socioeconomic Pathways considered - with respect to state-of-the-art approaches and the Sixth Assessment Report from the Intergovernmental Panel on Climate Change, respectively. We give evidence that our novel method provides narrower multi-model uncertainty together with more accurate climate projections, urgently required for climate adaptation.  ( 2 min )
    Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks. (arXiv:2308.10273v2 [cs.CV] UPDATED)
    Continuous Conditional Generative Adversarial Networks (CcGANs) enable generative modeling conditional on continuous scalar variables (termed regression labels). However, they can produce subpar fake images due to limited training data. Although Negative Data Augmentation (NDA) effectively enhances unconditional and class-conditional GANs by introducing anomalies into real training images, guiding the GANs away from low-quality outputs, its impact on CcGANs is limited, as it fails to replicate negative samples that may occur during the CcGAN sampling. We present a novel NDA approach called Dual-NDA specifically tailored for CcGANs to address this problem. Dual-NDA employs two types of negative samples: visually unrealistic images generated from a pre-trained CcGAN and label-inconsistent images created by manipulating real images' labels. Leveraging these negative samples, we introduce a novel discriminator objective alongside a modified CcGAN training algorithm. Empirical analysis on UTKFace and Steering Angle reveals that Dual-NDA consistently enhances the visual fidelity and label consistency of fake images generated by CcGANs, exhibiting a substantial performance gain over the vanilla NDA. Moreover, by applying Dual-NDA, CcGANs demonstrate a remarkable advancement beyond the capabilities of state-of-the-art conditional GANs and diffusion models, establishing a new pinnacle of performance. Our codes can be found at https://github.com/UBCDingXin/Dual-NDA.  ( 2 min )
    SABLE: Secure And Byzantine robust LEarning. (arXiv:2309.05395v4 [cs.LG] UPDATED)
    Due to the widespread availability of data, machine learning (ML) algorithms are increasingly being implemented in distributed topologies, wherein various nodes collaborate to train ML models via the coordination of a central server. However, distributed learning approaches face significant vulnerabilities, primarily stemming from two potential threats. Firstly, the presence of Byzantine nodes poses a risk of corrupting the learning process by transmitting inaccurate information to the server. Secondly, a curious server may compromise the privacy of individual nodes, sometimes reconstructing the entirety of the nodes' data. Homomorphic encryption (HE) has emerged as a leading security measure to preserve privacy in distributed learning under non-Byzantine scenarios. However, the extensive computational demands of HE, particularly for high-dimensional ML models, have deterred attempts to design purely homomorphic operators for non-linear robust aggregators. This paper introduces SABLE, the first homomorphic and Byzantine robust distributed learning algorithm. SABLE leverages HTS, a novel and efficient homomorphic operator implementing the prominent coordinate-wise trimmed mean robust aggregator. Designing HTS enables us to implement HMED, a novel homomorphic median aggregator. Extensive experiments on standard ML tasks demonstrate that SABLE achieves practical execution times while maintaining an ML accuracy comparable to its non-private counterpart.  ( 2 min )
    Ever Evolving Evaluator (EV3): Towards Flexible and Reliable Meta-Optimization for Knowledge Distillation. (arXiv:2310.18893v2 [cs.LG] UPDATED)
    We introduce EV3, a novel meta-optimization framework designed to efficiently train scalable machine learning models through an intuitive explore-assess-adapt protocol. In each iteration of EV3, we explore various model parameter updates, assess them using pertinent evaluation methods, and then adapt the model based on the optimal updates and previous progress history. EV3 offers substantial flexibility without imposing stringent constraints like differentiability on the key objectives relevant to the tasks of interest, allowing for exploratory updates with intentionally-biased gradients and through a diversity of losses and optimizers. Additionally, the assessment phase provides reliable safety controls to ensure robust generalization, and can dynamically prioritize tasks in scenarios with multiple objectives. With inspiration drawn from evolutionary algorithms, meta-learning, and neural architecture search, we investigate an application of EV3 to knowledge distillation. Our experimental results illustrate EV3's capability to safely explore the modeling landscape, while hinting at its potential applicability across numerous domains due to its inherent flexibility and adaptability. Finally, we provide a JAX implementation of EV3, along with source code for experiments, available at: https://github.com/google-research/google-research/tree/master/ev3.  ( 2 min )
    Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context. (arXiv:2312.06528v2 [cs.LG] UPDATED)
    Many neural network architectures have been shown to be Turing Complete, and can thus implement arbitrary algorithms. However, Transformers are unique in that they can implement gradient-based learning algorithms \emph{under simple parameter configurations}. A line of recent work shows that linear Transformers naturally learn to implement gradient descent (GD) when trained on a linear regression in-context learning task. But the linearity assumption (either in the Transformer architecture or in the learning task) is far from realistic settings where non-linear activations crucially enable Transformers to learn complicated non-linear functions. In this paper, we provide theoretical and empirical evidence that non-linear Transformers can, and \emph{in fact do}, learn to implement learning algorithms to learn non-linear functions in context. Our results apply to a broad class of combinations of non-linear architectures, and non-linear in-context learning tasks. Interestingly, we show that the optimal choice of non-linear activation depends in a natural way on the non-linearity of the learning task.  ( 2 min )
    Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective. (arXiv:2312.01397v2 [cs.CV] UPDATED)
    The rapid development of large-scale deep learning models questions the affordability of hardware platforms, which necessitates the pruning to reduce their computational and memory footprints. Sparse neural networks as the product, have demonstrated numerous favorable benefits like low complexity, undamaged generalization, etc. Most of the prominent pruning strategies are invented from a model-centric perspective, focusing on searching and preserving crucial weights by analyzing network topologies. However, the role of data and its interplay with model-centric pruning has remained relatively unexplored. In this research, we introduce a novel data-model co-design perspective: to promote superior weight sparsity by learning important model topology and adequate input data in a synergetic manner. Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework. As a pioneering effort, this paper conducts systematic investigations about the impact of different visual prompts on model pruning and suggests an effective joint optimization approach. Extensive experiments with 3 network architectures and 8 datasets evidence the substantial performance improvements from VPNs over existing start-of-the-art pruning algorithms. Furthermore, we find that subnetworks discovered by VPNs from pre-trained models enjoy better transferability across diverse downstream scenarios. These insights shed light on new promising possibilities of data-model co-designs for vision model sparsification.  ( 2 min )
    A Dual Convolutional Neural Network Pipeline for Melanoma Diagnostics and Prognostics. (arXiv:2312.08766v1 [eess.IV])
    Melanoma is a type of cancer that begins in the cells controlling the pigment of the skin, and it is often referred to as the most dangerous skin cancer. Diagnosing melanoma can be time-consuming, and a recent increase in melanoma incidents indicates a growing demand for a more efficient diagnostic process. This paper presents a pipeline for melanoma diagnostics, leveraging two convolutional neural networks, a diagnosis, and a prognosis model. The diagnostic model is responsible for localizing malignant patches across whole slide images and delivering a patient-level diagnosis as malignant or benign. Further, the prognosis model utilizes the diagnostic model's output to provide a patient-level prognosis as good or bad. The full pipeline has an F1 score of 0.79 when tested on data from the same distribution as it was trained on.  ( 2 min )
    Mitigating Label Bias in Machine Learning: Fairness through Confident Learning. (arXiv:2312.08749v1 [cs.LG])
    Discrimination can occur when the underlying unbiased labels are overwritten by an agent with potential bias, resulting in biased datasets that unfairly harm specific groups and cause classifiers to inherit these biases. In this paper, we demonstrate that despite only having access to the biased labels, it is possible to eliminate bias by filtering the fairest instances within the framework of confident learning. In the context of confident learning, low self-confidence usually indicates potential label errors; however, this is not always the case. Instances, particularly those from underrepresented groups, might exhibit low confidence scores for reasons other than labeling errors. To address this limitation, our approach employs truncation of the confidence score and extends the confidence interval of the probabilistic threshold. Additionally, we incorporate with co-teaching paradigm for providing a more robust and reliable selection of fair instances and effectively mitigating the adverse effects of biased labels. Through extensive experimentation and evaluation of various datasets, we demonstrate the efficacy of our approach in promoting fairness and reducing the impact of label bias in machine learning models.  ( 2 min )
    A Comparative Analysis of Fine-Tuned LLMs and Few-Shot Learning of LLMs for Financial Sentiment Analysis. (arXiv:2312.08725v1 [cs.LG])
    Financial sentiment analysis plays a crucial role in uncovering latent patterns and detecting emerging trends, enabling individuals to make well-informed decisions that may yield substantial advantages within the constantly changing realm of finance. Recently, Large Language Models (LLMs) have demonstrated their effectiveness in diverse domains, showcasing remarkable capabilities even in zero-shot and few-shot in-context learning for various Natural Language Processing (NLP) tasks. Nevertheless, their potential and applicability in the context of financial sentiment analysis have not been thoroughly explored yet. To bridge this gap, we employ two approaches: in-context learning (with a focus on gpt-3.5-turbo model) and fine-tuning LLMs on a finance-domain dataset. Given the computational costs associated with fine-tuning LLMs with large parameter sizes, our focus lies on smaller LLMs, spanning from 250M to 3B parameters for fine-tuning. We then compare the performances with state-of-the-art results to evaluate their effectiveness in the finance-domain. Our results demonstrate that fine-tuned smaller LLMs can achieve comparable performance to state-of-the-art fine-tuned LLMs, even with models having fewer parameters and a smaller training dataset. Additionally, the zero-shot and one-shot performance of LLMs produces comparable results with fine-tuned smaller LLMs and state-of-the-art outcomes. Furthermore, our analysis demonstrates that there is no observed enhancement in performance for finance-domain sentiment analysis when the number of shots for in-context learning is increased.  ( 2 min )
    Simplicial Representation Learning with Neural $k$-forms. (arXiv:2312.08515v1 [cs.LG])
    Geometric deep learning extends deep learning to incorporate information about the geometry and topology data, especially in complex domains like graphs. Despite the popularity of message passing in this field, it has limitations such as the need for graph rewiring, ambiguity in interpreting data, and over-smoothing. In this paper, we take a different approach, focusing on leveraging geometric information from simplicial complexes embedded in $\mathbb{R}^n$ using node coordinates. We use differential k-forms in \mathbb{R}^n to create representations of simplices, offering interpretability and geometric consistency without message passing. This approach also enables us to apply differential geometry tools and achieve universal approximation. Our method is efficient, versatile, and applicable to various input complexes, including graphs, simplicial complexes, and cell complexes. It outperforms existing message passing neural networks in harnessing information from geometrical graphs with node features serving as coordinates.  ( 2 min )
    Privacy Amplification by Iteration for ADMM with (Strongly) Convex Objective Functions. (arXiv:2312.08685v1 [cs.LG])
    We examine a private ADMM variant for (strongly) convex objectives which is a primal-dual iterative method. Each iteration has a user with a private function used to update the primal variable, masked by Gaussian noise for local privacy, without directly adding noise to the dual variable. Privacy amplification by iteration explores if noises from later iterations can enhance the privacy guarantee when releasing final variables after the last iteration. Cyffers et al. [ICML 2023] explored privacy amplification by iteration for the proximal ADMM variant, where a user's entire private function is accessed and noise is added to the primal variable. In contrast, we examine a private ADMM variant requiring just one gradient access to a user's function, but both primal and dual variables must be passed between successive iterations. To apply Balle et al.'s [NeurIPS 2019] coupling framework to the gradient ADMM variant, we tackle technical challenges with novel ideas. First, we address the non-expansive mapping issue in ADMM iterations by using a customized norm. Second, because the dual variables are not masked with any noise directly, their privacy guarantees are achieved by treating two consecutive noisy ADMM iterations as a Markov operator. Our main result is that the privacy guarantee for the gradient ADMM variant can be amplified proportionally to the number of iterations. For strongly convex objective functions, this amplification exponentially increases with the number of iterations. These amplification results align with the previously studied special case of stochastic gradient descent.  ( 3 min )
    Uplifting the Expressive Power of Graph Neural Networks through Graph Partitioning. (arXiv:2312.08671v1 [cs.LG])
    Graph Neural Networks (GNNs) have paved its way for being a cornerstone in graph related learning tasks. From a theoretical perspective, the expressive power of GNNs is primarily characterised according to their ability to distinguish non-isomorphic graphs. It is a well-known fact that most of the conventional GNNs are upper-bounded by Weisfeiler-Lehman graph isomorphism test (1-WL). In this work, we study the expressive power of graph neural networks through the lens of graph partitioning. This follows from our observation that permutation invariant graph partitioning enables a powerful way of exploring structural interactions among vertex sets and subgraphs, and can help uplifting the expressive power of GNNs efficiently. Based on this, we first establish a theoretical connection between graph partitioning and graph isomorphism. Then we introduce a novel GNN architecture, namely Graph Partitioning Neural Networks (GPNNs). We theoretically analyse how a graph partitioning scheme and different kinds of structural interactions relate to the k-WL hierarchy. Empirically, we demonstrate its superior performance over existing GNN models in a variety of graph benchmark tasks.  ( 2 min )
    Graph Network Surrogate Model for Subsurface Flow Optimization. (arXiv:2312.08625v1 [physics.geo-ph])
    The optimization of well locations and controls is an important step in the design of subsurface flow operations such as oil production or geological CO2 storage. These optimization problems can be computationally expensive, however, as many potential candidate solutions must be evaluated. In this study, we propose a graph network surrogate model (GNSM) for optimizing well placement and controls. The GNSM transforms the flow model into a computational graph that involves an encoding-processing-decoding architecture. Separate networks are constructed to provide global predictions for the pressure and saturation state variables. Model performance is enhanced through the inclusion of the single-phase steady-state pressure solution as a feature. A multistage multistep strategy is used for training. The trained GNSM is applied to predict flow responses in a 2D unstructured model of a channelized reservoir. Results are presented for a large set of test cases, in which five injection wells and five production wells are placed randomly throughout the model, with a random control variable (bottom-hole pressure) assigned to each well. Median relative error in pressure and saturation for 300 such test cases is 1-2%. The ability of the trained GNSM to provide accurate predictions for a new (geologically similar) permeability realization is demonstrated. Finally, the trained GNSM is used to optimize well locations and controls with a differential evolution algorithm. GNSM-based optimization results are comparable to those from simulation-based optimization, with a runtime speedup of a factor of 36. Much larger speedups are expected if the method is used for robust optimization, in which each candidate solution is evaluated on multiple geological models.  ( 3 min )
    Beyond Accuracy: Automated De-Identification of Large Real-World Clinical Text Datasets. (arXiv:2312.08495v1 [cs.CL])
    Recent research advances achieve human-level accuracy for de-identifying free-text clinical notes on research datasets, but gaps remain in reproducing this in large real-world settings. This paper summarizes lessons learned from building a system used to de-identify over one billion real clinical notes, in a fully automated way, that was independently certified by multiple organizations for production use. A fully automated solution requires a very high level of accuracy that does not require manual review. A hybrid context-based model architecture is described, which outperforms a Named Entity Recogniton (NER) - only model by 10% on the i2b2-2014 benchmark. The proposed system makes 50%, 475%, and 575% fewer errors than the comparable AWS, Azure, and GCP services respectively while also outperforming ChatGPT by 33%. It exceeds 98% coverage of sensitive data across 7 European languages, without a need for fine tuning. A second set of described models enable data obfuscation -- replacing sensitive data with random surrogates -- while retaining name, date, gender, clinical, and format consistency. Both the practical need and the solution architecture that provides for reliable & linked anonymized documents are described.  ( 2 min )
    KDAS3: Knowledge distillation via Attention Supervision, and Symmetrical structure guiding for Polyp Segmentation. (arXiv:2312.08555v1 [eess.IV])
    Polyp segmentation, a contentious issue in medical imaging, has seen numerous proposed methods aimed at improving the quality of segmented masks. Currently, state-of-the-art techniques yield impressive results. However, the sheer size of these models poses challenges for practical industry applications. To address this, we present a Knowledge Distillation framework, incorporating attention supervision and the symmetrical guiding method. This framework is designed to facilitate knowledge transfer from a teacher model to a more compact student model with fewer parameters. Our experimental evaluation of the framework assesses its effectiveness in enabling the student model to acquire knowledge from the teacher efficiently. Additionally, our method serves to prevent the student model from incorporating redundant features that could lead to inaccurate predictions. Consequently, our method, boasting approximately 5 million parameters, achieves competitive results comparable to the state-of-the-art approaches. The implementation can be found at: https://github.com/huyquoctrinh/KDAS3  ( 2 min )
    Best practices for machine learning in antibody discovery and development. (arXiv:2312.08470v1 [q-bio.BM])
    Over the past 40 years, the discovery and development of therapeutic antibodies to treat disease has become common practice. However, as therapeutic antibody constructs are becoming more sophisticated (e.g., multi-specifics), conventional approaches to optimisation are increasingly inefficient. Machine learning (ML) promises to open up an in silico route to antibody discovery and help accelerate the development of drug products using a reduced number of experiments and hence cost. Over the past few years, we have observed rapid developments in the field of ML-guided antibody discovery and development (D&D). However, many of the results are difficult to compare or hard to assess for utility by other experts in the field due to the high diversity in the datasets and evaluation techniques and metrics that are across industry and academia. This limitation of the literature curtails the broad adoption of ML across the industry and slows down overall progress in the field, highlighting the need to develop standards and guidelines that may help improve the reproducibility of ML models across different research groups. To address these challenges, we set out in this perspective to critically review current practices, explain common pitfalls, and clearly define a set of method development and evaluation guidelines that can be applied to different types of ML-based techniques for therapeutic antibody D&D. Specifically, we address in an end-to-end analysis, challenges associated with all aspects of the ML process and recommend a set of best practices for each stage.  ( 2 min )
    Verification of Neural Reachable Tubes via Scenario Optimization and Conformal Prediction. (arXiv:2312.08604v1 [cs.RO])
    Learning-based approaches for controlling safety-critical systems are rapidly growing in popularity; thus, it is important to assure their performance and safety. Hamilton-Jacobi (HJ) reachability analysis is a popular formal verification tool for providing such guarantees, since it can handle general nonlinear system dynamics, bounded adversarial system disturbances, and state and input constraints. However, its computational and memory complexity scales exponentially with the state dimension, making it intractable for large-scale systems. To overcome this challenge, neural approaches, such as DeepReach, have been used to synthesize reachable tubes and safety controllers for high-dimensional systems. However, verifying these neural reachable tubes remains challenging. In this work, we propose two verification methods, based on robust scenario optimization and conformal prediction, to provide probabilistic safety guarantees for neural reachable tubes. Our methods allow a direct trade-off between resilience to outlier errors in the neural tube, which are inevitable in a learning-based approach, and the strength of the probabilistic safety guarantee. Furthermore, we show that split conformal prediction, a widely used method in the machine learning community for uncertainty quantification, reduces to a scenario-based approach, making the two methods equivalent not only for verification of neural reachable tubes but also more generally. To our knowledge, our proof is the first in the literature to show a strong relationship between conformal prediction and scenario optimization. Finally, we propose an outlier-adjusted verification approach that uses the error distribution in neural reachable tubes to recover greater safe volumes. We demonstrate the efficacy of the proposed approaches for the high-dimensional problems of multi-vehicle collision avoidance and rocket landing with no-go zones.  ( 3 min )
    Harmonics of Learning: Universal Fourier Features Emerge in Invariant Networks. (arXiv:2312.08550v1 [cs.LG])
    In this work, we formally prove that, under certain conditions, if a neural network is invariant to a finite group then its weights recover the Fourier transform on that group. This provides a mathematical explanation for the emergence of Fourier features -- a ubiquitous phenomenon in both biological and artificial learning systems. The results hold even for non-commutative groups, in which case the Fourier transform encodes all the irreducible unitary group representations. Our findings have consequences for the problem of symmetry discovery. Specifically, we demonstrate that the algebraic structure of an unknown group can be recovered from the weights of a network that is at least approximately invariant within certain bounds. Overall, this work contributes to a foundation for an algebraic learning theory of invariant neural network representations.  ( 2 min )
    World Models via Policy-Guided Trajectory Diffusion. (arXiv:2312.08533v1 [cs.LG])
    World models are a powerful tool for developing intelligent agents. By predicting the outcome of a sequence of actions, world models enable policies to be optimised via on-policy reinforcement learning (RL) using synthetic data, i.e. in ``in imagination''. Existing world models are autoregressive, and interleave predicting the next state with sampling the next action from the policy. Thus, the prediction error inevitably compounds as the trajectory length grows. In this work, we propose a novel world modelling approach that is not autoregressive and generates entire on-policy trajectories via a single pass through a diffusion model. Our approach, Policy-Guided Trajectory Diffusion (PolyGRAD), leverages a denoising model in addition to the gradient of the action distribution of the policy to diffuse a trajectory of initially random states and actions into an on-policy synthetic trajectory. We analyse the capabilities of our approach and demonstrate that it obtains competitive prediction errors to state-of-the-art autoregressive baselines. PolyGRAD also enables performant policies to be trained via on-policy RL in imagination. We believe that PolyGRAD introduces a promising paradigm for world modelling with many possible extensions to explore in future work.  ( 2 min )
    Identifying Planetary Names in Astronomy Papers: A Multi-Step Approach. (arXiv:2312.08579v1 [cs.CL])
    The automatic identification of planetary feature names in astronomy publications presents numerous challenges. These features include craters, defined as roughly circular depressions resulting from impact or volcanic activity; dorsas, which are elongate raised structures or wrinkle ridges; and lacus, small irregular patches of dark, smooth material on the Moon, referred to as "lake" (Planetary Names Working Group, n.d.). Many feature names overlap with places or people's names that they are named after, for example, Syria, Tempe, Einstein, and Sagan, to name a few (U.S. Geological Survey, n.d.). Some feature names have been used in many contexts, for instance, Apollo, which can refer to mission, program, sample, astronaut, seismic, seismometers, core, era, data, collection, instrument, and station, in addition to the crater on the Moon. Some feature names can appear in the text as adjectives, like the lunar craters Black, Green, and White. Some feature names in other contexts serve as directions, like craters West and South on the Moon. Additionally, some features share identical names across different celestial bodies, requiring disambiguation, such as the Adams crater, which exists on both the Moon and Mars. We present a multi-step pipeline combining rule-based filtering, statistical relevance analysis, part-of-speech (POS) tagging, named entity recognition (NER) model, hybrid keyword harvesting, knowledge graph (KG) matching, and inference with a locally installed large language model (LLM) to reliably identify planetary names despite these challenges. When evaluated on a dataset of astronomy papers from the Astrophysics Data System (ADS), this methodology achieves an F1-score over 0.97 in disambiguating planetary feature names.  ( 3 min )
    Omega-Regular Decision Processes. (arXiv:2312.08602v1 [cs.LO])
    Regular decision processes (RDPs) are a subclass of non-Markovian decision processes where the transition and reward functions are guarded by some regular property of the past (a lookback). While RDPs enable intuitive and succinct representation of non-Markovian decision processes, their expressive power coincides with finite-state Markov decision processes (MDPs). We introduce omega-regular decision processes (ODPs) where the non-Markovian aspect of the transition and reward functions are extended to an omega-regular lookahead over the system evolution. Semantically, these lookaheads can be considered as promises made by the decision maker or the learning agent about her future behavior. In particular, we assume that, if the promised lookaheads are not met, then the payoff to the decision maker is $\bot$ (least desirable payoff), overriding any rewards collected by the decision maker. We enable optimization and learning for ODPs under the discounted-reward objective by reducing them to lexicographic optimization and learning over finite MDPs. We present experimental results demonstrating the effectiveness of the proposed reduction.  ( 2 min )
    Occupancy Detection Based on Electricity Consumption. (arXiv:2312.08535v1 [cs.LG])
    This article presents a new methodology for extracting intervals when a home is vacant from low-frequency electricity consumption data. The approach combines multiple algorithms, including change point detection, classification, period detection, and periodic spikes retrieval. It shows encouraging results on both simulated and real consumption curves. This approach offers practical insights for optimizing energy use and holds potential benefits for residential consumers and utility companies in terms of energy cost reduction and sustainability. Further research is needed to enhance its applicability in diverse settings and with larger datasets.  ( 2 min )
    Explainable AI in Grassland Monitoring: Enhancing Model Performance and Domain Adaptability. (arXiv:2312.08408v1 [cs.LG])
    Grasslands are known for their high biodiversity and ability to provide multiple ecosystem services. Challenges in automating the identification of indicator plants are key obstacles to large-scale grassland monitoring. These challenges stem from the scarcity of extensive datasets, the distributional shifts between generic and grassland-specific datasets, and the inherent opacity of deep learning models. This paper delves into the latter two challenges, with a specific focus on transfer learning and eXplainable Artificial Intelligence (XAI) approaches to grassland monitoring, highlighting the novelty of XAI in this domain. We analyze various transfer learning methods to bridge the distributional gaps between generic and grassland-specific datasets. Additionally, we showcase how explainable AI techniques can unveil the model's domain adaptation capabilities, employing quantitative assessments to evaluate the model's proficiency in accurately centering relevant input features around the object of interest. This research contributes valuable insights for enhancing model performance through transfer learning and measuring domain adaptability with explainable AI, showing significant promise for broader applications within the agricultural community.  ( 2 min )
    Towards Inductive Robustness: Distilling and Fostering Wave-induced Resonance in Transductive GCNs Against Graph Adversarial Attacks. (arXiv:2312.08651v1 [cs.LG])
    Graph neural networks (GNNs) have recently been shown to be vulnerable to adversarial attacks, where slight perturbations in the graph structure can lead to erroneous predictions. However, current robust models for defending against such attacks inherit the transductive limitations of graph convolutional networks (GCNs). As a result, they are constrained by fixed structures and do not naturally generalize to unseen nodes. Here, we discover that transductive GCNs inherently possess a distillable robustness, achieved through a wave-induced resonance process. Based on this, we foster this resonance to facilitate inductive and robust learning. Specifically, we first prove that the signal formed by GCN-driven message passing (MP) is equivalent to the edge-based Laplacian wave, where, within a wave system, resonance can naturally emerge between the signal and its transmitting medium. This resonance provides inherent resistance to malicious perturbations inflicted on the signal system. We then prove that merely three MP iterations within GCNs can induce signal resonance between nodes and edges, manifesting as a coupling between nodes and their distillable surrounding local subgraph. Consequently, we present Graph Resonance-fostering Network (GRN) to foster this resonance via learning node representations from their distilled resonating subgraphs. By capturing the edge-transmitted signals within this subgraph and integrating them with the node signal, GRN embeds these combined signals into the central node's representation. This node-wise embedding approach allows for generalization to unseen nodes. We validate our theoretical findings with experiments, and demonstrate that GRN generalizes robustness to unseen nodes, whilst maintaining state-of-the-art classification accuracy on perturbed graphs.  ( 3 min )
    Fair Active Learning in Low-Data Regimes. (arXiv:2312.08559v1 [cs.LG])
    In critical machine learning applications, ensuring fairness is essential to avoid perpetuating social inequities. In this work, we address the challenges of reducing bias and improving accuracy in data-scarce environments, where the cost of collecting labeled data prohibits the use of large, labeled datasets. In such settings, active learning promises to maximize marginal accuracy gains of small amounts of labeled data. However, existing applications of active learning for fairness fail to deliver on this, typically requiring large labeled datasets, or failing to ensure the desired fairness tolerance is met on the population distribution. To address such limitations, we introduce an innovative active learning framework that combines an exploration procedure inspired by posterior sampling with a fair classification subroutine. We demonstrate that this framework performs effectively in very data-scarce regimes, maximizing accuracy while satisfying fairness constraints with high probability. We evaluate our proposed approach using well-established real-world benchmark datasets and compare it against state-of-the-art methods, demonstrating its effectiveness in producing fair models, and improvement over existing methods.  ( 2 min )
    Connectivity Oracles for Predictable Vertex Failures. (arXiv:2312.08489v1 [cs.DS])
    The problem of designing connectivity oracles supporting vertex failures is one of the basic data structures problems for undirected graphs. It is already well understood: previous works [Duan--Pettie STOC'10; Long--Saranurak FOCS'22] achieve query time linear in the number of failed vertices, and it is conditionally optimal as long as we require preprocessing time polynomial in the size of the graph and update time polynomial in the number of failed vertices. We revisit this problem in the paradigm of algorithms with predictions: we ask if the query time can be improved if the set of failed vertices can be predicted beforehand up to a small number of errors. More specifically, we design a data structure that, given a graph $G=(V,E)$ and a set of vertices predicted to fail $\widehat{D} \subseteq V$ of size $d=|\widehat{D}|$, preprocesses it in time $\tilde{O}(d|E|)$ and then can receive an update given as the symmetric difference between the predicted and the actual set of failed vertices $\widehat{D} \triangle D = (\widehat{D} \setminus D) \cup (D \setminus \widehat{D})$ of size $\eta = |\widehat{D} \triangle D|$, process it in time $\tilde{O}(\eta^4)$, and after that answer connectivity queries in $G \setminus D$ in time $O(\eta)$. Viewed from another perspective, our data structure provides an improvement over the state of the art for the \emph{fully dynamic subgraph connectivity problem} in the \emph{sensitivity setting} [Henzinger--Neumann ESA'16]. We argue that the preprocessing time and query time of our data structure are conditionally optimal under standard fine-grained complexity assumptions.  ( 2 min )
    Space-Time Approximation with Shallow Neural Networks in Fourier Lebesgue spaces. (arXiv:2312.08461v1 [cs.LG])
    Approximation capabilities of shallow neural networks (SNNs) form an integral part in understanding the properties of deep neural networks (DNNs). In the study of these approximation capabilities some very popular classes of target functions are the so-called spectral Barron spaces. This spaces are of special interest when it comes to the approximation of partial differential equation (PDE) solutions. It has been shown that the solution of certain static PDEs will lie in some spectral Barron space. In order to alleviate the limitation to static PDEs and include a time-domain that might have a different regularity than the space domain, we extend the notion of spectral Barron spaces to anisotropic weighted Fourier-Lebesgue spaces. In doing so, we consider target functions that have two blocks of variables, among which each block is allowed to have different decay and integrability properties. For these target functions we first study the inclusion of anisotropic weighted Fourier-Lebesgue spaces in the Bochner-Sobolev spaces. With that we can now also measure the approximation error in terms of an anisotropic Sobolev norm, namely the Bochner-Sobolev norm. We use this observation in a second step where we establish a bound on the approximation rate for functions from the anisotropic weighted Fourier-Lebesgue spaces and approximation via SNNs in the Bochner-Sobolev norm.  ( 2 min )
    Deep Learning with Physics Priors as Generalized Regularizers. (arXiv:2312.08678v1 [cs.LG])
    In various scientific and engineering applications, there is typically an approximate model of the underlying complex system, even though it contains both aleatoric and epistemic uncertainties. In this paper, we present a principled method to incorporate these approximate models as physics priors in modeling, to prevent overfitting and enhancing the generalization capabilities of the trained models. Utilizing the structural risk minimization (SRM) inductive principle pioneered by Vapnik, this approach structures the physics priors into generalized regularizers. The experimental results demonstrate that our method achieves up to two orders of magnitude of improvement in testing accuracy.  ( 2 min )
    EmbAu: A Novel Technique to Embed Audio Data Using Shuffled Frog Leaping Algorithm. (arXiv:2312.08417v1 [cs.CR])
    The aim of steganographic algorithms is to identify the appropriate pixel positions in the host or cover image, where bits of sensitive information can be concealed for data encryption. Work is being done to improve the capacity to integrate sensitive information and to maintain the visual appearance of the steganographic image. Consequently, steganography is a challenging research area. In our currently proposed image steganographic technique, we used the Shuffled Frog Leaping Algorithm (SFLA) to determine the order of pixels by which sensitive information can be placed in the cover image. To achieve greater embedding capacity, pixels from the spatial domain of the cover image are carefully chosen and used for placing the sensitive data. Bolstered via image steganography, the final image after embedding is resistant to steganalytic attacks. The SFLA algorithm serves in the optimal pixels selection of any colored (RGB) cover image for secret bit embedding. Using the fitness function, the SFLA benefits by reaching a minimum cost value in an acceptable amount of time. The pixels for embedding are meticulously chosen to minimize the host image's distortion upon embedding. Moreover, an effort has been taken to make the detection of embedded data in the steganographic image a formidable challenge. Due to the enormous need for audio data encryption in the current world, we feel that our suggested method has significant potential in real-world applications. In this paper, we propose and compare our strategy to existing steganographic methods.  ( 3 min )
    Contractive error feedback for gradient compression. (arXiv:2312.08538v1 [cs.LG])
    On-device memory concerns in distributed deep learning have become severe due to (i) the growth of model size in multi-GPU training, and (ii) the wide adoption of deep neural networks for federated learning on IoT devices which have limited storage. In such settings, communication efficient optimization methods are attractive alternatives, however they still struggle with memory issues. To tackle these challenges, we propose an communication efficient method called contractive error feedback (ConEF). As opposed to SGD with error-feedback (EFSGD) that inefficiently manages memory, ConEF obtains the sweet spot of convergence and memory usage, and achieves communication efficiency by leveraging biased and all-reducable gradient compression. We empirically validate ConEF on various learning tasks that include image classification, language modeling, and machine translation and observe that ConEF saves 80\% - 90\% of the extra memory in EFSGD with almost no loss on test performance, while also achieving 1.3x - 5x speedup of SGD. Through our work, we also demonstrate the feasibility and convergence of ConEF to clear up the theoretical barrier of integrating ConEF to popular memory efficient frameworks such as ZeRO-3.  ( 2 min )
    Scalable Ensemble-based Detection Method against Adversarial Attacks for speaker verification. (arXiv:2312.08622v1 [eess.AS])
    Automatic speaker verification (ASV) is highly susceptible to adversarial attacks. Purification modules are usually adopted as a pre-processing to mitigate adversarial noise. However, they are commonly implemented across diverse experimental settings, rendering direct comparisons challenging. This paper comprehensively compares mainstream purification techniques in a unified framework. We find these methods often face a trade-off between user experience and security, as they struggle to simultaneously maintain genuine sample performance and reduce adversarial perturbations. To address this challenge, some efforts have extended purification modules to encompass detection capabilities, aiming to alleviate the trade-off. However, advanced purification modules will always come into the stage to surpass previous detection method. As a result, we further propose an easy-to-follow ensemble approach that integrates advanced purification modules for detection, achieving state-of-the-art (SOTA) performance in countering adversarial noise. Our ensemble method has great potential due to its compatibility with future advanced purification techniques.  ( 2 min )
    Deep learning-based estimation of time-dependent parameters in Markov models with application to nonlinear regression and SDEs. (arXiv:2312.08493v1 [stat.ML])
    We present a novel deep learning method for estimating time-dependent parameters in Markov processes through discrete sampling. Departing from conventional machine learning, our approach reframes parameter approximation as an optimization problem using the maximum likelihood approach. Experimental validation focuses on parameter estimation in multivariate regression and stochastic differential equations (SDEs). Theoretical results show that the real solution is close to SDE with parameters approximated using our neural network-derived under specific conditions. Our work contributes to SDE-based model parameter estimation, offering a versatile tool for diverse fields.  ( 2 min )
    Automatic Bug Detection in Games using LSTM Networks. (arXiv:2312.08418v1 [cs.LG])
    We introduced a new framework to detect perceptual bugs using a Long Short-Term Memory (LSTM) network, which detects bugs in video games as anomalies. The detected buggy frames are then clustered to determine the category of the occurred bug. The framework was evaluated on two First Person Shooter (FPS) games. Results show the effectiveness of the framework.  ( 2 min )
    Cooperative Learning for Cost-Adaptive Inference. (arXiv:2312.08532v1 [cs.LG])
    We propose a cooperative training framework for deep neural network architectures that enables the runtime network depths to change to satisfy dynamic computing resource requirements. In our framework, the number of layers participating in computation can be chosen dynamically to meet performance-cost trade-offs at inference runtime. Our method trains two Teammate nets and a Leader net, and two sets of Teammate sub-networks with various depths through knowledge distillation. The Teammate nets derive sub-networks and transfer knowledge to them, and to each other, while the Leader net guides Teammate nets to ensure accuracy. The approach trains the framework atomically at once instead of individually training various sizes of models; in a sense, the various-sized networks are all trained at once, in a "package deal." The proposed framework is not tied to any specific architecture but can incorporate any existing models/architectures, therefore it can maintain stable results and is insensitive to the size of a dataset's feature map. Compared with other related approaches, it provides comparable accuracy to its full network while various sizes of models are available.  ( 2 min )
    Markov Decision Processes with Noisy State Observation. (arXiv:2312.08536v1 [cs.LG])
    This paper addresses the challenge of a particular class of noisy state observations in Markov Decision Processes (MDPs), a common issue in various real-world applications. We focus on modeling this uncertainty through a confusion matrix that captures the probabilities of misidentifying the true state. Our primary goal is to estimate the inherent measurement noise, and to this end, we propose two novel algorithmic approaches. The first, the method of second-order repetitive actions, is designed for efficient noise estimation within a finite time window, providing identifiable conditions for system analysis. The second approach comprises a family of Bayesian algorithms, which we thoroughly analyze and compare in terms of performance and limitations. We substantiate our theoretical findings with simulations, demonstrating the effectiveness of our methods in different scenarios, particularly highlighting their behavior in environments with varying stationary distributions. Our work advances the understanding of reinforcement learning in noisy environments, offering robust techniques for more accurate state estimation in MDPs.  ( 2 min )
    The Relative Value of Prediction in Algorithmic Decision Making. (arXiv:2312.08511v1 [cs.CY])
    Algorithmic predictions are increasingly used to inform the allocations of goods and interventions in the public sphere. In these domains, predictions serve as a means to an end. They provide stakeholders with insights into likelihood of future events as a means to improve decision making quality, and enhance social welfare. However, if maximizing welfare is the ultimate goal, prediction is only a small piece of the puzzle. There are various other policy levers a social planner might pursue in order to improve bottom-line outcomes, such as expanding access to available goods, or increasing the effect sizes of interventions. Given this broad range of design decisions, a basic question to ask is: What is the relative value of prediction in algorithmic decision making? How do the improvements in welfare arising from better predictions compare to those of other policy levers? The goal of our work is to initiate the formal study of these questions. Our main results are theoretical in nature. We identify simple, sharp conditions determining the relative value of prediction vis-\`a-vis expanding access, within several statistical models that are popular amongst quantitative social scientists. Furthermore, we illustrate how these theoretical insights may be used to guide the design of algorithmic decision making systems in practice.  ( 2 min )
    Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods. (arXiv:2312.08531v1 [cs.LG])
    In the past several years, the convergence of the last iterate of the Stochastic Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz and convex functions, different works have established the optimal $O(\log(1/\delta)\log T/\sqrt{T})$ or $O(\sqrt{\log(1/\delta)/T})$ high-probability convergence rates for the final iterate, where $T$ is the time horizon and $\delta$ is the failure probability. However, to prove these bounds, all the existing works are limited to compact domains or require almost surely bounded noises. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last iterate convergence of SGD for non-smooth problems, only few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a non-composite objective and the standard Euclidean norm. It still remains unclear whether the last-iterate convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterate convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness and (strong) convexity simultaneously. Additionally, we extend our analysis to obtain the last-iterate convergence under heavy-tailed noises.  ( 3 min )
    Universal Approximation Property of Random Neural Networks. (arXiv:2312.08410v1 [cs.LG])
    In this paper, we study random neural networks which are single-hidden-layer feedforward neural networks whose weights and biases are randomly initialized. After this random initialization, only the linear readout needs to be trained, which can be performed efficiently, e.g., by the least squares method. By viewing random neural networks as Banach space-valued random variables, we prove their universal approximation properties within suitable Bochner spaces. Hereby, the corresponding Banach space can be more general than the space of continuous functions over a compact subset of a Euclidean space, namely, e.g., an $L^p$-space or a Sobolev space, where the latter includes the approximation of the derivatives. Moreover, we derive some approximation rates and develop an explicit algorithm to learn a deterministic function by a random neural network. In addition, we provide a full error analysis and study when random neural networks overcome the curse of dimensionality in the sense that the training costs scale at most polynomially in the input and output dimension. Furthermore, we show in two numerical examples the empirical advantages of random neural networks compared to fully trained deterministic neural networks.  ( 2 min )
    ConFormer: A Novel Collection of Deep Learning Models to Assist Cardiologists in the Assessment of Cardiac Function. (arXiv:2312.08567v1 [eess.IV])
    Cardiovascular diseases, particularly heart failure, are a leading cause of death globally. The early detection of heart failure through routine echocardiogram screenings is often impeded by the high cost and labor-intensive nature of these procedures, a barrier that can mean the difference between life and death. This paper presents ConFormer, a novel deep learning model designed to automate the estimation of Ejection Fraction (EF) and Left Ventricular Wall Thickness from echocardiograms. The implementation of ConFormer has the potential to enhance preventative cardiology by enabling cost-effective, accessible, and comprehensive heart health monitoring, thereby saving countless lives. The source code is available at https://github.com/Aether111/ConFormer.  ( 2 min )
    Principled Weight Initialization for Hypernetworks. (arXiv:2312.08399v1 [cs.LG])
    Hypernetworks are meta neural networks that generate weights for a main neural network in an end-to-end differentiable manner. Despite extensive applications ranging from multi-task learning to Bayesian deep learning, the problem of optimizing hypernetworks has not been studied to date. We observe that classical weight initialization methods like Glorot & Bengio (2010) and He et al. (2015), when applied directly on a hypernet, fail to produce weights for the mainnet in the correct scale. We develop principled techniques for weight initialization in hypernets, and show that they lead to more stable mainnet weights, lower training loss, and faster convergence.  ( 2 min )
    PerMod: Perceptually Grounded Voice Modification with Latent Diffusion Models. (arXiv:2312.08494v1 [cs.SD])
    Perceptual modification of voice is an elusive goal. While non-experts can modify an image or sentence perceptually with available tools, it is not clear how to similarly modify speech along perceptual axes. Voice conversion does make it possible to convert one voice to another, but these modifications are handled by black box models, and the specifics of what perceptual qualities to modify and how to modify them are unclear. Towards allowing greater perceptual control over voice, we introduce PerMod, a conditional latent diffusion model that takes in an input voice and a perceptual qualities vector, and produces a voice with the matching perceptual qualities. Unlike prior work, PerMod generates a new voice corresponding to specific perceptual modifications. Evaluating perceptual quality vectors with RMSE from both human and predicted labels, we demonstrate that PerMod produces voices with the desired perceptual qualities for typical voices, but performs poorly on atypical voices.  ( 2 min )
    MotherNet: A Foundational Hypernetwork for Tabular Classification. (arXiv:2312.08598v1 [cs.LG])
    The advent of Foundation Models is transforming machine learning across many modalities (e.g., language, images, videos) with prompt engineering replacing training in many settings. Recent work on tabular data (e.g., TabPFN) hints at a similar opportunity to build Foundation Models for classification for numerical data. In this paper, we go one step further and propose a hypernetwork architecture that we call MotherNet, trained on millions of classification tasks, that, once prompted with a never-seen-before training set generates the weights of a trained ``child'' neural-network. Like other Foundation Models, MotherNet replaces training on specific datasets with in-context learning through a single forward pass. In contrast to existing hypernetworks that were either task-specific or trained for relatively constraint multi-task settings, MotherNet is trained to generate networks to perform multiclass classification on arbitrary tabular datasets without any dataset specific gradient descent. The child network generated by MotherNet using in-context learning outperforms neural networks trained using gradient descent on small datasets, and is competitive with predictions by TabPFN and standard ML methods like Gradient Boosting. Unlike a direct application of transformer models like TabPFN, MotherNet generated networks are highly efficient at inference time. This methodology opens up a new approach to building predictive models on tabular data that is both efficient and robust, without any dataset-specific training.  ( 2 min )
    LDM$^2$: A Large Decision Model Imitating Human Cognition with Dynamic Memory Enhancement. (arXiv:2312.08402v1 [cs.LG])
    With the rapid development of large language models (LLMs), it is highly demanded that LLMs can be adopted to make decisions to enable the artificial general intelligence. Most approaches leverage manually crafted examples to prompt the LLMs to imitate the decision process of human. However, designing optimal prompts is difficult and the patterned prompts can hardly be generalized to more complex environments. In this paper, we propose a novel model named Large Decision Model with Memory (LDM$^2$), which leverages a dynamic memory mechanism to construct dynamic prompts, guiding the LLMs in making proper decisions according to the faced state. LDM$^2$ consists of two stages: memory formation and memory refinement. In the former stage, human behaviors are decomposed into state-action tuples utilizing the powerful summarizing ability of LLMs. Then, these tuples are stored in the memory, whose indices are generated by the LLMs, to facilitate the retrieval of the most relevant subset of memorized tuples based on the current state. In the latter stage, our LDM$^2$ employs tree exploration to discover more suitable decision processes and enrich the memory by adding valuable state-action tuples. The dynamic circle of exploration and memory enhancement provides LDM$^2$ a better understanding of the global environment. Extensive experiments conducted in two interactive environments have shown that our LDM$^2$ outperforms the baselines in terms of both score and success rate, which demonstrates its effectiveness.  ( 3 min )
    Taking it further: leveraging pseudo labels for field delineation across label-scarce smallholder regions. (arXiv:2312.08384v1 [cs.CV])
    Transfer learning allows for resource-efficient geographic transfer of pre-trained field delineation models. However, the scarcity of labeled data for complex and dynamic smallholder landscapes, particularly in Sub-Saharan Africa, remains a major bottleneck for large-area field delineation. This study explores opportunities of using sparse field delineation pseudo labels for fine-tuning models across geographies and sensor characteristics. We build on a FracTAL ResUNet trained for crop field delineation in India (median field size of 0.24 ha) and use this pre-trained model to generate pseudo labels in Mozambique (median field size of 0.06 ha). We designed multiple pseudo label selection strategies and compared the quantities, area properties, seasonal distribution, and spatial agreement of the pseudo labels against human-annotated training labels (n = 1,512). We then used the human-annotated labels and the pseudo labels for model fine-tuning and compared predictions against human field annotations (n = 2,199). Our results indicate i) a good baseline performance of the pre-trained model in both field delineation and field size estimation, and ii) the added value of regional fine-tuning with performance improvements in nearly all experiments. Moreover, we found iii) substantial performance increases when using only pseudo labels (up to 77% of the IoU increases and 68% of the RMSE decreases obtained by human labels), and iv) additional performance increases when complementing human annotations with pseudo labels. Pseudo labels can be efficiently generated at scale and thus facilitate domain adaptation in label-scarce settings. The workflow presented here is a stepping stone for overcoming the persisting data gaps in heterogeneous smallholder agriculture of Sub-Saharan Africa, where labels are commonly scarce.  ( 3 min )
    auto-sktime: Automated Time Series Forecasting. (arXiv:2312.08528v1 [cs.LG])
    In today's data-driven landscape, time series forecasting is pivotal in decision-making across various sectors. Yet, the proliferation of more diverse time series data, coupled with the expanding landscape of available forecasting methods, poses significant challenges for forecasters. To meet the growing demand for efficient forecasting, we introduce auto-sktime, a novel framework for automated time series forecasting. The proposed framework uses the power of automated machine learning (AutoML) techniques to automate the creation of the entire forecasting pipeline. The framework employs Bayesian optimization, to automatically construct pipelines from statistical, machine learning (ML) and deep neural network (DNN) models. Furthermore, we propose three essential improvements to adapt AutoML to time series data: First, pipeline templates to account for the different supported forecasting models. Second, a novel warm-starting technique to start the optimization from prior optimization runs. Third, we adapt multi-fidelity optimizations to make them applicable to a search space containing statistical, ML and DNN models. Experimental results on 64 diverse real-world time series datasets demonstrate the effectiveness and efficiency of the framework, outperforming traditional methods while requiring minimal human involvement.  ( 2 min )
    Personalized Decision Supports based on Theory of Mind Modeling and Explainable Reinforcement Learning. (arXiv:2312.08397v1 [cs.LG])
    In this paper, we propose a novel personalized decision support system that combines Theory of Mind (ToM) modeling and explainable Reinforcement Learning (XRL) to provide effective and interpretable interventions. Our method leverages DRL to provide expert action recommendations while incorporating ToM modeling to understand users' mental states and predict their future actions, enabling appropriate timing for intervention. To explain interventions, we use counterfactual explanations based on RL's feature importance and users' ToM model structure. Our proposed system generates accurate and personalized interventions that are easily interpretable by end-users. We demonstrate the effectiveness of our approach through a series of crowd-sourcing experiments in a simulated team decision-making task, where our system outperforms control baselines in terms of task performance. Our proposed approach is agnostic to task environment and RL model structure, therefore has the potential to be generalized to a wide range of applications.  ( 2 min )
    Exploring Graph Based Approaches for Author Name Disambiguation. (arXiv:2312.08388v1 [cs.SI])
    In many applications, such as scientific literature management, researcher search, social network analysis and etc, Name Disambiguation (aiming at disambiguating WhoIsWho) has been a challenging problem. In addition, the growth of scientific literature makes the problem more difficult and urgent. Although name disambiguation has been extensively studied in academia and industry, the problem has not been solved well due to the clutter of data and the complexity of the same name scenario. In this work, we aim to explore models that can perform the task of name disambiguation using the network structure that is intrinsic to the problem and present an analysis of the models.  ( 2 min )
    AutoNumerics-Zero: Automated Discovery of State-of-the-Art Mathematical Functions. (arXiv:2312.08472v1 [cs.NE])
    Computers calculate transcendental functions by approximating them through the composition of a few limited-precision instructions. For example, an exponential can be calculated with a Taylor series. These approximation methods were developed over the centuries by mathematicians, who emphasized the attainability of arbitrary precision. Computers, however, operate on few limited precision types, such as the popular float32. In this study, we show that when aiming for limited precision, existing approximation methods can be outperformed by programs automatically discovered from scratch by a simple evolutionary algorithm. In particular, over real numbers, our method can approximate the exponential function reaching orders of magnitude more precision for a given number of operations when compared to previous approaches. More practically, over float32 numbers and constrained to less than 1 ULP of error, the same method attains a speedup over baselines by generating code that triggers better XLA/LLVM compilation paths. In other words, in both cases, evolution searched a vast space of possible programs, without knowledge of mathematics, to discover previously unknown optimized approximations to high precision, for the first time. We also give evidence that these results extend beyond the exponential. The ubiquity of transcendental functions suggests that our method has the potential to reduce the cost of scientific computing applications.  ( 2 min )
    Privacy Constrained Fairness Estimation for Decision Trees. (arXiv:2312.08413v1 [cs.LG])
    The protection of sensitive data becomes more vital, as data increases in value and potency. Furthermore, the pressure increases from regulators and society on model developers to make their Artificial Intelligence (AI) models non-discriminatory. To boot, there is a need for interpretable, transparent AI models for high-stakes tasks. In general, measuring the fairness of any AI model requires the sensitive attributes of the individuals in the dataset, thus raising privacy concerns. In this work, the trade-offs between fairness, privacy and interpretability are further explored. We specifically examine the Statistical Parity (SP) of Decision Trees (DTs) with Differential Privacy (DP), that are each popular methods in their respective subfield. We propose a novel method, dubbed Privacy-Aware Fairness Estimation of Rules (PAFER), that can estimate SP in a DP-aware manner for DTs. DP, making use of a third-party legal entity that securely holds this sensitive data, guarantees privacy by adding noise to the sensitive data. We experimentally compare several DP mechanisms. We show that using the Laplacian mechanism, the method is able to estimate SP with low error while guaranteeing the privacy of the individuals in the dataset with high certainty. We further show experimentally and theoretically that the method performs better for DTs that humans generally find easier to interpret.  ( 2 min )
    Accelerating Meta-Learning by Sharing Gradients. (arXiv:2312.08398v1 [cs.LG])
    The success of gradient-based meta-learning is primarily attributed to its ability to leverage related tasks to learn task-invariant information. However, the absence of interactions between different tasks in the inner loop leads to task-specific over-fitting in the initial phase of meta-training. While this is eventually corrected by the presence of these interactions in the outer loop, it comes at a significant cost of slower meta-learning. To address this limitation, we explicitly encode task relatedness via an inner loop regularization mechanism inspired by multi-task learning. Our algorithm shares gradient information from previously encountered tasks as well as concurrent tasks in the same task batch, and scales their contribution with meta-learned parameters. We show using two popular few-shot classification datasets that gradient sharing enables meta-learning under bigger inner loop learning rates and can accelerate the meta-training process by up to 134%.  ( 2 min )
    Balanced and Deterministic Weight-sharing Helps Network Performance. (arXiv:2312.08401v1 [cs.LG])
    Weight-sharing plays a significant role in the success of many deep neural networks, by increasing memory efficiency and incorporating useful inductive priors about the problem into the network. But understanding how weight-sharing can be used effectively in general is a topic that has not been studied extensively. Chen et al. [2015] proposed HashedNets, which augments a multi-layer perceptron with a hash table, as a method for neural network compression. We generalize this method into a framework (ArbNets) that allows for efficient arbitrary weight-sharing, and use it to study the role of weight-sharing in neural networks. We show that common neural networks can be expressed as ArbNets with different hash functions. We also present two novel hash functions, the Dirichlet hash and the Neighborhood hash, and use them to demonstrate experimentally that balanced and deterministic weight-sharing helps with the performance of a neural network.  ( 2 min )
    An Explainable Machine Learning Framework for the Accurate Diagnosis of Ovarian Cancer. (arXiv:2312.08381v1 [cs.LG])
    Ovarian cancer (OC) is one of the most prevalent types of cancer in women. Early and accurate diagnosis is crucial for the survival of the patients. However, the majority of women are diagnosed in advanced stages due to the lack of effective biomarkers and accurate screening tools. While previous studies sought a common biomarker, our study suggests different biomarkers for the premenopausal and postmenopausal populations. This can provide a new perspective in the search for novel predictors for the effective diagnosis of OC. Lack of explainability is one major limitation of current AI systems. The stochastic nature of the ML algorithms raises concerns about the reliability of the system as it is difficult to interpret the reasons behind the decisions. To increase the trustworthiness and accountability of the diagnostic system as well as to provide transparency and explanations behind the predictions, explainable AI has been incorporated into the ML framework. SHAP is employed to quantify the contributions of the selected biomarkers and determine the most discriminative features. A hybrid decision support system has been established that can eliminate the bottlenecks caused by the black-box nature of the ML algorithms providing a safe and trustworthy AI tool. The diagnostic accuracy obtained from the proposed system outperforms the existing methods as well as the state-of-the-art ROMA algorithm by a substantial margin which signifies its potential to be an effective tool in the differential diagnosis of OC.  ( 3 min )
    Improving age prediction: Utilizing LSTM-based dynamic forecasting for data augmentation in multivariate time series analysis. (arXiv:2312.08383v1 [cs.LG])
    The high dimensionality and complexity of neuroimaging data necessitate large datasets to develop robust and high-performing deep learning models. However, the neuroimaging field is notably hampered by the scarcity of such datasets. In this work, we proposed a data augmentation and validation framework that utilizes dynamic forecasting with Long Short-Term Memory (LSTM) networks to enrich datasets. We extended multivariate time series data by predicting the time courses of independent component networks (ICNs) in both one-step and recursive configurations. The effectiveness of these augmented datasets was then compared with the original data using various deep learning models designed for chronological age prediction tasks. The results suggest that our approach improves model performance, providing a robust solution to overcome the challenges presented by the limited size of neuroimaging datasets.  ( 2 min )
    ALGNet: Attention Light Graph Memory Network for Medical Recommendation System. (arXiv:2312.08377v1 [cs.AI])
    Medication recommendation is a vital task for improving patient care and reducing adverse events. However, existing methods often fail to capture the complex and dynamic relationships among patient medical records, drug efficacy and safety, and drug-drug interactions (DDI). In this paper, we propose ALGNet, a novel model that leverages light graph convolutional networks (LGCN) and augmentation memory networks (AMN) to enhance medication recommendation. LGCN can efficiently encode the patient records and the DDI graph into low-dimensional embeddings, while AMN can augment the patient representation with external knowledge from a memory module. We evaluate our model on the MIMIC-III dataset and show that it outperforms several baselines in terms of recommendation accuracy and DDI avoidance. We also conduct an ablation study to analyze the effects of different components of our model. Our results demonstrate that ALGNet can achieve superior performance with less computation and more interpretability. The implementation of this paper can be found at: https://github.com/huyquoctrinh/ALGNet.  ( 2 min )
  • Open

    What does self-attention learn from Masked Language Modelling?. (arXiv:2304.07235v2 [cond-mat.dis-nn] UPDATED)
    Transformers are neural networks which revolutionised natural language processing and machine learning. They process sequences of inputs, like words, using a mechanism called self-attention, which is trained via masked language modelling (MLM). In MLM, a word is randomly masked in an input sequence, and the network is trained to predict the missing word. Despite the practical success of transformers, it remains unclear what type of data distribution self-attention can learn efficiently. Here, we show analytically that if one decouples the treatment of word positions and embeddings, a single layer of self-attention learns the conditionals of a generalised Potts model with interactions between sites and Potts colours. Moreover, we show that training this neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudo-likelihood method, well known in statistical physics. Using this mapping, we compute the generalisation error of self-attention in a model scenario analytically using the replica method.  ( 2 min )
    A Large Deviations Perspective on Policy Gradient Algorithms. (arXiv:2311.07411v2 [math.OC] UPDATED)
    Motivated by policy gradient methods in the context of reinforcement learning, we derive the first large deviation rate function for the iterates generated by stochastic gradient descent for possibly non-convex objectives satisfying a Polyak-Lojasiewicz condition. Leveraging the contraction principle from large deviations theory, we illustrate the potential of this result by showing how convergence properties of policy gradient with a softmax parametrization and an entropy regularized objective can be naturally extended to a wide spectrum of other policy parametrizations.  ( 2 min )
    Lagrangian Flow Networks for Conservation Laws. (arXiv:2305.16846v2 [cs.LG] UPDATED)
    We introduce Lagrangian Flow Networks (LFlows) for modeling fluid densities and velocities continuously in space and time. By construction, the proposed LFlows satisfy the continuity equation, a PDE describing mass conservation in its differentiable form. Our model is based on the insight that solutions to the continuity equation can be expressed as time-dependent density transformations via differentiable and invertible maps. This follows from classical theory of the existence and uniqueness of Lagrangian flows for smooth vector fields. Hence, we model fluid densities by transforming a base density with parameterized diffeomorphisms conditioned on time. The key benefit compared to methods relying on numerical ODE solvers or PINNs is that the analytic expression of the velocity is always consistent with changes in density. Furthermore, we require neither expensive numerical solvers, nor additional penalties to enforce the PDE. LFlows show higher predictive accuracy in density modeling tasks compared to competing models in 2D and 3D, while being computationally efficient. As a real-world application, we model bird migration based on sparse weather radar measurements.  ( 2 min )
    Fair Active Learning in Low-Data Regimes. (arXiv:2312.08559v1 [cs.LG])
    In critical machine learning applications, ensuring fairness is essential to avoid perpetuating social inequities. In this work, we address the challenges of reducing bias and improving accuracy in data-scarce environments, where the cost of collecting labeled data prohibits the use of large, labeled datasets. In such settings, active learning promises to maximize marginal accuracy gains of small amounts of labeled data. However, existing applications of active learning for fairness fail to deliver on this, typically requiring large labeled datasets, or failing to ensure the desired fairness tolerance is met on the population distribution. To address such limitations, we introduce an innovative active learning framework that combines an exploration procedure inspired by posterior sampling with a fair classification subroutine. We demonstrate that this framework performs effectively in very data-scarce regimes, maximizing accuracy while satisfying fairness constraints with high probability. We evaluate our proposed approach using well-established real-world benchmark datasets and compare it against state-of-the-art methods, demonstrating its effectiveness in producing fair models, and improvement over existing methods.  ( 2 min )
    Doubly Robust Estimator for Off-Policy Evaluation with Large Action Spaces. (arXiv:2308.03443v3 [stat.ML] UPDATED)
    We study Off-Policy Evaluation (OPE) in contextual bandit settings with large action spaces. The benchmark estimators suffer from severe bias and variance tradeoffs. Parametric approaches suffer from bias due to difficulty specifying the correct model, whereas ones with importance weight suffer from variance. To overcome these limitations, Marginalized Inverse Propensity Scoring (MIPS) was proposed to mitigate the estimator's variance via embeddings of an action. Nevertheless, MIPS is unbiased under the no direct effect, which assumes that the action embedding completely mediates the effect of an action on a reward. To overcome the dependency on these unrealistic assumptions, we propose a Marginalized Doubly Robust (MDR) estimator. Theoretical analysis shows that the proposed estimator is unbiased under weaker assumptions than MIPS while reducing the variance against MIPS. The empirical experiment verifies the supremacy of MDR against existing estimators with large action spaces.  ( 2 min )
    Subspace Identification for Multi-Source Domain Adaptation. (arXiv:2310.04723v2 [cs.LG] UPDATED)
    Multi-source domain adaptation (MSDA) methods aim to transfer knowledge from multiple labeled source domains to an unlabeled target domain. Although current methods achieve target joint distribution identifiability by enforcing minimal changes across domains, they often necessitate stringent conditions, such as an adequate number of domains, monotonic transformation of latent variables, and invariant label distributions. These requirements are challenging to satisfy in real-world applications. To mitigate the need for these strict assumptions, we propose a subspace identification theory that guarantees the disentanglement of domain-invariant and domain-specific variables under less restrictive constraints regarding domain numbers and transformation properties, thereby facilitating domain adaptation by minimizing the impact of domain shifts on invariant variables. Based on this theory, we develop a Subspace Identification Guarantee (SIG) model that leverages variational inference. Furthermore, the SIG model incorporates class-aware conditional alignment to accommodate target shifts where label distributions change with the domains. Experimental results demonstrate that our SIG model outperforms existing MSDA techniques on various benchmark datasets, highlighting its effectiveness in real-world applications.  ( 2 min )
    Distributed Stochastic Optimization under a General Variance Condition. (arXiv:2301.12677v3 [math.OC] UPDATED)
    Distributed stochastic optimization has drawn great attention recently due to its effectiveness in solving large-scale machine learning problems. Though numerous algorithms have been proposed and successfully applied to general practical problems, their theoretical guarantees mainly rely on certain boundedness conditions on the stochastic gradients, varying from uniform boundedness to the relaxed growth condition. In addition, how to characterize the data heterogeneity among the agents and its impacts on the algorithmic performance remains challenging. In light of such motivations, we revisit the classical Federated Averaging (FedAvg) algorithm (McMahan et al., 2017) as well as the more recent SCAFFOLD method (Karimireddy et al., 2020) for solving the distributed stochastic optimization problem and establish the convergence results under only a mild variance condition on the stochastic gradients for smooth nonconvex objective functions. Almost sure convergence to a stationary point is also established under the condition. Moreover, we discuss a more informative measurement for data heterogeneity as well as its implications.  ( 2 min )
    Physics-informed neural networks for pathloss prediction. (arXiv:2211.12986v2 [stat.ML] UPDATED)
    This paper introduces a physics-informed machine learning approach for pathloss prediction. This is achieved by including in the training phase simultaneously (i) physical dependencies between spatial loss field and (ii) measured pathloss values in the field. It is shown that the solution to a proposed learning problem improves generalization and prediction quality with a small number of neural network layers and parameters. The latter leads to fast inference times which are favorable for downstream tasks such as localization. Moreover, the physics-informed formulation allows training and prediction with a small amount of training data which makes it appealing for a wide range of practical pathloss prediction scenarios.  ( 2 min )
    Managing Temporal Resolution in Continuous Value Estimation: A Fundamental Trade-off. (arXiv:2212.08949v2 [cs.LG] UPDATED)
    A default assumption in reinforcement learning (RL) and optimal control is that observations arrive at discrete time points on a fixed clock cycle. Yet, many applications involve continuous-time systems where the time discretization, in principle, can be managed. The impact of time discretization on RL methods has not been fully characterized in existing theory, but a more detailed analysis of its effect could reveal opportunities for improving data-efficiency. We address this gap by analyzing Monte-Carlo policy evaluation for LQR systems and uncover a fundamental trade-off between approximation and statistical error in value estimation. Importantly, these two errors behave differently to time discretization, leading to an optimal choice of temporal resolution for a given data budget. These findings show that managing the temporal resolution can provably improve policy evaluation efficiency in LQR systems with finite data. Empirically, we demonstrate the trade-off in numerical simulations of LQR instances and standard RL benchmarks for non-linear continuous control.  ( 2 min )
    Featuring Koopman Mode Decomposition. (arXiv:2312.09146v1 [math.DS])
    This article introduces an advanced Koopman mode decomposition (KMD) technique -- coined Featurized Koopman Mode Decomposition (FKMD) -- that uses time embedding and Mahalanobis scaling to enhance analysis and prediction of high dimensional dynamical systems. The time embedding expands the observation space to better capture underlying manifold structure, while the Mahalanobis scaling, applied to kernel or random Fourier features, adjusts observations based on the system's dynamics. This aids in featurizing KMD in cases where good features are not a priori known. We show that our method improves KMD predictions for a high dimensional Lorenz attractor and for a cell signaling problem from cancer research.  ( 2 min )
    Diffusion Model in Causal Inference with Unmeasured Confounders. (arXiv:2308.03669v4 [cs.LG] UPDATED)
    We study how to extend the use of the diffusion model to answer the causal question from the observational data under the existence of unmeasured confounders. In Pearl's framework of using a Directed Acyclic Graph (DAG) to capture the causal intervention, a Diffusion-based Causal Model (DCM) was proposed incorporating the diffusion model to answer the causal questions more accurately, assuming that all of the confounders are observed. However, unmeasured confounders in practice exist, which hinders DCM from being applicable. To alleviate this limitation of DCM, we propose an extended model called Backdoor Criterion based DCM (BDCM), whose idea is rooted in the Backdoor criterion to find the variables in DAG to be included in the decoding process of the diffusion model so that we can extend DCM to the case with unmeasured confounders. Synthetic data experiment demonstrates that our proposed model captures the counterfactual distribution more precisely than DCM under the unmeasured confounders.  ( 2 min )
    Gaussian Process Regression under Computational and Epistemic Misspecification. (arXiv:2312.09225v1 [math.NA])
    Gaussian process regression is a classical kernel method for function estimation and data interpolation. In large data applications, computational costs can be reduced using low-rank or sparse approximations of the kernel. This paper investigates the effect of such kernel approximations on the interpolation error. We introduce a unified framework to analyze Gaussian process regression under important classes of computational misspecification: Karhunen-Lo\`eve expansions that result in low-rank kernel approximations, multiscale wavelet expansions that induce sparsity in the covariance matrix, and finite element representations that induce sparsity in the precision matrix. Our theory also accounts for epistemic misspecification in the choice of kernel parameters.  ( 2 min )
    Let's do the time-warp-attend: Learning topological invariants of dynamical systems. (arXiv:2312.09234v1 [cs.LG])
    Dynamical systems across the sciences, from electrical circuits to ecological networks, undergo qualitative and often catastrophic changes in behavior, called bifurcations, when their underlying parameters cross a threshold. Existing methods predict oncoming catastrophes in individual systems but are primarily time-series-based and struggle both to categorize qualitative dynamical regimes across diverse systems and to generalize to real data. To address this challenge, we propose a data-driven, physically-informed deep-learning framework for classifying dynamical regimes and characterizing bifurcation boundaries based on the extraction of topologically invariant features. We focus on the paradigmatic case of the supercritical Hopf bifurcation, which is used to model periodic dynamics across a wide range of applications. Our convolutional attention method is trained with data augmentations that encourage the learning of topological invariants which can be used to detect bifurcation boundaries in unseen systems and to design models of biological systems like oscillatory gene regulatory networks. We further demonstrate our method's use in analyzing real data by recovering distinct proliferation and differentiation dynamics along pancreatic endocrinogenesis trajectory in gene expression space based on single-cell data. Our method provides valuable insights into the qualitative, long-term behavior of a wide range of dynamical systems, and can detect bifurcations or catastrophic transitions in large-scale physical and biological systems.  ( 2 min )
    The impact of memory on learning sequence-to-sequence tasks. (arXiv:2205.14683v2 [cs.LG] UPDATED)
    The recent success of neural networks in natural language processing has drawn renewed attention to learning sequence-to-sequence (seq2seq) tasks. While there exists a rich literature that studies classification and regression tasks using solvable models of neural networks, seq2seq tasks have not yet been studied from this perspective. Here, we propose a simple model for a seq2seq task that has the advantage of providing explicit control over the degree of memory, or non-Markovianity, in the sequences -- the stochastic switching-Ornstein-Uhlenbeck (SSOU) model. We introduce a measure of non-Markovianity to quantify the amount of memory in the sequences. For a minimal auto-regressive (AR) learning model trained on this task, we identify two learning regimes corresponding to distinct phases in the stationary state of the SSOU process. These phases emerge from the interplay between two different time scales that govern the sequence statistics. Moreover, we observe that while increasing the integration window of the AR model always improves performance, albeit with diminishing returns, increasing the non-Markovianity of the input sequences can improve or degrade its performance. Finally, we perform experiments with recurrent and convolutional neural networks that show that our observations carry over to more complicated neural network architectures.  ( 2 min )
    Symmetry Breaking and Equivariant Neural Networks. (arXiv:2312.09016v1 [cs.LG])
    Using symmetry as an inductive bias in deep learning has been proven to be a principled approach for sample-efficient model design. However, the relationship between symmetry and the imperative for equivariance in neural networks is not always obvious. Here, we analyze a key limitation that arises in equivariant functions: their incapacity to break symmetry at the level of individual data samples. In response, we introduce a novel notion of 'relaxed equivariance' that circumvents this limitation. We further demonstrate how to incorporate this relaxation into equivariant multilayer perceptrons (E-MLPs), offering an alternative to the noise-injection method. The relevance of symmetry breaking is then discussed in various application domains: physics, graph representation learning, combinatorial optimization and equivariant decoding.  ( 2 min )
    Fair Clustering: A Causal Perspective. (arXiv:2312.09061v1 [stat.ML])
    Clustering algorithms may unintentionally propagate or intensify existing disparities, leading to unfair representations or biased decision-making. Current fair clustering methods rely on notions of fairness that do not capture any information on the underlying causal mechanisms. We show that optimising for non-causal fairness notions can paradoxically induce direct discriminatory effects from a causal standpoint. We present a clustering approach that incorporates causal fairness metrics to provide a more nuanced approach to fairness in unsupervised learning. Our approach enables the specification of the causal fairness metrics that should be minimised. We demonstrate the efficacy of our methodology using datasets known to harbour unfair biases.  ( 2 min )
    Estimating calibration error under label shift without labels. (arXiv:2312.08586v1 [cs.LG])
    In the face of dataset shift, model calibration plays a pivotal role in ensuring the reliability of machine learning systems. Calibration error (CE) is an indicator of the alignment between the predicted probabilities and the classifier accuracy. While prior works have delved into the implications of dataset shift on calibration, existing CE estimators assume access to labels from the target domain, which are often unavailable in practice, i.e., when the model is deployed and used. This work addresses such challenging scenario, and proposes a novel CE estimator under label shift, which is characterized by changes in the marginal label distribution $p(Y)$, while keeping the conditional $p(X|Y)$ constant between the source and target distributions. Our contribution is an approach, which, by leveraging importance re-weighting of the labeled source distribution, provides consistent and asymptotically unbiased CE estimation with respect to the shifted target distribution. Empirical results across diverse real-world datasets, under various conditions and label-shift intensities, demonstrate the effectiveness and reliability of the proposed estimator.  ( 2 min )
    Fast Sampling via De-randomization for Discrete Diffusion Models. (arXiv:2312.09193v1 [cs.LG])
    Diffusion models have emerged as powerful tools for high-quality data generation, such as image generation. Despite its success in continuous spaces, discrete diffusion models, which apply to domains such as texts and natural languages, remain under-studied and often suffer from slow generation speed. In this paper, we propose a novel de-randomized diffusion process, which leads to an accelerated algorithm for discrete diffusion models. Our technique significantly reduces the number of function evaluations (i.e., calls to the neural network), making the sampling process much faster. Furthermore, we introduce a continuous-time (i.e., infinite-step) sampling algorithm that can provide even better sample qualities than its discrete-time (finite-step) counterpart. Extensive experiments on natural language generation and machine translation tasks demonstrate the superior performance of our method in terms of both generation speed and sample quality over existing methods for discrete diffusion models.  ( 2 min )
    Using Surprise Index for Competency Assessment in Autonomous Decision-Making. (arXiv:2312.09033v1 [cs.RO])
    This paper considers the problem of evaluating an autonomous system's competency in performing a task, particularly when working in dynamic and uncertain environments. The inherent opacity of machine learning models, from the perspective of the user, often described as a `black box', poses a challenge. To overcome this, we propose using a measure called the Surprise index, which leverages available measurement data to quantify whether the dynamic system performs as expected. We show that the surprise index can be computed in closed form for dynamic systems when observed evidence in a probabilistic model if the joint distribution for that evidence follows a multivariate Gaussian marginal distribution. We then apply it to a nonlinear spacecraft maneuver problem, where actions are chosen by a reinforcement learning agent and show it can indicate how well the trajectory follows the required orbit.  ( 2 min )
    Knowledge-Driven Modulation of Neural Networks with Attention Mechanism for Next Activity Prediction. (arXiv:2312.08847v1 [cs.AI])
    Predictive Process Monitoring (PPM) aims at leveraging historic process execution data to predict how ongoing executions will continue up to their completion. In recent years, PPM techniques for the prediction of the next activities have matured significantly, mainly thanks to the use of Neural Networks (NNs) as a predictor. While their performance is difficult to beat in the general case, there are specific situations where background process knowledge can be helpful. Such knowledge can be leveraged for improving the quality of predictions for exceptional process executions or when the process changes due to a concept drift. In this paper, we present a Symbolic[Neuro] system that leverages background knowledge expressed in terms of a procedural process model to offset the under-sampling in the training data. More specifically, we make predictions using NNs with attention mechanism, an emerging technology in the NN field. The system has been tested on several real-life logs showing an improvement in the performance of the prediction task.  ( 2 min )
    Conformalised data synthesis with statistical quality guarantees. (arXiv:2312.08999v1 [cs.LG])
    With the proliferation of ever more complicated Deep Learning architectures, data synthesis is a highly promising technique to address the demand of data-hungry models. However, reliably assessing the quality of a 'synthesiser' model's output is an open research question with significant associated risks for high-stake domains. To address this challenge, we have designed a unique confident data synthesis algorithm that introduces statistical confidence guarantees through a novel extension of the Conformal Prediction framework. We support our proposed algorithm with theoretical proofs and an extensive empirical evaluation of five benchmark datasets. To show our approach's versatility on ubiquitous real-world challenges, the datasets were carefully selected for their variety of difficult characteristics: low sample count, class imbalance and non-separability, and privacy-sensitive data. In all trials, training sets extended with our confident synthesised data performed at least as well as the original, and frequently significantly improved Deep Learning performance by up to +65% F1-score.  ( 2 min )
    Does provable absence of barren plateaus imply classical simulability? Or, why we need to rethink variational quantum computing. (arXiv:2312.09121v1 [quant-ph])
    A large amount of effort has recently been put into understanding the barren plateau phenomenon. In this perspective article, we face the increasingly loud elephant in the room and ask a question that has been hinted at by many but not explicitly addressed: Can the structure that allows one to avoid barren plateaus also be leveraged to efficiently simulate the loss classically? We present strong evidence that commonly used models with provable absence of barren plateaus are also classically simulable, provided that one can collect some classical data from quantum devices during an initial data acquisition phase. This follows from the observation that barren plateaus result from a curse of dimensionality, and that current approaches for solving them end up encoding the problem into some small, classically simulable, subspaces. This sheds serious doubt on the non-classicality of the information processing capabilities of parametrized quantum circuits for barren plateau-free landscapes and on the possibility of superpolynomial advantages from running them on quantum hardware. We end by discussing caveats in our arguments, the role of smart initializations, and by highlighting new opportunities that our perspective raises.  ( 3 min )
    Fast sampling from constrained spaces using the Metropolis-adjusted Mirror Langevin Algorithm. (arXiv:2312.08823v1 [stat.CO])
    We propose a new method called the Metropolis-adjusted Mirror Langevin algorithm for approximate sampling from distributions whose support is a compact and convex set. This algorithm adds an accept-reject filter to the Markov chain induced by a single step of the mirror Langevin algorithm (Zhang et al., 2020), which is a basic discretisation of the mirror Langevin dynamics. Due to the inclusion of this filter, our method is unbiased relative to the target, while known discretisations of the mirror Langevin dynamics including the mirror Langevin algorithm have an asymptotic bias. We give upper bounds for the mixing time of the proposed algorithm when the potential is relatively smooth, convex, and Lipschitz with respect to a self-concordant mirror function. As a consequence of the reversibility of the Markov chain induced by the algorithm, we obtain an exponentially better dependence on the error tolerance for approximate sampling. We also present numerical experiments that corroborate our theoretical findings.  ( 2 min )
    Consistent and Asymptotically Unbiased Estimation of Proper Calibration Errors. (arXiv:2312.08589v1 [cs.LG])
    Proper scoring rules evaluate the quality of probabilistic predictions, playing an essential role in the pursuit of accurate and well-calibrated models. Every proper score decomposes into two fundamental components -- proper calibration error and refinement -- utilizing a Bregman divergence. While uncertainty calibration has gained significant attention, current literature lacks a general estimator for these quantities with known statistical properties. To address this gap, we propose a method that allows consistent, and asymptotically unbiased estimation of all proper calibration errors and refinement terms. In particular, we introduce Kullback--Leibler calibration error, induced by the commonly used cross-entropy loss. As part of our results, we prove the relation between refinement and f-divergences, which implies information monotonicity in neural networks, regardless of which proper scoring rule is optimized. Our experiments validate empirically the claimed properties of the proposed estimator and suggest that the selection of a post-hoc calibration method should be determined by the particular calibration error of interest.  ( 2 min )
    ReCoRe: Regularized Contrastive Representation Learning of World Model. (arXiv:2312.09056v1 [cs.LG])
    While recent model-free Reinforcement Learning (RL) methods have demonstrated human-level effectiveness in gaming environments, their success in everyday tasks like visual navigation has been limited, particularly under significant appearance variations. This limitation arises from (i) poor sample efficiency and (ii) over-fitting to training scenarios. To address these challenges, we present a world model that learns invariant features using (i) contrastive unsupervised learning and (ii) an intervention-invariant regularizer. Learning an explicit representation of the world dynamics i.e. a world model, improves sample efficiency while contrastive learning implicitly enforces learning of invariant features, which improves generalization. However, the naive integration of contrastive loss to world models fails due to a lack of supervisory signals to the visual encoder, as world-model-based RL methods independently optimize representation learning and agent policy. To overcome this issue, we propose an intervention-invariant regularizer in the form of an auxiliary task such as depth prediction, image denoising, etc., that explicitly enforces invariance to style-interventions. Our method outperforms current state-of-the-art model-based and model-free RL methods and significantly on out-of-distribution point navigation task evaluated on the iGibson benchmark. We further demonstrate that our approach, with only visual observations, outperforms recent language-guided foundation models for point navigation, which is essential for deployment on robots with limited computation capabilities. Finally, we demonstrate that our proposed model excels at the sim-to-real transfer of its perception module on Gibson benchmark.  ( 3 min )
    Performance evaluation of matrix factorization for fMRI data. (arXiv:2312.08809v1 [q-bio.NC])
    In the study of the brain, there is a hypothesis that sparse coding is realized in information representation of external stimuli, which is experimentally confirmed for visual stimulus recently. However, unlike the specific functional region in the brain, sparse coding in information processing in the whole brain has not been clarified sufficiently. In this study, we investigate the validity of sparse coding in the whole human brain by applying various matrix factorization methods to functional magnetic resonance imaging data of neural activities in the whole human brain. The result suggests sparse coding hypothesis in information representation in the whole human brain, because extracted features from sparse MF method, SparsePCA or MOD under high sparsity setting, or approximate sparse MF method, FastICA, can classify external visual stimuli more accurately than non-sparse MF method or sparse MF method under low sparsity setting.  ( 2 min )
    Deep learning-based estimation of time-dependent parameters in Markov models with application to nonlinear regression and SDEs. (arXiv:2312.08493v1 [stat.ML])
    We present a novel deep learning method for estimating time-dependent parameters in Markov processes through discrete sampling. Departing from conventional machine learning, our approach reframes parameter approximation as an optimization problem using the maximum likelihood approach. Experimental validation focuses on parameter estimation in multivariate regression and stochastic differential equations (SDEs). Theoretical results show that the real solution is close to SDE with parameters approximated using our neural network-derived under specific conditions. Our work contributes to SDE-based model parameter estimation, offering a versatile tool for diverse fields.  ( 2 min )
    Space-Time Approximation with Shallow Neural Networks in Fourier Lebesgue spaces. (arXiv:2312.08461v1 [cs.LG])
    Approximation capabilities of shallow neural networks (SNNs) form an integral part in understanding the properties of deep neural networks (DNNs). In the study of these approximation capabilities some very popular classes of target functions are the so-called spectral Barron spaces. This spaces are of special interest when it comes to the approximation of partial differential equation (PDE) solutions. It has been shown that the solution of certain static PDEs will lie in some spectral Barron space. In order to alleviate the limitation to static PDEs and include a time-domain that might have a different regularity than the space domain, we extend the notion of spectral Barron spaces to anisotropic weighted Fourier-Lebesgue spaces. In doing so, we consider target functions that have two blocks of variables, among which each block is allowed to have different decay and integrability properties. For these target functions we first study the inclusion of anisotropic weighted Fourier-Lebesgue spaces in the Bochner-Sobolev spaces. With that we can now also measure the approximation error in terms of an anisotropic Sobolev norm, namely the Bochner-Sobolev norm. We use this observation in a second step where we establish a bound on the approximation rate for functions from the anisotropic weighted Fourier-Lebesgue spaces and approximation via SNNs in the Bochner-Sobolev norm.  ( 2 min )
    The Relative Value of Prediction in Algorithmic Decision Making. (arXiv:2312.08511v1 [cs.CY])
    Algorithmic predictions are increasingly used to inform the allocations of goods and interventions in the public sphere. In these domains, predictions serve as a means to an end. They provide stakeholders with insights into likelihood of future events as a means to improve decision making quality, and enhance social welfare. However, if maximizing welfare is the ultimate goal, prediction is only a small piece of the puzzle. There are various other policy levers a social planner might pursue in order to improve bottom-line outcomes, such as expanding access to available goods, or increasing the effect sizes of interventions. Given this broad range of design decisions, a basic question to ask is: What is the relative value of prediction in algorithmic decision making? How do the improvements in welfare arising from better predictions compare to those of other policy levers? The goal of our work is to initiate the formal study of these questions. Our main results are theoretical in nature. We identify simple, sharp conditions determining the relative value of prediction vis-\`a-vis expanding access, within several statistical models that are popular amongst quantitative social scientists. Furthermore, we illustrate how these theoretical insights may be used to guide the design of algorithmic decision making systems in practice.  ( 2 min )
    ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks. (arXiv:2312.08583v1 [cs.CL])
    This study examines 4-bit quantization methods like GPTQ in large language models (LLMs), highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperform. However, simply shifting to higher precision formats like FP6 has been particularly challenging, thus overlooked, due to poor performance caused by the lack of sophisticated integration and system acceleration strategies on current AI hardware. Our results show that FP6, even with a coarse-grain quantization scheme, performs robustly across various algorithms and tasks, demonstrating its superiority in accuracy and versatility. Notably, with the FP6 quantization, \codestar-15B model performs comparably to its FP16 counterpart in code generation, and for smaller models like the 406M it closely matches their baselines in summarization. Neither can be achieved by INT4. To better accommodate various AI hardware and achieve the best system performance, we propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization. With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.  ( 2 min )
    Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods. (arXiv:2312.08531v1 [cs.LG])
    In the past several years, the convergence of the last iterate of the Stochastic Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz and convex functions, different works have established the optimal $O(\log(1/\delta)\log T/\sqrt{T})$ or $O(\sqrt{\log(1/\delta)/T})$ high-probability convergence rates for the final iterate, where $T$ is the time horizon and $\delta$ is the failure probability. However, to prove these bounds, all the existing works are limited to compact domains or require almost surely bounded noises. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last iterate convergence of SGD for non-smooth problems, only few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a non-composite objective and the standard Euclidean norm. It still remains unclear whether the last-iterate convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterate convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness and (strong) convexity simultaneously. Additionally, we extend our analysis to obtain the last-iterate convergence under heavy-tailed noises.  ( 3 min )
    Universal Approximation Property of Random Neural Networks. (arXiv:2312.08410v1 [cs.LG])
    In this paper, we study random neural networks which are single-hidden-layer feedforward neural networks whose weights and biases are randomly initialized. After this random initialization, only the linear readout needs to be trained, which can be performed efficiently, e.g., by the least squares method. By viewing random neural networks as Banach space-valued random variables, we prove their universal approximation properties within suitable Bochner spaces. Hereby, the corresponding Banach space can be more general than the space of continuous functions over a compact subset of a Euclidean space, namely, e.g., an $L^p$-space or a Sobolev space, where the latter includes the approximation of the derivatives. Moreover, we derive some approximation rates and develop an explicit algorithm to learn a deterministic function by a random neural network. In addition, we provide a full error analysis and study when random neural networks overcome the curse of dimensionality in the sense that the training costs scale at most polynomially in the input and output dimension. Furthermore, we show in two numerical examples the empirical advantages of random neural networks compared to fully trained deterministic neural networks.  ( 2 min )

  • Open

    Advancements in machine learning for machine learning
    Posted by Phitchaya Mangpo Phothilimthana, Staff Research Scientist, Google DeepMind, and Bryan Perozzi, Senior Staff Research Scientist, Google Research With the recent and accelerated advances in machine learning (ML), machines can understand natural language, engage in conversations, draw images, create videos and more. Modern ML models are programmed and trained using ML programming frameworks, such as TensorFlow, JAX, PyTorch, among many others. These libraries provide high-level instructions to ML practitioners, such as linear algebra operations (e.g., matrix multiplication, convolution, etc.) and neural network layers (e.g., 2D convolution layers, transformer layers). Importantly, practitioners need not worry about how to make their models run efficiently on hardware because an…  ( 93 min )
    StyleDrop: Text-to-image generation in any style
    Posted by Kihyuk Sohn and Dilip Krishnan, Research Scientists, Google Research Text-to-image models trained on large volumes of image-text pairs have enabled the creation of rich and diverse images encompassing many genres and themes. Moreover, popular styles such as “anime” or “steampunk”, when added to the input text prompt, may translate to specific visual outputs. While many efforts have been put into prompt engineering, a wide range of styles are simply hard to describe in text form due to the nuances of color schemes, illumination, and other characteristics. As an example, “watercolor painting” may refer to various styles, and using a text prompt that simply says “watercolor painting style” may either result in one specific style or an unpredictable mix of several. When we …  ( 92 min )
  • Open

    What is the point of a public key fingerprint?
    Public key cryptography uses two keys: a private key and a public key. The nature of these keys depends on the encryption scheme, such as whether one is using RSA, ECC, or some other method, but you can think of a key as a long number. A key may be a short list of numbers, […] What is the point of a public key fingerprint? first appeared on John D. Cook.  ( 5 min )
  • Open

    Beyond LLMs and Trillion-Parameter Models
    These days, it is as if AI is just about GenAI (generative AI), LLMs (large language models) and very large models. It has eclipsed computer vision, voice AI and everything else. Part of the success of trillion-parameter models is that they are over-parametrized. That is, many different parameter combinations lead to good enough solutions. In… Read More »Beyond LLMs and Trillion-Parameter Models The post Beyond LLMs and Trillion-Parameter Models appeared first on Data Science Central.  ( 22 min )
  • Open

    Use Amazon DocumentDB to build no-code machine learning solutions in Amazon SageMaker Canvas
    We are excited to announce the launch of Amazon DocumentDB (with MongoDB compatibility) integration with Amazon SageMaker Canvas, allowing Amazon DocumentDB customers to build and use generative AI and machine learning (ML) solutions without writing code. Amazon DocumentDB is a fully managed native JSON document database that makes it straightforward and cost-effective to operate critical […]  ( 9 min )
  • Open

    Image recognition accuracy: An unseen challenge confounding today’s AI
    “Minimum viewing time” benchmark gauges image recognition complexity for AI systems by measuring the time needed for accurate human identification.  ( 11 min )
    Computational model captures the elusive transition states of chemical reactions
    Using generative AI, MIT chemists created a model that can predict the structures formed when a chemical reaction reaches its point of no return.  ( 9 min )
  • Open

    Discretization-Induced Dirichlet Posterior for Robust Uncertainty Quantification on Regression. (arXiv:2308.09065v2 [cs.CV] UPDATED)
    Uncertainty quantification is critical for deploying deep neural networks (DNNs) in real-world applications. An Auxiliary Uncertainty Estimator (AuxUE) is one of the most effective means to estimate the uncertainty of the main task prediction without modifying the main task model. To be considered robust, an AuxUE must be capable of maintaining its performance and triggering higher uncertainties while encountering Out-of-Distribution (OOD) inputs, i.e., to provide robust aleatoric and epistemic uncertainty. However, for vision regression tasks, current AuxUE designs are mainly adopted for aleatoric uncertainty estimates, and AuxUE robustness has not been explored. In this work, we propose a generalized AuxUE scheme for more robust uncertainty quantification on regression tasks. Concretely, to achieve a more robust aleatoric uncertainty estimation, different distribution assumptions are considered for heteroscedastic noise, and Laplace distribution is finally chosen to approximate the prediction error. For epistemic uncertainty, we propose a novel solution named Discretization-Induced Dirichlet pOsterior (DIDO), which models the Dirichlet posterior on the discretized prediction error. Extensive experiments on age estimation, monocular depth estimation, and super-resolution tasks show that our proposed method can provide robust uncertainty estimates in the face of noisy inputs and that it can be scalable to both image-level and pixel-wise tasks. Code is available at https://github.com/ENSTA-U2IS/DIDO .  ( 3 min )
    Ensemble Reinforcement Learning: A Survey. (arXiv:2303.02618v3 [cs.LG] UPDATED)
    Reinforcement Learning (RL) has emerged as a highly effective technique for addressing various scientific and applied problems. Despite its success, certain complex tasks remain challenging to be addressed solely with a single model and algorithm. In response, ensemble reinforcement learning (ERL), a promising approach that combines the benefits of both RL and ensemble learning (EL), has gained widespread popularity. ERL leverages multiple models or training algorithms to comprehensively explore the problem space and possesses strong generalization capabilities. In this study, we present a comprehensive survey on ERL to provide readers with an overview of recent advances and challenges in the field. Firstly, we provide an introduction to the background and motivation for ERL. Secondly, we conduct a detailed analysis of strategies such as model selection and combination that have been successfully implemented in ERL. Subsequently, we explore the application of ERL, summarize the datasets, and analyze the algorithms employed. Finally, we outline several open questions and discuss future research directions of ERL. By offering guidance for future scientific research and engineering applications, this survey significantly contributes to the advancement of ERL.  ( 2 min )
    Reinforcement Learning for Generative AI: State of the Art, Opportunities and Open Research Challenges. (arXiv:2308.00031v2 [cs.LG] UPDATED)
    Generative Artificial Intelligence (AI) is one of the most exciting developments in Computer Science of the last decade. At the same time, Reinforcement Learning (RL) has emerged as a very successful paradigm for a variety of machine learning tasks. In this survey, we discuss the state of the art, opportunities and open research questions in applying RL to generative AI. In particular, we will discuss three types of applications, namely, RL as an alternative way for generation without specified objectives; as a way for generating outputs while concurrently maximizing an objective function; and, finally, as a way of embedding desired characteristics, which cannot be easily captured by means of an objective function, into the generative process. We conclude the survey with an in-depth discussion of the opportunities and challenges in this fascinating emerging area.  ( 2 min )
    Levenshtein Distance Embedding with Poisson Regression for DNA Storage. (arXiv:2312.07931v1 [cs.LG])
    Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.  ( 2 min )
    Kimad: Adaptive Gradient Compression with Bandwidth Awareness. (arXiv:2312.08053v1 [cs.LG])
    In distributed training, communication often emerges as a bottleneck. In response, we introduce Kimad, a solution that offers adaptive gradient compression. By consistently monitoring bandwidth, Kimad refines compression ratios to match specific neural network layer requirements. Our exhaustive tests and proofs confirm Kimad's outstanding performance, establishing it as a benchmark in adaptive compression for distributed deep learning.  ( 2 min )
    Exploiting Machine Unlearning for Backdoor Attacks in Deep Learning System. (arXiv:2310.10659v2 [cs.CR] UPDATED)
    In recent years, the security issues of artificial intelligence have become increasingly prominent due to the rapid development of deep learning research and applications. Backdoor attack is an attack targeting the vulnerability of deep learning models, where hidden backdoors are activated by triggers embedded by the attacker, thereby outputting malicious predictions that may not align with the intended output for a given input. In this work, we propose a novel black-box backdoor attack based on machine unlearning. The attacker first augments the training set with carefully designed samples, including poison and mitigation data, to train a `benign' model. Then, the attacker posts unlearning requests for the mitigation samples to remove the impact of relevant data on the model, gradually activating the hidden backdoor. Since backdoors are implanted during the iterative unlearning process, it significantly increases the computational overhead of existing defense methods for backdoor detection or mitigation. To address this new security threat, we proposes two methods for detecting or mitigating such malicious unlearning requests. We conduct the experiment in both exact unlearning and approximate unlearning (i.e., SISA) settings. Experimental results indicate that: 1) our attack approach can successfully implant backdoor into the model, and sharding increases the difficult of attack; 2) our detection algorithms are effective in identifying the mitigation samples, while sharding reduces the effectiveness of our detection algorithms.  ( 2 min )
    Coreset selection can accelerate quantum machine learning models with provable generalization. (arXiv:2309.10441v2 [quant-ph] UPDATED)
    Quantum neural networks (QNNs) and quantum kernels stand as prominent figures in the realm of quantum machine learning, poised to leverage the nascent capabilities of near-term quantum computers to surmount classical machine learning challenges. Nonetheless, the training efficiency challenge poses a limitation on both QNNs and quantum kernels, curbing their efficacy when applied to extensive datasets. To confront this concern, we present a unified approach: coreset selection, aimed at expediting the training of QNNs and quantum kernels by distilling a judicious subset from the original training dataset. Furthermore, we analyze the generalization error bounds of QNNs and quantum kernels when trained on such coresets, unveiling the comparable performance with those training on the complete original dataset. Through systematic numerical simulations, we illuminate the potential of coreset selection in expediting tasks encompassing synthetic data classification, identification of quantum correlations, and quantum compiling. Our work offers a useful way to improve diverse quantum machine learning models with a theoretical guarantee while reducing the training cost.  ( 2 min )
    On the verification of Embeddings using Hybrid Markov Logic. (arXiv:2312.08287v1 [cs.LG])
    The standard approach to verify representations learned by Deep Neural Networks is to use them in specific tasks such as classification or regression, and measure their performance based on accuracy in such tasks. However, in many cases, we would want to verify more complex properties of a learned representation. To do this, we propose a framework based on a probabilistic first-order language, namely, Hybrid Markov Logic Networks (HMLNs) where we specify properties over embeddings mixed with symbolic domain knowledge. We present an approach to learn parameters for the properties within this framework. Further, we develop a verification method to test embeddings in this framework by encoding this task as a Mixed Integer Linear Program for which we can leverage existing state-of-the-art solvers. We illustrate verification in Graph Neural Networks, Deep Knowledge Tracing and Intelligent Tutoring Systems to demonstrate the generality of our approach.  ( 2 min )
    Universal Adversarial Framework to Improve Adversarial Robustness for Diabetic Retinopathy Detection. (arXiv:2312.08193v1 [eess.IV])
    Diabetic Retinopathy (DR) is a prevalent illness associated with Diabetes which, if left untreated, can result in irreversible blindness. Deep Learning based systems are gradually being introduced as automated support for clinical diagnosis. Since healthcare has always been an extremely important domain demanding error-free performance, any adversaries could pose a big threat to the applicability of such systems. In this work, we use Universal Adversarial Perturbations (UAPs) to quantify the vulnerability of Medical Deep Neural Networks (DNNs) for detecting DR. To the best of our knowledge, this is the very first attempt that works on attacking complete fine-grained classification of DR images using various UAPs. Also, as a part of this work, we use UAPs to fine-tune the trained models to defend against adversarial samples. We experiment on several models and observe that the performance of such models towards unseen adversarial attacks gets boosted on average by $3.41$ Cohen-kappa value and maximum by $31.92$ Cohen-kappa value. The performance degradation on normal data upon ensembling the fine-tuned models was found to be statistically insignificant using t-test, highlighting the benefits of UAP-based adversarial fine-tuning.  ( 2 min )
    Accelerating the Global Aggregation of Local Explanations. (arXiv:2312.07991v1 [cs.LG])
    Local explanation methods highlight the input tokens that have a considerable impact on the outcome of classifying the document at hand. For example, the Anchor algorithm applies a statistical analysis of the sensitivity of the classifier to changes in the token. Aggregating local explanations over a dataset provides a global explanation of the model. Such aggregation aims to detect words with the most impact, giving valuable insights about the model, like what it has learned in training and which adversarial examples expose its weaknesses. However, standard aggregation methods bear a high computational cost: a na\"ive implementation applies a costly algorithm to each token of each document, and hence, it is infeasible for a simple user running in the scope of a short analysis session. % We devise techniques for accelerating the global aggregation of the Anchor algorithm. Specifically, our goal is to compute a set of top-$k$ words with the highest global impact according to different aggregation functions. Some of our techniques are lossless and some are lossy. We show that for a very mild loss of quality, we are able to accelerate the computation by up to 30$\times$, reducing the computation from hours to minutes. We also devise and study a probabilistic model that accounts for noise in the Anchor algorithm and diminishes the bias toward words that are frequent yet low in impact.  ( 2 min )
    Towards Optimal Statistical Watermarking. (arXiv:2312.07930v1 [cs.LG])
    We study statistical watermarking by formulating it as a hypothesis testing problem, a general framework which subsumes all previous statistical watermarking methods. Key to our formulation is a coupling of the output tokens and the rejection region, realized by pseudo-random generators in practice, that allows non-trivial trade-off between the Type I error and Type II error. We characterize the Uniformly Most Powerful (UMP) watermark in this context. In the most common scenario where the output is a sequence of $n$ tokens, we establish matching upper and lower bounds on the number of i.i.d. tokens required to guarantee small Type I and Type II errors. Our rate scales as $\Theta(h^{-1} \log (1/h))$ with respect to the average entropy per token $h$ and thus greatly improves the $O(h^{-2})$ rate in the previous works. For scenarios where the detector lacks knowledge of the model's distribution, we introduce the concept of model-agnostic watermarking and establish the minimax bounds for the resultant increase in Type II error. Moreover, we formulate the robust watermarking problem where user is allowed to perform a class of perturbation on the generated texts, and characterize the optimal type II error of robust UMP tests via a linear programming problem. To the best of our knowledge, this is the first systematic statistical treatment on the watermarking problem with near-optimal rates in the i.i.d. setting, and might be of interest for future works.  ( 3 min )
    Differentially private projection-depth-based medians. (arXiv:2312.07792v1 [math.ST])
    We develop $(\epsilon,\delta)$-differentially private projection-depth-based medians using the propose-test-release (PTR) and exponential mechanisms. Under general conditions on the input parameters and the population measure, (e.g. we do not assume any moment bounds), we quantify the probability the test in PTR fails, as well as the cost of privacy via finite sample deviation bounds. We demonstrate our main result on the canonical projection-depth-based median. In the Gaussian setting, we show that the resulting deviation bound matches the known lower bound for private Gaussian mean estimation, up to a polynomial function of the condition number of the covariance matrix. In the Cauchy setting, we show that the ``outlier error amplification'' effect resulting from the heavy tails outweighs the cost of privacy. This result is then verified via numerical simulations. Additionally, we present results on general PTR mechanisms and a uniform concentration result on the projected spacings of order statistics.  ( 2 min )
    SVInvNet: A Densely Connected Encoder-Decoder Architecture for Seismic Velocity Inversion. (arXiv:2312.08194v1 [cs.LG])
    This study presents a deep learning-based approach to seismic velocity inversion problem, focusing on both noisy and noiseless training datasets of varying sizes. Our Seismic Velocity Inversion Network (SVInvNet) introduces a novel architecture that contains a multi-connection encoder-decoder structure enhanced with dense blocks. This design is specifically tuned to effectively process complex information, crucial for addressing the challenges of non-linear seismic velocity inversion. For training and testing, we created diverse seismic velocity models, including multi-layered, faulty, and salt dome categories. We also investigated how different kinds of ambient noise, both coherent and stochastic, and the size of the training dataset affect learning outcomes. SVInvNet is trained on datasets ranging from 750 to 6,000 samples and is tested using a large benchmark dataset of 12,000 samples. Despite its fewer parameters compared to the baseline, SVInvNet achieves superior performance with this dataset. The outcomes of the SVInvNet are additionally compared to those of the Full Waveform Inversion (FWI) method. The comparative analysis clearly reveals the effectiveness of the proposed model.  ( 2 min )
    A New Perspective On Denoising Based On Optimal Transport. (arXiv:2312.08135v1 [math.ST])
    In the standard formulation of the denoising problem, one is given a probabilistic model relating a latent variable $\Theta \in \Omega \subset \mathbb{R}^m \; (m\ge 1)$ and an observation $Z \in \mathbb{R}^d$ according to: $Z \mid \Theta \sim p(\cdot\mid \Theta)$ and $\Theta \sim G^*$, and the goal is to construct a map to recover the latent variable from the observation. The posterior mean, a natural candidate for estimating $\Theta$ from $Z$, attains the minimum Bayes risk (under the squared error loss) but at the expense of over-shrinking the $Z$, and in general may fail to capture the geometric features of the prior distribution $G^*$ (e.g., low dimensionality, discreteness, sparsity, etc.). To rectify these drawbacks, in this paper we take a new perspective on this denoising problem that is inspired by optimal transport (OT) theory and use it to propose a new OT-based denoiser at the population level setting. We rigorously prove that, under general assumptions on the model, our OT-based denoiser is well-defined and unique, and is closely connected to solutions to a Monge OT problem. We then prove that, under appropriate identifiability assumptions on the model, our OT-based denoiser can be recovered solely from information of the marginal distribution of $Z$ and the posterior mean of the model, after solving a linear relaxation problem over a suitable space of couplings that is reminiscent of a standard multimarginal OT (MOT) problem. In particular, thanks to Tweedie's formula, when the likelihood model $\{ p(\cdot \mid \theta) \}_{\theta \in \Omega}$ is an exponential family of distributions, the OT-based denoiser can be recovered solely from the marginal distribution of $Z$. In general, our family of OT-like relaxations is of interest in its own right and for the denoising problem suggests alternative numerical methods inspired by the rich literature on computational OT.  ( 3 min )
    The Effective Horizon Explains Deep RL Performance in Stochastic Environments. (arXiv:2312.08369v1 [stat.ML])
    Reinforcement learning (RL) theory has largely focused on proving minimax sample complexity bounds. These require strategic exploration algorithms that use relatively limited function classes for representing the policy or value function. Our goal is to explain why deep RL algorithms often perform well in practice, despite using random exploration and much more expressive function classes like neural networks. Our work arrives at an explanation by showing that many stochastic MDPs can be solved by performing only a few steps of value iteration on the random policy's Q function and then acting greedily. When this is true, we find that it is possible to separate the exploration and learning components of RL, making it much easier to analyze. We introduce a new RL algorithm, SQIRL, that iteratively learns a near-optimal policy by exploring randomly to collect rollouts and then performing a limited number of steps of fitted-Q iteration over those rollouts. Any regression algorithm that satisfies basic in-distribution generalization properties can be used in SQIRL to efficiently solve common MDPs. This can explain why deep RL works neural networks, since it is empirically established that neural networks generalize well in-distribution. Furthermore, SQIRL explains why random exploration works well in practice, since we show many environments can be solved by estimating the random policy's Q-function and then applying zero or a few steps of value iteration. We leverage SQIRL to derive instance-dependent sample complexity bounds for RL that are exponential only in an "effective horizon" of lookahead and on the complexity of the class used for function approximation. Empirically, we also find that SQIRL performance strongly correlates with PPO and DQN performance in a variety of stochastic environments, supporting that our theoretical analysis is predictive of practical performance.  ( 3 min )
    Semi-Supervised Segmentation of Functional Tissue Units at the Cellular Level. (arXiv:2305.02148v2 [eess.IV] UPDATED)
    We present a new method for functional tissue unit segmentation at the cellular level, which utilizes the latest deep learning semantic segmentation approaches together with domain adaptation and semi-supervised learning techniques. This approach allows for minimizing the domain gap, class imbalance, and captures settings influence between HPA and HubMAP datasets. The presented approach achieves comparable with state-of-the-art-result in functional tissue unit segmentation at the cellular level. The source code is available at https://github.com/VSydorskyy/hubmap_2022_htt_solution  ( 2 min )
    Large Human Language Models: A Need and the Challenges. (arXiv:2312.07751v1 [cs.CL])
    As research in human-centered NLP advances, there is a growing recognition of the importance of incorporating human and social factors into NLP models. At the same time, our NLP systems have become heavily reliant on LLMs, most of which do not model authors. To build NLP systems that can truly understand human language, we must better integrate human contexts into LLMs. This brings to the fore a range of design considerations and challenges in terms of what human aspects to capture, how to represent them, and what modeling strategies to pursue. To address these, we advocate for three positions toward creating large human language models (LHLMs) using concepts from psychological and behavioral sciences: First, LM training should include the human context. Second, LHLMs should recognize that people are more than their group(s). Third, LHLMs should be able to account for the dynamic and temporally-dependent nature of the human context. We refer to relevant advances and present open challenges that need to be addressed and their possible solutions in realizing these goals.  ( 2 min )
    Individualized Deepfake Detection Exploiting Traces Due to Double Neural-Network Operations. (arXiv:2312.08034v1 [eess.IV])
    In today's digital landscape, journalists urgently require tools to verify the authenticity of facial images and videos depicting specific public figures before incorporating them into news stories. Existing deepfake detectors are not optimized for this detection task when an image is associated with a specific and identifiable individual. This study focuses on the deepfake detection of facial images of individual public figures. We propose to condition the proposed detector on the identity of the identified individual given the advantages revealed by our theory-driven simulations. While most detectors in the literature rely on perceptible or imperceptible artifacts present in deepfake facial images, we demonstrate that the detection performance can be improved by exploiting the idempotency property of neural networks. In our approach, the training process involves double neural-network operations where we pass an authentic image through a deepfake simulating network twice. Experimental results show that the proposed method improves the area under the curve (AUC) from 0.92 to 0.94 and reduces its standard deviation by 17\%. For evaluating the detection performance of individual public figures, a facial image dataset with individuals' names is required, a criterion not met by the current deepfake datasets. To address this, we curated a dataset comprising 32k images featuring 45 public figures, which we intend to release to the public after the paper is published.  ( 2 min )
    Learning Nash Equilibria in Zero-Sum Markov Games: A Single Time-scale Algorithm Under Weak Reachability. (arXiv:2312.08008v1 [cs.GT])
    We consider decentralized learning for zero-sum games, where players only see their payoff information and are agnostic to actions and payoffs of the opponent. Previous works demonstrated convergence to a Nash equilibrium in this setting using double time-scale algorithms under strong reachability assumptions. We address the open problem of achieving an approximate Nash equilibrium efficiently with an uncoupled and single time-scale algorithm under weaker conditions. Our contribution is a rational and convergent algorithm, utilizing Tsallis-entropy regularization in a value-iteration-based approach. The algorithm learns an approximate Nash equilibrium in polynomial time, requiring only the existence of a policy pair that induces an irreducible and aperiodic Markov chain, thus considerably weakening past assumptions. Our analysis leverages negative drift inequalities and introduces novel properties of Tsallis entropy that are of independent interest.  ( 2 min )
    GLARE: A Dataset for Traffic Sign Detection in Sun Glare. (arXiv:2209.08716v2 [cs.CV] UPDATED)
    Real-time machine learning object detection algorithms are often found within autonomous vehicle technology and depend on quality datasets. It is essential that these algorithms work correctly in everyday conditions as well as under strong sun glare. Reports indicate glare is one of the two most prominent environment-related reasons for crashes. However, existing datasets, such as the Laboratory for Intelligent & Safe Automobiles Traffic Sign (LISA) Dataset and the German Traffic Sign Recognition Benchmark, do not reflect the existence of sun glare at all. This paper presents the GLARE (GLARE is available at: https://github.com/NicholasCG/GLARE_Dataset ) traffic sign dataset: a collection of images with U.S-based traffic signs under heavy visual interference by sunlight. GLARE contains 2,157 images of traffic signs with sun glare, pulled from 33 videos of dashcam footage of roads in the United States. It provides an essential enrichment to the widely used LISA Traffic Sign dataset. Our experimental study shows that although several state-of-the-art baseline architectures have demonstrated good performance on traffic sign detection in conditions without sun glare in the past, they performed poorly when tested against GLARE (e.g., average mAP0.5:0.95 of 19.4). We also notice that current architectures have better detection when trained on images of traffic signs in sun glare performance (e.g., average mAP0.5:0.95 of 39.6), and perform best when trained on a mixture of conditions (e.g., average mAP0.5:0.95 of 42.3).
    TwinLiteNet: An Efficient and Lightweight Model for Driveable Area and Lane Segmentation in Self-Driving Cars. (arXiv:2307.10705v5 [cs.CV] UPDATED)
    Semantic segmentation is a common task in autonomous driving to understand the surrounding environment. Driveable Area Segmentation and Lane Detection are particularly important for safe and efficient navigation on the road. However, original semantic segmentation models are computationally expensive and require high-end hardware, which is not feasible for embedded systems in autonomous vehicles. This paper proposes a lightweight model for the driveable area and lane line segmentation. TwinLiteNet is designed cheaply but achieves accurate and efficient segmentation results. We evaluate TwinLiteNet on the BDD100K dataset and compare it with modern models. Experimental results show that our TwinLiteNet performs similarly to existing approaches, requiring significantly fewer computational resources. Specifically, TwinLiteNet achieves a mIoU score of 91.3% for the Drivable Area task and 31.08% IoU for the Lane Detection task with only 0.4 million parameters and achieves 415 FPS on GPU RTX A5000. Furthermore, TwinLiteNet can run in real-time on embedded devices with limited computing power, especially since it achieves 60FPS on Jetson Xavier NX, making it an ideal solution for self-driving vehicles. Code is available: url{https://github.com/chequanghuy/TwinLiteNet}.
    Partial Symmetry Detection for 3D Geometry using Contrastive Learning with Geodesic Point Cloud Patches. (arXiv:2312.08230v1 [cs.CV])
    Symmetry detection, especially partial and extrinsic symmetry, is essential for various downstream tasks, like 3D geometry completion, segmentation, compression and structure-aware shape encoding or generation. In order to detect partial extrinsic symmetries, we propose to learn rotation, reflection, translation and scale invariant local shape features for geodesic point cloud patches via contrastive learning, which are robust across multiple classes and generalize over different datasets. We show that our approach is able to extract multiple valid solutions for this ambiguous problem. Furthermore, we introduce a novel benchmark test for partial extrinsic symmetry detection to evaluate our method. Lastly, we incorporate the detected symmetries together with a region growing algorithm to demonstrate a downstream task with the goal of computing symmetry-aware partitions of 3D shapes. To our knowledge, we are the first to propose a self-supervised data-driven method for partial extrinsic symmetry detection.
    Secure Deep Reinforcement Learning for Dynamic Resource Allocation in Wireless MEC Networks. (arXiv:2312.08016v1 [cs.LG])
    This paper proposes a blockchain-secured deep reinforcement learning (BC-DRL) optimization framework for {data management and} resource allocation in decentralized {wireless mobile edge computing (MEC)} networks. In our framework, {we design a low-latency reputation-based proof-of-stake (RPoS) consensus protocol to select highly reliable blockchain-enabled BSs to securely store MEC user requests and prevent data tampering attacks.} {We formulate the MEC resource allocation optimization as a constrained Markov decision process that balances minimum processing latency and denial-of-service (DoS) probability}. {We use the MEC aggregated features as the DRL input to significantly reduce the high-dimensionality input of the remaining service processing time for individual MEC requests. Our designed constrained DRL effectively attains the optimal resource allocations that are adapted to the dynamic DoS requirements. We provide extensive simulation results and analysis to} validate that our BC-DRL framework achieves higher security, reliability, and resource utilization efficiency than benchmark blockchain consensus protocols and {MEC} resource allocation algorithms.  ( 2 min )
    Kunyu: A High-Performing Global Weather Model Beyond Regression Losses. (arXiv:2312.08264v1 [eess.SP])
    Over the past year, data-driven global weather forecasting has emerged as a new alternative to traditional numerical weather prediction. This innovative approach yields forecasts of comparable accuracy at a tiny fraction of computational costs. Regrettably, as far as I know, existing models exclusively rely on regression losses, producing forecasts with substantial blurring. Such blurring, although compromises practicality, enjoys an unfair advantage on evaluation metrics. In this paper, I present Kunyu, a global data-driven weather forecasting model which delivers accurate predictions across a comprehensive array of atmospheric variables at 0.35{\deg} resolution. With both regression and adversarial losses integrated in its training framework, Kunyu generates forecasts with enhanced clarity and realism. Its performance outpaces even ECMWF HRES in some aspects such as the estimation of anomaly extremes, while remaining competitive with ECMWF HRES on evaluation metrics such as RMSE and ACC. Kunyu is an important step forward in closing the utility gap between numerical and data-driven weather prediction.
    The Blessing of Heterogeneity in Federated Q-Learning: Linear Speedup and Beyond. (arXiv:2305.10697v2 [cs.LG] UPDATED)
    When the data used for reinforcement learning (RL) are collected by multiple agents in a distributed manner, federated versions of RL algorithms allow collaborative learning without the need for agents to share their local data. In this paper, we consider federated Q-learning, which aims to learn an optimal Q-function by periodically aggregating local Q-estimates trained on local data alone. Focusing on infinite-horizon tabular Markov decision processes, we provide sample complexity guarantees for both the synchronous and asynchronous variants of federated Q-learning. In both cases, our bounds exhibit a linear speedup with respect to the number of agents and near-optimal dependencies on other salient problem parameters. In the asynchronous setting, existing analyses of federated Q-learning, which adopt an equally weighted averaging of local Q-estimates, require that every agent covers the entire state-action space. In contrast, our improved sample complexity scales inverse proportionally to the minimum entry of the average stationary state-action occupancy distribution of all agents, thus only requiring the agents to collectively cover the entire state-action space, unveiling the blessing of heterogeneity in enabling collaborative learning by relaxing the coverage requirement of the single-agent case. However, its sample complexity still suffers when the local trajectories are highly heterogeneous. In response, we propose a novel federated Q-learning algorithm with importance averaging, giving larger weights to more frequently visited state-action pairs, which achieves a robust linear speedup as if all trajectories are centrally processed, regardless of the heterogeneity of local behavior policies.
    An Incentive Mechanism for Federated Learning Based on Multiple Resource Exchange. (arXiv:2312.08096v1 [cs.LG])
    Federated Learning (FL) is a distributed machine learning paradigm that addresses privacy concerns in machine learning and still guarantees high test accuracy. However, achieving the necessary accuracy by having all clients participate in FL is impractical, given the constraints of client local computing resource. In this paper, we introduce a multi-user collaborative computing framework, categorizing users into two roles: model owners (MOs) and data owner (DOs). Without resorting to monetary incentives, an MO can encourage more DOs to join in FL by allowing the DOs to offload extra local computing tasks to the MO for execution. This exchange of "data" for "computing resources" streamlines the incentives for clients to engage more effectively in FL. We formulate the interaction between MO and DOs as an optimization problem, and the objective is to effectively utilize the communication and computing resource of the MO and DOs to minimize the time to complete an FL task. The proposed problem is a mixed integer nonlinear programming (MINLP) with high computational complexity. We first decompose it into two distinct subproblems, namely the client selection problem and the resource allocation problem to segregate the integer variables from the continuous variables. Then, an effective iterative algorithm is proposed to solve problem. Simulation results demonstrate that the proposed collaborative computing framework can achieve an accuracy of more than 95\% while minimizing the overall time to complete an FL task.
    Rewiring with Positional Encodings for Graph Neural Networks. (arXiv:2201.12674v4 [cs.LG] UPDATED)
    Several recent works use positional encodings to extend the receptive fields of graph neural network (GNN) layers equipped with attention mechanisms. These techniques, however, extend receptive fields to the complete graph, at substantial computational cost and risking a change in the inductive biases of conventional GNNs, or require complex architecture adjustments. As a conservative alternative, we use positional encodings to expand receptive fields to $r$-hop neighborhoods. More specifically, our method augments the input graph with additional nodes/edges and uses positional encodings as node and/or edge features. We thus modify graphs before inputting them to a downstream GNN model, instead of modifying the model itself. This makes our method model-agnostic, i.e., compatible with any of the existing GNN architectures. We also provide examples of positional encodings that are lossless with a one-to-one map between the original and the modified graphs. We demonstrate that extending receptive fields via positional encodings and a virtual fully-connected node significantly improves GNN performance and alleviates over-squashing using small $r$. We obtain improvements on a variety of models and datasets and reach competitive performance using traditional GNNs or graph Transformers.
    Mixed moving average field guided learning for spatio-temporal data. (arXiv:2301.00736v3 [stat.ML] UPDATED)
    Influenced mixed moving average fields are a versatile modeling class for spatio-temporal data. However, their predictive distribution is not generally known. Under this modeling assumption, we define a novel spatio-temporal embedding and a theory-guided machine learning approach that employs a generalized Bayesian algorithm to make ensemble forecasts. We employ Lipschitz predictors and determine fixed-time and any-time PAC Bayesian bounds in the batch learning setting. Performing causal forecast is a highlight of our methodology as its potential application to data with spatial and temporal short and long-range dependence. We then test the performance of our learning methodology by using linear predictors and data sets simulated from a spatio-temporal Ornstein-Uhlenbeck process.
    CBQ: Cross-Block Quantization for Large Language Models. (arXiv:2312.07950v1 [cs.LG])
    Post-training quantization (PTQ) has driven attention to producing efficient large language models (LLMs) with ultra-low costs. Since hand-craft quantization parameters lead to low performance in low-bit quantization, recent methods optimize the quantization parameters through block-wise reconstruction between the floating-point and quantized models. However, these methods suffer from two challenges: accumulated errors from independent one-by-one block quantization and reconstruction difficulties from extreme weight and activation outliers. To address these two challenges, we propose CBQ, a cross-block reconstruction-based PTQ method for LLMs. To reduce error accumulation, we introduce a cross-block dependency with the aid of a homologous reconstruction scheme to build the long-range dependency between adjacent multi-blocks with overlapping. To reduce reconstruction difficulty, we design a coarse-to-fine pre-processing (CFP) to truncate weight outliers and dynamically scale activation outliers before optimization, and an adaptive rounding scheme, called LoRA-Rounding, with two low-rank learnable matrixes to further rectify weight quantization errors. Extensive experiments demonstrate that: (1) CBQ pushes both activation and weight quantization to low-bit settings W4A4, W4A8, and W2A16. (2) CBQ achieves better performance than the existing state-of-the-art methods on various LLMs and benchmark datasets.
    TERM Model: Tensor Ring Mixture Model for Density Estimation. (arXiv:2312.08075v1 [cs.LG])
    Efficient probability density estimation is a core challenge in statistical machine learning. Tensor-based probabilistic graph methods address interpretability and stability concerns encountered in neural network approaches. However, a substantial number of potential tensor permutations can lead to a tensor network with the same structure but varying expressive capabilities. In this paper, we take tensor ring decomposition for density estimator, which significantly reduces the number of permutation candidates while enhancing expressive capability compared with existing used decompositions. Additionally, a mixture model that incorporates multiple permutation candidates with adaptive weights is further designed, resulting in increased expressive flexibility and comprehensiveness. Different from the prevailing directions of tensor network structure/permutation search, our approach provides a new viewpoint inspired by ensemble learning. This approach acknowledges that suboptimal permutations can offer distinctive information besides that of optimal permutations. Experiments show the superiority of the proposed approach in estimating probability density for moderately dimensional datasets and sampling to capture intricate details.
    SLJP: Semantic Extraction based Legal Judgment Prediction. (arXiv:2312.07979v1 [cs.CL])
    Legal Judgment Prediction (LJP) is a judicial assistance system that recommends the legal components such as applicable statues, prison term and penalty term by analyzing the given input case document. Indian legal system is in the need of technical assistance such as artificial intelligence to solve the crores of pending cases in various courts for years and its being increased day to day. Most of the existing Indian models did not adequately concentrate on the semantics embedded in the fact description (FD) that impacts the decision. The proposed semantic extraction based LJP (SLJP) model provides the advantages of pretrained transformers for complex unstructured legal case document understanding and to generate embeddings. The model draws the in-depth semantics of the given FD at multiple levels i.e., chunk and case document level by following the divide and conquer approach. It creates the concise view of the given fact description using the extracted semantics as per the original court case document structure and predicts judgment using attention mechanism. We tested the model performance on two available Indian datasets Indian Legal Documents corpus (ILDC) and Indian Legal Statue Identification (ILSI) and got promising results. Also shown the highest performance and less performance degradation for increased epochs than base models on ILDC dataset.
    Combinatorial Stochastic-Greedy Bandit. (arXiv:2312.08057v1 [cs.LG])
    We propose a novel combinatorial stochastic-greedy bandit (SGB) algorithm for combinatorial multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time step $t\in [T]$ is observed. SGB adopts an optimized stochastic-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms. Unlike existing methods that explore the entire set of unselected base arms during each selection step, our SGB algorithm samples only an optimized proportion of unselected arms and selects actions from this subset. We prove that our algorithm achieves a $(1-1/e)$-regret bound of $\mathcal{O}(n^{\frac{1}{3}} k^{\frac{2}{3}} T^{\frac{2}{3}} \log(T)^{\frac{2}{3}})$ for monotone stochastic submodular rewards, which outperforms the state-of-the-art in terms of the cardinality constraint $k$. Furthermore, we empirically evaluate the performance of our algorithm in the context of online constrained social influence maximization. Our results demonstrate that our proposed approach consistently outperforms the other algorithms, increasing the performance gap as $k$ grows.
    Saturn: An Optimized Data System for Large Model Deep Learning Workloads. (arXiv:2309.01226v2 [cs.LG] UPDATED)
    Large language models such as GPT-3 & ChatGPT have transformed deep learning (DL), powering applications that have captured the public's imagination. These models are rapidly being adopted across domains for analytics on various modalities, often by finetuning pre-trained base models. Such models need multiple GPUs due to both their size and computational load, driving the development of a bevy of "model parallelism" techniques & tools. Navigating such parallelism choices, however, is a new burden for end users of DL such as data scientists, domain scientists, etc. who may lack the necessary systems knowhow. The need for model selection, which leads to many models to train due to hyper-parameter tuning or layer-wise finetuning, compounds the situation with two more burdens: resource apportioning and scheduling. In this work, we tackle these three burdens for DL users in a unified manner by formalizing them as a joint problem that we call SPASE: Select a Parallelism, Allocate resources, and SchedulE. We propose a new information system architecture to tackle the SPASE problem holistically, representing a key step toward enabling wider adoption of large DL models. We devise an extensible template for existing parallelism schemes and combine it with an automated empirical profiler for runtime estimation. We then formulate SPASE as an MILP. We find that direct use of an MILP-solver is significantly more effective than several baseline heuristics. We optimize the system runtime further with an introspective scheduling approach. We implement all these techniques into a new data system we call Saturn. Experiments with benchmark DL workloads show that Saturn achieves 39-49% lower model selection runtimes than typical current DL practice.
    On the Dynamics Under the Unhinged Loss and Beyond. (arXiv:2312.07841v1 [cs.LG])
    Recent works have studied implicit biases in deep learning, especially the behavior of last-layer features and classifier weights. However, they usually need to simplify the intermediate dynamics under gradient flow or gradient descent due to the intractability of loss functions and model architectures. In this paper, we introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze the closed-form dynamics while requiring as few simplifications or assumptions as possible. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization. Based on the layer-peeled model that views last-layer features as free optimization variables, we conduct a thorough analysis in the unconstrained, regularized, and spherical constrained cases, as well as the case where the neural tangent kernel remains invariant. To bridge the performance of the unhinged loss to that of Cross-Entropy (CE), we investigate the scenario of fixing classifier weights with a specific structure, (e.g., a simplex equiangular tight frame). Our analysis shows that these dynamics converge exponentially fast to a solution depending on the initialization of features and classifier weights. These theoretical results not only offer valuable insights, including explicit feature regularization and rescaled learning rates for enhancing practical training with the unhinged loss, but also extend their applicability to other loss functions. Finally, we empirically demonstrate these theoretical results and insights through extensive experiments.
    Incremental hierarchical text clustering methods: a review. (arXiv:2312.07769v1 [cs.LG])
    The growth in Internet usage has contributed to a large volume of continuously available data, and has created the need for automatic and efficient organization of the data. In this context, text clustering techniques are significant because they aim to organize documents according to their characteristics. More specifically, hierarchical and incremental clustering techniques can organize dynamic data in a hierarchical form, thus guaranteeing that this organization is updated and its exploration is facilitated. Based on the relevance and contemporary nature of the field, this study aims to analyze various hierarchical and incremental clustering techniques; the main contribution of this research is the organization and comparison of the techniques used by studies published between 2010 and 2018 that aimed to texts documents clustering. We describe the principal concepts related to the challenge and the different characteristics of these published works in order to provide a better understanding of the research in this field.
    OpenVoice: Versatile Instant Voice Cloning. (arXiv:2312.01479v2 [cs.SD] UPDATED)
    We introduce OpenVoice, a versatile voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice represents a significant advancement in addressing the following open challenges in the field: 1) Flexible Voice Style Control. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. The voice styles are not directly copied from and constrained by the style of the reference speaker. Previous approaches lacked the ability to flexibly manipulate voice styles after cloning. 2) Zero-Shot Cross-Lingual Voice Cloning. OpenVoice achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set. Unlike previous approaches, which typically require extensive massive-speaker multi-lingual (MSML) dataset for all languages, OpenVoice can clone voices into a new language without any massive-speaker training data for that language. OpenVoice is also computationally efficient, costing tens of times less than commercially available APIs that offer even inferior performance. To foster further research in the field, we have made the source code and trained model publicly accessible. We also provide qualitative results in our demo website. Prior to its public release, our internal version of OpenVoice was used tens of millions of times by users worldwide between May and October 2023, serving as the backend of MyShell.
    Minimax-optimal estimation for sparse multi-reference alignment with collision-free signals. (arXiv:2312.07839v1 [math.ST])
    The Multi-Reference Alignment (MRA) problem aims at the recovery of an unknown signal from repeated observations under the latent action of a group of cyclic isometries, in the presence of additive noise of high intensity $\sigma$. It is a more tractable version of the celebrated cryo EM model. In the crucial high noise regime, it is known that its sample complexity scales as $\sigma^6$. Recent investigations have shown that for the practically significant setting of sparse signals, the sample complexity of the maximum likelihood estimator asymptotically scales with the noise level as $\sigma^4$. In this work, we investigate minimax optimality for signal estimation under the MRA model for so-called collision-free signals. In particular, this signal class covers the setting of generic signals of dilute sparsity (wherein the support size $s=O(L^{1/3})$, where $L$ is the ambient dimension. We demonstrate that the minimax optimal rate of estimation in for the sparse MRA problem in this setting is $\sigma^2/\sqrt{n}$, where $n$ is the sample size. In particular, this widely generalizes the sample complexity asymptotics for the restricted MLE in this setting, establishing it as the statistically optimal estimator. Finally, we demonstrate a concentration inequality for the restricted MLE on its deviations from the ground truth.
    Characteristic Circuits. (arXiv:2312.07790v1 [cs.LG])
    In many real-world scenarios, it is crucial to be able to reliably and efficiently reason under uncertainty while capturing complex relationships in data. Probabilistic circuits (PCs), a prominent family of tractable probabilistic models, offer a remedy to this challenge by composing simple, tractable distributions into a high-dimensional probability distribution. However, learning PCs on heterogeneous data is challenging and densities of some parametric distributions are not available in closed form, limiting their potential use. We introduce characteristic circuits (CCs), a family of tractable probabilistic models providing a unified formalization of distributions over heterogeneous data in the spectral domain. The one-to-one relationship between characteristic functions and probability measures enables us to learn high-dimensional distributions on heterogeneous data domains and facilitates efficient probabilistic inference even when no closed-form density function is available. We show that the structure and parameters of CCs can be learned efficiently from the data and find that CCs outperform state-of-the-art density estimators for heterogeneous data domains on common benchmark data sets.
    MedYOLO: A Medical Image Object Detection Framework. (arXiv:2312.07729v1 [eess.IV])
    Artificial intelligence-enhanced identification of organs, lesions, and other structures in medical imaging is typically done using convolutional neural networks (CNNs) designed to make voxel-accurate segmentations of the region of interest. However, the labels required to train these CNNs are time-consuming to generate and require attention from subject matter experts to ensure quality. For tasks where voxel-level precision is not required, object detection models offer a viable alternative that can reduce annotation effort. Despite this potential application, there are few options for general purpose object detection frameworks available for 3-D medical imaging. We report on MedYOLO, a 3-D object detection framework using the one-shot detection method of the YOLO family of models and designed for use with medical imaging. We tested this model on four different datasets: BRaTS, LIDC, an abdominal organ Computed Tomography (CT) dataset, and an ECG-gated heart CT dataset. We found our models achieve high performance on commonly present medium and large-sized structures such as the heart, liver, and pancreas even without hyperparameter tuning. However, the models struggle with very small or rarely present structures.
    Robust and Performance Incentivizing Algorithms for Multi-Armed Bandits with Strategic Agents. (arXiv:2312.07929v1 [cs.GT])
    We consider a variant of the stochastic multi-armed bandit problem. Specifically, the arms are strategic agents who can improve their rewards or absorb them. The utility of an agent increases if she is pulled more or absorbs more of her rewards but decreases if she spends more effort improving her rewards. Agents have heterogeneous properties, specifically having different means and able to improve their rewards up to different levels. Further, a non-empty subset of agents are ''honest'' and in the worst case always give their rewards without absorbing any part. The principal wishes to obtain a high revenue (cumulative reward) by designing a mechanism that incentives top level performance at equilibrium. At the same time, the principal wishes to be robust and obtain revenue at least at the level of the honest agent with the highest mean in case of non-equilibrium behaviour. We identify a class of MAB algorithms which we call performance incentivizing which satisfy a collection of properties and show that they lead to mechanisms that incentivize top level performance at equilibrium and are robust under any strategy profile. Interestingly, we show that UCB is an example of such a MAB algorithm. Further, in the case where the top performance level is unknown we show that combining second price auction ideas with performance incentivizing algorithms achieves performance at least at the second top level while also being robust.
    Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning. (arXiv:2310.04796v2 [cs.LG] UPDATED)
    Learning Nash equilibrium (NE) in complex zero-sum games with multi-agent reinforcement learning (MARL) can be extremely computationally expensive. Curriculum learning is an effective way to accelerate learning, but an under-explored dimension for generating a curriculum is the difficulty-to-learn of the subgames -- games induced by starting from a specific state. In this work, we present a novel subgame curriculum learning framework for zero-sum games. It adopts an adaptive initial state distribution by resetting agents to some previously visited states where they can quickly learn to improve performance. Building upon this framework, we derive a subgame selection metric that approximates the squared distance to NE values and further adopt a particle-based state sampler for subgame generation. Integrating these techniques leads to our new algorithm, Subgame Automatic Curriculum Learning (SACL), which is a realization of the subgame curriculum learning framework. SACL can be combined with any MARL algorithm such as MAPPO. Experiments in the particle-world environment and Google Research Football environment show SACL produces much stronger policies than baselines. In the challenging hide-and-seek quadrant environment, SACL produces all four emergent stages and uses only half the samples of MAPPO with self-play. The project website is at https://sites.google.com/view/sacl-rl.
    XLB: A Differentiable Massively Parallel Lattice Boltzmann Library in Python. (arXiv:2311.16080v2 [physics.comp-ph] UPDATED)
    The lattice Boltzmann method (LBM) has emerged as a prominent technique for solving fluid dynamics problems due to its algorithmic potential for computational scalability. We introduce XLB library, a Python-based differentiable LBM library based on the JAX platform. The architecture of XLB is predicated upon ensuring accessibility, extensibility, and computational performance, enabling scaling effectively across CPU, TPU, multi-GPU, and distributed multi-GPU or TPU systems. The library can be readily augmented with novel boundary conditions, collision models, or multi-physics simulation capabilities. XLB's differentiability and data structure is compatible with the extensive JAX-based machine learning ecosystem, enabling it to address physics-based machine learning, optimization, and inverse problems. XLB has been successfully scaled to handle simulations with billions of cells, achieving giga-scale lattice updates per second. XLB is released under the permissive Apache-2.0 license and is available on GitHub at https://github.com/Autodesk/XLB.
    IDKM: Memory Efficient Neural Network Quantization via Implicit, Differentiable $k$-Means. (arXiv:2312.07759v1 [cs.LG])
    Compressing large neural networks with minimal performance loss is crucial to enabling their deployment on edge devices. (Cho et al., 2022) proposed a weight quantization method that uses an attention-based clustering algorithm called differentiable $k$-means (DKM). Despite achieving state-of-the-art results, DKM's performance is constrained by its heavy memory dependency. We propose an implicit, differentiable $k$-means algorithm (IDKM), which eliminates the major memory restriction of DKM. Let $t$ be the number of $k$-means iterations, $m$ be the number of weight-vectors, and $b$ be the number of bits per cluster address. IDKM reduces the overall memory complexity of a single $k$-means layer from $\mathcal{O}(t \cdot m \cdot 2^b)$ to $\mathcal{O}( m \cdot 2^b)$. We also introduce a variant, IDKM with Jacobian-Free-Backpropagation (IDKM-JFB), for which the time complexity of the gradient calculation is independent of $t$ as well. We provide a proof of concept of our methods by showing that, under the same settings, IDKM achieves comparable performance to DKM with less compute time and less memory. We also use IDKM and IDKM-JFB to quantize a large neural network, Resnet18, on hardware where DKM cannot train at all.
    Meta-learning to Calibrate Gaussian Processes with Deep Kernels for Regression Uncertainty Estimation. (arXiv:2312.07952v1 [stat.ML])
    Although Gaussian processes (GPs) with deep kernels have been successfully used for meta-learning in regression tasks, its uncertainty estimation performance can be poor. We propose a meta-learning method for calibrating deep kernel GPs for improving regression uncertainty estimation performance with a limited number of training data. The proposed method meta-learns how to calibrate uncertainty using data from various tasks by minimizing the test expected calibration error, and uses the knowledge for unseen tasks. We design our model such that the adaptation and calibration for each task can be performed without iterative procedures, which enables effective meta-learning. In particular, a task-specific uncalibrated output distribution is modeled by a GP with a task-shared encoder network, and it is transformed to a calibrated one using a cumulative density function of a task-specific Gaussian mixture model (GMM). By integrating the GP and GMM into our neural network-based model, we can meta-learn model parameters in an end-to-end fashion. Our experiments demonstrate that the proposed method improves uncertainty estimation performance while keeping high regression performance compared with the existing methods using real-world datasets in few-shot settings.
    Adapting Self-Supervised Representations to Multi-Domain Setups. (arXiv:2309.03999v2 [cs.CV] UPDATED)
    Current state-of-the-art self-supervised approaches, are effective when trained on individual domains but show limited generalization on unseen domains. We observe that these models poorly generalize even when trained on a mixture of domains, making them unsuitable to be deployed under diverse real-world setups. We therefore propose a general-purpose, lightweight Domain Disentanglement Module (DDM) that can be plugged into any self-supervised encoder to effectively perform representation learning on multiple, diverse domains with or without shared classes. During pre-training according to a self-supervised loss, DDM enforces a disentanglement in the representation space by splitting it into a domain-variant and a domain-invariant portion. When domain labels are not available, DDM uses a robust clustering approach to discover pseudo-domains. We show that pre-training with DDM can show up to 3.5% improvement in linear probing accuracy on state-of-the-art self-supervised models including SimCLR, MoCo, BYOL, DINO, SimSiam and Barlow Twins on multi-domain benchmarks including PACS, DomainNet and WILDS. Models trained with DDM show significantly improved generalization (7.4%) to unseen domains compared to baselines. Therefore, DDM can efficiently adapt self-supervised encoders to provide high-quality, generalizable representations for diverse multi-domain data.
    Optimizing accuracy and diversity: a multi-task approach to forecast combinations. (arXiv:2310.20545v2 [cs.LG] UPDATED)
    Forecast combination involves using multiple forecasts to create a single, more accurate prediction. Recently, feature-based forecasting has been employed to either select the most appropriate forecasting models or to optimize the weights of their combination. In this paper, we present a multi-task optimization paradigm that focuses on solving both problems simultaneously and enriches current operational research approaches to forecasting. In essence, it incorporates an additional learning and optimization task into the standard feature-based forecasting approach, focusing on the identification of an optimal set of forecasting methods. During the training phase, an optimization model with linear constraints and quadratic objective function is employed to identify accurate and diverse methods for each time series. Moreover, within the training phase, a neural network is used to learn the behavior of that optimization model. Once training is completed the candidate set of methods is identified using the network. The proposed approach elicits the essential role of diversity in feature-based forecasting and highlights the interplay between model combination and model selection when optimizing forecasting ensembles. Experimental results on a large set of series from the M4 competition dataset show that our proposal enhances point forecast accuracy compared to state-of-the-art methods.
    PMET: Precise Model Editing in a Transformer. (arXiv:2308.08742v3 [cs.CL] UPDATED)
    Model editing techniques modify a minor proportion of knowledge in Large Language Models (LLMs) at a relatively low cost, which have demonstrated notable success. Existing methods assume Transformer Layer (TL) hidden states are values of key-value memories of the Feed-Forward Network (FFN). They usually optimize the TL hidden states to memorize target knowledge and use it to update the weights of the FFN in LLMs. However, the information flow of TL hidden states comes from three parts: Multi-Head Self-Attention (MHSA), FFN, and residual connections. Existing methods neglect the fact that the TL hidden states contains information not specifically required for FFN. Consequently, the performance of model editing decreases. To achieve more precise model editing, we analyze hidden states of MHSA and FFN, finding that MHSA encodes certain general knowledge extraction patterns. This implies that MHSA weights do not require updating when new knowledge is introduced. Based on above findings, we introduce PMET, which simultaneously optimizes Transformer Component (TC, namely MHSA and FFN) hidden states, while only using the optimized TC hidden states of FFN to precisely update FFN weights. Our experiments demonstrate that PMET exhibits state-of-the-art performance on both the COUNTERFACT and zsRE datasets. Our ablation experiments substantiate the effectiveness of our enhancements, further reinforcing the finding that the MHSA encodes certain general knowledge extraction patterns and indicating its storage of a small amount of factual knowledge. Our code is available at https://github.com/xpq-tech/PMET.
    LLQL: Logistic Likelihood Q-Learning for Reinforcement Learning. (arXiv:2307.02345v4 [cs.LG] UPDATED)
    Modern reinforcement learning (RL) can be categorized into online and offline variants. As a pivotal aspect of both online and offline RL, current research on the Bellman equation revolves primarily around optimization techniques and performance enhancement rather than exploring the inherent structural properties of the Bellman error, such as its distribution characteristics. This study investigates the distribution of the Bellman approximation error through iterative exploration of the Bellman equation with the observation that the Bellman error approximately follows the Logistic distribution. Based on this, we proposed the utilization of the Logistic maximum likelihood function (LLoss) as an alternative to the commonly used mean squared error (MSELoss) that assumes a Normal distribution for Bellman errors. We validated the hypotheses through extensive numerical experiments across diverse online and offline environments. In particular, we applied the Logistic correction to loss functions in various RL baseline methods and observed that the results with LLoss consistently outperformed the MSE counterparts. We also conducted the Kolmogorov-Smirnov tests to confirm the reliability of the Logistic distribution. Moreover, our theory connects the Bellman error to the proportional reward scaling phenomenon by providing a distribution-based analysis. Furthermore, we applied the bias-variance decomposition for sampling from the Logistic distribution. The theoretical and empirical insights of this study lay a valuable foundation for future investigations and enhancements centered on the distribution of Bellman error.
    Generalized Graph Prompt: Toward a Unification of Pre-Training and Downstream Tasks on Graphs. (arXiv:2311.15317v2 [cs.LG] UPDATED)
    Graph neural networks have emerged as a powerful tool for graph representation learning, but their performance heavily relies on abundant task-specific supervision. To reduce labeling requirement, the "pre-train, prompt" paradigms have become increasingly common. However, existing study of prompting on graphs is limited, lacking a universal treatment to appeal to different downstream tasks. In this paper, we propose GraphPrompt, a novel pre-training and prompting framework on graphs. GraphPrompt not only unifies pre-training and downstream tasks into a common task template but also employs a learnable prompt to assist a downstream task in locating the most relevant knowledge from the pre-trained model in a task-specific manner. To further enhance GraphPrompt in these two stages, we extend it into GraphPrompt+ with two major enhancements. First, we generalize several popular graph pre-training tasks beyond simple link prediction to broaden the compatibility with our task template. Second, we propose a more generalized prompt design that incorporates a series of prompt vectors within every layer of the pre-trained graph encoder, in order to capitalize on the hierarchical information across different layers beyond just the readout layer. Finally, we conduct extensive experiments on five public datasets to evaluate and analyze GraphPrompt and GraphPrompt+.
    Venn: Resource Management Across Federated Learning Jobs. (arXiv:2312.08298v1 [cs.DC])
    In recent years, federated learning (FL) has emerged as a promising approach for machine learning (ML) and data science across distributed edge devices. With the increasing popularity of FL, resource contention between multiple FL jobs training on the same device population is increasing as well. Scheduling edge resources among multiple FL jobs is different from GPU scheduling for cloud ML because of the ephemeral nature and planetary scale of participating devices as well as the overlapping resource requirements of diverse FL jobs. Existing resource managers for FL jobs opt for random assignment of devices to FL jobs for simplicity and scalability, which leads to poor performance. In this paper, we present Venn, an FL resource manager, that efficiently schedules ephemeral, heterogeneous devices among many FL jobs, with the goal of reducing their average job completion time (JCT). Venn formulates the Intersection Resource Scheduling (IRS) problem to identify complex resource contention among multiple FL jobs. Then, Venn proposes a contention-aware scheduling heuristic to minimize the average scheduling delay. Furthermore, it proposes a resource-aware device-to-job matching heuristic that focuses on optimizing response collection time by mitigating stragglers. Our evaluation shows that, compared to the state-of-the-art FL resource managers, Venn improves the average JCT by up to 1.88X.
    The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models. (arXiv:2302.14407v2 [cs.LG] UPDATED)
    Thompson sampling (TS) has been known for its outstanding empirical performance supported by theoretical guarantees across various reward models in the classical stochastic multi-armed bandit problems. Nonetheless, its optimality is often restricted to specific priors due to the common observation that TS is fairly insensitive to the choice of the prior when it comes to asymptotic regret bounds. However, when the model contains multiple parameters, the optimality of TS highly depends on the choice of priors, which casts doubt on the generalizability of previous findings to other models. To address this gap, this study explores the impact of selecting noninformative priors, offering insights into the performance of TS when dealing with new models that lack theoretical understanding. We first extend the regret analysis of TS to the model of uniform distributions with unknown supports, which would be the simplest non-regular model. Our findings reveal that changing noninformative priors can significantly affect the expected regret, aligning with previously known results in other multiparameter bandit models. Although the uniform prior is shown to be optimal, we highlight the inherent limitation of its optimality, which is limited to specific parameterizations and emphasizes the significance of the invariance property of priors. In light of this limitation, we propose a slightly modified TS-based policy, called TS with Truncation (TS-T), which can achieve the asymptotic optimality for the Gaussian models and the uniform models by using the reference prior and the Jeffreys prior that are invariant under one-to-one reparameterizations. This policy provides an alternative approach to achieving optimality by employing fine-tuned truncation, which would be much easier than hunting for optimal priors in practice.
    A Hitchhiker's Guide to Geometric GNNs for 3D Atomic Systems. (arXiv:2312.07511v1 [cs.LG] CROSS LISTED)
    Recent advances in computational modelling of atomic systems, spanning molecules, proteins, and materials, represent them as geometric graphs with atoms embedded as nodes in 3D Euclidean space. In these graphs, the geometric attributes transform according to the inherent physical symmetries of 3D atomic systems, including rotations and translations in Euclidean space, as well as node permutations. In recent years, Geometric Graph Neural Networks have emerged as the preferred machine learning architecture powering applications ranging from protein structure prediction to molecular simulations and material generation. Their specificity lies in the inductive biases they leverage -- such as physical symmetries and chemical properties -- to learn informative representations of these geometric graphs. In this opinionated paper, we provide a comprehensive and self-contained overview of the field of Geometric GNNs for 3D atomic systems. We cover fundamental background material and introduce a pedagogical taxonomy of Geometric GNN architectures:(1) invariant networks, (2) equivariant networks in Cartesian basis, (3) equivariant networks in spherical basis, and (4) unconstrained networks. Additionally, we outline key datasets and application areas and suggest future research directions. The objective of this work is to present a structured perspective on the field, making it accessible to newcomers and aiding practitioners in gaining an intuition for its mathematical abstractions.
    PromptBench: A Unified Library for Evaluation of Large Language Models. (arXiv:2312.07910v1 [cs.AI])
    The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components that are easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. PromptBench is designed to be an open, general, and flexible codebase for research purposes that can facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. The code is available at: https://github.com/microsoft/promptbench and will be continuously supported.
    EquiReact: An equivariant neural network for chemical reactions. (arXiv:2312.08307v1 [physics.chem-ph])
    Equivariant neural networks have considerably improved the accuracy and data-efficiency of predictions of molecular properties. Building on this success, we introduce EquiReact, an equivariant neural network to infer properties of chemical reactions, built from three-dimensional structures of reactants and products. We illustrate its competitive performance on the prediction of activation barriers on the GDB7-22-TS, Cyclo-23-TS and Proparg-21-TS datasets with different regimes according to the inclusion of atom-mapping information. We show that, compared to state-of-the-art models for reaction property prediction, EquiReact offers: (i) a flexible model with reduced sensitivity between atom-mapping regimes, (ii) better extrapolation capabilities to unseen chemistries, (iii) impressive prediction errors for datasets exhibiting subtle variations in three-dimensional geometries of reactants/products, (iv) reduced sensitivity to geometry quality and (iv) excellent data efficiency.
    Measuring Self-Supervised Representation Quality for Downstream Classification using Discriminative Features. (arXiv:2203.01881v6 [cs.LG] UPDATED)
    Self-supervised learning (SSL) has shown impressive results in downstream classification tasks. However, there is limited work in understanding their failure modes and interpreting their learned representations. In this paper, we study the representation space of state-of-the-art self-supervised models including SimCLR, SwaV, MoCo, BYOL, DINO, SimSiam, VICReg and Barlow Twins. Without the use of class label information, we discover discriminative features that correspond to unique physical attributes in images, present mostly in correctly-classified representations. Using these features, we can compress the representation space by up to 40% without significantly affecting linear classification performance. We then propose Self-Supervised Representation Quality Score (or Q-Score), an unsupervised score that can reliably predict if a given sample is likely to be mis-classified during linear evaluation, achieving AUPRC of 91.45 on ImageNet-100 and 78.78 on ImageNet-1K. Q-Score can also be used as a regularization term on pre-trained encoders to remedy low-quality representations. Fine-tuning with Q-Score regularization can boost the linear probing accuracy of SSL models by up to 5.8% on ImageNet-100 and 3.7% on ImageNet-1K compared to their baselines. Finally, using gradient heatmaps and Salient ImageNet masks, we define a metric to quantify the interpretability of each representation. We show that discriminative features are strongly correlated to core attributes and, enhancing these features through Q-score regularization makes SSL representations more interpretable.
    Graph Harmony: Denoising and Nuclear-Norm Wasserstein Adaptation for Enhanced Domain Transfer in Graph-Structured Data. (arXiv:2301.12361v2 [cs.LG] UPDATED)
    Graph-structured data can be found in numerous domains, yet the scarcity of labeled instances hinders its effective utilization of deep learning in many scenarios. Traditional unsupervised domain adaptation (UDA) strategies for graphs primarily hinge on adversarial learning and pseudo-labeling. These approaches fail to effectively leverage graph discriminative features, leading to class mismatching and unreliable label quality. To navigate these obstacles, we develop the Denoising and Nuclear-Norm Wasserstein Adaptation Network (DNAN). DNAN employs the Nuclear-norm Wasserstein discrepancy (NWD), which can simultaneously achieve domain alignment and class distinguishment. DANA also integrates a denoising mechanism via a variational graph autoencoder that mitigates data noise. This denoising mechanism helps capture essential features of both source and target domains, improving the robustness of the domain adaptation process. Our comprehensive experiments demonstrate that DNAN outperforms state-of-the-art methods on standard UDA benchmarks for graph classification.
    Toward Discretization-Consistent Closure Schemes for Large Eddy Simulation Using Reinforcement Learning. (arXiv:2309.06260v2 [physics.flu-dyn] UPDATED)
    This study proposes a novel method for developing discretization-consistent closure schemes for implicitly filtered Large Eddy Simulation (LES). Here, the induced filter kernel, and thus the closure terms, are determined by the properties of the grid and the discretization operator, leading to additional computational subgrid terms that are generally unknown in a priori analysis. In this work, the task of adapting the coefficients of LES closure models is thus framed as a Markov decision process and solved in an a posteriori manner with Reinforcement Learning (RL). This optimization framework is applied to both explicit and implicit closure models. The explicit model is based on an element-local eddy viscosity model. The optimized model is found to adapt its induced viscosity within discontinuous Galerkin (DG) methods to homogenize the dissipation within an element by adding more viscosity near its center. For the implicit modeling, RL is applied to identify an optimal blending strategy for a hybrid DG and Finite Volume (FV) scheme. The resulting optimized discretization yields more accurate results in LES than either the pure DG or FV method and renders itself as a viable modeling ansatz that could initiate a novel class of high-order schemes for compressible turbulence by combining turbulence modeling with shock capturing in a single framework. All newly derived models achieve accurate results that either match or outperform traditional models for different discretizations and resolutions. Overall, the results demonstrate that the proposed RL optimization can provide discretization-consistent closures that could reduce the uncertainty in implicitly filtered LES.
    Norm Tweaking: High-performance Low-bit Quantization of Large Language Models. (arXiv:2309.02784v2 [cs.LG] UPDATED)
    As the size of large language models (LLMs) continues to grow, model compression without sacrificing accuracy has become a crucial challenge for deployment. While some quantization methods, such as GPTQ, have made progress in achieving acceptable 4-bit weight-only quantization, attempts at lower-bit quantization often result in severe performance degradation. In this paper, we introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision while being cost-efficient. Our approach is inspired by the observation that rectifying the quantized activation distribution to match its float counterpart can readily restore accuracy for LLMs. To achieve this, we carefully design a tweaking strategy that includes calibration data generation and channel-wise distance constraint to update the weights of normalization layers for better generalization. We conduct extensive experiments on various datasets using several open-sourced LLMs. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations, surpassing existing PTQ methods. On GLM-130B and OPT-66B, our method even achieves the same level of accuracy at 2-bit quantization as their float ones. Our simple and effective approach makes it more practical for real-world applications.
    Improving search relevance of Azure Cognitive Search by Bayesian optimization. (arXiv:2312.08021v1 [cs.IR])
    Azure Cognitive Search (ACS) has emerged as a major contender in "Search as a Service" cloud products in recent years. However, one of the major challenges for ACS users is to improve the relevance of the search results for their specific usecases. In this paper, we propose a novel method to find the optimal ACS configuration that maximizes search relevance for a specific usecase (product search, document search...) The proposed solution improves key online marketplace metrics such as click through rates (CTR) by formulating the search relevance problem as hyperparameter tuning. We have observed significant improvements in real-world search call to action (CTA) rate in multiple marketplaces by introducing optimized weights generated from the proposed approach.
    Breaking the Silence: the Threats of Using LLMs in Software Engineering. (arXiv:2312.08055v1 [cs.SE])
    Large Language Models (LLMs) have gained considerable traction within the Software Engineering (SE) community, impacting various SE tasks from code completion to test generation, from program repair to code summarization. Despite their promise, researchers must still be careful as numerous intricate factors can influence the outcomes of experiments involving LLMs. This paper initiates an open discussion on potential threats to the validity of LLM-based research including issues such as closed-source models, possible data leakage between LLM training data and research evaluation, and the reproducibility of LLM-based findings. In response, this paper proposes a set of guidelines tailored for SE researchers and Language Model (LM) providers to mitigate these concerns. The implications of the guidelines are illustrated using existing good practices followed by LLM providers and a practical example for SE researchers in the context of test case generation.
    Leveraging sparse and shared feature activations for disentangled representation learning. (arXiv:2304.07939v3 [cs.LG] UPDATED)
    Recovering the latent factors of variation of high dimensional data has so far focused on simple synthetic settings. Mostly building on unsupervised and weakly-supervised objectives, prior work missed out on the positive implications for representation learning on real world data. In this work, we propose to leverage knowledge extracted from a diversified set of supervised tasks to learn a common disentangled representation. Assuming each supervised task only depends on an unknown subset of the factors of variation, we disentangle the feature space of a supervised multi-task model, with features activating sparsely across different tasks and information being shared as appropriate. Importantly, we never directly observe the factors of variations but establish that access to multiple tasks is sufficient for identifiability under sufficiency and minimality assumptions. We validate our approach on six real world distribution shift benchmarks, and different data modalities (images, text), demonstrating how disentangled representations can be transferred to real settings.
    Efficient Representation of the Activation Space in Deep Neural Networks. (arXiv:2312.08143v1 [cs.LG])
    The representations of the activation space of deep neural networks (DNNs) are widely utilized for tasks like natural language processing, anomaly detection and speech recognition. Due to the diverse nature of these tasks and the large size of DNNs, an efficient and task-independent representation of activations becomes crucial. Empirical p-values have been used to quantify the relative strength of an observed node activation compared to activations created by already-known inputs. Nonetheless, keeping raw data for these calculations increases memory resource consumption and raises privacy concerns. To this end, we propose a model-agnostic framework for creating representations of activations in DNNs using node-specific histograms to compute p-values of observed activations without retaining already-known inputs. Our proposed approach demonstrates promising potential when validated with multiple network architectures across various downstream tasks and compared with the kernel density estimates and brute-force empirical baselines. In addition, the framework reduces memory usage by 30% with up to 4 times faster p-value computing time while maintaining state of-the-art detection power in downstream tasks such as the detection of adversarial attacks and synthesized content. Moreover, as we do not persist raw data at inference time, we could potentially reduce susceptibility to attacks and privacy issues.
    Invariant Graph Transformer. (arXiv:2312.07859v1 [cs.LG])
    Rationale discovery is defined as finding a subset of the input data that maximally supports the prediction of downstream tasks. In graph machine learning context, graph rationale is defined to locate the critical subgraph in the given graph topology, which fundamentally determines the prediction results. In contrast to the rationale subgraph, the remaining subgraph is named the environment subgraph. Graph rationalization can enhance the model performance as the mapping between the graph rationale and prediction label is viewed as invariant, by assumption. To ensure the discriminative power of the extracted rationale subgraphs, a key technique named "intervention" is applied. The core idea of intervention is that given any changing environment subgraphs, the semantics from the rationale subgraph is invariant, which guarantees the correct prediction result. However, most, if not all, of the existing rationalization works on graph data develop their intervention strategies on the graph level, which is coarse-grained. In this paper, we propose well-tailored intervention strategies on graph data. Our idea is driven by the development of Transformer models, whose self-attention module provides rich interactions between input nodes. Based on the self-attention module, our proposed invariant graph Transformer (IGT) can achieve fine-grained, more specifically, node-level and virtual node-level intervention. Our comprehensive experiments involve 7 real-world datasets, and the proposed IGT shows significant performance advantages compared to 13 baseline methods.
    On the Stability of Iterative Retraining of Generative Models on their own Data. (arXiv:2310.00429v3 [cs.LG] UPDATED)
    Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to these models' striking performance and ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models must contend with the reality that their training is curated from both clean data and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact of training generative models on mixed datasets (of real and synthetic data) on their stability. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.
    Curriculum-Enhanced Residual Soft An-Isotropic Normalization for Over-smoothness in Deep GNNs. (arXiv:2312.08221v1 [cs.LG])
    Despite Graph neural networks' significant performance gain over many classic techniques in various graph-related downstream tasks, their successes are restricted in shallow models due to over-smoothness and the difficulties of optimizations among many other issues. In this paper, to alleviate the over-smoothing issue, we propose a soft graph normalization method to preserve the diversities of node embeddings and prevent indiscrimination due to possible over-closeness. Combined with residual connections, we analyze the reason why the method can effectively capture the knowledge in both input graph structures and node features even with deep networks. Additionally, inspired by Curriculum Learning that learns easy examples before the hard ones, we propose a novel label-smoothing-based learning framework to enhance the optimization of deep GNNs, which iteratively smooths labels in an auxiliary graph and constructs many gradual non-smooth tasks for extracting increasingly complex knowledge and gradually discriminating nodes from coarse to fine. The method arguably reduces the risk of overfitting and generalizes better results. Finally, extensive experiments are carried out to demonstrate the effectiveness and potential of the proposed model and learning framework through comparison with twelve existing baselines including the state-of-the-art methods on twelve real-world node classification benchmarks.
    Contrast and Clustering: Learning Neighborhood Pair Representation for Source-free Domain Adaptation. (arXiv:2301.13428v4 [cs.CV] UPDATED)
    Unsupervised domain adaptation uses source data from different distributions to solve the problem of classifying data from unlabeled target domains. However, conventional methods require access to source data, which often raise concerns about data privacy. In this paper, we consider a more practical but challenging setting where the source domain data is unavailable and the target domain data is unlabeled. Specifically, we address the domain discrepancy problem from the perspective of contrastive learning. The key idea of our work is to learn a domain-invariant feature by 1) performing clustering directly in the original feature space with nearest neighbors; 2) constructing truly hard negative pairs by extended neighbors without introducing additional computational complexity; and 3) combining noise-contrastive estimation theory to gain computational advantage. We conduct careful ablation studies and extensive experiments on three common benchmarks: VisDA, Office-Home, and Office-31. The results demonstrate the superiority of our methods compared with other state-of-the-art works.
    Efficient Training of Energy-Based Models Using Jarzynski Equality. (arXiv:2305.19414v2 [cs.LG] UPDATED)
    Energy-based models (EBMs) are generative models inspired by statistical physics with a wide range of applications in unsupervised learning. Their performance is best measured by the cross-entropy (CE) of the model distribution relative to the data distribution. Using the CE as the objective for training is however challenging because the computation of its gradient with respect to the model parameters requires sampling the model distribution. Here we show how results for nonequilibrium thermodynamics based on Jarzynski equality together with tools from sequential Monte-Carlo sampling can be used to perform this computation efficiently and avoid the uncontrolled approximations made using the standard contrastive divergence algorithm. Specifically, we introduce a modification of the unadjusted Langevin algorithm (ULA) in which each walker acquires a weight that enables the estimation of the gradient of the cross-entropy at any step during GD, thereby bypassing sampling biases induced by slow mixing of ULA. We illustrate these results with numerical experiments on Gaussian mixture distributions as well as the MNIST dataset. We show that the proposed approach outperforms methods based on the contrastive divergence algorithm in all the considered situations.
    Traffic Signal Control Using Lightweight Transformers: An Offline-to-Online RL Approach. (arXiv:2312.07795v1 [cs.LG])
    Efficient traffic signal control is critical for reducing traffic congestion and improving overall transportation efficiency. The dynamic nature of traffic flow has prompted researchers to explore Reinforcement Learning (RL) for traffic signal control (TSC). Compared with traditional methods, RL-based solutions have shown preferable performance. However, the application of RL-based traffic signal controllers in the real world is limited by the low sample efficiency and high computational requirements of these solutions. In this work, we propose DTLight, a simple yet powerful lightweight Decision Transformer-based TSC method that can learn policy from easily accessible offline datasets. DTLight novelly leverages knowledge distillation to learn a lightweight controller from a well-trained larger teacher model to reduce implementation computation. Additionally, it integrates adapter modules to mitigate the expenses associated with fine-tuning, which makes DTLight practical for online adaptation with minimal computation and only a few fine-tuning steps during real deployment. Moreover, DTLight is further enhanced to be more applicable to real-world TSC problems. Extensive experiments on synthetic and real-world scenarios show that DTLight pre-trained purely on offline datasets can outperform state-of-the-art online RL-based methods in most scenarios. Experiment results also show that online fine-tuning further improves the performance of DTLight by up to 42.6% over the best online RL baseline methods. In this work, we also introduce Datasets specifically designed for TSC with offline RL (referred to as DTRL). Our datasets and code are publicly available.
    Ultra Low Complexity Deep Learning Based Noise Suppression. (arXiv:2312.08132v1 [eess.AS])
    This paper introduces an innovative method for reducing the computational complexity of deep neural networks in real-time speech enhancement on resource-constrained devices. The proposed approach utilizes a two-stage processing framework, employing channelwise feature reorientation to reduce the computational load of convolutional operations. By combining this with a modified power law compression technique for enhanced perceptual quality, this approach achieves noise suppression performance comparable to state-of-the-art methods with significantly less computational requirements. Notably, our algorithm exhibits 3 to 4 times less computational complexity and memory usage than prior state-of-the-art approaches.
    Robust MRI Reconstruction by Smoothed Unrolling (SMUG). (arXiv:2312.07784v1 [eess.IV])
    As the popularity of deep learning (DL) in the field of magnetic resonance imaging (MRI) continues to rise, recent research has indicated that DL-based MRI reconstruction models might be excessively sensitive to minor input disturbances, including worst-case additive perturbations. This sensitivity often leads to unstable, aliased images. This raises the question of how to devise DL techniques for MRI reconstruction that can be robust to train-test variations. To address this problem, we propose a novel image reconstruction framework, termed Smoothed Unrolling (SMUG), which advances a deep unrolling-based MRI reconstruction model using a randomized smoothing (RS)-based robust learning approach. RS, which improves the tolerance of a model against input noises, has been widely used in the design of adversarial defense approaches for image classification tasks. Yet, we find that the conventional design that applies RS to the entire DL-based MRI model is ineffective. In this paper, we show that SMUG and its variants address the above issue by customizing the RS process based on the unrolling architecture of a DL-based MRI reconstruction model. Compared to the vanilla RS approach, we show that SMUG improves the robustness of MRI reconstruction with respect to a diverse set of instability sources, including worst-case and random noise perturbations to input measurements, varying measurement sampling rates, and different numbers of unrolling steps. Furthermore, we theoretically analyze the robustness of our method in the presence of perturbations.
    Hybrid Sample Synthesis-based Debiasing of Classifier in Limited Data Setting. (arXiv:2312.08288v1 [cs.CV])
    Deep learning models are known to suffer from the problem of bias, and researchers have been exploring methods to address this issue. However, most of these methods require prior knowledge of the bias and are not always practical. In this paper, we focus on a more practical setting with no prior information about the bias. Generally, in this setting, there are a large number of bias-aligned samples that cause the model to produce biased predictions and a few bias-conflicting samples that do not conform to the bias. If the training data is limited, the influence of the bias-aligned samples may become even stronger on the model predictions, and we experimentally demonstrate that existing debiasing techniques suffer severely in such cases. In this paper, we examine the effects of unknown bias in small dataset regimes and present a novel approach to mitigate this issue. The proposed approach directly addresses the issue of the extremely low occurrence of bias-conflicting samples in limited data settings through the synthesis of hybrid samples that can be used to reduce the effect of bias. We perform extensive experiments on several benchmark datasets and experimentally demonstrate the effectiveness of our proposed approach in addressing any unknown bias in the presence of limited data. Specifically, our approach outperforms the vanilla, LfF, LDD, and DebiAN debiasing methods by absolute margins of 10.39%, 9.08%, 8.07%, and 9.67% when only 10% of the Corrupted CIFAR-10 Type 1 dataset is available with a bias-conflicting sample ratio of 0.05.
    Transferable Adversarial Robustness for Categorical Data via Universal Robust Embeddings. (arXiv:2306.04064v2 [cs.LG] UPDATED)
    Research on adversarial robustness is primarily focused on image and text data. Yet, many scenarios in which lack of robustness can result in serious risks, such as fraud detection, medical diagnosis, or recommender systems often do not rely on images or text but instead on tabular data. Adversarial robustness in tabular data poses two serious challenges. First, tabular datasets often contain categorical features, and therefore cannot be tackled directly with existing optimization procedures. Second, in the tabular domain, algorithms that are not based on deep networks are widely used and offer great performance, but algorithms to enhance robustness are tailored to neural networks (e.g. adversarial training). In this paper, we tackle both challenges. We present a method that allows us to train adversarially robust deep networks for tabular data and to transfer this robustness to other classifiers via universal robust embeddings tailored to categorical data. These embeddings, created using a bilevel alternating minimization framework, can be transferred to boosted trees or random forests making them robust without the need for adversarial training while preserving their high accuracy on tabular data. We show that our methods outperform existing techniques within a practical threat model suitable for tabular data.
    Active learning with biased non-response to label requests. (arXiv:2312.08150v1 [cs.LG])
    Active learning can improve the efficiency of training prediction models by identifying the most informative new labels to acquire. However, non-response to label requests can impact active learning's effectiveness in real-world contexts. We conceptualise this degradation by considering the type of non-response present in the data, demonstrating that biased non-response is particularly detrimental to model performance. We argue that this sort of non-response is particularly likely in contexts where the labelling process, by nature, relies on user interactions. To mitigate the impact of biased non-response, we propose a cost-based correction to the sampling strategy--the Upper Confidence Bound of the Expected Utility (UCB-EU)--that can, plausibly, be applied to any active learning algorithm. Through experiments, we demonstrate that our method successfully reduces the harm from labelling non-response in many settings. However, we also characterise settings where the non-response bias in the annotations remains detrimental under UCB-EU for particular sampling methods and data generating processes. Finally, we evaluate our method on a real-world dataset from e-commerce platform Taobao. We show that UCB-EU yields substantial performance improvements to conversion models that are trained on clicked impressions. Most generally, this research serves to both better conceptualise the interplay between types of non-response and model improvements via active learning, and to provide a practical, easy to implement correction that helps mitigate model degradation.
    On the fast convergence of minibatch heavy ball momentum. (arXiv:2206.07553v4 [cs.LG] UPDATED)
    Simple stochastic momentum methods are widely used in machine learning optimization, but their good practical performance is at odds with an absence of theoretical guarantees of acceleration in the literature. In this work, we aim to close the gap between theory and practice by showing that stochastic heavy ball momentum retains the fast linear rate of (deterministic) heavy ball momentum on quadratic optimization problems, at least when minibatching with a sufficiently large batch size. The algorithm we study can be interpreted as an accelerated randomized Kaczmarz algorithm with minibatching and heavy ball momentum. The analysis relies on carefully decomposing the momentum transition matrix, and using new spectral norm concentration bounds for products of independent random matrices. We provide numerical illustrations demonstrating that our bounds are reasonably sharp.
    Acting in Delayed Environments with Non-Stationary Markov Policies. (arXiv:2101.11992v4 [cs.LG] UPDATED)
    The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of $m$ steps. The brute-force state augmentation baseline where the state is concatenated to the last $m$ committed actions suffers from an exponential complexity in $m$, as we show for policy iteration. We then prove that with execution delay, deterministic Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary. As for stationary Markov policies, we show they are sub-optimal in general. Consequently, we devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation. Experiments on tabular, physical, and Atari domains reveal that it converges quickly to high performance even for substantial delays, while standard approaches that either ignore the delay or rely on state-augmentation struggle or fail due to divergence. The code is available at github.com/galdl/rl_delay_basic and github.com/galdl/rl_delay_atari.
    $\rho$-Diffusion: A diffusion-based density estimation framework for computational physics. (arXiv:2312.08153v1 [physics.comp-ph])
    In physics, density $\rho(\cdot)$ is a fundamentally important scalar function to model, since it describes a scalar field or a probability density function that governs a physical process. Modeling $\rho(\cdot)$ typically scales poorly with parameter space, however, and quickly becomes prohibitively difficult and computationally expensive. One promising avenue to bypass this is to leverage the capabilities of denoising diffusion models often used in high-fidelity image generation to parameterize $\rho(\cdot)$ from existing scientific data, from which new samples can be trivially sampled from. In this paper, we propose $\rho$-Diffusion, an implementation of denoising diffusion probabilistic models for multidimensional density estimation in physics, which is currently in active development and, from our results, performs well on physically motivated 2D and 3D density functions. Moreover, we propose a novel hashing technique that allows $\rho$-Diffusion to be conditioned by arbitrary amounts of physical parameters of interest.
    AmbientFlow: Invertible generative models from incomplete, noisy measurements. (arXiv:2309.04856v2 [cs.LG] UPDATED)
    Generative models have gained popularity for their potential applications in imaging science, such as image reconstruction, posterior sampling and data sharing. Flow-based generative models are particularly attractive due to their ability to tractably provide exact density estimates along with fast, inexpensive and diverse samples. Training such models, however, requires a large, high quality dataset of objects. In applications such as computed imaging, it is often difficult to acquire such data due to requirements such as long acquisition time or high radiation dose, while acquiring noisy or partially observed measurements of these objects is more feasible. In this work, we propose AmbientFlow, a framework for learning flow-based generative models directly from noisy and incomplete data. Using variational Bayesian methods, a novel framework for establishing flow-based generative models from noisy, incomplete data is proposed. Extensive numerical studies demonstrate the effectiveness of AmbientFlow in learning the object distribution. The utility of AmbientFlow in a downstream inference task of image reconstruction is demonstrated.
    GraphGuard: Detecting and Counteracting Training Data Misuse in Graph Neural Networks. (arXiv:2312.07861v1 [cs.LG])
    The emergence of Graph Neural Networks (GNNs) in graph data analysis and their deployment on Machine Learning as a Service platforms have raised critical concerns about data misuse during model training. This situation is further exacerbated due to the lack of transparency in local training processes, potentially leading to the unauthorized accumulation of large volumes of graph data, thereby infringing on the intellectual property rights of data owners. Existing methodologies often address either data misuse detection or mitigation, and are primarily designed for local GNN models rather than cloud-based MLaaS platforms. These limitations call for an effective and comprehensive solution that detects and mitigates data misuse without requiring exact training data while respecting the proprietary nature of such data. This paper introduces a pioneering approach called GraphGuard, to tackle these challenges. We propose a training-data-free method that not only detects graph data misuse but also mitigates its impact via targeted unlearning, all without relying on the original training data. Our innovative misuse detection technique employs membership inference with radioactive data, enhancing the distinguishability between member and non-member data distributions. For mitigation, we utilize synthetic graphs that emulate the characteristics previously learned by the target model, enabling effective unlearning even in the absence of exact graph data. We conduct comprehensive experiments utilizing four real-world graph datasets to demonstrate the efficacy of GraphGuard in both detection and unlearning. We show that GraphGuard attains a near-perfect detection rate of approximately 100% across these datasets with various GNN models. In addition, it performs unlearning by eliminating the impact of the unlearned graph with a marginal decrease in accuracy (less than 5%).
    AI Competitions and Benchmarks: towards impactful challenges with post-challenge papers, benchmarks and other dissemination actions. (arXiv:2312.06036v2 [cs.LG] UPDATED)
    Organising an AI challenge does not end with the final event. The long-lasting impact also needs to be organised. This chapter covers the various activities after the challenge is formally finished. The target audience of different post-challenge activities is identified. The various outputs of the challenge are listed with the means to collect them. The main part of the chapter is a template for a typical post-challenge paper, including possible graphs as well as advice on how to turn the challenge into a long-lasting benchmark.
    Explainable Trajectory Representation through Dictionary Learning. (arXiv:2312.08052v1 [cs.LG])
    Trajectory representation learning on a network enhances our understanding of vehicular traffic patterns and benefits numerous downstream applications. Existing approaches using classic machine learning or deep learning embed trajectories as dense vectors, which lack interpretability and are inefficient to store and analyze in downstream tasks. In this paper, an explainable trajectory representation learning framework through dictionary learning is proposed. Given a collection of trajectories on a network, it extracts a compact dictionary of commonly used subpaths called "pathlets", which optimally reconstruct each trajectory by simple concatenations. The resulting representation is naturally sparse and encodes strong spatial semantics. Theoretical analysis of our proposed algorithm is conducted to provide a probabilistic bound on the estimation error of the optimal dictionary. A hierarchical dictionary learning scheme is also proposed to ensure the algorithm's scalability on large networks, leading to a multi-scale trajectory representation. Our framework is evaluated on two large-scale real-world taxi datasets. Compared to previous work, the dictionary learned by our method is more compact and has better reconstruction rate for new trajectories. We also demonstrate the promising performance of this method in downstream tasks including trip time prediction task and data compression.
    Smoothed Differential Privacy. (arXiv:2107.01559v4 [cs.CR] UPDATED)
    Differential privacy (DP) is a widely-accepted and widely-applied notion of privacy based on worst-case analysis. Often, DP classifies most mechanisms without additive noise as non-private (Dwork et al., 2014). Thus, additive noises are added to improve privacy (to achieve DP). However, in many real-world applications, adding additive noise is undesirable (Bagdasaryan et al., 2019) and sometimes prohibited (Liu et al., 2020). In this paper, we propose a natural extension of DP following the worst average-case idea behind the celebrated smoothed analysis (Spielman & Teng, May 2004). Our notion, smoothed DP, can effectively measure the privacy leakage of mechanisms without additive noises under realistic settings. We prove that any discrete mechanism with sampling procedures is more private than what DP predicts, while many continuous mechanisms with sampling procedures are still non-private under smoothed DP. In addition, we prove several desirable properties of smoothed DP, including composition, robustness to post-processing, and distribution reduction. Based on those properties, we propose an efficient algorithm to calculate the privacy parameters for smoothed DP. Experimentally, we verify that, according to smoothed DP, the discrete sampling mechanisms are private in real-world elections, and some discrete neural networks can be private without adding any additive noise. We believe that these results contribute to the theoretical foundation of realistic privacy measures beyond worst-case analysis.
    PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning. (arXiv:2308.03977v2 [cs.CV] UPDATED)
    Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to (i) render as many data samples as needed, (ii) precisely control each scene and yield granular ground truth labels (and captions), (iii) precisely control distribution shifts between training and testing to isolate variables of interest for sound experimentation. Despite such promise, the use of synthetic image data is still limited -- and often played down -- mainly due to their lack of realism. Most works therefore rely on datasets of real images, which have often been scraped from public images on the internet, and may have issues with regards to privacy, bias, and copyright, while offering little control over how objects precisely appear. In this work, we present a path to democratize the use of photorealistic synthetic data: we develop a new generation of interactive environments for representation learning research, that offer both controllability and realism. We use the Unreal Engine, a powerful game engine well known in the entertainment industry, to produce PUG (Photorealistic Unreal Graphics) environments and datasets for representation learning. In this paper, we demonstrate the potential of PUG to enable more rigorous evaluations of vision models.
    Time Series Diffusion Method: A Denoising Diffusion Probabilistic Model for Vibration Signal Generation. (arXiv:2312.07981v1 [cs.LG])
    Diffusion models have demonstrated robust data generation capabilities in various research fields. In this paper, a Time Series Diffusion Method (TSDM) is proposed for vibration signal generation, leveraging the foundational principles of diffusion models. The TSDM uses an improved U-net architecture with attention block to effectively segment and extract features from one-dimensional time series data. It operates based on forward diffusion and reverse denoising processes for time-series generation. Experimental validation is conducted using single-frequency, multi-frequency datasets, and bearing fault datasets. The results show that TSDM can accurately generate the single-frequency and multi-frequency features in the time series and retain the basic frequency features for the diffusion generation results of the bearing fault series. Finally, TSDM is applied to the small sample fault diagnosis of three public bearing fault datasets, and the results show that the accuracy of small sample fault diagnosis of the three datasets is improved by 32.380%, 18.355% and 9.298% at most, respectively
    EZ-CLIP: Efficient Zeroshot Video Action Recognition. (arXiv:2312.08010v1 [cs.CV])
    Recent advancements in large-scale pre-training of visual-language models on paired image-text data have demonstrated impressive generalization capabilities for zero-shot tasks. Building on this success, efforts have been made to adapt these image-based visual-language models, such as CLIP, for videos extending their zero-shot capabilities to the video domain. While these adaptations have shown promising results, they come at a significant computational cost and struggle with effectively modeling the crucial temporal aspects inherent to the video domain. In this study, we present EZ-CLIP, a simple and efficient adaptation of CLIP that addresses these challenges. EZ-CLIP leverages temporal visual prompting for seamless temporal adaptation, requiring no fundamental alterations to the core CLIP architecture while preserving its remarkable generalization abilities. Moreover, we introduce a novel learning objective that guides the temporal visual prompts to focus on capturing motion, thereby enhancing its learning capabilities from video data. We conducted extensive experiments on five different benchmark datasets, thoroughly evaluating EZ-CLIP for zero-shot learning and base-to-novel video action recognition, and also demonstrating its potential for few-shot generalization.Impressively, with a mere 5.2 million learnable parameters (as opposed to the 71.1 million in the prior best model), EZ-CLIP can be efficiently trained on a single GPU, outperforming existing approaches in several evaluations.
    On the Second-Order Convergence of Biased Policy Gradient Algorithms. (arXiv:2311.02546v2 [cs.LG] UPDATED)
    Since the objective functions of reinforcement learning problems are typically highly nonconvex, we seek guarantees that these algorithms escape saddle points and arrive at second-order stationary points. Existing results only consider vanilla policy gradient algorithms with unbiased gradient estimators, but practical implementations under the infinite-horizon discounted reward setting are biased due to finite-horizon sampling. Moreover, actor-critic methods, whose second-order convergence has not yet been established, are also biased due to the critic approximation of the value function. We provide a novel second-order analysis of biased policy gradient methods, including the vanilla gradient estimator computed from Monte-Carlo sampling of trajectories as well as the double-loop actor-critic algorithm, where in the inner loop the the critic parameter improves the approximation of the value function via TD(0) learning. Separately, we also establish the convergence of TD(0) on Markov chains irrespective of initial state distribution.
    Attributing Learned Concepts in Neural Networks to Training Data. (arXiv:2310.03149v3 [cs.LG] UPDATED)
    By now there is substantial evidence that deep learning models learn certain human-interpretable features as part of their internal representations of data. As having the right (or wrong) concepts is critical to trustworthy machine learning systems, it is natural to ask which inputs from the model's original training set were most important for learning a concept at a given layer. To answer this, we combine data attribution methods with methods for probing the concepts learned by a model. Training network and probe ensembles for two concept datasets on a range of network layers, we use the recently developed TRAK method for large-scale data attribution. We find some evidence for convergence, where removing the 10,000 top attributing images for a concept and retraining the model does not change the location of the concept in the network nor the probing sparsity of the concept. This suggests that rather than being highly dependent on a few specific examples, the features that inform the development of a concept are spread in a more diffuse manner across its exemplars, implying robustness in concept formation.
    Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models. (arXiv:2312.06585v2 [cs.LG] UPDATED)
    Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST$^{EM}$, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that ReST$^{EM}$ scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can substantially reduce dependence on human-generated data.
    Fast Machine Unlearning Without Retraining Through Selective Synaptic Dampening. (arXiv:2308.07707v2 [cs.LG] UPDATED)
    Machine unlearning, the ability for a machine learning model to forget, is becoming increasingly important to comply with data privacy regulations, as well as to remove harmful, manipulated, or outdated information. The key challenge lies in forgetting specific information while protecting model performance on the remaining data. While current state-of-the-art methods perform well, they typically require some level of retraining over the retained data, in order to protect or restore model performance. This adds computational overhead and mandates that the training data remain available and accessible, which may not be feasible. In contrast, other methods employ a retrain-free paradigm, however, these approaches are prohibitively computationally expensive and do not perform on par with their retrain-based counterparts. We present Selective Synaptic Dampening (SSD), a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data. First, SSD uses the Fisher information matrix of the training and forgetting data to select parameters that are disproportionately important to the forget set. Second, SSD induces forgetting by dampening these parameters proportional to their relative importance to the forget set with respect to the wider training data. We evaluate our method against several existing unlearning methods in a range of experiments using ResNet18 and Vision Transformer. Results show that the performance of SSD is competitive with retrain-based post hoc methods, demonstrating the viability of retrain-free post hoc unlearning approaches.
    Unsupervised Protein-Ligand Binding Energy Prediction via Neural Euler's Rotation Equation. (arXiv:2301.10814v2 [q-bio.BM] UPDATED)
    Protein-ligand binding prediction is a fundamental problem in AI-driven drug discovery. Prior work focused on supervised learning methods using a large set of binding affinity data for small molecules, but it is hard to apply the same strategy to other drug classes like antibodies as labelled data is limited. In this paper, we explore unsupervised approaches and reformulate binding energy prediction as a generative modeling task. Specifically, we train an energy-based model on a set of unlabelled protein-ligand complexes using SE(3) denoising score matching and interpret its log-likelihood as binding affinity. Our key contribution is a new equivariant rotation prediction network called Neural Euler's Rotation Equations (NERE) for SE(3) score matching. It predicts a rotation by modeling the force and torque between protein and ligand atoms, where the force is defined as the gradient of an energy function with respect to atom coordinates. We evaluate NERE on protein-ligand and antibody-antigen binding affinity prediction benchmarks. Our model outperforms all unsupervised baselines (physics-based and statistical potentials) and matches supervised learning methods in the antibody case.
    Crystal-GFN: sampling crystals with desirable properties and constraints. (arXiv:2310.04925v2 [cs.LG] UPDATED)
    Accelerating material discovery holds the potential to greatly help mitigate the climate crisis. Discovering new solid-state materials such as electrocatalysts, super-ionic conductors or photovoltaic materials can have a crucial impact, for instance, in improving the efficiency of renewable energy production and storage. In this paper, we introduce Crystal-GFN, a generative model of crystal structures that sequentially samples structural properties of crystalline materials, namely the space group, composition and lattice parameters. This domain-inspired approach enables the flexible incorporation of physical and structural hard constraints, as well as the use of any available predictive model of a desired physicochemical property as an objective function. To design stable materials, one must target the candidates with the lowest formation energy. Here, we use as objective the formation energy per atom of a crystal structure predicted by a new proxy machine learning model trained on MatBench. The results demonstrate that Crystal-GFN is able to sample highly diverse crystals with low (median -3.1 eV/atom) predicted formation energy.
    Distributed Inference and Fine-tuning of Large Language Models Over The Internet. (arXiv:2312.08361v1 [cs.LG])
    Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategies. We observe that a large enough model (50B+) can run efficiently even on geodistributed devices in a consumer-grade network. This could allow running LLM efficiently by pooling together idle compute resources of multiple research groups and volunteers. We address two open problems: (1) how to perform inference and fine-tuning reliably if any device can disconnect abruptly and (2) how to partition LLMs between devices with uneven hardware, joining and leaving at will. In order to do that, we develop special fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput. We showcase these algorithms in Petals - a decentralized system that runs Llama 2 (70B) and BLOOM (176B) over the Internet up to 10x faster than offloading for interactive generation. We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.
    Direct Preference Optimization: Your Language Model is Secretly a Reward Model. (arXiv:2305.18290v2 [cs.LG] UPDATED)
    While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.
    Domain Generalization with Fourier Transform and Soft Thresholding. (arXiv:2309.09866v3 [eess.IV] UPDATED)
    Domain generalization aims to train models on multiple source domains so that they can generalize well to unseen target domains. Among many domain generalization methods, Fourier-transform-based domain generalization methods have gained popularity primarily because they exploit the power of Fourier transformation to capture essential patterns and regularities in the data, making the model more robust to domain shifts. The mainstream Fourier-transform-based domain generalization swaps the Fourier amplitude spectrum while preserving the phase spectrum between the source and the target images. However, it neglects background interference in the amplitude spectrum. To overcome this limitation, we introduce a soft-thresholding function in the Fourier domain. We apply this newly designed algorithm to retinal fundus image segmentation, which is important for diagnosing ocular diseases but the neural network's performance can degrade across different sources due to domain shifts. The proposed technique basically enhances fundus image augmentation by eliminating small values in the Fourier domain and providing better generalization. The innovative nature of the soft thresholding fused with Fourier-transform-based domain generalization improves neural network models' performance by reducing the target images' background interference significantly. Experiments on public data validate our approach's effectiveness over conventional and state-of-the-art methods with superior segmentation metrics.
    Semantic Text-to-Face GAN -ST^2FG. (arXiv:2107.10756v4 [cs.CV] UPDATED)
    Faces generated using generative adversarial networks (GANs) have reached unprecedented realism. These faces, also known as "Deep Fakes", appear as realistic photographs with very little pixel-level distortions. While some work has enabled the training of models that lead to the generation of specific properties of the subject, generating a facial image based on a natural language description has not been fully explored. For security and criminal identification, the ability to provide a GAN-based system that works like a sketch artist would be incredibly useful. In this paper, we present a novel approach to generate facial images from semantic text descriptions. The learned model is provided with a text description and an outline of the type of face, which the model uses to sketch the features. Our models are trained using an Affine Combination Module (ACM) mechanism to combine the text embedding from BERT and the GAN latent space using a self-attention matrix. This avoids the loss of features due to inadequate "attention", which may happen if text embedding and latent vector are simply concatenated. Our approach is capable of generating images that are very accurately aligned to the exhaustive textual descriptions of faces with many fine detail features of the face and helps in generating better images. The proposed method is also capable of making incremental changes to a previously generated image if it is provided with additional textual descriptions or sentences.
    Graph schemas as abstractions for transfer learning, inference, and planning. (arXiv:2302.07350v2 [cs.AI] UPDATED)
    Transferring latent structure from one environment or problem to another is a mechanism by which humans and animals generalize with very little data. Inspired by cognitive and neurobiological insights, we propose graph schemas as a mechanism of abstraction for transfer learning. Graph schemas start with latent graph learning where perceptually aliased observations are disambiguated in the latent space using contextual information. Latent graph learning is also emerging as a new computational model of the hippocampus to explain map learning and transitive inference. Our insight is that a latent graph can be treated as a flexible template -- a schema -- that models concepts and behaviors, with slots that bind groups of latent nodes to the specific observations or groundings. By treating learned latent graphs (schemas) as prior knowledge, new environments can be quickly learned as compositions of schemas and their newly learned bindings. We evaluate graph schemas on two previously published challenging tasks: the memory & planning game and one-shot StreetLearn, which are designed to test rapid task solving in novel environments. Graph schemas can be learned in far fewer episodes than previous baselines, and can model and plan in a few steps in novel variations of these tasks. We also demonstrate learning, matching, and reusing graph schemas in more challenging 2D and 3D environments with extensive perceptual aliasing and size variations, and show how different schemas can be composed to model larger and more complex environments. To summarize, our main contribution is a unified system, inspired and grounded in cognitive science, that facilitates rapid transfer learning of new environments using schemas via map-induction and composition that handles perceptual aliasing.
    Accelerating Batch Active Learning Using Continual Learning Techniques. (arXiv:2305.06408v2 [cs.LG] UPDATED)
    A major problem with Active Learning (AL) is high training costs since models are typically retrained from scratch after every query round. We start by demonstrating that standard AL on neural networks with warm starting fails, both to accelerate training and to avoid catastrophic forgetting when using fine-tuning over AL query rounds. We then develop a new class of techniques, circumventing this problem, by biasing further training towards previously labeled sets. We accomplish this by employing existing, and developing novel, replay-based Continual Learning (CL) algorithms that are effective at quickly learning the new without forgetting the old, especially when data comes from an evolving distribution. We call this paradigm Continual Active Learning (CAL). We show CAL achieves significant speedups using a plethora of replay schemes that use model distillation and that select diverse, uncertain points from the history. We conduct experiments across many data domains, including natural language, vision, medical imaging, and computational biology, each with different neural architectures and dataset sizes. CAL consistently provides a 3x reduction in training time, while retaining performance.
    On the convex formulations of robust Markov decision processes. (arXiv:2209.10187v2 [math.OC] UPDATED)
    Robust Markov decision processes (MDPs) are used for applications of dynamic optimization in uncertain environments and have been studied extensively. Many of the main properties and algorithms of MDPs, such as value iteration and policy iteration, extend directly to RMDPs. Surprisingly, there is no known analog of the MDP convex optimization formulation for solving RMDPs. This work describes the first convex optimization formulation of RMDPs under the classical sa-rectangularity and s-rectangularity assumptions. By using entropic regularization and exponential change of variables, we derive a convex formulation with a number of variables and constraints polynomial in the number of states and actions, but with large coefficients in the constraints. We further simplify the formulation for RMDPs with polyhedral, ellipsoidal, or entropy-based uncertainty sets, showing that, in these cases, RMDPs can be reformulated as conic programs based on exponential cones, quadratic cones, and non-negative orthants. Our work opens a new research direction for RMDPs and can serve as a first step toward obtaining a tractable convex formulation of RMDPs.
    Conformal Prediction Regions for Time Series using Linear Complementarity Programming. (arXiv:2304.01075v4 [eess.SY] UPDATED)
    Conformal prediction is a statistical tool for producing prediction regions of machine learning models that are valid with high probability. However, applying conformal prediction to time series data leads to conservative prediction regions. In fact, to obtain prediction regions over $T$ time steps with confidence $1-\delta$, {previous works require that each individual prediction region is valid} with confidence $1-\delta/T$. We propose an optimization-based method for reducing this conservatism to enable long horizon planning and verification when using learning-enabled time series predictors. Instead of considering prediction errors individually at each time step, we consider a parameterized prediction error over multiple time steps. By optimizing the parameters over an additional dataset, we find prediction regions that are not conservative. We show that this problem can be cast as a mixed integer linear complementarity program (MILCP), which we then relax into a linear complementarity program (LCP). Additionally, we prove that the relaxed LP has the same optimal cost as the original MILCP. Finally, we demonstrate the efficacy of our method on case studies using pedestrian trajectory predictors and F16 fighter jet altitude predictors.
    Do SSL Models Have D\'ej\`a Vu? A Case of Unintended Memorization in Self-supervised Learning. (arXiv:2304.13850v3 [cs.CV] UPDATED)
    Self-supervised learning (SSL) algorithms can produce useful image representations by learning to associate different parts of natural images with one another. However, when taken to the extreme, SSL models can unintendedly memorize specific parts in individual training samples rather than learning semantically meaningful associations. In this work, we perform a systematic study of the unintended memorization of image-specific information in SSL models -- which we refer to as d\'ej\`a vu memorization. Concretely, we show that given the trained model and a crop of a training image containing only the background (e.g., water, sky, grass), it is possible to infer the foreground object with high accuracy or even visually reconstruct it. Furthermore, we show that d\'ej\`a vu memorization is common to different SSL algorithms, is exacerbated by certain design choices, and cannot be detected by conventional techniques for evaluating representation quality. Our study of d\'ej\`a vu memorization reveals previously unknown privacy risks in SSL models, as well as suggests potential practical mitigation strategies. Code is available at https://github.com/facebookresearch/DejaVu.
    PhenDiff: Revealing Invisible Phenotypes with Conditional Diffusion Models. (arXiv:2312.08290v1 [eess.IV])
    Over the last five years, deep generative models have gradually been adopted for various tasks in biological research. Notably, image-to-image translation methods showed to be effective in revealing subtle phenotypic cell variations otherwise invisible to the human eye. Current methods to achieve this goal mainly rely on Generative Adversarial Networks (GANs). However, these models are known to suffer from some shortcomings such as training instability and mode collapse. Furthermore, the lack of robustness to invert a real image into the latent of a trained GAN prevents flexible editing of real images. In this work, we propose PhenDiff, an image-to-image translation method based on conditional diffusion models to identify subtle phenotypes in microscopy images. We evaluate this approach on biological datasets against previous work such as CycleGAN. We show that PhenDiff outperforms this baseline in terms of quality and diversity of the generated images. We then apply this method to display invisible phenotypic changes triggered by a rare neurodevelopmental disorder on microscopy images of organoids. Altogether, we demonstrate that PhenDiff is able to perform high quality biological image-to-image translation allowing to spot subtle phenotype variations on a real image.
    Stable Rivers: A Case Study in the Application of Text-to-Image Generative Models for Earth Sciences. (arXiv:2312.07833v1 [cs.CV])
    Text-to-image (TTI) generative models can be used to generate photorealistic images from a given text-string input. These models offer great potential to mitigate challenges to the uptake of machine learning in the earth sciences. However, the rapid increase in their use has raised questions about fairness and biases, with most research to-date focusing on social and cultural areas rather than domain-specific considerations. We conducted a case study for the earth sciences, focusing on the field of fluvial geomorphology, where we evaluated subject-area specific biases in the training data and downstream model performance of Stable Diffusion (v1.5). In addition to perpetuating Western biases, we found that the training data over-represented scenic locations, such as famous rivers and waterfalls, and showed serious under- and over-representation of many morphological and environmental terms. Despite biased training data, we found that with careful prompting, the Stable Diffusion model was able to generate photorealistic synthetic river images reproducing many important environmental and morphological characteristics. Furthermore, conditional control techniques, such as the use of condition maps with ControlNet were effective for providing additional constraints on output images. Despite great potential for the use of TTI models in the earth sciences field, we advocate for caution in sensitive applications, and advocate for domain-specific reviews of training data and image generation biases to mitigate perpetuation of existing biases.
    HappyFeat -- An interactive and efficient BCI framework for clinical applications. (arXiv:2310.02948v2 [q-bio.NC] UPDATED)
    Brain-Computer Interface (BCI) systems allow users to perform actions by translating their brain activity into commands. Such systems usually need a training phase, consisting in training a classification algorithm to discriminate between mental states using specific features from the recorded signals. This phase of feature selection and training is crucial for BCI performance and presents specific constraints to be met in a clinical context, such as post-stroke rehabilitation. In this paper, we present HappyFeat, a software making Motor Imagery (MI) based BCI experiments easier, by gathering all necessary manipulations and analysis in a single convenient GUI and via automation of experiment or analysis parameters. The resulting workflow allows for effortlessly selecting the best features, helping to achieve good BCI performance in time-constrained environments. Alternative features based on Functional Connectivity can be used and compared or combined with Power Spectral Density, allowing a network-oriented approach. We then give details of HappyFeat's main mechanisms, and a review of its performances in typical use cases. We also show that it can be used as an efficient tool for comparing different metrics extracted from the signals, to train the classification algorithm. To this end, we show a comparison between the commonly-used Power Spectral Density and network metrics based on Functional Connectivity. HappyFeat is available as an open-source project which can be freely downloaded on GitHub.
    An Invitation to Deep Reinforcement Learning. (arXiv:2312.08365v1 [cs.LG])
    Training a deep neural network to maximize a target objective has become the standard recipe for successful machine learning over the last decade. These networks can be optimized with supervised learning, if the target objective is differentiable. For many interesting problems, this is however not the case. Common objectives like intersection over union (IoU), bilingual evaluation understudy (BLEU) score or rewards cannot be optimized with supervised learning. A common workaround is to define differentiable surrogate losses, leading to suboptimal solutions with respect to the actual objective. Reinforcement learning (RL) has emerged as a promising alternative for optimizing deep neural networks to maximize non-differentiable objectives in recent years. Examples include aligning large language models via human feedback, code generation, object detection or control problems. This makes RL techniques relevant to the larger machine learning audience. The subject is, however, time intensive to approach due to the large range of methods, as well as the often very theoretical presentation. In this introduction, we take an alternative approach, different from classic reinforcement learning textbooks. Rather than focusing on tabular problems, we introduce reinforcement learning as a generalization of supervised learning, which we first apply to non-differentiable objectives and later to temporal problems. Assuming only basic knowledge of supervised learning, the reader will be able to understand state-of-the-art deep RL algorithms like proximal policy optimization (PPO) after reading this tutorial.
    GLOP: Learning Global Partition and Local Construction for Solving Large-scale Routing Problems in Real-time. (arXiv:2312.08224v1 [cs.AI])
    The recent end-to-end neural solvers have shown promise for small-scale routing problems but suffered from limited real-time scaling-up performance. This paper proposes GLOP (Global and Local Optimization Policies), a unified hierarchical framework that efficiently scales toward large-scale routing problems. GLOP partitions large routing problems into Travelling Salesman Problems (TSPs) and TSPs into Shortest Hamiltonian Path Problems. For the first time, we hybridize non-autoregressive neural heuristics for coarse-grained problem partitions and autoregressive neural heuristics for fine-grained route constructions, leveraging the scalability of the former and the meticulousness of the latter. Experimental results show that GLOP achieves competitive and state-of-the-art real-time performance on large-scale routing problems, including TSP, ATSP, CVRP, and PCTSP.
    Differentially Private Gradient Flow based on the Sliced Wasserstein Distance for Non-Parametric Generative Modeling. (arXiv:2312.08227v1 [stat.ML])
    Safeguarding privacy in sensitive training data is paramount, particularly in the context of generative modeling. This is done through either differentially private stochastic gradient descent, or with a differentially private metric for training models or generators. In this paper, we introduce a novel differentially private generative modeling approach based on parameter-free gradient flows in the space of probability measures. The proposed algorithm is a new discretized flow which operates through a particle scheme, utilizing drift derived from the sliced Wasserstein distance and computed in a private manner. Our experiments show that compared to a generator-based model, our proposed model can generate higher-fidelity data at a low privacy budget, offering a viable alternative to generator-based approaches.
    Dynamic Budget Throttling in Repeated Second-Price Auctions. (arXiv:2207.04690v7 [cs.GT] UPDATED)
    In today's online advertising markets, a crucial requirement for an advertiser is to control her total expenditure within a time horizon under some budget. Among various budget control methods, throttling has emerged as a popular choice, managing an advertiser's total expenditure by selecting only a subset of auctions to participate in. This paper provides a theoretical panorama of a single advertiser's dynamic budget throttling process in repeated second-price auctions. We first establish a lower bound on the regret and an upper bound on the asymptotic competitive ratio for any throttling algorithm, respectively, when the advertiser's values are stochastic and adversarial. Regarding the algorithmic side, we propose the OGD-CB algorithm, which guarantees a near-optimal expected regret with stochastic values. On the other hand, when values are adversarial, we prove that this algorithm also reaches the upper bound on the asymptotic competitive ratio. We further compare throttling with pacing, another widely adopted budget control method, in repeated second-price auctions. In the stochastic case, we demonstrate that pacing is generally superior to throttling for the advertiser, supporting the well-known result that pacing is asymptotically optimal in this scenario. However, in the adversarial case, we give an exciting result indicating that throttling is also an asymptotically optimal dynamic bidding strategy. Our results bridge the gaps in theoretical research of throttling in repeated auctions and comprehensively reveal the ability of this popular budget-smoothing strategy.
    Inferring Atmospheric Properties of Exoplanets with Flow Matching and Neural Importance Sampling. (arXiv:2312.08295v1 [astro-ph.IM])
    Atmospheric retrievals (AR) characterize exoplanets by estimating atmospheric parameters from observed light spectra, typically by framing the task as a Bayesian inference problem. However, traditional approaches such as nested sampling are computationally expensive, thus sparking an interest in solutions based on machine learning (ML). In this ongoing work, we first explore flow matching posterior estimation (FMPE) as a new ML-based method for AR and find that, in our case, it is more accurate than neural posterior estimation (NPE), but less accurate than nested sampling. We then combine both FMPE and NPE with importance sampling, in which case both methods outperform nested sampling in terms of accuracy and simulation efficiency. Going forward, our analysis suggests that simulation-based inference with likelihood-based importance sampling provides a framework for accurate and efficient AR that may become a valuable tool not only for the analysis of observational data from existing telescopes, but also for the development of new missions and instruments.
    Double Machine Learning for Static Panel Models with Fixed Effects. (arXiv:2312.08174v1 [econ.EM])
    Machine Learning (ML) algorithms are powerful data-driven tools for approximating high-dimensional or non-linear nuisance functions which are useful in practice because the true functional form of the predictors is ex-ante unknown. In this paper, we develop estimators of policy interventions from panel data which allow for non-linear effects of the confounding regressors, and investigate the performance of these estimators using three well-known ML algorithms, specifically, LASSO, classification and regression trees, and random forests. We use Double Machine Learning (DML) (Chernozhukov et al., 2018) for the estimation of causal effects of homogeneous treatments with unobserved individual heterogeneity (fixed effects) and no unobserved confounding by extending Robinson (1988)'s partially linear regression model. We develop three alternative approaches for handling unobserved individual heterogeneity based on extending the within-group estimator, first-difference estimator, and correlated random effect estimator (Mundlak, 1978) for non-linear models. Using Monte Carlo simulations, we find that conventional least squares estimators can perform well even if the data generating process is non-linear, but there are substantial performance gains in terms of bias reduction under a process where the true effect of the regressors is non-linear and discontinuous. However, for the same scenarios, we also find -- despite extensive hyperparameter tuning -- inference to be problematic for both tree-based learners because these lead to highly non-normal estimator distributions and the estimator variance being severely under-estimated. This contradicts the performance of trees in other circumstances and requires further investigation. Finally, we provide an illustrative example of DML for observational panel data showing the impact of the introduction of the national minimum wage in the UK.
    Differentially private inference via noisy optimization. (arXiv:2103.11003v4 [math.ST] UPDATED)
    We propose a general optimization-based framework for computing differentially private M-estimators and a new method for constructing differentially private confidence regions. Firstly, we show that robust statistics can be used in conjunction with noisy gradient descent or noisy Newton methods in order to obtain optimal private estimators with global linear or quadratic convergence, respectively. We establish local and global convergence guarantees, under both local strong convexity and self-concordance, showing that our private estimators converge with high probability to a small neighborhood of the non-private M-estimators. Secondly, we tackle the problem of parametric inference by constructing differentially private estimators of the asymptotic variance of our private M-estimators. This naturally leads to approximate pivotal statistics for constructing confidence regions and conducting hypothesis testing. We demonstrate the effectiveness of a bias correction that leads to enhanced small-sample empirical performance in simulations. We illustrate the benefits of our methods in several numerical examples.
    Causal Optimal Transport of Abstractions. (arXiv:2312.08107v1 [cs.LG])
    Causal abstraction (CA) theory establishes formal criteria for relating multiple structural causal models (SCMs) at different levels of granularity by defining maps between them. These maps have significant relevance for real-world challenges such as synthesizing causal evidence from multiple experimental environments, learning causally consistent representations at different resolutions, and linking interventions across multiple SCMs. In this work, we propose COTA, the first method to learn abstraction maps from observational and interventional data without assuming complete knowledge of the underlying SCMs. In particular, we introduce a multi-marginal Optimal Transport (OT) formulation that enforces do-calculus causal constraints, together with a cost function that relies on interventional information. We extensively evaluate COTA on synthetic and real world problems, and showcase its advantages over non-causal, independent and aggregated COTA formulations. Finally, we demonstrate the efficiency of our method as a data augmentation tool by comparing it against the state-of-the-art CA learning framework, which assumes fully specified SCMs, on a real-world downstream task.
    OCTDL: Optical Coherence Tomography Dataset for Image-Based Deep Learning Methods. (arXiv:2312.08255v1 [eess.IV])
    Optical coherence tomography (OCT) is a non-invasive imaging technique with extensive clinical applications in ophthalmology. OCT enables the visualization of the retinal layers, playing a vital role in the early detection and monitoring of retinal diseases. OCT uses the principle of light wave interference to create detailed images of the retinal microstructures, making it a valuable tool for diagnosing ocular conditions. This work presents an open-access OCT dataset (OCTDL) comprising over 1600 high-resolution OCT images labeled according to disease group and retinal pathology. The dataset consists of OCT records of patients with Age-related Macular Degeneration (AMD), Diabetic Macular Edema (DME), Epiretinal Membrane (ERM), Retinal Artery Occlusion (RAO), Retinal Vein Occlusion (RVO), and Vitreomacular Interface Disease (VID). The images were acquired with an Optovue Avanti RTVue XR using raster scanning protocols with dynamic scan length and image resolution. Each retinal b-scan was acquired by centering on the fovea and interpreted and cataloged by an experienced retinal specialist. In this work, we applied Deep Learning classification techniques to this new open-access dataset.
    Conformers are All You Need for Visual Speech Recognition. (arXiv:2302.10915v2 [cs.LG] UPDATED)
    Visual speech recognition models extract visual features in a hierarchical manner. At the lower level, there is a visual front-end with a limited temporal receptive field that processes the raw pixels depicting the lips or faces. At the higher level, there is an encoder that attends to the embeddings produced by the front-end over a large temporal receptive field. Previous work has focused on improving the visual front-end of the model to extract more useful features for speech recognition. Surprisingly, our work shows that complex visual front-ends are not necessary. Instead of allocating resources to a sophisticated visual front-end, we find that a linear visual front-end paired with a larger Conformer encoder results in lower latency, more efficient memory usage, and improved WER performance. We achieve a new state-of-the-art of 12.8% WER for visual speech recognition on the TED LRS3 dataset, which rivals the performance of audio-only models from just four years ago.
    Generating Novel Scene Compositions from Single Images and Videos. (arXiv:2103.13389v5 [cs.CV] UPDATED)
    Given a large dataset for training, generative adversarial networks (GANs) can achieve remarkable performance for the image synthesis task. However, training GANs in extremely low data regimes remains a challenge, as overfitting often occurs, leading to memorization or training divergence. In this work, we introduce SIV-GAN, an unconditional generative model that can generate new scene compositions from a single training image or a single video clip. We propose a two-branch discriminator architecture, with content and layout branches designed to judge internal content and scene layout realism separately from each other. This discriminator design enables synthesis of visually plausible, novel compositions of a scene, with varying content and layout, while preserving the context of the original sample. Compared to previous single image GANs, our model generates more diverse, higher quality images, while not being restricted to a single image setting. We further introduce a new challenging task of learning from a few frames of a single video. In this training setup the training images are highly similar to each other, which makes it difficult for prior GAN models to achieve a synthesis of both high quality and diversity.
    Noise in the reverse process improves the approximation capabilities of diffusion models. (arXiv:2312.07851v1 [cs.LG])
    In Score based Generative Modeling (SGMs), the state-of-the-art in generative modeling, stochastic reverse processes are known to perform better than their deterministic counterparts. This paper delves into the heart of this phenomenon, comparing neural ordinary differential equations (ODEs) and neural stochastic differential equations (SDEs) as reverse processes. We use a control theoretic perspective by posing the approximation of the reverse process as a trajectory tracking problem. We analyze the ability of neural SDEs to approximate trajectories of the Fokker-Planck equation, revealing the advantages of stochasticity. First, neural SDEs exhibit a powerful regularizing effect, enabling $L^2$ norm trajectory approximation surpassing the Wasserstein metric approximation achieved by neural ODEs under similar conditions, even when the reference vector field or score function is not Lipschitz. Applying this result, we establish the class of distributions that can be sampled using score matching in SGMs, relaxing the Lipschitz requirement on the gradient of the data distribution in existing literature. Second, we show that this approximation property is preserved when network width is limited to the input dimension of the network. In this limited width case, the weights act as control inputs, framing our analysis as a controllability problem for neural SDEs in probability density space. This sheds light on how noise helps to steer the system towards the desired solution and illuminates the empirical success of stochasticity in generative modeling.
    Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF. (arXiv:2312.08358v1 [cs.LG])
    In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count. We show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. Furthermore, our analysis formalizes the way that preference learning from users with diverse values tacitly implements a social choice function. A key implication of this result is that annotators have an incentive to misreport their preferences in order to influence the learned model, leading to vulnerabilities in the deployment of RLHF. As a step towards mitigating these problems, we introduce a class of methods called distributional preference learning (DPL). DPL methods estimate a distribution of possible score values for each alternative in order to better account for hidden context. Experimental results indicate that applying DPL to RLHF for LLM chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability. Our code and data are available at https://github.com/cassidylaidlaw/hidden-context
    Modeling non-genetic information dynamics in cells using reservoir computing. (arXiv:2312.07977v1 [q-bio.CB])
    Virtually all cells use energy and ion-specific membrane pumps to maintain large transmembrane gradients of Na$^+$, K$^+$, Cl$^-$, Mg$^{++}$, and Ca$^{++}$. Although they consume up to 1/3 of a cell's energy budget, the corresponding evolutionary benefit of transmembrane ion gradients remain unclear. Here, we propose that ion gradients enable a dynamic and versatile biological system that acquires, analyzes, and responds to environmental information. We hypothesize environmental signals are transmitted into the cell by ion fluxes along pre-existing gradients through gated ion-specific membrane channels. The consequent changes of cytoplasmic ion concentration can generate a local response and orchestrate global or regional responses through wire-like ion fluxes along pre-existing and self-assembling cytoskeleton to engage the endoplasmic reticulum, mitochondria, and nucleus. Here, we frame our hypothesis through a quasi-physical (Cell-Reservoir) model that treats intra-cellular ion-based information dynamics as a sub-cellular process permitting spatiotemporally resolved cellular response that is also capable of learning complex nonlinear dynamical cellular behavior. We demonstrate the proposed ion dynamics permits rapid dissemination of response to information extrinsic perturbations that is consistent with experimental observations.
    Big Data -- Supply Chain Management Framework for Forecasting: Data Preprocessing and Machine Learning Techniques. (arXiv:2307.12971v2 [cs.LG] UPDATED)
    This article intends to systematically identify and comparatively analyze state-of-the-art supply chain (SC) forecasting strategies and technologies. A novel framework has been proposed incorporating Big Data Analytics in SC Management (problem identification, data sources, exploratory data analysis, machine-learning model training, hyperparameter tuning, performance evaluation, and optimization), forecasting effects on human-workforce, inventory, and overall SC. Initially, the need to collect data according to SC strategy and how to collect them has been discussed. The article discusses the need for different types of forecasting according to the period or SC objective. The SC KPIs and the error-measurement systems have been recommended to optimize the top-performing model. The adverse effects of phantom inventory on forecasting and the dependence of managerial decisions on the SC KPIs for determining model performance parameters and improving operations management, transparency, and planning efficiency have been illustrated. The cyclic connection within the framework introduces preprocessing optimization based on the post-process KPIs, optimizing the overall control process (inventory management, workforce determination, cost, production and capacity planning). The contribution of this research lies in the standard SC process framework proposal, recommended forecasting data analysis, forecasting effects on SC performance, machine learning algorithms optimization followed, and in shedding light on future research.
    Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models. (arXiv:2312.07887v1 [cs.CL])
    Incremental Learning (IL) has been a long-standing problem in both vision and Natural Language Processing (NLP) communities. In recent years, as Pre-trained Language Models (PLMs) have achieved remarkable progress in various NLP downstream tasks, utilizing PLMs as backbones has become a common practice in recent research of IL in NLP. Most assume that catastrophic forgetting is the biggest obstacle to achieving superior IL performance and propose various techniques to overcome this issue. However, we find that this assumption is problematic. Specifically, we revisit more than 20 methods on four classification tasks (Text Classification, Intent Classification, Relation Extraction, and Named Entity Recognition) under the two most popular IL settings (Class-Incremental and Task-Incremental) and reveal that most of them severely underestimate the inherent anti-forgetting ability of PLMs. Based on the observation, we propose a frustratingly easy method called SEQ* for IL with PLMs. The results show that SEQ* has competitive or superior performance compared to state-of-the-art (SOTA) IL methods and requires considerably less trainable parameters and training time. These findings urge us to revisit the IL with PLMs and encourage future studies to have a fundamental understanding of the catastrophic forgetting in PLMs. The data, code and scripts are publicly available at https://github.com/zzz47zzz/pretrained-lm-for-incremental-learning.
    Estimation of Concept Explanations Should be Uncertainty Aware. (arXiv:2312.08063v1 [cs.LG])
    Model explanations are very valuable for interpreting and debugging prediction models. We study a specific kind of global explanations called Concept Explanations, where the goal is to interpret a model using human-understandable concepts. Recent advances in multi-modal learning rekindled interest in concept explanations and led to several label-efficient proposals for estimation. However, existing estimation methods are unstable to the choice of concepts or dataset that is used for computing explanations. We observe that instability in explanations is due to high variance in point estimation of importance scores. We propose an uncertainty aware Bayesian estimation method, which readily improved reliability of the concept explanations. We demonstrate with theoretical analysis and empirical evaluation that explanations computed by our method are more reliable while also being label-efficient and faithful.
    Synthetic Data: Can We Trust Statistical Estimators?. (arXiv:2312.07837v1 [cs.LG])
    The increasing interest in data sharing makes synthetic data appealing. However, the analysis of synthetic data raises a unique set of methodological challenges. In this work, we highlight the importance of inferential utility and provide empirical evidence against naive inference from synthetic data (that handles these as if they were really observed). We argue that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased. One of the reasons is the underestimation of the true standard error, which may even progressively increase with larger sample sizes due to slower convergence. This is especially problematic for deep generative models. Before publishing synthetic data, it is essential to develop statistical inference tools for such data.
    Combining propensity score methods with variational autoencoders for generating synthetic data in presence of latent sub-groups. (arXiv:2312.07781v1 [cs.LG])
    In settings requiring synthetic data generation based on a clinical cohort, e.g., due to data protection regulations, heterogeneity across individuals might be a nuisance that we need to control or faithfully preserve. The sources of such heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and thus reflected only in properties of distributions, such as bimodality or skewness. We investigate how such heterogeneity can be preserved and controlled when obtaining synthetic data from variational autoencoders (VAEs), i.e., a generative deep learning technique that utilizes a low-dimensional latent representation. To faithfully reproduce unknown heterogeneity reflected in marginal distributions, we propose to combine VAEs with pre-transformations. For dealing with known heterogeneity due to sub-groups, we complement VAEs with models for group membership, specifically from propensity score regression. The evaluation is performed with a realistic simulation design that features sub-groups and challenging marginal distributions. The proposed approach faithfully recovers the latter, compared to synthetic data approaches that focus purely on marginal distributions. Propensity scores add complementary information, e.g., when visualized in the latent space, and enable sampling of synthetic data with or without sub-group specific characteristics. We also illustrate the proposed approach with real data from an international stroke trial that exhibits considerable distribution differences between study sites, in addition to bimodality. These results indicate that describing heterogeneity by statistical approaches, such as propensity score regression, might be more generally useful for complementing generative deep learning for obtaining synthetic data that faithfully reflects structure from clinical cohorts.
    Beyond Top-Class Agreement: Using Divergences to Forecast Performance under Distribution Shift. (arXiv:2312.08033v1 [cs.LG])
    Knowing if a model will generalize to data 'in the wild' is crucial for safe deployment. To this end, we study model disagreement notions that consider the full predictive distribution - specifically disagreement based on Hellinger distance, Jensen-Shannon and Kullback-Leibler divergence. We find that divergence-based scores provide better test error estimates and detection rates on out-of-distribution data compared to their top-1 counterparts. Experiments involve standard vision and foundation models.
    Machine Learning for the Multi-Dimensional Bin Packing Problem: Literature Review and Empirical Evaluation. (arXiv:2312.08103v1 [cs.LG])
    The Bin Packing Problem (BPP) is a well-established combinatorial optimization (CO) problem. Since it has many applications in our daily life, e.g. logistics and resource allocation, people are seeking efficient bin packing algorithms. On the other hand, researchers have been making constant advances in machine learning (ML), which is famous for its efficiency. In this article, we first formulate BPP, introducing its variants and practical constraints. Then, a comprehensive survey on ML for multi-dimensional BPP is provided. We further collect some public benchmarks of 3D BPP, and evaluate some online methods on the Cutting Stock Dataset. Finally, we share our perspective on challenges and future directions in BPP. To the best of our knowledge, this is the first systematic review of ML-related methods for BPP.
    Exploring Popularity Bias in Session-based Recommendation. (arXiv:2312.07855v1 [cs.IR])
    Existing work has revealed that large-scale offline evaluation of recommender systems for user-item interactions is prone to bias caused by the deployed system itself, as a form of closed loop feedback. Many adopt the \textit{propensity} concept to analyze or mitigate this empirical issue. In this work, we extend the analysis to session-based setup and adapted propensity calculation to the unique characteristics of session-based recommendation tasks. Our experiments incorporate neural models and KNN-based models, and cover both the music and the e-commerce domain. We study the distributions of propensity and different stratification techniques on different datasets and find that propensity-related traits are actually dataset-specific. We then leverage the effect of stratification and achieve promising results compared to the original models.
    SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space. (arXiv:2312.08200v1 [cs.LG])
    Symmetric positive definite~(SPD) matrices have shown important value and applications in statistics and machine learning, such as FMRI analysis and traffic prediction. Previous works on SPD matrices mostly focus on discriminative models, where predictions are made directly on $E(X|y)$, where $y$ is a vector and $X$ is an SPD matrix. However, these methods are challenging to handle for large-scale data, as they need to access and process the whole data. In this paper, inspired by denoising diffusion probabilistic model~(DDPM), we propose a novel generative model, termed SPD-DDPM, by introducing Gaussian distribution in the SPD space to estimate $E(X|y)$. Moreover, our model is able to estimate $p(X)$ unconditionally and flexibly without giving $y$. On the one hand, the model conditionally learns $p(X|y)$ and utilizes the mean of samples to obtain $E(X|y)$ as a prediction. On the other hand, the model unconditionally learns the probability distribution of the data $p(X)$ and generates samples that conform to this distribution. Furthermore, we propose a new SPD net which is much deeper than the previous networks and allows for the inclusion of conditional factors. Experiment results on toy data and real taxi data demonstrate that our models effectively fit the data distribution both unconditionally and unconditionally and provide accurate predictions.
    Pneumonia Detection on chest X-ray images Using Ensemble of Deep Convolutional Neural Networks. (arXiv:2312.07965v1 [eess.IV])
    Pneumonia is a life-threatening lung infection resulting from several different viral infections. Identifying and treating pneumonia on chest X-ray images can be difficult due to its similarity to other pulmonary diseases. Thus, the existing methods for predicting pneumonia cannot attain substantial levels of accuracy. Therefore, this paper presents a computer-aided classification of pneumonia, coined as Ensemble Learning (EL), to simplify the diagnosis process on chest X-ray images. Our proposal is based on Convolutional Neural Network (CNN) models, which are pre-trained CNN models that have been recently employed to enhance the performance of many medical tasks instead of training CNN models from scratch. We propose to use three well-known CNN pre-trained (DenseNet169, MobileNetV2 and Vision Transformer) using the ImageNet database. Then, these models are trained on the chest X-ray data set using fine-tuning. Finally, the results are obtained by combining the extracted features from these three models during the experimental phase. The proposed EL approach outperforms other existing state-of-the-art methods, and it obtains an accuracy of 93.91% and a F1-Score of 93.88% on the testing phase.
    Defending Our Privacy With Backdoors. (arXiv:2310.08320v2 [cs.LG] UPDATED)
    The proliferation of large AI models trained on uncurated, often sensitive web-scraped data has raised significant privacy concerns. One of the concerns is that adversaries can extract information about the training data using privacy attacks. Unfortunately, the task of removing specific information from the models without sacrificing performance is not straightforward and has proven to be challenging. We propose a rather easy yet effective defense based on backdoor attacks to remove private information such as names of individuals from models, and focus in this work on text encoders. Specifically, through strategic insertion of backdoors, we align the embeddings of sensitive phrases with those of neutral terms-"a person" instead of the person's name. Our empirical results demonstrate the effectiveness of our backdoor-based defense on CLIP by assessing its performance using a specialized privacy attack for zero-shot classifiers. Our approach provides not only a new "dual-use" perspective on backdoor attacks, but also presents a promising avenue to enhance the privacy of individuals within models trained on uncurated web-scraped data.
    Enhancing Robotic Navigation: An Evaluation of Single and Multi-Objective Reinforcement Learning Strategies. (arXiv:2312.07953v1 [cs.RO])
    This study presents a comparative analysis between single-objective and multi-objective reinforcement learning methods for training a robot to navigate effectively to an end goal while efficiently avoiding obstacles. Traditional reinforcement learning techniques, namely Deep Q-Network (DQN), Deep Deterministic Policy Gradient (DDPG), and Twin Delayed DDPG (TD3), have been evaluated using the Gazebo simulation framework in a variety of environments with parameters such as random goal and robot starting locations. These methods provide a numerical reward to the robot, offering an indication of action quality in relation to the goal. However, their limitations become apparent in complex settings where multiple, potentially conflicting, objectives are present. To address these limitations, we propose an approach employing Multi-Objective Reinforcement Learning (MORL). By modifying the reward function to return a vector of rewards, each pertaining to a distinct objective, the robot learns a policy that effectively balances the different goals, aiming to achieve a Pareto optimal solution. This comparative study highlights the potential for MORL in complex, dynamic robotic navigation tasks, setting the stage for future investigations into more adaptable and robust robotic behaviors.
    Erasing Self-Supervised Learning Backdoor by Cluster Activation Masking. (arXiv:2312.07955v1 [cs.CV])
    Researchers have recently found that Self-Supervised Learning (SSL) is vulnerable to backdoor attacks. The attacker can embed hidden SSL backdoors via a few poisoned examples in the training dataset and maliciously manipulate the behavior of downstream models. To defend against SSL backdoor attacks, a feasible route is to detect and remove the poisonous samples in the training set. However, the existing SSL backdoor defense method fails to detect the poisonous samples precisely. In this paper, we propose to erase the SSL backdoor by cluster activation masking and propose a novel PoisonCAM method. After obtaining the threat model trained on the poisoned dataset, our method can precisely detect poisonous samples based on the assumption that masking the backdoor trigger can effectively change the activation of a downstream clustering model. In experiments, our PoisonCAM achieves 96% accuracy for backdoor trigger detection compared to 3% of the state-of-the-art method on poisoned ImageNet-100. Moreover, our proposed PoisonCAM significantly improves the performance of the trained SSL model under backdoor attacks compared to the state-of-the-art method. Our code will be available at https://github.com/LivXue/PoisonCAM.
    Multi-perspective Feedback-attention Coupling Model for Continuous-time Dynamic Graphs. (arXiv:2312.07983v1 [cs.LG])
    Recently, representation learning over graph networks has gained popularity, with various models showing promising results. Despite this, several challenges persist: 1) most methods are designed for static or discrete-time dynamic graphs; 2) existing continuous-time dynamic graph algorithms focus on a single evolving perspective; and 3) many continuous-time dynamic graph approaches necessitate numerous temporal neighbors to capture long-term dependencies. In response, this paper introduces the Multi-Perspective Feedback-Attention Coupling (MPFA) model. MPFA incorporates information from both evolving and raw perspectives, efficiently learning the interleaved dynamics of observed processes. The evolving perspective employs temporal self-attention to distinguish continuously evolving temporal neighbors for information aggregation. Through dynamic updates, this perspective can capture long-term dependencies using a small number of temporal neighbors. Meanwhile, the raw perspective utilizes a feedback attention module with growth characteristic coefficients to aggregate raw neighborhood information. Experimental results on a self-organizing dataset and seven public datasets validate the efficacy and competitiveness of our proposed model.
    EdVAE: Mitigating Codebook Collapse with Evidential Discrete Variational Autoencoders. (arXiv:2310.05718v2 [cs.CV] UPDATED)
    Codebook collapse is a common problem in training deep generative models with discrete representation spaces like Vector Quantized Variational Autoencoders (VQ-VAEs). We observe that the same problem arises for the alternatively designed discrete variational autoencoders (dVAEs) whose encoder directly learns a distribution over the codebook embeddings to represent the data. We hypothesize that using the softmax function to obtain a probability distribution causes the codebook collapse by assigning overconfident probabilities to the best matching codebook elements. In this paper, we propose a novel way to incorporate evidential deep learning (EDL) instead of softmax to combat the codebook collapse problem of dVAE. We evidentially monitor the significance of attaining the probability distribution over the codebook embeddings, in contrast to softmax usage. Our experiments using various datasets show that our model, called EdVAE, mitigates codebook collapse while improving the reconstruction performance, and enhances the codebook usage compared to dVAE and VQ-VAE based models. Our code can be found at https://github.com/ituvisionlab/EdVAE .
    Hierarchical Classification of Financial Transactions Through Context-Fusion of Transformer-based Embeddings and Taxonomy-aware Attention Layer. (arXiv:2312.07730v1 [cs.LG])
    This work proposes the Two-headed DragoNet, a Transformer-based model for hierarchical multi-label classification of financial transactions. Our model is based on a stack of Transformers encoder layers that generate contextual embeddings from two short textual descriptors (merchant name and business activity), followed by a Context Fusion layer and two output heads that classify transactions according to a hierarchical two-level taxonomy (macro and micro categories). Finally, our proposed Taxonomy-aware Attention Layer corrects predictions that break categorical hierarchy rules defined in the given taxonomy. Our proposal outperforms classical machine learning methods in experiments of macro-category classification by achieving an F1-score of 93\% on a card dataset and 95% on a current account dataset.
    Learning to Transmit with Provable Guarantees in Wireless Federated Learning. (arXiv:2304.09329v2 [cs.LG] UPDATED)
    We propose a novel data-driven approach to allocate transmit power for federated learning (FL) over interference-limited wireless networks. The proposed method is useful in challenging scenarios where the wireless channel is changing during the FL training process and when the training data are not independent and identically distributed (non-i.i.d.) on the local devices. Intuitively, the power policy is designed to optimize the information received at the server end during the FL process under communication constraints. Ultimately, our goal is to improve the accuracy and efficiency of the global FL model being trained. The proposed power allocation policy is parameterized using graph convolutional networks (GCNs), and the associated constrained optimization problem is solved through a primal-dual (PD) algorithm. Theoretically, we show that the formulated problem has a zero duality gap and, once the power policy is parameterized, optimality depends on how expressive this parameterization is. Numerically, we demonstrate that the proposed method outperforms existing baselines under different wireless channel settings and varying degrees of data heterogeneity.
    Linear Combination of Exponential Moving Averages for Wireless Channel Prediction. (arXiv:2312.07945v1 [cs.NI])
    The ability to predict the behavior of a wireless channel in terms of the frame delivery ratio is quite valuable, and permits, e.g., to optimize the operating parameters of a wireless network at runtime, or to proactively react to the degradation of the channel quality, in order to meet the stringent requirements about dependability and end-to-end latency that typically characterize industrial applications. In this work, prediction models based on the exponential moving average (EMA) are investigated in depth, which are proven to outperform other simple statistical methods and whose performance is nearly as good as artificial neural networks, but with dramatically lower computational requirements. Regarding the innovation and motivation of this work, a new model that we called EMA linear combination (ELC), is introduced, explained, and evaluated experimentally. Its prediction accuracy, tested on some databases acquired from a real setup based on Wi-Fi devices, showed that ELC brings tangible improvements over EMA in any experimental conditions, the only drawback being a slight increase in computational complexity.
    GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values. (arXiv:2311.03426v2 [cs.LG] UPDATED)
    Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grouping techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.
    Interpretable factorization of clinical questionnaires to identify latent factors of psychopathology. (arXiv:2312.07762v1 [cs.LG])
    Psychiatry research seeks to understand the manifestations of psychopathology in behavior, as measured in questionnaire data, by identifying a small number of latent factors that explain them. While factor analysis is the traditional tool for this purpose, the resulting factors may not be interpretable, and may also be subject to confounding variables. Moreover, missing data are common, and explicit imputation is often required. To overcome these limitations, we introduce interpretability constrained questionnaire factorization (ICQF), a non-negative matrix factorization method with regularization tailored for questionnaire data. Our method aims to promote factor interpretability and solution stability. We provide an optimization procedure with theoretical convergence guarantees, and an automated procedure to detect latent dimensionality accurately. We validate these procedures using realistic synthetic data. We demonstrate the effectiveness of our method in a widely used general-purpose questionnaire, in two independent datasets (the Healthy Brain Network and Adolescent Brain Cognitive Development studies). Specifically, we show that ICQF improves interpretability, as defined by domain experts, while preserving diagnostic information across a range of disorders, and outperforms competing methods for smaller dataset sizes. This suggests that the regularization in our method matches domain characteristics. The python implementation for ICQF is available at \url{https://github.com/jefferykclam/ICQF}.
    On a Foundation Model for Operating Systems. (arXiv:2312.07813v1 [cs.OS])
    This paper lays down the research agenda for a domain-specific foundation model for operating systems (OSes). Our case for a foundation model revolves around the observations that several OS components such as CPU, memory, and network subsystems are interrelated and that OS traces offer the ideal dataset for a foundation model to grasp the intricacies of diverse OS components and their behavior in varying environments and workloads. We discuss a wide range of possibilities that then arise, from employing foundation models as policy agents to utilizing them as generators and predictors to assist traditional OS control algorithms. Our hope is that this paper spurs further research into OS foundation models and creating the next generation of operating systems for the evolving computing landscape.
    Radio Signal Classification by Adversarially Robust Quantum Machine Learning. (arXiv:2312.07821v1 [quant-ph])
    Radio signal classification plays a pivotal role in identifying the modulation scheme used in received radio signals, which is essential for demodulation and proper interpretation of the transmitted information. Researchers have underscored the high susceptibility of ML algorithms for radio signal classification to adversarial attacks. Such vulnerability could result in severe consequences, including misinterpretation of critical messages, interception of classified information, or disruption of communication channels. Recent advancements in quantum computing have revolutionized theories and implementations of computation, bringing the unprecedented development of Quantum Machine Learning (QML). It is shown that quantum variational classifiers (QVCs) provide notably enhanced robustness against classical adversarial attacks in image classification. However, no research has yet explored whether QML can similarly mitigate adversarial threats in the context of radio signal classification. This work applies QVCs to radio signal classification and studies their robustness to various adversarial attacks. We also propose the novel application of the approximate amplitude encoding (AAE) technique to encode radio signal data efficiently. Our extensive simulation results present that attacks generated on QVCs transfer well to CNN models, indicating that these adversarial examples can fool neural networks that they are not explicitly designed to attack. However, the converse is not true. QVCs primarily resist the attacks generated on CNNs. Overall, with comprehensive simulations, our results shed new light on the growing field of QML by bridging knowledge gaps in QAML in radio signal classification and uncovering the advantages of applying QML methods in practical applications.
    Abusive Span Detection for Vietnamese Narrative Texts. (arXiv:2312.07831v1 [cs.CL])
    Abuse in its various forms, including physical, psychological, verbal, sexual, financial, and cultural, has a negative impact on mental health. However, there are limited studies on applying natural language processing (NLP) in this field in Vietnam. Therefore, we aim to contribute by building a human-annotated Vietnamese dataset for detecting abusive content in Vietnamese narrative texts. We sourced these texts from VnExpress, Vietnam's popular online newspaper, where readers often share stories containing abusive content. Identifying and categorizing abusive spans in these texts posed significant challenges during dataset creation, but it also motivated our research. We experimented with lightweight baseline models by freezing PhoBERT and XLM-RoBERTa and using their hidden states in a BiLSTM to assess the complexity of the dataset. According to our experimental results, PhoBERT outperforms other models in both labeled and unlabeled abusive span detection tasks. These results indicate that it has the potential for future improvements.
    Brain-optimized inference improves reconstructions of fMRI brain activity. (arXiv:2312.07705v1 [q-bio.NC])
    The release of large datasets and developments in AI have led to dramatic improvements in decoding methods that reconstruct seen images from human brain activity. We evaluate the prospect of further improving recent decoding methods by optimizing for consistency between reconstructions and brain activity during inference. We sample seed reconstructions from a base decoding method, then iteratively refine these reconstructions using a brain-optimized encoding model that maps images to brain activity. At each iteration, we sample a small library of images from an image distribution (a diffusion model) conditioned on a seed reconstruction from the previous iteration. We select those that best approximate the measured brain activity when passed through our encoding model, and use these images for structural guidance during the generation of the small library in the next iteration. We reduce the stochasticity of the image distribution at each iteration, and stop when a criterion on the "width" of the image distribution is met. We show that when this process is applied to recent decoding methods, it outperforms the base decoding method as measured by human raters, a variety of image feature metrics, and alignment to brain activity. These results demonstrate that reconstruction quality can be significantly improved by explicitly aligning decoding distributions to brain activity distributions, even when the seed reconstruction is output from a state-of-the-art decoding algorithm. Interestingly, the rate of refinement varies systematically across visual cortex, with earlier visual areas generally converging more slowly and preferring narrower image distributions, relative to higher-level brain areas. Brain-optimized inference thus offers a succinct and novel method for improving reconstructions and exploring the diversity of representations across visual brain areas.
    Estimation of embedding vectors in high dimensions. (arXiv:2312.07802v1 [cs.LG])
    Embeddings are a basic initial feature extraction step in many machine learning models, particularly in natural language processing. An embedding attempts to map data tokens to a low-dimensional space where similar tokens are mapped to vectors that are close to one another by some metric in the embedding space. A basic question is how well can such embedding be learned? To study this problem, we consider a simple probability model for discrete data where there is some "true" but unknown embedding where the correlation of random variables is related to the similarity of the embeddings. Under this model, it is shown that the embeddings can be learned by a variant of low-rank approximate message passing (AMP) method. The AMP approach enables precise predictions of the accuracy of the estimation in certain high-dimensional limits. In particular, the methodology provides insight on the relations of key parameters such as the number of samples per value, the frequency of the terms, and the strength of the embedding correlation on the probability distribution. Our theoretical findings are validated by simulations on both synthetic data and real text data.
    Intelligence Primer. (arXiv:2008.07324v4 [cs.AI] UPDATED)
    Intelligence is a fundamental part of all living things, as well as the foundation for Artificial Intelligence. In this primer we explore the ideas associated with intelligence and, by doing so, understand the implications and constraints and potentially outline the capabilities of future systems. Artificial Intelligence, in the form of Machine Learning, has already had a significant impact on our lives. As an exploration, we journey into different parts of intelligence that appear essential. We hope that people find this helpful in determining the future. Also, during the exploration, we hope to create new thought-provoking questions. Intelligence is not a single weighable quantity but a subject that spans Biology, Physics, Philosophy, Cognitive Science, Neuroscience, Psychology, and Computer Science. The historian Yuval Noah Harari pointed out that engineers and scientists in the future will have to broaden their understandings to include disciplines such as Psychology, Philosophy, and Ethics. Fiction writers have long portrayed engineers and scientists as deficient in these areas. Today, in modern society, the emergence of Artificial Intelligence and legal requirements act as forcing functions to push these broader subjects into the foreground. We start with an introduction to intelligence and move quickly to more profound thoughts and ideas. We call this a Life, the Universe, and Everything primer, after the famous science fiction book by Douglas Adams. Forty-two may be the correct answer, but what are the questions?
    ScaLearn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale. (arXiv:2310.01217v2 [cs.LG] UPDATED)
    Multi-task learning (MTL) has shown considerable practical benefits, particularly when using pre-trained language models (PLMs). While this is commonly achieved by simultaneously learning $n$ tasks under a joint optimization procedure, recent methods such as AdapterFusion structure the problem into two distinct stages: (i) task learning, where knowledge specific to a task is encapsulated within sets of parameters (e.g., adapters), and (ii) transfer, where this already learned knowledge is leveraged for a target task. This separation of concerns provides numerous benefits, such as promoting reusability, and addressing cases involving data privacy and societal concerns; on the flip side, current two-stage MTL methods come with the cost of introducing a substantial number of additional parameters. In this work, we address this issue by leveraging the usefulness of linearly scaling the output representations of source adapters for transfer learning. We introduce ScaLearn, a simple and highly parameter-efficient two-stage MTL method that capitalizes on the knowledge of the source tasks by learning a minimal set of scaling parameters that enable effective knowledge transfer to a target task. Our experiments on three benchmarks (GLUE, SuperGLUE, and HumSet) show that our ScaLearn, in addition to facilitating the benefits of two-stage MTL, consistently outperforms strong baselines with only a small number of transfer parameters - roughly 0.35% of those of AdapterFusion. Remarkably, we observe that ScaLearn maintains its strong abilities even when further reducing parameters through uniform scaling and layer-sharing, achieving similarly competitive results with only $8$ transfer parameters for each target task. Our proposed approach thus demonstrates the power of simple scaling as a promise for more efficient task transfer.
    I Open at the Close: A Deep Reinforcement Learning Evaluation of Open Streets Initiatives. (arXiv:2312.07680v1 [cs.LG])
    The open streets initiative "opens" streets to pedestrians and bicyclists by closing them to cars and trucks. The initiative, adopted by many cities across North America, increases community space in urban environments. But could open streets also make cities safer and less congested? We study this question by framing the choice of which streets to open as a reinforcement learning problem. In order to simulate the impact of opening streets, we first compare models for predicting vehicle collisions given network and temporal data. We find that a recurrent graph neural network, leveraging the graph structure and the short-term temporal dependence of the data, gives the best predictive performance. Then, with the ability to simulate collisions and traffic, we frame a reinforcement learning problem to find which streets to open. We compare the streets in the NYC Open Streets program to those proposed by a Q-learning algorithm. We find that the streets proposed by the Q-learning algorithm have reliably better outcomes, while streets in the program have similar outcomes to randomly selected streets. We present our work as a step toward principally choosing which streets to open for safer and less congested cities. All our code and data are available on Github.
    GP+: A Python Library for Kernel-based learning via Gaussian Processes. (arXiv:2312.07694v1 [cs.LG])
    In this paper we introduce GP+, an open-source library for kernel-based learning via Gaussian processes (GPs) which are powerful statistical models that are completely characterized by their parametric covariance and mean functions. GP+ is built on PyTorch and provides a user-friendly and object-oriented tool for probabilistic learning and inference. As we demonstrate with a host of examples, GP+ has a few unique advantages over other GP modeling libraries. We achieve these advantages primarily by integrating nonlinear manifold learning techniques with GPs' covariance and mean functions. As part of introducing GP+, in this paper we also make methodological contributions that (1) enable probabilistic data fusion and inverse parameter estimation, and (2) equip GPs with parsimonious parametric mean functions which span mixed feature spaces that have both categorical and quantitative variables. We demonstrate the impact of these contributions in the context of Bayesian optimization, multi-fidelity modeling, sensitivity analysis, and calibration of computer models.
    Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes. (arXiv:2309.01922v2 [cs.LG] UPDATED)
    In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradient-based algorithm, liberating it from the constraints of assuming a linear MDP structure. We propose a policy gradient-based algorithm and show its global convergence property. We then prove that the proposed algorithm has $\tilde{\mathcal{O}}({T}^{3/4})$ regret. Remarkably, this paper marks a pioneering effort by presenting the first exploration into regret-bound computation for the general parameterized policy gradient algorithm in the context of average reward scenarios.
    Diffusion Models Enable Zero-Shot Pose Estimation for Lower-Limb Prosthetic Users. (arXiv:2312.07854v1 [cs.CV])
    The application of 2D markerless gait analysis has garnered increasing interest and application within clinical settings. However, its effectiveness in the realm of lower-limb amputees has remained less than optimal. In response, this study introduces an innovative zero-shot method employing image generation diffusion models to achieve markerless pose estimation for lower-limb prosthetics, presenting a promising solution to gait analysis for this specific population. Our approach demonstrates an enhancement in detecting key points on prosthetic limbs over existing methods, and enables clinicians to gain invaluable insights into the kinematics of lower-limb amputees across the gait cycle. The outcomes obtained not only serve as a proof-of-concept for the feasibility of this zero-shot approach but also underscore its potential in advancing rehabilitation through gait analysis for this unique population.
    Visual Instruction Tuning. (arXiv:2304.08485v2 [cs.CV] UPDATED)
    Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.
    On Estimating the Gradient of the Expected Information Gain in Bayesian Experimental Design. (arXiv:2308.09888v2 [stat.ML] UPDATED)
    Bayesian Experimental Design (BED), which aims to find the optimal experimental conditions for Bayesian inference, is usually posed as to optimize the expected information gain (EIG). The gradient information is often needed for efficient EIG optimization, and as a result the ability to estimate the gradient of EIG is essential for BED problems. The primary goal of this work is to develop methods for estimating the gradient of EIG, which, combined with the stochastic gradient descent algorithms, result in efficient optimization of EIG. Specifically, we first introduce a posterior expected representation of the EIG gradient with respect to the design variables. Based on this, we propose two methods for estimating the EIG gradient, UEEG-MCMC that leverages posterior samples generated through Markov Chain Monte Carlo (MCMC) to estimate the EIG gradient, and BEEG-AP that focuses on achieving high simulation efficiency by repeatedly using parameter samples. Theoretical analysis and numerical studies illustrate that UEEG-MCMC is robust agains the actual EIG value, while BEEG-AP is more efficient when the EIG value to be optimized is small. Moreover, both methods show superior performance compared to several popular benchmarks in our numerical experiments.
    Training of Neural Networks with Uncertain Data, A Mixture of Experts Approach. (arXiv:2312.08083v1 [stat.ML])
    This paper presents the "Uncertainty-aware Mixture of Experts" (uMoE), a novel approach designed to address aleatoric uncertainty in the training of predictive models based on Neural Networks (NNs). While existing methods primarily focus on managing uncertainty during infer-ence, uMoE integrates uncertainty directly into the train-ing process. The uMoE approach adopts a "Divide and Conquer" paradigm to partition the uncertain input space into more manageable subspaces. It consists of Expert components, each trained solely on the portion of input uncertainty corresponding to their subspace. On top of the Experts, a Gating Unit, guided by additional infor-mation about the distribution of uncertain inputs across these subspaces, learns to weight the Experts to minimize deviations from the ground truth. Our results highlight that uMoE significantly outperforms baseline methods in handling data uncertainty. Furthermore, we conducted a robustness analysis, illustrating its capability to adapt to varying levels of uncertainty and suggesting optimal threshold parameters. This innovative approach holds wide applicability across diverse data-driven domains, in-cluding biomedical signal processing, autonomous driv-ing, and production quality control.
    \emph{Lifted} RDT based capacity analysis of the 1-hidden layer treelike \emph{sign} perceptrons neural networks. (arXiv:2312.08257v1 [stat.ML])
    We consider the memorization capabilities of multilayered \emph{sign} perceptrons neural networks (SPNNs). A recent rigorous upper-bounding capacity characterization, obtained in \cite{Stojnictcmspnncaprdt23} utilizing the Random Duality Theory (RDT), demonstrated that adding neurons in a network configuration may indeed be very beneficial. Moreover, for particular \emph{treelike committee machines} (TCM) architectures with $d\leq 5$ neurons in the hidden layer, \cite{Stojnictcmspnncaprdt23} made a very first mathematically rigorous progress in over 30 years by lowering the previously best known capacity bounds of \cite{MitchDurb89}. Here, we first establish that the RDT bounds from \cite{Stojnictcmspnncaprdt23} scale as $\sim \sqrt{d}$ and can not on their own \emph{universally} (over the entire range of $d$) beat the best known $\sim \log(d)$ scaling of the bounds from \cite{MitchDurb89}. After recognizing that the progress from \cite{Stojnictcmspnncaprdt23} is therefore promising, but yet without a complete concretization, we then proceed by considering the recently developed fully lifted RDT (fl RDT) as an alternative. While the fl RDT is indeed a powerful juggernaut, it typically relies on heavy numerical evaluations. To avoid such heavy numerics, we here focus on a simplified, \emph{partially lifted}, variant and show that it allows for very neat, closed form, analytical capacity characterizations. Moreover, we obtain the concrete capacity bounds that \emph{universally} improve for \emph{any} $d$ over the best known ones of \cite{MitchDurb89}.
    A Novel Metric for Measuring Data Quality in Classification Applications (extended version). (arXiv:2312.08066v1 [cs.LG])
    Data quality is a key element for building and optimizing good learning models. Despite many attempts to characterize data quality, there is still a need for rigorous formalization and an efficient measure of the quality from available observations. Indeed, without a clear understanding of the training and testing processes, it is hard to evaluate the intrinsic performance of a model. Besides, tools allowing to measure data quality specific to machine learning are still lacking. In this paper, we introduce and explain a novel metric to measure data quality. This metric is based on the correlated evolution between the classification performance and the deterioration of data. The proposed method has the major advantage of being model-independent. Furthermore, we provide an interpretation of each criterion and examples of assessment levels. We confirm the utility of the proposed metric with intensive numerical experiments and detail some illustrative cases with controlled and interpretable qualities.
    ClusterDDPM: An EM clustering framework with Denoising Diffusion Probabilistic Models. (arXiv:2312.08029v1 [cs.LG])
    Variational autoencoder (VAE) and generative adversarial networks (GAN) have found widespread applications in clustering and have achieved significant success. However, the potential of these approaches may be limited due to VAE's mediocre generation capability or GAN's well-known instability during adversarial training. In contrast, denoising diffusion probabilistic models (DDPMs) represent a new and promising class of generative models that may unlock fresh dimensions in clustering. In this study, we introduce an innovative expectation-maximization (EM) framework for clustering using DDPMs. In the E-step, we aim to derive a mixture of Gaussian priors for the subsequent M-step. In the M-step, our focus lies in learning clustering-friendly latent representations for the data by employing the conditional DDPM and matching the distribution of latent representations to the mixture of Gaussian priors. We present a rigorous theoretical analysis of the optimization process in the M-step, proving that the optimizations are equivalent to maximizing the lower bound of the Q function within the vanilla EM framework under certain constraints. Comprehensive experiments validate the advantages of the proposed framework, showcasing superior performance in clustering, unsupervised conditional generation and latent representation learning.
    SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention. (arXiv:2312.07987v1 [cs.LG])
    The costly self-attention layers in modern Transformers require memory and compute quadratic in sequence length. Existing approximation methods usually underperform and fail to obtain significant speedups in practice. Here we present SwitchHead - a novel method that reduces both compute and memory requirements and achieves wall-clock speedup, while matching the language modeling performance of baseline Transformers with the same parameter budget. SwitchHead uses Mixture-of-Experts (MoE) layers for the value and output projections and requires 4 to 8 times fewer attention matrices than standard Transformers. Our novel attention can also be combined with MoE MLP layers, resulting in an efficient fully-MoE "SwitchHead" Transformer model. Our code is public.
    Video Dynamics Prior: An Internal Learning Approach for Robust Video Enhancements. (arXiv:2312.07835v1 [cs.CV])
    In this paper, we present a novel robust framework for low-level vision tasks, including denoising, object removal, frame interpolation, and super-resolution, that does not require any external training data corpus. Our proposed approach directly learns the weights of neural modules by optimizing over the corrupted test sequence, leveraging the spatio-temporal coherence and internal statistics of videos. Furthermore, we introduce a novel spatial pyramid loss that leverages the property of spatio-temporal patch recurrence in a video across the different scales of the video. This loss enhances robustness to unstructured noise in both the spatial and temporal domains. This further results in our framework being highly robust to degradation in input frames and yields state-of-the-art results on downstream tasks such as denoising, object removal, and frame interpolation. To validate the effectiveness of our approach, we conduct qualitative and quantitative evaluations on standard video datasets such as DAVIS, UCF-101, and VIMEO90K-T.
    An Online, Adaptive and Unsupervised Regression Framework with Drift Detection for Label Scarcity Contexts. (arXiv:2312.07682v1 [cs.LG])
    In scenarios where obtaining real-time labels proves challenging, conventional approaches may result in sub-optimal performance. This paper presents an optimal strategy for streaming contexts with limited labeled data, introducing an adaptive technique for unsupervised regression. The proposed method leverages a sparse set of initial labels and introduces an innovative drift detection mechanism to enable dynamic model adaptations in response to evolving patterns in the data. To enhance adaptability, we integrate the ADWIN (ADaptive WINdowing) algorithm with error generalization based on Root Mean Square Error (RMSE). ADWIN facilitates real-time drift detection, while RMSE provides a robust measure of model prediction accuracy. This combination enables our multivariate method to effectively navigate the challenges of streaming data, continuously adapting to changing patterns while maintaining a high level of predictive precision. Finally, we evaluate the performance of our multivariate method across various public datasets, comparing it to non-adapting baselines. Through comprehensive assessments, we demonstrate the superior efficacy of our adaptive regression technique for tasks where obtaining labels in real-time is a significant challenge. The results underscore the method's capacity to outperform traditional approaches and highlight its potential in scenarios characterized by label scarcity and evolving data patterns.
    Prototypical Self-Explainable Models Without Re-training. (arXiv:2312.07822v1 [cs.LG])
    Explainable AI (XAI) has unfolded in two distinct research directions with, on the one hand, post-hoc methods that explain the predictions of a pre-trained black-box model and, on the other hand, self-explainable models (SEMs) which are trained directly to provide explanations alongside their predictions. While the latter is preferred in most safety-critical scenarios, post-hoc approaches have received the majority of attention until now, owing to their simplicity and ability to explain base models without retraining. Current SEMs instead, require complex architectures and heavily regularized loss functions, thus necessitating specific and costly training. To address this shortcoming and facilitate wider use of SEMs, we propose a simple yet efficient universal method called KMEx (K-Means Explainer), which can convert any existing pre-trained model into a prototypical SEM. The motivation behind KMEx is to push towards more transparent deep learning-based decision-making via class-prototype-based explanations that are guaranteed to be diverse and trustworthy without retraining the base model. We compare models obtained from KMEx to state-of-the-art SEMs using an extensive qualitative evaluation to highlight the strengths and weaknesses of each model, further paving the way toward a more reliable and objective evaluation of SEMs.
    FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems. (arXiv:2312.07743v1 [cs.LG])
    Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89\%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.
    Morphological Profiling for Drug Discovery in the Era of Deep Learning. (arXiv:2312.07899v1 [q-bio.QM])
    Morphological profiling is a valuable tool in phenotypic drug discovery. The advent of high-throughput automated imaging has enabled the capturing of a wide range of morphological features of cells or organisms in response to perturbations at the single-cell resolution. Concurrently, significant advances in machine learning and deep learning, especially in computer vision, have led to substantial improvements in analyzing large-scale high-content images at high-throughput. These efforts have facilitated understanding of compound mechanism-of-action (MOA), drug repurposing, characterization of cell morphodynamics under perturbation, and ultimately contributing to the development of novel therapeutics. In this review, we provide a comprehensive overview of the recent advances in the field of morphological profiling. We summarize the image profiling analysis workflow, survey a broad spectrum of analysis strategies encompassing feature engineering- and deep learning-based approaches, and introduce publicly available benchmark datasets. We place a particular emphasis on the application of deep learning in this pipeline, covering cell segmentation, image representation learning, and multimodal learning. Additionally, we illuminate the application of morphological profiling in phenotypic drug discovery and highlight potential challenges and opportunities in this field.
    CaVE: A Cone-Aligned Approach for Fast Predict-then-optimize with Binary Linear Programs. (arXiv:2312.07718v1 [cs.LG])
    The end-to-end predict-then-optimize framework, also known as decision-focused learning, has gained popularity for its ability to integrate optimization into the training procedure of machine learning models that predict the unknown cost (objective function) coefficients of optimization problems from contextual instance information. Naturally, most of the problems of interest in this space can be cast as integer linear programs. In this work, we focus on binary linear programs (BLPs) and propose a new end-to-end training method for predict-then-optimize. Our method, Cone-aligned Vector Estimation (CaVE), aligns the predicted cost vectors with the cone corresponding to the true optimal solution of a training instance. When the predicted cost vector lies inside the cone, the optimal solution to the linear relaxation of the binary problem is optimal w.r.t. to the true cost vector. Not only does this alignment produce decision-aware learning models, but it also dramatically reduces training time as it circumvents the need to solve BLPs to compute a loss function with its gradients. Experiments across multiple datasets show that our method exhibits a favorable trade-off between training time and solution quality, particularly with large-scale optimization problems such as vehicle routing, a hard BLP that has yet to benefit from predict-then-optimize methods in the literature due to its difficulty.
    Machine Learning and Citizen Science Approaches for Monitoring the Changing Environment. (arXiv:2312.07698v1 [cs.LG])
    This dissertation will combine new tools and methodologies to answer pressing questions regarding inundation area and hurricane events in complex, heterogeneous changing environments. In addition to remote sensing approaches, citizen science and machine learning are both emerging fields that harness advancing technology to answer environmental management and disaster response questions.
    A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning. (arXiv:2312.07685v1 [cs.LG])
    Offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Built on offline RL algorithms, most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples. In this paper, from a novel perspective, we systematically study the challenges that remain in O2O RL and identify that the reason behind the slow improvement of the performance and the instability of online finetuning lies in the inaccurate Q-value estimation inherited from offline pretraining. Specifically, we demonstrate that the estimation bias and the inaccurate rank of Q-value cause a misleading signal for the policy update, making the standard offline RL algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based on this observation, we address the problem of Q-value estimation by two techniques: (1) perturbed value update and (2) increased frequency of Q-value updates. The first technique smooths out biased Q-value estimation with sharp peaks, preventing early-stage policy exploitation of sub-optimal actions. The second one alleviates the estimation bias inherited from offline pretraining by accelerating learning. Extensive experiments on the MuJoco and Adroit environments demonstrate that the proposed method, named SO2, significantly alleviates Q-value estimation issues, and consistently improves the performance against the state-of-the-art methods by up to 83.1%.  ( 2 min )
    Multimodal Sentiment Analysis: Perceived vs Induced Sentiments. (arXiv:2312.07627v1 [cs.CV])
    Social media has created a global network where people can easily access and exchange vast information. This information gives rise to a variety of opinions, reflecting both positive and negative viewpoints. GIFs stand out as a multimedia format offering a visually engaging way for users to communicate. In this research, we propose a multimodal framework that integrates visual and textual features to predict the GIF sentiment. It also incorporates attributes including face emotion detection and OCR generated captions to capture the semantic aspects of the GIF. The developed classifier achieves an accuracy of 82.7% on Twitter GIFs, which is an improvement over state-of-the-art models. Moreover, we have based our research on the ReactionGIF dataset, analysing the variance in sentiment perceived by the author and sentiment induced in the reader  ( 2 min )
    IndoorGNN: A Graph Neural Network based approach for Indoor Localization using WiFi RSSI. (arXiv:2312.07609v1 [eess.SP])
    Indoor localization is the process of determining the location of a person or object inside a building. Potential usage of indoor localization includes navigation, personalization, safety and security, and asset tracking. Commonly used technologies for indoor localization include WiFi, Bluetooth, RFID, and Ultra-wideband. Among these, WiFi's Received Signal Strength Indicator (RSSI)-based localization is preferred because of widely available WiFi Access Points (APs). We have two main contributions. First, we develop our method, 'IndoorGNN' which involves using a Graph Neural Network (GNN) based algorithm in a supervised manner to classify a specific location into a particular region based on the RSSI values collected at that location. Most of the ML algorithms that perform this classification require a large number of labeled data points (RSSI vectors with location information). Collecting such data points is a labor-intensive and time-consuming task. To overcome this challenge, as our second contribution, we demonstrate the performance of IndoorGNN on the restricted dataset. It shows a comparable prediction accuracy to that of the complete dataset. We performed experiments on the UJIIndoorLoc and MNAV datasets, which are real-world standard indoor localization datasets. Our experiments show that IndoorGNN gives better location prediction accuracies when compared with state-of-the-art existing conventional as well as GNN-based methods for this same task. It continues to outperform these algorithms even with restricted datasets. It is noteworthy that its performance does not decrease a lot with a decrease in the number of available data points. Our method can be utilized for navigation and wayfinding in complex indoor environments, asset tracking and building management, enhancing mobile applications with location-based services, and improving safety and security during emergencies.  ( 3 min )
    Decoding Working-Memory Load During n-Back Task Performance from High Channel NIRS Data. (arXiv:2312.07546v1 [eess.SP])
    Near-infrared spectroscopy (NIRS) can measure neural activity through blood oxygenation changes in the brain in a wearable form factor, enabling unique applications for research in and outside the lab. NIRS has proven capable of measuring cognitive states such as mental workload, often using machine learning (ML) based brain-computer interfaces (BCIs). To date, NIRS research has largely relied on probes with under ten to several hundred channels, although recently a new class of wearable NIRS devices with thousands of channels has emerged. This poses unique challenges for ML classification, as NIRS is typically limited by few training trials which results in severely under-determined estimation problems. So far, it is not well understood how such high-resolution data is best leveraged in practical BCIs and whether state-of-the-art (SotA) or better performance can be achieved. To address these questions, we propose an ML strategy to classify working-memory load that relies on spatio-temporal regularization and transfer learning from other subjects in a combination that has not been used in previous NIRS BCIs. The approach can be interpreted as an end-to-end generalized linear model and allows for a high degree of interpretability using channel-level or cortical imaging approaches. We show that using the proposed methodology, it is possible to achieve SotA decoding performance with high-resolution NIRS data. We also replicated several SotA approaches on our dataset of 43 participants wearing a 3198 dual-channel NIRS device while performing the n-Back task and show that these existing methods struggle in the high-channel regime and are largely outperformed by the proposed method. Our approach helps establish high-channel NIRS devices as a viable platform for SotA BCI and opens new applications using this class of headset while also enabling high-resolution model imaging and interpretation.  ( 3 min )
    Large Language Models for Intent-Driven Session Recommendations. (arXiv:2312.07552v1 [cs.CL])
    Intent-aware session recommendation (ISR) is pivotal in discerning user intents within sessions for precise predictions. Traditional approaches, however, face limitations due to their presumption of a uniform number of intents across all sessions. This assumption overlooks the dynamic nature of user sessions, where the number and type of intentions can significantly vary. In addition, these methods typically operate in latent spaces, thus hinder the model's transparency.Addressing these challenges, we introduce a novel ISR approach, utilizing the advanced reasoning capabilities of large language models (LLMs). First, this approach begins by generating an initial prompt that guides LLMs to predict the next item in a session, based on the varied intents manifested in user sessions. Then, to refine this process, we introduce an innovative prompt optimization mechanism that iteratively self-reflects and adjusts prompts. Furthermore, our prompt selection module, built upon the LLMs' broad adaptability, swiftly selects the most optimized prompts across diverse domains. This new paradigm empowers LLMs to discern diverse user intents at a semantic level, leading to more accurate and interpretable session recommendations. Our extensive experiments on three real-world datasets demonstrate the effectiveness of our method, marking a significant advancement in ISR systems.  ( 2 min )
    COVID-19 Detection Using Slices Processing Techniques and a Modified Xception Classifier from Computed Tomography Images. (arXiv:2312.07580v1 [eess.IV])
    This paper extends our previous method for COVID-19 diagnosis, proposing an enhanced solution for detecting COVID-19 from computed tomography (CT) images. To decrease model misclassifications, two key steps of image processing were employed. Firstly, the uppermost and lowermost slices were removed, preserving sixty percent of each patient's slices. Secondly, all slices underwent manual cropping to emphasize the lung areas. Subsequently, resized CT scans (224 by 224) were input into an Xception transfer learning model. Leveraging Xception's architecture and pre-trained weights, the modified model achieved binary classification. Promising results on the COV19-CT database showcased higher validation accuracy and macro F1 score at both the slice and patient levels compared to our previous solution and alternatives on the same dataset.  ( 2 min )
    Understanding (Un)Intended Memorization in Text-to-Image Generative Models. (arXiv:2312.07550v1 [cs.CV])
    Multimodal machine learning, especially text-to-image models like Stable Diffusion and DALL-E 3, has gained significance for transforming text into detailed images. Despite their growing use and remarkable generative capabilities, there is a pressing need for a detailed examination of these models' behavior, particularly with respect to memorization. Historically, memorization in machine learning has been context-dependent, with diverse definitions emerging from classification tasks to complex models like Large Language Models (LLMs) and Diffusion models. Yet, a definitive concept of memorization that aligns with the intricacies of text-to-image synthesis remains elusive. This understanding is vital as memorization poses privacy risks yet is essential for meeting user expectations, especially when generating representations of underrepresented entities. In this paper, we introduce a specialized definition of memorization tailored to text-to-image models, categorizing it into three distinct types according to user expectations. We closely examine the subtle distinctions between intended and unintended memorization, emphasizing the importance of balancing user privacy with the generative quality of the model outputs. Using the Stable Diffusion model, we offer examples to validate our memorization definitions and clarify their application.  ( 2 min )
    SE(3)-Invariant Multiparameter Persistent Homology for Chiral-Sensitive Molecular Property Prediction. (arXiv:2312.07633v1 [cs.LG])
    In this study, we present a novel computational method for generating molecular fingerprints using multiparameter persistent homology (MPPH). This technique holds considerable significance for drug discovery and materials science, where precise molecular property prediction is vital. By integrating SE(3)-invariance with Vietoris-Rips persistent homology, we effectively capture the three-dimensional representations of molecular chirality. This non-superimposable mirror image property directly influences the molecular interactions, serving as an essential factor in molecular property prediction. We explore the underlying topologies and patterns in molecular structures by applying Vietoris-Rips persistent homology across varying scales and parameters such as atomic weight, partial charge, bond type, and chirality. Our method's efficacy can be improved by incorporating additional parameters such as aromaticity, orbital hybridization, bond polarity, conjugated systems, as well as bond and torsion angles. Additionally, we leverage Stochastic Gradient Langevin Boosting in a Bayesian ensemble of GBDTs to obtain aleatoric and epistemic uncertainty estimates for gradient boosting models. With these uncertainty estimates, we prioritize high-uncertainty samples for active learning and model fine-tuning, benefiting scenarios where data labeling is costly or time consuming. Compared to conventional GNNs which usually suffer from oversmoothing and oversquashing, MPPH provides a more comprehensive and interpretable characterization of molecular data topology. We substantiate our approach with theoretical stability guarantees and demonstrate its superior performance over existing state-of-the-art methods in predicting molecular properties through extensive evaluations on the MoleculeNet benchmark datasets.  ( 2 min )
    Non-contact Multimodal Indoor Human Monitoring Systems: A Survey. (arXiv:2312.07601v1 [eess.SP])
    Indoor human monitoring systems leverage a wide range of sensors, including cameras, radio devices, and inertial measurement units, to collect extensive data from users and the environment. These sensors contribute diverse data modalities, such as video feeds from cameras, received signal strength indicators and channel state information from WiFi devices, and three-axis acceleration data from inertial measurement units. In this context, we present a comprehensive survey of multimodal approaches for indoor human monitoring systems, with a specific focus on their relevance in elderly care. Our survey primarily highlights non-contact technologies, particularly cameras and radio devices, as key components in the development of indoor human monitoring systems. Throughout this article, we explore well-established techniques for extracting features from multimodal data sources. Our exploration extends to methodologies for fusing these features and harnessing multiple modalities to improve the accuracy and robustness of machine learning models. Furthermore, we conduct comparative analysis across different data modalities in diverse human monitoring tasks and undertake a comprehensive examination of existing multimodal datasets. This extensive survey not only highlights the significance of indoor human monitoring systems but also affirms their versatile applications. In particular, we emphasize their critical role in enhancing the quality of elderly care, offering valuable insights into the development of non-contact monitoring solutions applicable to the needs of aging populations.  ( 3 min )
    Reacting like Humans: Incorporating Intrinsic Human Behaviors into NAO through Sound-Based Reactions for Enhanced Sociability. (arXiv:2312.07671v1 [cs.RO])
    Robots' acceptability among humans and their sociability can be significantly enhanced by incorporating human-like reactions. Humans can react to environmental events very quickly and without thinking. An instance where humans display natural reactions is when they encounter a sudden and loud sound that startles or frightens them. During such moments, individuals may instinctively move their hands, turn toward the origin of the sound, and try to determine the event's cause. This inherent behavior motivated us to explore this less-studied part of social robotics. In this work, a multi-modal system composed of an action generator, sound classifier, and YOLO object detector was designed to sense the environment and, in the presence of sudden loud sounds, show natural human fear reactions, and finally, locate the fear-causing sound source in the environment. These unique and valid generated motions and inferences could imitate intrinsic human reactions and enhance the sociability of robots. For motion generation, a model based on LSTM and MDN networks was proposed to synthesize various motions. Also, in the case of sound detection, a transfer learning model was preferred that used the spectrogram of sound signals as its input. After developing individual models for sound detection, motion generation, and image recognition, they were integrated into a comprehensive fear module that was implemented on the NAO robot. Finally, the fear module was tested in practical application and two groups of experts and non-experts filled out a questionnaire to evaluate the performance of the robot. Given our promising results, this preliminary exploratory research provides a fresh perspective on social robotics and could be a starting point for modeling intrinsic human behaviors and emotions in robots.  ( 3 min )
    AI-driven Structure Detection and Information Extraction from Historical Cadastral Maps (Early 19th Century Franciscean Cadastre in the Province of Styria) and Current High-resolution Satellite and Aerial Imagery for Remote Sensing. (arXiv:2312.07560v1 [cs.CV])
    Cadastres from the 19th century are a complex as well as rich source for historians and archaeologists, whose use presents them with great challenges. For archaeological and historical remote sensing, we have trained several Deep Learning models, CNNs as well as Vision Transformers, to extract large-scale data from this knowledge representation. We present the principle results of our work here and we present a the demonstrator of our browser-based tool that allows researchers and public stakeholders to quickly identify spots that featured buildings in the 19th century Franciscean Cadastre. The tool not only supports scholars and fellow researchers in building a better understanding of the settlement history of the region of Styria, it also helps public administration and fellow citizens to swiftly identify areas of heightened sensibility with regard to the cultural heritage of the region.  ( 2 min )
    Bayesian Online Learning for Consensus Prediction. (arXiv:2312.07679v1 [cs.LG])
    Given a pre-trained classifier and multiple human experts, we investigate the task of online classification where model predictions are provided for free but querying humans incurs a cost. In this practical but under-explored setting, oracle ground truth is not available. Instead, the prediction target is defined as the consensus vote of all experts. Given that querying full consensus can be costly, we propose a general framework for online Bayesian consensus estimation, leveraging properties of the multivariate hypergeometric distribution. Based on this framework, we propose a family of methods that dynamically estimate expert consensus from partial feedback by producing a posterior over expert and model beliefs. Analyzing this posterior induces an interpretable trade-off between querying cost and classification performance. We demonstrate the efficacy of our framework against a variety of baselines on CIFAR-10H and ImageNet-16H, two large-scale crowdsourced datasets.  ( 2 min )
    Characteristic Guidance: Non-linear Correction for DDPM at Large Guidance Scale. (arXiv:2312.07586v1 [cs.CV])
    Popular guidance for denoising diffusion probabilistic model (DDPM) linearly combines distinct conditional models together to provide enhanced control over samples. However, this approach overlooks nonlinear effects that become significant when guidance scale is large. To address this issue, we propose characteristic guidance, a novel method that provides non-linear correction for classifier-free guided DDPMs. Such correction forces the guided DDPMs to respect the Fokker-Planck equation of their underlying diffusion process, in a way that is first-principle, training-free, derivative-free, and compatible with existing sampling methods. Experiments show that characteristic guidance is robust to various applications, offers enhanced control over sample generation, suppresses color and exposure issues even for latent space sampling, and can handle physics problems such as the phase transitions.  ( 2 min )
    Contrastive News and Social Media Linking using BERT for Articles and Tweets across Dual Platforms. (arXiv:2312.07599v1 [cs.CL])
    X (formerly Twitter) has evolved into a contemporary agora, offering a platform for individuals to express opinions and viewpoints on current events. The majority of the topics discussed on Twitter are directly related to ongoing events, making it an important source for monitoring public discourse. However, linking tweets to specific news presents a significant challenge due to their concise and informal nature. Previous approaches, including topic models, graph-based models, and supervised classifiers, have fallen short in effectively capturing the unique characteristics of tweets and articles. Inspired by the success of the CLIP model in computer vision, which employs contrastive learning to model similarities between images and captions, this paper introduces a contrastive learning approach for training a representation space where linked articles and tweets exhibit proximity. We present our contrastive learning approach, CATBERT (Contrastive Articles Tweets BERT), leveraging pre-trained BERT models. The model is trained and tested on a dataset containing manually labeled English and Polish tweets and articles related to the Russian-Ukrainian war. We evaluate CATBERT's performance against traditional approaches like LDA, and the novel method based on OpenAI embeddings, which has not been previously applied to this task. Our findings indicate that CATBERT demonstrates superior performance in associating tweets with relevant news articles. Furthermore, we demonstrate the performance of the models when applied to finding the main topic -- represented by an article -- of the whole cascade of tweets. In this new task, we report the performance of the different models in dependence on the cascade size.  ( 3 min )
    Optimizing Likelihood-free Inference using Self-supervised Neural Symmetry Embeddings. (arXiv:2312.07615v1 [cs.LG])
    Likelihood-free inference is quickly emerging as a powerful tool to perform fast/effective parameter estimation. We demonstrate a technique of optimizing likelihood-free inference to make it even faster by marginalizing symmetries in a physical problem. In this approach, physical symmetries, for example, time-translation are learned using joint-embedding via self-supervised learning with symmetry data augmentations. Subsequently, parameter inference is performed using a normalizing flow where the embedding network is used to summarize the data before conditioning the parameters. We present this approach on two simple physical problems and we show faster convergence in a smaller number of parameters compared to a normalizing flow that does not use a pre-trained symmetry-informed representation.  ( 2 min )
    Investigating YOLO Models Towards Outdoor Obstacle Detection For Visually Impaired People. (arXiv:2312.07571v1 [cs.CV])
    The utilization of deep learning-based object detection is an effective approach to assist visually impaired individuals in avoiding obstacles. In this paper, we implemented seven different YOLO object detection models \textit{viz}., YOLO-NAS (small, medium, large), YOLOv8, YOLOv7, YOLOv6, and YOLOv5 and performed comprehensive evaluation with carefully tuned hyperparameters, to analyze how these models performed on images containing common daily-life objects presented on roads and sidewalks. After a systematic investigation, YOLOv8 was found to be the best model, which reached a precision of $80\%$ and a recall of $68.2\%$ on a well-known Obstacle Dataset which includes images from VOC dataset, COCO dataset, and TT100K dataset along with images collected by the researchers in the field. Despite being the latest model and demonstrating better performance in many other applications, YOLO-NAS was found to be suboptimal for the obstacle detection task.  ( 2 min )
    Annotating sleep states in children from wrist-worn accelerometer data using Machine Learning. (arXiv:2312.07561v1 [eess.SP])
    Sleep detection and annotation are crucial for researchers to understand sleep patterns, especially in children. With modern wrist-worn watches comprising built-in accelerometers, sleep logs can be collected. However, the annotation of these logs into distinct sleep events: onset and wakeup, proves to be challenging. These annotations must be automated, precise, and scalable. We propose to model the accelerometer data using different machine learning (ML) techniques such as support vectors, boosting, ensemble methods, and more complex approaches involving LSTMs and Region-based CNNs. Later, we aim to evaluate these approaches using the Event Detection Average Precision (EDAP) score (similar to the IOU metric) to eventually compare the predictive power and model performance.  ( 2 min )
    PaperQA: Retrieval-Augmented Generative Agent for Scientific Research. (arXiv:2312.07559v1 [cs.CL])
    Large Language Models (LLMs) generalize well across language tasks, but suffer from hallucinations and uninterpretability, making it difficult to assess their accuracy without ground-truth. Retrieval-Augmented Generation (RAG) models have been proposed to reduce hallucinations and provide provenance for how an answer was generated. Applying such models to the scientific literature may enable large-scale, systematic processing of scientific knowledge. We present PaperQA, a RAG agent for answering questions over the scientific literature. PaperQA is an agent that performs information retrieval across full-text scientific articles, assesses the relevance of sources and passages, and uses RAG to provide answers. Viewing this agent as a question answering model, we find it exceeds performance of existing LLMs and LLM agents on current science QA benchmarks. To push the field closer to how humans perform research on scientific literature, we also introduce LitQA, a more complex benchmark that requires retrieval and synthesis of information from full-text scientific papers across the literature. Finally, we demonstrate PaperQA's matches expert human researchers on LitQA.  ( 2 min )
    Active Inference and Intentional Behaviour. (arXiv:2312.07547v1 [q-bio.NC])
    Recent advances in theoretical biology suggest that basal cognition and sentient behaviour are emergent properties of in vitro cell cultures and neuronal networks, respectively. Such neuronal networks spontaneously learn structured behaviours in the absence of reward or reinforcement. In this paper, we characterise this kind of self-organisation through the lens of the free energy principle, i.e., as self-evidencing. We do this by first discussing the definitions of reactive and sentient behaviour in the setting of active inference, which describes the behaviour of agents that model the consequences of their actions. We then introduce a formal account of intentional behaviour, that describes agents as driven by a preferred endpoint or goal in latent state-spaces. We then investigate these forms of (reactive, sentient, and intentional) behaviour using simulations. First, we simulate the aforementioned in vitro experiments, in which neuronal cultures spontaneously learn to play Pong, by implementing nested, free energy minimising processes. The simulations are then used to deconstruct the ensuing predictive behaviour, leading to the distinction between merely reactive, sentient, and intentional behaviour, with the latter formalised in terms of inductive planning. This distinction is further studied using simple machine learning benchmarks (navigation in a grid world and the Tower of Hanoi problem), that show how quickly and efficiently adaptive behaviour emerges under an inductive form of active inference.  ( 2 min )
    Adaptive Proximal Policy Optimization with Upper Confidence Bound. (arXiv:2312.07624v1 [cs.LG])
    Trust Region Policy Optimization (TRPO) attractively optimizes the policy while constraining the update of the new policy within a trust region, ensuring the stability and monotonic optimization. Building on the theoretical guarantees of trust region optimization, Proximal Policy Optimization (PPO) successfully enhances the algorithm's sample efficiency and reduces deployment complexity by confining the update of the new and old policies within a surrogate trust region. However, this approach is limited by the fixed setting of surrogate trust region and is not sufficiently adaptive, because there is no theoretical proof that the optimal clipping bound remains consistent throughout the entire training process, truncating the ratio of the new and old policies within surrogate trust region can ensure that the algorithm achieves its best performance, therefore, exploring and researching a dynamic clip bound for improving PPO's performance can be quite beneficial. To design an adaptive clipped trust region and explore the dynamic clip bound's impact on the performance of PPO, we introduce an adaptive PPO-CLIP (Adaptive-PPO) method that dynamically explores and exploits the clip bound using a bandit during the online training process. Furthermore, ample experiments will initially demonstrate that our Adaptive-PPO exhibits sample efficiency and performance compared to PPO-CLIP.  ( 2 min )
    Go beyond End-to-End Training: Boosting Greedy Local Learning with Context Supply. (arXiv:2312.07636v1 [cs.LG])
    Traditional end-to-end (E2E) training of deep networks necessitates storing intermediate activations for back-propagation, resulting in a large memory footprint on GPUs and restricted model parallelization. As an alternative, greedy local learning partitions the network into gradient-isolated modules and trains supervisely based on local preliminary losses, thereby providing asynchronous and parallel training methods that substantially reduce memory cost. However, empirical experiments reveal that as the number of segmentations of the gradient-isolated module increases, the performance of the local learning scheme degrades substantially, severely limiting its expansibility. To avoid this issue, we theoretically analyze the greedy local learning from the standpoint of information theory and propose a ContSup scheme, which incorporates context supply between isolated modules to compensate for information loss. Experiments on benchmark datasets (i.e. CIFAR, SVHN, STL-10) achieve SOTA results and indicate that our proposed method can significantly improve the performance of greedy local learning with minimal memory and computational overhead, allowing for the boost of the number of isolated modules. Our codes are available at https://github.com/Tab-ct/ContSup.  ( 2 min )
    Classification with Partially Private Features. (arXiv:2312.07583v1 [cs.LG])
    In this paper, we consider differentially private classification when some features are sensitive, while the rest of the features and the label are not. We adapt the definition of differential privacy naturally to this setting. Our main contribution is a novel adaptation of AdaBoost that is not only provably differentially private, but also significantly outperforms a natural benchmark that assumes the entire data of the individual is sensitive in the experiments. As a surprising observation, we show that boosting randomly generated classifiers suffices to achieve high accuracy. Our approach easily adapts to the classical setting where all the features are sensitive, providing an alternate algorithm for differentially private linear classification with a much simpler privacy proof and comparable or higher accuracy than differentially private logistic regression on real-world datasets.  ( 2 min )
    CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor. (arXiv:2312.07661v1 [cs.CV])
    Existing open-vocabulary image segmentation methods require a fine-tuning step on mask annotations and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. As a result, the open-vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions when there are text queries referring to non-existing concepts in the image. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a VLM with frozen weights. Thus, our model retains the VLM's broad vocabulary space and strengthens its segmentation capability. Experimental results show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of additional data samples, and sets new state-of-the-art records for both zero-shot semantic and referring image segmentation tasks. Specifically, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.  ( 2 min )
    Benchmarking Distribution Shift in Tabular Data with TableShift. (arXiv:2312.07577v1 [cs.LG])
    Robustness to distribution shift has become a growing concern for text and image models as they transition from research subjects to deployment in the real world. However, high-quality benchmarks for distribution shift in tabular machine learning tasks are still lacking despite the widespread real-world use of tabular data and differences in the models used for tabular data in comparison to text and images. As a consequence, the robustness of tabular models to distribution shift is poorly understood. To address this issue, we introduce TableShift, a distribution shift benchmark for tabular data. TableShift contains 15 binary classification tasks in total, each with an associated shift, and includes a diverse set of data sources, prediction targets, and distribution shifts. The benchmark covers domains including finance, education, public policy, healthcare, and civic participation, and is accessible using only a few lines of Python code via the TableShift API. We conduct a large-scale study comparing several state-of-the-art tabular data models alongside robust learning and domain generalization methods on the benchmark tasks. Our study demonstrates (1) a linear trend between in-distribution (ID) and out-of-distribution (OOD) accuracy; (2) domain robustness methods can reduce shift gaps but at the cost of reduced ID accuracy; (3) a strong relationship between shift gap (difference between ID and OOD performance) and shifts in the label distribution. The benchmark data, Python package, model implementations, and more information about TableShift are available at https://github.com/mlfoundations/tableshift and https://tableshift.org .  ( 2 min )
  • Open

    Big Data -- Supply Chain Management Framework for Forecasting: Data Preprocessing and Machine Learning Techniques. (arXiv:2307.12971v2 [cs.LG] UPDATED)
    This article intends to systematically identify and comparatively analyze state-of-the-art supply chain (SC) forecasting strategies and technologies. A novel framework has been proposed incorporating Big Data Analytics in SC Management (problem identification, data sources, exploratory data analysis, machine-learning model training, hyperparameter tuning, performance evaluation, and optimization), forecasting effects on human-workforce, inventory, and overall SC. Initially, the need to collect data according to SC strategy and how to collect them has been discussed. The article discusses the need for different types of forecasting according to the period or SC objective. The SC KPIs and the error-measurement systems have been recommended to optimize the top-performing model. The adverse effects of phantom inventory on forecasting and the dependence of managerial decisions on the SC KPIs for determining model performance parameters and improving operations management, transparency, and planning efficiency have been illustrated. The cyclic connection within the framework introduces preprocessing optimization based on the post-process KPIs, optimizing the overall control process (inventory management, workforce determination, cost, production and capacity planning). The contribution of this research lies in the standard SC process framework proposal, recommended forecasting data analysis, forecasting effects on SC performance, machine learning algorithms optimization followed, and in shedding light on future research.  ( 3 min )
    \emph{Lifted} RDT based capacity analysis of the 1-hidden layer treelike \emph{sign} perceptrons neural networks. (arXiv:2312.08257v1 [stat.ML])
    We consider the memorization capabilities of multilayered \emph{sign} perceptrons neural networks (SPNNs). A recent rigorous upper-bounding capacity characterization, obtained in \cite{Stojnictcmspnncaprdt23} utilizing the Random Duality Theory (RDT), demonstrated that adding neurons in a network configuration may indeed be very beneficial. Moreover, for particular \emph{treelike committee machines} (TCM) architectures with $d\leq 5$ neurons in the hidden layer, \cite{Stojnictcmspnncaprdt23} made a very first mathematically rigorous progress in over 30 years by lowering the previously best known capacity bounds of \cite{MitchDurb89}. Here, we first establish that the RDT bounds from \cite{Stojnictcmspnncaprdt23} scale as $\sim \sqrt{d}$ and can not on their own \emph{universally} (over the entire range of $d$) beat the best known $\sim \log(d)$ scaling of the bounds from \cite{MitchDurb89}. After recognizing that the progress from \cite{Stojnictcmspnncaprdt23} is therefore promising, but yet without a complete concretization, we then proceed by considering the recently developed fully lifted RDT (fl RDT) as an alternative. While the fl RDT is indeed a powerful juggernaut, it typically relies on heavy numerical evaluations. To avoid such heavy numerics, we here focus on a simplified, \emph{partially lifted}, variant and show that it allows for very neat, closed form, analytical capacity characterizations. Moreover, we obtain the concrete capacity bounds that \emph{universally} improve for \emph{any} $d$ over the best known ones of \cite{MitchDurb89}.  ( 2 min )
    Differentially private inference via noisy optimization. (arXiv:2103.11003v4 [math.ST] UPDATED)
    We propose a general optimization-based framework for computing differentially private M-estimators and a new method for constructing differentially private confidence regions. Firstly, we show that robust statistics can be used in conjunction with noisy gradient descent or noisy Newton methods in order to obtain optimal private estimators with global linear or quadratic convergence, respectively. We establish local and global convergence guarantees, under both local strong convexity and self-concordance, showing that our private estimators converge with high probability to a small neighborhood of the non-private M-estimators. Secondly, we tackle the problem of parametric inference by constructing differentially private estimators of the asymptotic variance of our private M-estimators. This naturally leads to approximate pivotal statistics for constructing confidence regions and conducting hypothesis testing. We demonstrate the effectiveness of a bias correction that leads to enhanced small-sample empirical performance in simulations. We illustrate the benefits of our methods in several numerical examples.  ( 2 min )
    Discretization-Induced Dirichlet Posterior for Robust Uncertainty Quantification on Regression. (arXiv:2308.09065v2 [cs.CV] UPDATED)
    Uncertainty quantification is critical for deploying deep neural networks (DNNs) in real-world applications. An Auxiliary Uncertainty Estimator (AuxUE) is one of the most effective means to estimate the uncertainty of the main task prediction without modifying the main task model. To be considered robust, an AuxUE must be capable of maintaining its performance and triggering higher uncertainties while encountering Out-of-Distribution (OOD) inputs, i.e., to provide robust aleatoric and epistemic uncertainty. However, for vision regression tasks, current AuxUE designs are mainly adopted for aleatoric uncertainty estimates, and AuxUE robustness has not been explored. In this work, we propose a generalized AuxUE scheme for more robust uncertainty quantification on regression tasks. Concretely, to achieve a more robust aleatoric uncertainty estimation, different distribution assumptions are considered for heteroscedastic noise, and Laplace distribution is finally chosen to approximate the prediction error. For epistemic uncertainty, we propose a novel solution named Discretization-Induced Dirichlet pOsterior (DIDO), which models the Dirichlet posterior on the discretized prediction error. Extensive experiments on age estimation, monocular depth estimation, and super-resolution tasks show that our proposed method can provide robust uncertainty estimates in the face of noisy inputs and that it can be scalable to both image-level and pixel-wise tasks. Code is available at https://github.com/ENSTA-U2IS/DIDO .  ( 3 min )
    Differentially Private Gradient Flow based on the Sliced Wasserstein Distance for Non-Parametric Generative Modeling. (arXiv:2312.08227v1 [stat.ML])
    Safeguarding privacy in sensitive training data is paramount, particularly in the context of generative modeling. This is done through either differentially private stochastic gradient descent, or with a differentially private metric for training models or generators. In this paper, we introduce a novel differentially private generative modeling approach based on parameter-free gradient flows in the space of probability measures. The proposed algorithm is a new discretized flow which operates through a particle scheme, utilizing drift derived from the sliced Wasserstein distance and computed in a private manner. Our experiments show that compared to a generator-based model, our proposed model can generate higher-fidelity data at a low privacy budget, offering a viable alternative to generator-based approaches.  ( 2 min )
    A Hitchhiker's Guide to Geometric GNNs for 3D Atomic Systems. (arXiv:2312.07511v1 [cs.LG] CROSS LISTED)
    Recent advances in computational modelling of atomic systems, spanning molecules, proteins, and materials, represent them as geometric graphs with atoms embedded as nodes in 3D Euclidean space. In these graphs, the geometric attributes transform according to the inherent physical symmetries of 3D atomic systems, including rotations and translations in Euclidean space, as well as node permutations. In recent years, Geometric Graph Neural Networks have emerged as the preferred machine learning architecture powering applications ranging from protein structure prediction to molecular simulations and material generation. Their specificity lies in the inductive biases they leverage -- such as physical symmetries and chemical properties -- to learn informative representations of these geometric graphs. In this opinionated paper, we provide a comprehensive and self-contained overview of the field of Geometric GNNs for 3D atomic systems. We cover fundamental background material and introduce a pedagogical taxonomy of Geometric GNN architectures:(1) invariant networks, (2) equivariant networks in Cartesian basis, (3) equivariant networks in spherical basis, and (4) unconstrained networks. Additionally, we outline key datasets and application areas and suggest future research directions. The objective of this work is to present a structured perspective on the field, making it accessible to newcomers and aiding practitioners in gaining an intuition for its mathematical abstractions.  ( 2 min )
    On Estimating the Gradient of the Expected Information Gain in Bayesian Experimental Design. (arXiv:2308.09888v2 [stat.ML] UPDATED)
    Bayesian Experimental Design (BED), which aims to find the optimal experimental conditions for Bayesian inference, is usually posed as to optimize the expected information gain (EIG). The gradient information is often needed for efficient EIG optimization, and as a result the ability to estimate the gradient of EIG is essential for BED problems. The primary goal of this work is to develop methods for estimating the gradient of EIG, which, combined with the stochastic gradient descent algorithms, result in efficient optimization of EIG. Specifically, we first introduce a posterior expected representation of the EIG gradient with respect to the design variables. Based on this, we propose two methods for estimating the EIG gradient, UEEG-MCMC that leverages posterior samples generated through Markov Chain Monte Carlo (MCMC) to estimate the EIG gradient, and BEEG-AP that focuses on achieving high simulation efficiency by repeatedly using parameter samples. Theoretical analysis and numerical studies illustrate that UEEG-MCMC is robust agains the actual EIG value, while BEEG-AP is more efficient when the EIG value to be optimized is small. Moreover, both methods show superior performance compared to several popular benchmarks in our numerical experiments.  ( 2 min )
    Meta-learning to Calibrate Gaussian Processes with Deep Kernels for Regression Uncertainty Estimation. (arXiv:2312.07952v1 [stat.ML])
    Although Gaussian processes (GPs) with deep kernels have been successfully used for meta-learning in regression tasks, its uncertainty estimation performance can be poor. We propose a meta-learning method for calibrating deep kernel GPs for improving regression uncertainty estimation performance with a limited number of training data. The proposed method meta-learns how to calibrate uncertainty using data from various tasks by minimizing the test expected calibration error, and uses the knowledge for unseen tasks. We design our model such that the adaptation and calibration for each task can be performed without iterative procedures, which enables effective meta-learning. In particular, a task-specific uncalibrated output distribution is modeled by a GP with a task-shared encoder network, and it is transformed to a calibrated one using a cumulative density function of a task-specific Gaussian mixture model (GMM). By integrating the GP and GMM into our neural network-based model, we can meta-learn model parameters in an end-to-end fashion. Our experiments demonstrate that the proposed method improves uncertainty estimation performance while keeping high regression performance compared with the existing methods using real-world datasets in few-shot settings.  ( 2 min )
    Optimizing accuracy and diversity: a multi-task approach to forecast combinations. (arXiv:2310.20545v2 [cs.LG] UPDATED)
    Forecast combination involves using multiple forecasts to create a single, more accurate prediction. Recently, feature-based forecasting has been employed to either select the most appropriate forecasting models or to optimize the weights of their combination. In this paper, we present a multi-task optimization paradigm that focuses on solving both problems simultaneously and enriches current operational research approaches to forecasting. In essence, it incorporates an additional learning and optimization task into the standard feature-based forecasting approach, focusing on the identification of an optimal set of forecasting methods. During the training phase, an optimization model with linear constraints and quadratic objective function is employed to identify accurate and diverse methods for each time series. Moreover, within the training phase, a neural network is used to learn the behavior of that optimization model. Once training is completed the candidate set of methods is identified using the network. The proposed approach elicits the essential role of diversity in feature-based forecasting and highlights the interplay between model combination and model selection when optimizing forecasting ensembles. Experimental results on a large set of series from the M4 competition dataset show that our proposal enhances point forecast accuracy compared to state-of-the-art methods.  ( 2 min )
    The Blessing of Heterogeneity in Federated Q-Learning: Linear Speedup and Beyond. (arXiv:2305.10697v2 [cs.LG] UPDATED)
    When the data used for reinforcement learning (RL) are collected by multiple agents in a distributed manner, federated versions of RL algorithms allow collaborative learning without the need for agents to share their local data. In this paper, we consider federated Q-learning, which aims to learn an optimal Q-function by periodically aggregating local Q-estimates trained on local data alone. Focusing on infinite-horizon tabular Markov decision processes, we provide sample complexity guarantees for both the synchronous and asynchronous variants of federated Q-learning. In both cases, our bounds exhibit a linear speedup with respect to the number of agents and near-optimal dependencies on other salient problem parameters. In the asynchronous setting, existing analyses of federated Q-learning, which adopt an equally weighted averaging of local Q-estimates, require that every agent covers the entire state-action space. In contrast, our improved sample complexity scales inverse proportionally to the minimum entry of the average stationary state-action occupancy distribution of all agents, thus only requiring the agents to collectively cover the entire state-action space, unveiling the blessing of heterogeneity in enabling collaborative learning by relaxing the coverage requirement of the single-agent case. However, its sample complexity still suffers when the local trajectories are highly heterogeneous. In response, we propose a novel federated Q-learning algorithm with importance averaging, giving larger weights to more frequently visited state-action pairs, which achieves a robust linear speedup as if all trajectories are centrally processed, regardless of the heterogeneity of local behavior policies.  ( 3 min )
    Mixed moving average field guided learning for spatio-temporal data. (arXiv:2301.00736v3 [stat.ML] UPDATED)
    Influenced mixed moving average fields are a versatile modeling class for spatio-temporal data. However, their predictive distribution is not generally known. Under this modeling assumption, we define a novel spatio-temporal embedding and a theory-guided machine learning approach that employs a generalized Bayesian algorithm to make ensemble forecasts. We employ Lipschitz predictors and determine fixed-time and any-time PAC Bayesian bounds in the batch learning setting. Performing causal forecast is a highlight of our methodology as its potential application to data with spatial and temporal short and long-range dependence. We then test the performance of our learning methodology by using linear predictors and data sets simulated from a spatio-temporal Ornstein-Uhlenbeck process.  ( 2 min )
    Towards Optimal Statistical Watermarking. (arXiv:2312.07930v1 [cs.LG])
    We study statistical watermarking by formulating it as a hypothesis testing problem, a general framework which subsumes all previous statistical watermarking methods. Key to our formulation is a coupling of the output tokens and the rejection region, realized by pseudo-random generators in practice, that allows non-trivial trade-off between the Type I error and Type II error. We characterize the Uniformly Most Powerful (UMP) watermark in this context. In the most common scenario where the output is a sequence of $n$ tokens, we establish matching upper and lower bounds on the number of i.i.d. tokens required to guarantee small Type I and Type II errors. Our rate scales as $\Theta(h^{-1} \log (1/h))$ with respect to the average entropy per token $h$ and thus greatly improves the $O(h^{-2})$ rate in the previous works. For scenarios where the detector lacks knowledge of the model's distribution, we introduce the concept of model-agnostic watermarking and establish the minimax bounds for the resultant increase in Type II error. Moreover, we formulate the robust watermarking problem where user is allowed to perform a class of perturbation on the generated texts, and characterize the optimal type II error of robust UMP tests via a linear programming problem. To the best of our knowledge, this is the first systematic statistical treatment on the watermarking problem with near-optimal rates in the i.i.d. setting, and might be of interest for future works.  ( 3 min )
    The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models. (arXiv:2302.14407v2 [cs.LG] UPDATED)
    Thompson sampling (TS) has been known for its outstanding empirical performance supported by theoretical guarantees across various reward models in the classical stochastic multi-armed bandit problems. Nonetheless, its optimality is often restricted to specific priors due to the common observation that TS is fairly insensitive to the choice of the prior when it comes to asymptotic regret bounds. However, when the model contains multiple parameters, the optimality of TS highly depends on the choice of priors, which casts doubt on the generalizability of previous findings to other models. To address this gap, this study explores the impact of selecting noninformative priors, offering insights into the performance of TS when dealing with new models that lack theoretical understanding. We first extend the regret analysis of TS to the model of uniform distributions with unknown supports, which would be the simplest non-regular model. Our findings reveal that changing noninformative priors can significantly affect the expected regret, aligning with previously known results in other multiparameter bandit models. Although the uniform prior is shown to be optimal, we highlight the inherent limitation of its optimality, which is limited to specific parameterizations and emphasizes the significance of the invariance property of priors. In light of this limitation, we propose a slightly modified TS-based policy, called TS with Truncation (TS-T), which can achieve the asymptotic optimality for the Gaussian models and the uniform models by using the reference prior and the Jeffreys prior that are invariant under one-to-one reparameterizations. This policy provides an alternative approach to achieving optimality by employing fine-tuned truncation, which would be much easier than hunting for optimal priors in practice.  ( 3 min )
    On the Stability of Iterative Retraining of Generative Models on their own Data. (arXiv:2310.00429v3 [cs.LG] UPDATED)
    Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to these models' striking performance and ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models must contend with the reality that their training is curated from both clean data and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact of training generative models on mixed datasets (of real and synthetic data) on their stability. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.  ( 2 min )
    Characteristic Circuits. (arXiv:2312.07790v1 [cs.LG])
    In many real-world scenarios, it is crucial to be able to reliably and efficiently reason under uncertainty while capturing complex relationships in data. Probabilistic circuits (PCs), a prominent family of tractable probabilistic models, offer a remedy to this challenge by composing simple, tractable distributions into a high-dimensional probability distribution. However, learning PCs on heterogeneous data is challenging and densities of some parametric distributions are not available in closed form, limiting their potential use. We introduce characteristic circuits (CCs), a family of tractable probabilistic models providing a unified formalization of distributions over heterogeneous data in the spectral domain. The one-to-one relationship between characteristic functions and probability measures enables us to learn high-dimensional distributions on heterogeneous data domains and facilitates efficient probabilistic inference even when no closed-form density function is available. We show that the structure and parameters of CCs can be learned efficiently from the data and find that CCs outperform state-of-the-art density estimators for heterogeneous data domains on common benchmark data sets.  ( 2 min )
    Randomly pivoted Cholesky: Practical approximation of a kernel matrix with few entry evaluations. (arXiv:2207.06503v5 [math.NA] UPDATED)
    The randomly pivoted partial Cholesky algorithm (RPCholesky) computes a factorized rank-k approximation of an N x N positive-semidefinite (psd) matrix. RPCholesky requires only (k + 1) N entry evaluations and O(k^2 N) additional arithmetic operations, and it can be implemented with just a few lines of code. The method is particularly useful for approximating a kernel matrix. This paper offers a thorough new investigation of the empirical and theoretical behavior of this fundamental algorithm. For matrix approximation problems that arise in scientific machine learning, experiments show that RPCholesky matches or beats the performance of alternative algorithms. Moreover, RPCholesky provably returns low-rank approximations that are nearly optimal. The simplicity, effectiveness, and robustness of RPCholesky strongly support its use in scientific computing and machine learning applications.  ( 2 min )
    On the fast convergence of minibatch heavy ball momentum. (arXiv:2206.07553v4 [cs.LG] UPDATED)
    Simple stochastic momentum methods are widely used in machine learning optimization, but their good practical performance is at odds with an absence of theoretical guarantees of acceleration in the literature. In this work, we aim to close the gap between theory and practice by showing that stochastic heavy ball momentum retains the fast linear rate of (deterministic) heavy ball momentum on quadratic optimization problems, at least when minibatching with a sufficiently large batch size. The algorithm we study can be interpreted as an accelerated randomized Kaczmarz algorithm with minibatching and heavy ball momentum. The analysis relies on carefully decomposing the momentum transition matrix, and using new spectral norm concentration bounds for products of independent random matrices. We provide numerical illustrations demonstrating that our bounds are reasonably sharp.  ( 2 min )
    Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF. (arXiv:2312.08358v1 [cs.LG])
    In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count. We show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. Furthermore, our analysis formalizes the way that preference learning from users with diverse values tacitly implements a social choice function. A key implication of this result is that annotators have an incentive to misreport their preferences in order to influence the learned model, leading to vulnerabilities in the deployment of RLHF. As a step towards mitigating these problems, we introduce a class of methods called distributional preference learning (DPL). DPL methods estimate a distribution of possible score values for each alternative in order to better account for hidden context. Experimental results indicate that applying DPL to RLHF for LLM chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability. Our code and data are available at https://github.com/cassidylaidlaw/hidden-context  ( 3 min )
    Fit Like You Sample: Sample-Efficient Generalized Score Matching from Fast Mixing Diffusions. (arXiv:2306.09332v3 [cs.DS] UPDATED)
    Score matching is an approach to learning probability distributions parametrized up to a constant of proportionality (e.g. Energy-Based Models). The idea is to fit the score of the distribution, rather than the likelihood, thus avoiding the need to evaluate the constant of proportionality. While there's a clear algorithmic benefit, the statistical "cost'' can be steep: recent work by Koehler et al. 2022 showed that for distributions that have poor isoperimetric properties (a large Poincar\'e or log-Sobolev constant), score matching is substantially statistically less efficient than maximum likelihood. However, many natural realistic distributions, e.g. multimodal distributions as simple as a mixture of two Gaussians in one dimension -- have a poor Poincar\'e constant. In this paper, we show a close connection between the mixing time of a broad class of Markov processes with generator $\mathcal{L}$ and an appropriately chosen generalized score matching loss that tries to fit $\frac{\mathcal{O} p}{p}$. This allows us to adapt techniques to speed up Markov chains to construct better score-matching losses. In particular, ``preconditioning'' the diffusion can be translated to an appropriate ``preconditioning'' of the score loss. Lifting the chain by adding a temperature like in simulated tempering can be shown to result in a Gaussian-convolution annealed score matching loss, similar to Song and Ermon, 2019. Moreover, we show that if the distribution being learned is a finite mixture of Gaussians in $d$ dimensions with a shared covariance, the sample complexity of annealed score matching is polynomial in the ambient dimension, the diameter of the means, and the smallest and largest eigenvalues of the covariance -- obviating the Poincar\'e constant-based lower bounds of the basic score matching loss shown in Koehler et al. 2022.  ( 3 min )
    The Effective Horizon Explains Deep RL Performance in Stochastic Environments. (arXiv:2312.08369v1 [stat.ML])
    Reinforcement learning (RL) theory has largely focused on proving minimax sample complexity bounds. These require strategic exploration algorithms that use relatively limited function classes for representing the policy or value function. Our goal is to explain why deep RL algorithms often perform well in practice, despite using random exploration and much more expressive function classes like neural networks. Our work arrives at an explanation by showing that many stochastic MDPs can be solved by performing only a few steps of value iteration on the random policy's Q function and then acting greedily. When this is true, we find that it is possible to separate the exploration and learning components of RL, making it much easier to analyze. We introduce a new RL algorithm, SQIRL, that iteratively learns a near-optimal policy by exploring randomly to collect rollouts and then performing a limited number of steps of fitted-Q iteration over those rollouts. Any regression algorithm that satisfies basic in-distribution generalization properties can be used in SQIRL to efficiently solve common MDPs. This can explain why deep RL works neural networks, since it is empirically established that neural networks generalize well in-distribution. Furthermore, SQIRL explains why random exploration works well in practice, since we show many environments can be solved by estimating the random policy's Q-function and then applying zero or a few steps of value iteration. We leverage SQIRL to derive instance-dependent sample complexity bounds for RL that are exponential only in an "effective horizon" of lookahead and on the complexity of the class used for function approximation. Empirically, we also find that SQIRL performance strongly correlates with PPO and DQN performance in a variety of stochastic environments, supporting that our theoretical analysis is predictive of practical performance.  ( 3 min )
    Active learning with biased non-response to label requests. (arXiv:2312.08150v1 [cs.LG])
    Active learning can improve the efficiency of training prediction models by identifying the most informative new labels to acquire. However, non-response to label requests can impact active learning's effectiveness in real-world contexts. We conceptualise this degradation by considering the type of non-response present in the data, demonstrating that biased non-response is particularly detrimental to model performance. We argue that this sort of non-response is particularly likely in contexts where the labelling process, by nature, relies on user interactions. To mitigate the impact of biased non-response, we propose a cost-based correction to the sampling strategy--the Upper Confidence Bound of the Expected Utility (UCB-EU)--that can, plausibly, be applied to any active learning algorithm. Through experiments, we demonstrate that our method successfully reduces the harm from labelling non-response in many settings. However, we also characterise settings where the non-response bias in the annotations remains detrimental under UCB-EU for particular sampling methods and data generating processes. Finally, we evaluate our method on a real-world dataset from e-commerce platform Taobao. We show that UCB-EU yields substantial performance improvements to conversion models that are trained on clicked impressions. Most generally, this research serves to both better conceptualise the interplay between types of non-response and model improvements via active learning, and to provide a practical, easy to implement correction that helps mitigate model degradation.  ( 2 min )
    Capacity of the treelike sign perceptrons neural networks with one hidden layer -- RDT based upper bounds. (arXiv:2312.08244v1 [cond-mat.dis-nn])
    We study the capacity of \emph{sign} perceptrons neural networks (SPNN) and particularly focus on 1-hidden layer \emph{treelike committee machine} (TCM) architectures. Similarly to what happens in the case of a single perceptron neuron, it turns out that, in a statistical sense, the capacity of a corresponding multilayered network architecture consisting of multiple \emph{sign} perceptrons also undergoes the so-called phase transition (PT) phenomenon. This means: (i) for certain range of system parameters (size of data, number of neurons), the network can be properly trained to accurately memorize \emph{all} elements of the input dataset; and (ii) outside the region such a training does not exist. Clearly, determining the corresponding phase transition curve that separates these regions is an extraordinary task and among the most fundamental questions related to the performance of any network. Utilizing powerful mathematical engine called Random Duality Theory (RDT), we establish a generic framework for determining the upper bounds on the 1-hidden layer TCM SPNN capacity. Moreover, we do so for \emph{any} given (odd) number of neurons. We further show that the obtained results \emph{exactly} match the replica symmetry predictions of \cite{EKTVZ92,BHS92}, thereby proving that the statistical physics based results are not only nice estimates but also mathematically rigorous bounds as well. Moreover, for $d\leq 5$, we obtain the capacity values that improve on the best known rigorous ones of \cite{MitchDurb89}, thereby establishing a first, mathematically rigorous, progress in well over 30 years.  ( 3 min )
    Estimation of embedding vectors in high dimensions. (arXiv:2312.07802v1 [cs.LG])
    Embeddings are a basic initial feature extraction step in many machine learning models, particularly in natural language processing. An embedding attempts to map data tokens to a low-dimensional space where similar tokens are mapped to vectors that are close to one another by some metric in the embedding space. A basic question is how well can such embedding be learned? To study this problem, we consider a simple probability model for discrete data where there is some "true" but unknown embedding where the correlation of random variables is related to the similarity of the embeddings. Under this model, it is shown that the embeddings can be learned by a variant of low-rank approximate message passing (AMP) method. The AMP approach enables precise predictions of the accuracy of the estimation in certain high-dimensional limits. In particular, the methodology provides insight on the relations of key parameters such as the number of samples per value, the frequency of the terms, and the strength of the embedding correlation on the probability distribution. Our theoretical findings are validated by simulations on both synthetic data and real text data.  ( 2 min )
    Minimax-optimal estimation for sparse multi-reference alignment with collision-free signals. (arXiv:2312.07839v1 [math.ST])
    The Multi-Reference Alignment (MRA) problem aims at the recovery of an unknown signal from repeated observations under the latent action of a group of cyclic isometries, in the presence of additive noise of high intensity $\sigma$. It is a more tractable version of the celebrated cryo EM model. In the crucial high noise regime, it is known that its sample complexity scales as $\sigma^6$. Recent investigations have shown that for the practically significant setting of sparse signals, the sample complexity of the maximum likelihood estimator asymptotically scales with the noise level as $\sigma^4$. In this work, we investigate minimax optimality for signal estimation under the MRA model for so-called collision-free signals. In particular, this signal class covers the setting of generic signals of dilute sparsity (wherein the support size $s=O(L^{1/3})$, where $L$ is the ambient dimension. We demonstrate that the minimax optimal rate of estimation in for the sparse MRA problem in this setting is $\sigma^2/\sqrt{n}$, where $n$ is the sample size. In particular, this widely generalizes the sample complexity asymptotics for the restricted MLE in this setting, establishing it as the statistically optimal estimator. Finally, we demonstrate a concentration inequality for the restricted MLE on its deviations from the ground truth.  ( 2 min )
    Double Machine Learning for Static Panel Models with Fixed Effects. (arXiv:2312.08174v1 [econ.EM])
    Machine Learning (ML) algorithms are powerful data-driven tools for approximating high-dimensional or non-linear nuisance functions which are useful in practice because the true functional form of the predictors is ex-ante unknown. In this paper, we develop estimators of policy interventions from panel data which allow for non-linear effects of the confounding regressors, and investigate the performance of these estimators using three well-known ML algorithms, specifically, LASSO, classification and regression trees, and random forests. We use Double Machine Learning (DML) (Chernozhukov et al., 2018) for the estimation of causal effects of homogeneous treatments with unobserved individual heterogeneity (fixed effects) and no unobserved confounding by extending Robinson (1988)'s partially linear regression model. We develop three alternative approaches for handling unobserved individual heterogeneity based on extending the within-group estimator, first-difference estimator, and correlated random effect estimator (Mundlak, 1978) for non-linear models. Using Monte Carlo simulations, we find that conventional least squares estimators can perform well even if the data generating process is non-linear, but there are substantial performance gains in terms of bias reduction under a process where the true effect of the regressors is non-linear and discontinuous. However, for the same scenarios, we also find -- despite extensive hyperparameter tuning -- inference to be problematic for both tree-based learners because these lead to highly non-normal estimator distributions and the estimator variance being severely under-estimated. This contradicts the performance of trees in other circumstances and requires further investigation. Finally, we provide an illustrative example of DML for observational panel data showing the impact of the introduction of the national minimum wage in the UK.  ( 3 min )
    Bayesian Online Learning for Consensus Prediction. (arXiv:2312.07679v1 [cs.LG])
    Given a pre-trained classifier and multiple human experts, we investigate the task of online classification where model predictions are provided for free but querying humans incurs a cost. In this practical but under-explored setting, oracle ground truth is not available. Instead, the prediction target is defined as the consensus vote of all experts. Given that querying full consensus can be costly, we propose a general framework for online Bayesian consensus estimation, leveraging properties of the multivariate hypergeometric distribution. Based on this framework, we propose a family of methods that dynamically estimate expert consensus from partial feedback by producing a posterior over expert and model beliefs. Analyzing this posterior induces an interpretable trade-off between querying cost and classification performance. We demonstrate the efficacy of our framework against a variety of baselines on CIFAR-10H and ImageNet-16H, two large-scale crowdsourced datasets.  ( 2 min )
    Go beyond End-to-End Training: Boosting Greedy Local Learning with Context Supply. (arXiv:2312.07636v1 [cs.LG])
    Traditional end-to-end (E2E) training of deep networks necessitates storing intermediate activations for back-propagation, resulting in a large memory footprint on GPUs and restricted model parallelization. As an alternative, greedy local learning partitions the network into gradient-isolated modules and trains supervisely based on local preliminary losses, thereby providing asynchronous and parallel training methods that substantially reduce memory cost. However, empirical experiments reveal that as the number of segmentations of the gradient-isolated module increases, the performance of the local learning scheme degrades substantially, severely limiting its expansibility. To avoid this issue, we theoretically analyze the greedy local learning from the standpoint of information theory and propose a ContSup scheme, which incorporates context supply between isolated modules to compensate for information loss. Experiments on benchmark datasets (i.e. CIFAR, SVHN, STL-10) achieve SOTA results and indicate that our proposed method can significantly improve the performance of greedy local learning with minimal memory and computational overhead, allowing for the boost of the number of isolated modules. Our codes are available at https://github.com/Tab-ct/ContSup.  ( 2 min )
    Synthetic Data: Can We Trust Statistical Estimators?. (arXiv:2312.07837v1 [cs.LG])
    The increasing interest in data sharing makes synthetic data appealing. However, the analysis of synthetic data raises a unique set of methodological challenges. In this work, we highlight the importance of inferential utility and provide empirical evidence against naive inference from synthetic data (that handles these as if they were really observed). We argue that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased. One of the reasons is the underestimation of the true standard error, which may even progressively increase with larger sample sizes due to slower convergence. This is especially problematic for deep generative models. Before publishing synthetic data, it is essential to develop statistical inference tools for such data.  ( 2 min )
    SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space. (arXiv:2312.08200v1 [cs.LG])
    Symmetric positive definite~(SPD) matrices have shown important value and applications in statistics and machine learning, such as FMRI analysis and traffic prediction. Previous works on SPD matrices mostly focus on discriminative models, where predictions are made directly on $E(X|y)$, where $y$ is a vector and $X$ is an SPD matrix. However, these methods are challenging to handle for large-scale data, as they need to access and process the whole data. In this paper, inspired by denoising diffusion probabilistic model~(DDPM), we propose a novel generative model, termed SPD-DDPM, by introducing Gaussian distribution in the SPD space to estimate $E(X|y)$. Moreover, our model is able to estimate $p(X)$ unconditionally and flexibly without giving $y$. On the one hand, the model conditionally learns $p(X|y)$ and utilizes the mean of samples to obtain $E(X|y)$ as a prediction. On the other hand, the model unconditionally learns the probability distribution of the data $p(X)$ and generates samples that conform to this distribution. Furthermore, we propose a new SPD net which is much deeper than the previous networks and allows for the inclusion of conditional factors. Experiment results on toy data and real taxi data demonstrate that our models effectively fit the data distribution both unconditionally and unconditionally and provide accurate predictions.  ( 2 min )
    A New Perspective On Denoising Based On Optimal Transport. (arXiv:2312.08135v1 [math.ST])
    In the standard formulation of the denoising problem, one is given a probabilistic model relating a latent variable $\Theta \in \Omega \subset \mathbb{R}^m \; (m\ge 1)$ and an observation $Z \in \mathbb{R}^d$ according to: $Z \mid \Theta \sim p(\cdot\mid \Theta)$ and $\Theta \sim G^*$, and the goal is to construct a map to recover the latent variable from the observation. The posterior mean, a natural candidate for estimating $\Theta$ from $Z$, attains the minimum Bayes risk (under the squared error loss) but at the expense of over-shrinking the $Z$, and in general may fail to capture the geometric features of the prior distribution $G^*$ (e.g., low dimensionality, discreteness, sparsity, etc.). To rectify these drawbacks, in this paper we take a new perspective on this denoising problem that is inspired by optimal transport (OT) theory and use it to propose a new OT-based denoiser at the population level setting. We rigorously prove that, under general assumptions on the model, our OT-based denoiser is well-defined and unique, and is closely connected to solutions to a Monge OT problem. We then prove that, under appropriate identifiability assumptions on the model, our OT-based denoiser can be recovered solely from information of the marginal distribution of $Z$ and the posterior mean of the model, after solving a linear relaxation problem over a suitable space of couplings that is reminiscent of a standard multimarginal OT (MOT) problem. In particular, thanks to Tweedie's formula, when the likelihood model $\{ p(\cdot \mid \theta) \}_{\theta \in \Omega}$ is an exponential family of distributions, the OT-based denoiser can be recovered solely from the marginal distribution of $Z$. In general, our family of OT-like relaxations is of interest in its own right and for the denoising problem suggests alternative numerical methods inspired by the rich literature on computational OT.  ( 3 min )
    Combinatorial Stochastic-Greedy Bandit. (arXiv:2312.08057v1 [cs.LG])
    We propose a novel combinatorial stochastic-greedy bandit (SGB) algorithm for combinatorial multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time step $t\in [T]$ is observed. SGB adopts an optimized stochastic-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms. Unlike existing methods that explore the entire set of unselected base arms during each selection step, our SGB algorithm samples only an optimized proportion of unselected arms and selects actions from this subset. We prove that our algorithm achieves a $(1-1/e)$-regret bound of $\mathcal{O}(n^{\frac{1}{3}} k^{\frac{2}{3}} T^{\frac{2}{3}} \log(T)^{\frac{2}{3}})$ for monotone stochastic submodular rewards, which outperforms the state-of-the-art in terms of the cardinality constraint $k$. Furthermore, we empirically evaluate the performance of our algorithm in the context of online constrained social influence maximization. Our results demonstrate that our proposed approach consistently outperforms the other algorithms, increasing the performance gap as $k$ grows.  ( 2 min )
    Causal Optimal Transport of Abstractions. (arXiv:2312.08107v1 [cs.LG])
    Causal abstraction (CA) theory establishes formal criteria for relating multiple structural causal models (SCMs) at different levels of granularity by defining maps between them. These maps have significant relevance for real-world challenges such as synthesizing causal evidence from multiple experimental environments, learning causally consistent representations at different resolutions, and linking interventions across multiple SCMs. In this work, we propose COTA, the first method to learn abstraction maps from observational and interventional data without assuming complete knowledge of the underlying SCMs. In particular, we introduce a multi-marginal Optimal Transport (OT) formulation that enforces do-calculus causal constraints, together with a cost function that relies on interventional information. We extensively evaluate COTA on synthetic and real world problems, and showcase its advantages over non-causal, independent and aggregated COTA formulations. Finally, we demonstrate the efficiency of our method as a data augmentation tool by comparing it against the state-of-the-art CA learning framework, which assumes fully specified SCMs, on a real-world downstream task.  ( 2 min )
    TERM Model: Tensor Ring Mixture Model for Density Estimation. (arXiv:2312.08075v1 [cs.LG])
    Efficient probability density estimation is a core challenge in statistical machine learning. Tensor-based probabilistic graph methods address interpretability and stability concerns encountered in neural network approaches. However, a substantial number of potential tensor permutations can lead to a tensor network with the same structure but varying expressive capabilities. In this paper, we take tensor ring decomposition for density estimator, which significantly reduces the number of permutation candidates while enhancing expressive capability compared with existing used decompositions. Additionally, a mixture model that incorporates multiple permutation candidates with adaptive weights is further designed, resulting in increased expressive flexibility and comprehensiveness. Different from the prevailing directions of tensor network structure/permutation search, our approach provides a new viewpoint inspired by ensemble learning. This approach acknowledges that suboptimal permutations can offer distinctive information besides that of optimal permutations. Experiments show the superiority of the proposed approach in estimating probability density for moderately dimensional datasets and sampling to capture intricate details.  ( 2 min )
    Training of Neural Networks with Uncertain Data, A Mixture of Experts Approach. (arXiv:2312.08083v1 [stat.ML])
    This paper presents the "Uncertainty-aware Mixture of Experts" (uMoE), a novel approach designed to address aleatoric uncertainty in the training of predictive models based on Neural Networks (NNs). While existing methods primarily focus on managing uncertainty during infer-ence, uMoE integrates uncertainty directly into the train-ing process. The uMoE approach adopts a "Divide and Conquer" paradigm to partition the uncertain input space into more manageable subspaces. It consists of Expert components, each trained solely on the portion of input uncertainty corresponding to their subspace. On top of the Experts, a Gating Unit, guided by additional infor-mation about the distribution of uncertain inputs across these subspaces, learns to weight the Experts to minimize deviations from the ground truth. Our results highlight that uMoE significantly outperforms baseline methods in handling data uncertainty. Furthermore, we conducted a robustness analysis, illustrating its capability to adapt to varying levels of uncertainty and suggesting optimal threshold parameters. This innovative approach holds wide applicability across diverse data-driven domains, in-cluding biomedical signal processing, autonomous driv-ing, and production quality control.  ( 2 min )
    GP+: A Python Library for Kernel-based learning via Gaussian Processes. (arXiv:2312.07694v1 [cs.LG])
    In this paper we introduce GP+, an open-source library for kernel-based learning via Gaussian processes (GPs) which are powerful statistical models that are completely characterized by their parametric covariance and mean functions. GP+ is built on PyTorch and provides a user-friendly and object-oriented tool for probabilistic learning and inference. As we demonstrate with a host of examples, GP+ has a few unique advantages over other GP modeling libraries. We achieve these advantages primarily by integrating nonlinear manifold learning techniques with GPs' covariance and mean functions. As part of introducing GP+, in this paper we also make methodological contributions that (1) enable probabilistic data fusion and inverse parameter estimation, and (2) equip GPs with parsimonious parametric mean functions which span mixed feature spaces that have both categorical and quantitative variables. We demonstrate the impact of these contributions in the context of Bayesian optimization, multi-fidelity modeling, sensitivity analysis, and calibration of computer models.  ( 2 min )

  • Open

    Voice Search Revolution: Data-Driven SEO Strategies for Future Success
    With the rise of voice search, how can businesses adapt their SEO strategies to optimize for conversational queries, backed by data-driven insights? Voice search is causing changes to occur in search engine optimization. Users are using more natural language and conversational queries with voice-activated devices. Businesses need to adjust SEO strategies for changing search behavior.… Read More »Voice Search Revolution: Data-Driven SEO Strategies for Future Success The post Voice Search Revolution: Data-Driven SEO Strategies for Future Success appeared first on Data Science Central.  ( 26 min )
  • Open

    AI meets climate: MIT Energy and Climate Hack 2023
    The Energy and Climate Hack presented opportunities for students and companies to collaborate and develop innovative solutions.  ( 8 min )
  • Open

    Boost productivity on Amazon SageMaker Studio: Introducing JupyterLab Spaces and generative AI tools
    Amazon SageMaker Studio offers a broad set of fully managed integrated development environments (IDEs) for machine learning (ML) development, including JupyterLab, Code Editor based on Code-OSS (Visual Studio Code Open Source), and RStudio. It provides access to the most comprehensive set of tools for each step of ML development, from preparing data to building, training, […]  ( 16 min )
    How AWS Prototyping enabled ICL-Group to build computer vision models on Amazon SageMaker
    This is a customer post jointly authored by ICL and AWS employees. ICL is a multi-national manufacturing and mining corporation based in Israel that manufactures products based on unique minerals and fulfills humanity’s essential needs, primarily in three markets: agriculture, food, and engineered materials. Their mining sites use industrial equipment that has to be monitored […]  ( 8 min )
    Automate PDF pre-labeling for Amazon Comprehend
    Amazon Comprehend is a natural-language processing (NLP) service that provides pre-trained and custom APIs to derive insights from textual data. Amazon Comprehend customers can train custom named entity recognition (NER) models to extract entities of interest, such as location, person name, and date, that are unique to their business. To train a custom model, you […]  ( 8 min )
    Improve your Stable Diffusion prompts with Retrieval Augmented Generation
    Text-to-image generation is a rapidly growing field of artificial intelligence with applications in a variety of areas, such as media and entertainment, gaming, ecommerce product visualization, advertising and marketing, architectural design and visualization, artistic creations, and medical imaging. Stable Diffusion is a text-to-image model that empowers you to create high-quality images within seconds. In November […]  ( 9 min )
    Streamlining ETL data processing at Talent.com with Amazon SageMaker
    This post outlines the ETL pipeline we developed for feature processing for training and deploying a job recommender model at Talent.com. Our pipeline uses SageMaker Processing jobs for efficient data processing and feature extraction at a large scale. Feature extraction code is implemented in Python enabling the use of popular ML libraries to perform feature extraction at scale, without the need to port the code to use PySpark.  ( 10 min )
  • Open

    ‘Forza Horizon’ Races Over to GeForce NOW
    This GFN Thursday is burning rubber with the latest Forza Horizon games from Microsoft Studios. Check them out on PC Game Pass. Plus, give the gift of cloud gaming with the latest membership bundle, which includes a free, three-month PC Game Pass subscription with the purchase of a six-month GeForce NOW Ultimate membership. It’s all Read article >  ( 6 min )
  • Open

    Superalignment Fast Grants
    We’re launching $10M in grants to support technical research towards the alignment and safety of superhuman AI systems, including weak-to-strong generalization, interpretability, scalable oversight, and more.  ( 2 min )
    Practices for Governing Agentic AI Systems
    No content preview  ( 1 min )
    Weak-to-strong generalization
    We present a new research direction for superalignment, together with promising initial results: can we leverage the generalization properties of deep learning to control strong models with weak supervisors?  ( 3 min )
  • Open

    Beyond Expected Return: Accounting for Policy Reproducibility when Evaluating Reinforcement Learning Algorithms. (arXiv:2312.07178v1 [cs.LG])
    Many applications in Reinforcement Learning (RL) usually have noise or stochasticity present in the environment. Beyond their impact on learning, these uncertainties lead the exact same policy to perform differently, i.e. yield different return, from one roll-out to another. Common evaluation procedures in RL summarise the consequent return distributions using solely the expected return, which does not account for the spread of the distribution. Our work defines this spread as the policy reproducibility: the ability of a policy to obtain similar performance when rolled out many times, a crucial property in some real-world applications. We highlight that existing procedures that only use the expected return are limited on two fronts: first an infinite number of return distributions with a wide range of performance-reproducibility trade-offs can have the same expected return, limiting its effectiveness when used for comparing policies; second, the expected return metric does not leave any room for practitioners to choose the best trade-off value for considered applications. In this work, we address these limitations by recommending the use of Lower Confidence Bound, a metric taken from Bayesian optimisation that provides the user with a preference parameter to choose a desired performance-reproducibility trade-off. We also formalise and quantify policy reproducibility, and demonstrate the benefit of our metrics using extensive experiments of popular RL algorithms on common uncertain RL tasks.  ( 2 min )
    PLASTIC: Improving Input and Label Plasticity for Sample Efficient Reinforcement Learning. (arXiv:2306.10711v3 [cs.LG] UPDATED)
    In Reinforcement Learning (RL), enhancing sample efficiency is crucial, particularly in scenarios when data acquisition is costly and risky. In principle, off-policy RL algorithms can improve sample efficiency by allowing multiple updates per environment interaction. However, these multiple updates often lead the model to overfit to earlier interactions, which is referred to as the loss of plasticity. Our study investigates the underlying causes of this phenomenon by dividing plasticity into two aspects. Input plasticity, which denotes the model's adaptability to changing input data, and label plasticity, which denotes the model's adaptability to evolving input-output relationships. Synthetic experiments on the CIFAR-10 dataset reveal that finding smoother minima of loss landscape enhances input plasticity, whereas refined gradient propagation improves label plasticity. Leveraging these findings, we introduce the PLASTIC algorithm, which harmoniously combines techniques to address both concerns. With minimal architectural modifications, PLASTIC achieves competitive performance on benchmarks including Atari-100k and Deepmind Control Suite. This result emphasizes the importance of preserving the model's plasticity to elevate the sample efficiency in RL. The code is available at https://github.com/dojeon-ai/plastic.  ( 2 min )
    PIGEON: Predicting Image Geolocations. (arXiv:2307.05845v3 [cs.CV] UPDATED)
    Planet-scale image geolocalization remains a challenging problem due to the diversity of images originating from anywhere in the world. Although approaches based on vision transformers have made significant progress in geolocalization accuracy, success in prior literature is constrained to narrow distributions of images of landmarks, and performance has not generalized to unseen places. We present a new geolocalization system that combines semantic geocell creation, multi-task contrastive pretraining, and a novel loss function. Additionally, our work is the first to perform retrieval over location clusters for guess refinements. We train two models for evaluations on street-level data and general-purpose image geolocalization; the first model, PIGEON, is trained on data from the game of Geoguessr and is capable of placing over 40% of its guesses within 25 kilometers of the target location globally. We also develop a bot and deploy PIGEON in a blind experiment against humans, ranking in the top 0.01% of players. We further challenge one of the world's foremost professional Geoguessr players to a series of six matches with millions of viewers, winning all six games. Our second model, PIGEOTTO, differs in that it is trained on a dataset of images from Flickr and Wikipedia, achieving state-of-the-art results on a wide range of image geolocalization benchmarks, outperforming the previous SOTA by up to 7.7 percentage points on the city accuracy level and up to 38.8 percentage points on the country level. Our findings suggest that PIGEOTTO is the first image geolocalization model that effectively generalizes to unseen places and that our approach can pave the way for highly accurate, planet-scale image geolocalization systems. Our code is available on GitHub.  ( 3 min )
    Early Stopping for Deep Image Prior. (arXiv:2112.06074v4 [cs.CV] UPDATED)
    Deep image prior (DIP) and its variants have showed remarkable potential for solving inverse problems in computer vision, without any extra training data. Practical DIP models are often substantially overparameterized. During the fitting process, these models learn mostly the desired visual content first, and then pick up the potential modeling and observational noise, i.e., overfitting. Thus, the practicality of DIP often depends critically on good early stopping (ES) that captures the transition period. In this regard, the majority of DIP works for vision tasks only demonstrates the potential of the models -- reporting the peak performance against the ground truth, but provides no clue about how to operationally obtain near-peak performance without access to the groundtruth. In this paper, we set to break this practicality barrier of DIP, and propose an efficient ES strategy, which consistently detects near-peak performance across several vision tasks and DIP variants. Based on a simple measure of dispersion of consecutive DIP reconstructions, our ES method not only outpaces the existing ones -- which only work in very narrow domains, but also remains effective when combined with a number of methods that try to mitigate the overfitting. The code is available at https://github.com/sun-umn/Early_Stopping_for_DIP.  ( 3 min )
    Deep Internal Learning: Deep Learning from a Single Input. (arXiv:2312.07425v1 [cs.LG])
    Deep learning in general focuses on training a neural network from large labeled datasets. Yet, in many cases there is value in training a network just from the input at hand. This may involve training a network from scratch using a single input or adapting an already trained network to a provided input example at inference time. This survey paper aims at covering deep internal-learning techniques that have been proposed in the past few years for these two important directions. While our main focus will be on image processing problems, most of the approaches that we survey are derived for general signals (vectors with recurring patterns that can be distinguished from noise) and are therefore applicable to other modalities. We believe that the topic of internal-learning is very important in many signal and image processing problems where training data is scarce and diversity is large on the one hand, and on the other, there is a lot of structure in the data that can be exploited.  ( 2 min )
    Cross-client Label Propagation for Transductive and Semi-Supervised Federated Learning. (arXiv:2210.06434v4 [cs.LG] UPDATED)
    We present Cross-Client Label Propagation(XCLP), a new method for transductive federated learning. XCLP estimates a data graph jointly from the data of multiple clients and computes labels for the unlabeled data by propagating label information across the graph. To avoid clients having to share their data with anyone, XCLP employs two cryptographically secure protocols: secure Hamming distance computation and secure summation. We demonstrate two distinct applications of XCLP within federated learning. In the first, we use it in a one-shot way to predict labels for unseen test points. In the second, we use it to repeatedly pseudo-label unlabeled training data in a federated semi-supervised setting. Experiments on both real federated and standard benchmark datasets show that in both applications XCLP achieves higher classification accuracy than alternative approaches.  ( 2 min )
    Ahpatron: A New Budgeted Online Kernel Learning Machine with Tighter Mistake Bound. (arXiv:2312.07032v1 [cs.LG])
    In this paper, we study the mistake bound of online kernel learning on a budget. We propose a new budgeted online kernel learning model, called Ahpatron, which significantly improves the mistake bound of previous work and resolves the open problem posed by Dekel, Shalev-Shwartz, and Singer (2005). We first present an aggressive variant of Perceptron, named AVP, a model without budget, which uses an active updating rule. Then we design a new budget maintenance mechanism, which removes a half of examples,and projects the removed examples onto a hypothesis space spanned by the remaining examples. Ahpatron adopts the above mechanism to approximate AVP. Theoretical analyses prove that Ahpatron has tighter mistake bounds, and experimental results show that Ahpatron outperforms the state-of-the-art algorithms on the same or a smaller budget.  ( 2 min )
    Text2AC-Zero: Consistent Synthesis of Animated Characters using 2D Diffusion. (arXiv:2312.07133v1 [cs.CV])
    We propose a zero-shot approach for consistent Text-to-Animated-Characters synthesis based on pre-trained Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time, their zero-shot alternatives fail to produce temporally consistent videos. We strive to bridge this gap, and we introduce a zero-shot approach that produces temporally consistent videos of animated characters and requires no training or fine-tuning. We leverage existing text-based motion diffusion models to generate diverse motions that we utilize to guide a T2I model. To achieve temporal consistency, we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore, we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies. Our proposed approach generates temporally consistent videos with diverse motions and styles, outperforming existing zero-shot T2V approaches in terms of pixel-wise consistency and user preference.  ( 2 min )
    GateNet: A novel Neural Network Architecture for Automated Flow Cytometry Gating. (arXiv:2312.07316v1 [cs.LG])
    Flow cytometry is widely used to identify cell populations in patient-derived fluids such as peripheral blood (PB) or cerebrospinal fluid (CSF). While ubiquitous in research and clinical practice, flow cytometry requires gating, i.e. cell type identification which requires labor-intensive and error-prone manual adjustments. To facilitate this process, we designed GateNet, the first neural network architecture enabling full end-to-end automated gating without the need to correct for batch effects. We train GateNet with over 8,000,000 events based on N=127 PB and CSF samples which were manually labeled independently by four experts. We show that for novel, unseen samples, GateNet achieves human-level performance (F1 score ranging from 0.910 to 0.997). In addition we apply GateNet to a publicly available dataset confirming generalization with an F1 score of 0.936. As our implementation utilizes graphics processing units (GPU), gating only needs 15 microseconds per event. Importantly, we also show that GateNet only requires ~10 samples to reach human-level performance, rendering it widely applicable in all domains of flow cytometry.  ( 2 min )
    Time-Selective RNN for Device-Free Multi-Room Human Presence Detection Using WiFi CSI. (arXiv:2304.13107v2 [cs.AI] UPDATED)
    Device-free human presence detection is a crucial technology for various applications, including home automation, security, and healthcare. While camera-based systems have traditionally been used for this purpose, they raise privacy concerns. To address this issue, recent research has explored the use of wireless channel state information (CSI) extracted from commercial WiFi access points (APs) to provide detailed channel characteristics. In this paper, we propose a device-free human presence detection system for multi-room scenarios using a time-selective conditional dual feature extract recurrent network (TCD-FERN). Our system is designed to capture significant time features on current human features using a dynamic and static data preprocessing technique. We extract both moving and spatial features of people and differentiate between line-of-sight (LoS) and non-line-of-sight (NLoS) cases. Subcarrier fusion is carried out in order to provide more objective variation of each sample while reducing the computational complexity. A voting scheme is further adopted to mitigate the feature attenuation problem caused by room partitions, with around 3% improvement of human presence detection accuracy. Experimental results have revealed the significant improvement of leveraging subcarrier fusion, dual-feature recurrent network, time selection and condition mechanisms. Compared to the existing works in open literature, our proposed TCD-FERN system can achieve above 97% of human presence detection accuracy for multi-room scenarios with the adoption of fewer WiFi APs.  ( 3 min )
    Smooth, exact rotational symmetrization for deep learning on point clouds. (arXiv:2305.19302v2 [cs.CV] UPDATED)
    Point clouds are versatile representations of 3D objects and have found widespread application in science and engineering. Many successful deep-learning models have been proposed that use them as input. The domain of chemical and materials modeling is especially challenging because exact compliance with physical constraints is highly desirable for a model to be usable in practice. These constraints include smoothness and invariance with respect to translations, rotations, and permutations of identical atoms. If these requirements are not rigorously fulfilled, atomistic simulations might lead to absurd outcomes even if the model has excellent accuracy. Consequently, dedicated architectures, which achieve invariance by restricting their design space, have been developed. General-purpose point-cloud models are more varied but often disregard rotational symmetry. We propose a general symmetrization method that adds rotational equivariance to any given model while preserving all the other requirements. Our approach simplifies the development of better atomic-scale ML schemes by relaxing the constraints on the design space and making it possible to incorporate ideas that proved effective in other domains. We demonstrate this idea by introducing the Point Edge Transformer (PET) architecture, which is not intrinsically equivariant but achieves state-of-the-art performance on several benchmark datasets of molecules and solids. A-posteriori application of our general protocol makes PET exactly equivariant, with minimal changes to its accuracy.  ( 3 min )
    Investigation into the Training Dynamics of Learned Optimizers. (arXiv:2312.07174v1 [cs.LG])
    Optimization is an integral part of modern deep learning. Recently, the concept of learned optimizers has emerged as a way to accelerate this optimization process by replacing traditional, hand-crafted algorithms with meta-learned functions. Despite the initial promising results of these methods, issues with stability and generalization still remain, limiting their practical use. Moreover, their inner workings and behavior under different conditions are not yet fully understood, making it difficult to come up with improvements. For this reason, our work examines their optimization trajectories from the perspective of network architecture symmetries and parameter update distributions. Furthermore, by contrasting the learned optimizers with their manually designed counterparts, we identify several key insights that demonstrate how each approach can benefit from the strengths of the other.  ( 2 min )
    MammoFL: Mammographic Breast Density Estimation using Federated Learning. (arXiv:2206.05575v4 [eess.IV] UPDATED)
    In this study, we automate quantitative mammographic breast density estimation with neural networks and show that this tool is a strong use case for federated learning on multi-institutional datasets. Our dataset included bilateral CC-view and MLO-view mammographic images from two separate institutions. Two U-Nets were separately trained on algorithm-generated labels to perform segmentation of the breast and dense tissue from these images and subsequently calculate breast percent density (PD). The networks were trained with federated learning and compared to three non-federated baselines, one trained on each single-institution dataset and one trained on the aggregated multi-institution dataset. We demonstrate that training on multi-institution datasets is critical to algorithm generalizability. We further show that federated learning on multi-institutional datasets improves model generalization to unseen data at nearly the same level as centralized training on multi-institutional datasets, indicating that federated learning can be applied to our method to improve algorithm generalizability while maintaining patient privacy.  ( 2 min )
    Gotta be SAFE: A New Framework for Molecular Design. (arXiv:2310.10773v2 [cs.LG] UPDATED)
    Traditional molecular string representations, such as SMILES, often pose challenges for AI-driven molecular design due to their non-sequential depiction of molecular substructures. To address this issue, we introduce Sequential Attachment-based Fragment Embedding (SAFE), a novel line notation for chemical structures. SAFE reimagines SMILES strings as an unordered sequence of interconnected fragment blocks while maintaining compatibility with existing SMILES parsers. It streamlines complex generative tasks, including scaffold decoration, fragment linking, polymer generation, and scaffold hopping, while facilitating autoregressive generation for fragment-constrained design, thereby eliminating the need for intricate decoding or graph-based models. We demonstrate the effectiveness of SAFE by training an 87-million-parameter GPT2-like model on a dataset containing 1.1 billion SAFE representations. Through targeted experimentation, we show that our SAFE-GPT model exhibits versatile and robust optimization performance. SAFE opens up new avenues for the rapid exploration of chemical space under various constraints, promising breakthroughs in AI-driven molecular design.  ( 2 min )
    ICL Markup: Structuring In-Context Learning using Soft-Token Tags. (arXiv:2312.07405v1 [cs.CL])
    Large pretrained language models (LLMs) can be rapidly adapted to a wide variety of tasks via a text-to-text approach, where the instruction and input are fed to the model in natural language. Combined with in-context learning (ICL), this paradigm is impressively flexible and powerful. However, it also burdens users with an overwhelming number of choices, many of them arbitrary. Inspired by markup languages like HTML, we contribute a method of using soft-token tags to compose prompt templates. This approach reduces arbitrary decisions and streamlines the application of ICL. Our method is a form of meta-learning for ICL; it learns these tags in advance during a parameter-efficient fine-tuning ``warm-up'' process. The tags can subsequently be used in templates for ICL on new, unseen tasks without any additional fine-tuning. Our experiments with this approach yield promising initial results, improving LLM performance on important enterprise applications such as few-shot and open-world intent detection, as well as text classification in news and legal domains.  ( 2 min )
    Towards Optimal Sobolev Norm Rates for the Vector-Valued Regularized Least-Squares Algorithm. (arXiv:2312.07186v1 [stat.ML])
    We present the first optimal rates for infinite-dimensional vector-valued ridge regression on a continuous scale of norms that interpolate between $L_2$ and the hypothesis space, which we consider as a vector-valued reproducing kernel Hilbert space. These rates allow to treat the misspecified case in which the true regression function is not contained in the hypothesis space. We combine standard assumptions on the capacity of the hypothesis space with a novel tensor product construction of vector-valued interpolation spaces in order to characterize the smoothness of the regression function. Our upper bound not only attains the same rate as real-valued kernel ridge regression, but also removes the assumption that the target regression function is bounded. For the lower bound, we reduce the problem to the scalar setting using a projection argument. We show that these rates are optimal in most cases and independent of the dimension of the output space. We illustrate our results for the special case of vector-valued Sobolev spaces.  ( 2 min )
    Safe Multi-Task Bayesian Optimization. (arXiv:2312.07281v1 [cs.LG])
    Bayesian optimization has become a powerful tool for safe online optimization of systems, due to its high sample efficiency and noise robustness. For further speed-up reduced physical models of the system can be incorporated into the optimization to accelerate the process, since the models are able to offer an approximation of the actual system, and sampling from them is significantly cheaper. The similarity between model and reality is represented by additional hyperparameters and learned within the optimization process. Safety is an important criteria for online optimization methods like Bayesian optimization, which has been addressed by recent literature, which provide safety guarantees under the assumption of known hyperparameters. However, in practice this is not applicable. Therefore, we extend the robust Gaussian process uniform error bounds to meet the multi-task setting, which involves the calculation of a confidence region from the hyperparameter posterior distribution utilizing Markov chain Monte Carlo methods. Then, using the robust safety bounds, Bayesian optimization is applied to safely optimize the system while incorporating measurements of the models. Simulations show that the optimization can be significantly accelerated compared to other state-of-the-art safe Bayesian optimization methods depending on the fidelity of the models.  ( 2 min )
    Distributionally Robust Statistical Verification with Imprecise Neural Networks. (arXiv:2308.14815v3 [cs.AI] UPDATED)
    A particularly challenging problem in AI safety is providing guarantees on the behavior of high-dimensional autonomous systems. Verification approaches centered around reachability analysis fail to scale, and purely statistical approaches are constrained by the distributional assumptions about the sampling process. Instead, we pose a distributionally robust version of the statistical verification problem for black-box systems, where our performance guarantees hold over a large family of distributions. This paper proposes a novel approach based on a combination of active learning, uncertainty quantification, and neural network verification. A central piece of our approach is an ensemble technique called Imprecise Neural Networks, which provides the uncertainty to guide active learning. The active learning uses an exhaustive neural-network verification tool Sherlock to collect samples. An evaluation on multiple physical simulators in the openAI gym Mujoco environments with reinforcement-learned controllers demonstrates that our approach can provide useful and scalable guarantees for high-dimensional systems.
    Data-Free Hard-Label Robustness Stealing Attack. (arXiv:2312.05924v2 [cs.CV] UPDATED)
    The popularity of Machine Learning as a Service (MLaaS) has led to increased concerns about Model Stealing Attacks (MSA), which aim to craft a clone model by querying MLaaS. Currently, most research on MSA assumes that MLaaS can provide soft labels and that the attacker has a proxy dataset with a similar distribution. However, this fails to encapsulate the more practical scenario where only hard labels are returned by MLaaS and the data distribution remains elusive. Furthermore, most existing work focuses solely on stealing the model accuracy, neglecting the model robustness, while robustness is essential in security-sensitive scenarios, e.g., face-scan payment. Notably, improving model robustness often necessitates the use of expensive techniques such as adversarial training, thereby further making stealing robustness a more lucrative prospect. In response to these identified gaps, we introduce a novel Data-Free Hard-Label Robustness Stealing (DFHL-RS) attack in this paper, which enables the stealing of both model accuracy and robustness by simply querying hard labels of the target model without the help of any natural data. Comprehensive experiments demonstrate the effectiveness of our method. The clone model achieves a clean accuracy of 77.86% and a robust accuracy of 39.51% against AutoAttack, which are only 4.71% and 8.40% lower than the target model on the CIFAR-10 dataset, significantly exceeding the baselines. Our code is available at: https://github.com/LetheSec/DFHL-RS-Attack.
    Convergence of the Chambolle-Pock Algorithm in the Absence of Monotonicity. (arXiv:2312.06540v1 [math.OC] CROSS LISTED)
    The Chambolle-Pock algorithm (CPA), also known as the primal-dual hybrid gradient method (PDHG), has surged in popularity in the last decade due to its success in solving convex/monotone structured problems. This work provides convergence results for problems with varying degrees of (non)monotonicity, quantified through a so-called oblique weak Minty condition on the associated primal-dual operator. Our results reveal novel stepsize and relaxation parameter ranges which do not only depend on the norm of the linear mapping, but also on its other singular values. In particular, in nonmonotone settings, in addition to the classical stepsize conditions for CPA, extra bounds on the stepsizes and relaxation parameters are required. On the other hand, in the strongly monotone setting, the relaxation parameter is allowed to exceed the classical upper bound of two. Moreover, sufficient convergence conditions are obtained when the individual operators belong to the recently introduced class of semimonotone operators. Since this class of operators encompasses many traditional operator classes including (hypo)- and co(hypo)monotone operators, this analysis recovers and extends existing results for CPA. Several examples are provided for the aforementioned problem classes to demonstrate and establish tightness of the proposed stepsize ranges.
    Granularity at Scale: Estimating Neighborhood Socioeconomic Indicators from High-Resolution Orthographic Imagery and Hybrid Learning. (arXiv:2309.16808v2 [cs.CV] UPDATED)
    Many areas of the world are without basic information on the socioeconomic well-being of the residing population due to limitations in existing data collection methods. Overhead images obtained remotely, such as from satellite or aircraft, can help serve as windows into the state of life on the ground and help "fill in the gaps" where community information is sparse, with estimates at smaller geographic scales requiring higher resolution sensors. Concurrent with improved sensor resolutions, recent advancements in machine learning and computer vision have made it possible to quickly extract features from and detect patterns in image data, in the process correlating these features with other information. In this work, we explore how well two approaches, a supervised convolutional neural network and semi-supervised clustering based on bag-of-visual-words, estimate population density, median household income, and educational attainment of individual neighborhoods from publicly available high-resolution imagery of cities throughout the United States. Results and analyses indicate that features extracted from the imagery can accurately estimate the density (R$^2$ up to 0.81) of neighborhoods, with the supervised approach able to explain about half the variation in a population's income and education. In addition to the presented approaches serving as a basis for further geographic generalization, the novel semi-supervised approach provides a foundation for future work seeking to estimate fine-scale information from aerial imagery without the need for label data.
    Reward Certification for Policy Smoothed Reinforcement Learning. (arXiv:2312.06436v2 [cs.LG] UPDATED)
    Reinforcement Learning (RL) has achieved remarkable success in safety-critical areas, but it can be weakened by adversarial attacks. Recent studies have introduced "smoothed policies" in order to enhance its robustness. Yet, it is still challenging to establish a provable guarantee to certify the bound of its total reward. Prior methods relied primarily on computing bounds using Lipschitz continuity or calculating the probability of cumulative reward above specific thresholds. However, these techniques are only suited for continuous perturbations on the RL agent's observations and are restricted to perturbations bounded by the $l_2$-norm. To address these limitations, this paper proposes a general black-box certification method capable of directly certifying the cumulative reward of the smoothed policy under various $l_p$-norm bounded perturbations. Furthermore, we extend our methodology to certify perturbations on action spaces. Our approach leverages f-divergence to measure the distinction between the original distribution and the perturbed distribution, subsequently determining the certification bound by solving a convex optimisation problem. We provide a comprehensive theoretical analysis and run sufficient experiments in multiple environments. Our results show that our method not only improves the certified lower bound of mean cumulative reward but also demonstrates better efficiency than state-of-the-art techniques.
    Adaptive Image Registration: A Hybrid Approach Integrating Deep Learning and Optimization Functions for Enhanced Precision. (arXiv:2311.15497v2 [cs.CV] UPDATED)
    Image registration has traditionally been done using two distinct approaches: learning based methods, relying on robust deep neural networks, and optimization-based methods, applying complex mathematical transformations to warp images accordingly. Of course, both paradigms offer advantages and disadvantages, and, in this work, we seek to combine their respective strengths into a single streamlined framework, using the outputs of the learning based method as initial parameters for optimization while prioritizing computational power for the image pairs that offer the greatest loss. Our investigations showed that an improvement of 1.5% in testing when utilizing the best performing state-of-the-art model as the backbone of the framework, while maintaining the same inference time and a substantial 0.94% points performance gain in deformation field smoothness.
    Non-monotone Sequential Submodular Maximization. (arXiv:2308.08641v2 [cs.LG] UPDATED)
    In this paper, we study a fundamental problem in submodular optimization, which is called sequential submodular maximization. Specifically, we aim to select and rank a group of $k$ items from a ground set $V$ such that the weighted summation of $k$ (possibly non-monotone) submodular functions $f_1, \cdots ,f_k: 2^V \rightarrow \mathbb{R}^+$ is maximized, here each function $f_j$ takes the first $j$ items from this sequence as input. The existing research on sequential submodular maximization has predominantly concentrated on the monotone setting, assuming that the submodular functions are non-decreasing. However, in various real-world scenarios, like diversity-aware recommendation systems, adding items to an existing set might negatively impact the overall utility. In response, this paper pioneers the examination of the aforementioned problem with non-monotone submodular functions and offers effective solutions for both flexible and fixed length constraints, as well as a special case with identical utility functions. The empirical evaluations further validate the effectiveness of our proposed algorithms in the domain of video recommendations. The results of this research have implications in various fields, including recommendation systems and assortment optimization, where the ordering of items significantly impacts the overall value obtained.
    DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing. (arXiv:2306.14435v5 [cs.CV] UPDATED)
    Accurate and controllable image editing is a challenging task that has attracted significant attention recently. Notably, DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision. However, due to its reliance on generative adversarial networks (GANs), its generality is limited by the capacity of pretrained GAN models. In this work, we extend this editing framework to diffusion models and propose a novel approach DragDiffusion. By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images. Our approach involves optimizing the diffusion latents to achieve precise spatial control. The supervision signal of this optimization process is from the diffusion model's UNet features, which are known to contain rich semantic and geometric information. Moreover, we introduce two additional techniques, namely LoRA fine-tuning and latent-MasaCtrl, to further preserve the identity of the original image. Lastly, we present a challenging benchmark dataset called DragBench -- the first benchmark to evaluate the performance of interactive point-based image editing methods. Experiments across a wide range of challenging cases (e.g., images with multiple objects, diverse object categories, various styles, etc.) demonstrate the versatility and generality of DragDiffusion. Code: https://github.com/Yujun-Shi/DragDiffusion.
    Gated Linear Attention Transformers with Hardware-Efficient Training. (arXiv:2312.06635v2 [cs.LG] UPDATED)
    Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear (with respect to output length) inference complexity. Recent works such as RetNet (Sun et al., 2023) and TransNormerLLM (Qin et al., 2023a) observe that adding a global decay term to the additive RNN update rule greatly improves performance, sometimes outperforming standard Transformers with softmax attention when trained at scale. In this work we show that adding a data-dependent gating mechanism further improves performance. We derive a parallel form of this gated linear attention layer that enables efficient training. However, a straightforward, numerically stable implementation of this parallel form requires generalized matrix multiplications in log-space for numerical stability, and thus cannot take advantage of tensor cores on modern GPUs which are optimized for standard matrix multiplications. We develop a hardware-efficient version of the parallel form that can still make use of tensor cores through block-parallel computations over sequence chunks. Experiments on moderate-scale language modeling (340M-parameter models trained on 15B tokens, 1.3B-parameter models trained on 100B tokens) show that gated linear attention (GLA) Transformers perform competitively against a strong LLaMA-architecture Transformer baseline (Touvron et al., 2023) as well as Mamba (Gu & Dao, 2023), a recently introduced state-space model with a data-dependent state transition mechanism. For training speed, our Triton-based implementation performs comparably to CUDA-optimized FlashAttention-2 (Dao, 2023) under the regular 2048 training length setting, while outperforming FlashAttention-2 when training on longer sequences beyond 4096.
    ArCL: Enhancing Contrastive Learning with Augmentation-Robust Representations. (arXiv:2303.01092v2 [cs.LG] UPDATED)
    Self-Supervised Learning (SSL) is a paradigm that leverages unlabeled data for model training. Empirical studies show that SSL can achieve promising performance in distribution shift scenarios, where the downstream and training distributions differ. However, the theoretical understanding of its transferability remains limited. In this paper, we develop a theoretical framework to analyze the transferability of self-supervised contrastive learning, by investigating the impact of data augmentation on it. Our results reveal that the downstream performance of contrastive learning depends largely on the choice of data augmentation. Moreover, we show that contrastive learning fails to learn domain-invariant features, which limits its transferability. Based on these theoretical insights, we propose a novel method called Augmentation-robust Contrastive Learning (ArCL), which guarantees to learn domain-invariant features and can be easily integrated with existing contrastive learning algorithms. We conduct experiments on several datasets and show that ArCL significantly improves the transferability of contrastive learning.  ( 2 min )
    INFLECT-DGNN: Influencer Prediction with Dynamic Graph Neural Networks. (arXiv:2307.08131v3 [cs.SI] UPDATED)
    Leveraging network information for predictive modeling has become widespread in many domains. Within the realm of referral and targeted marketing, influencer detection stands out as an area that could greatly benefit from the incorporation of dynamic network representation due to the ongoing development of customer-brand relationships. To elaborate this idea, we introduce INFLECT-DGNN, a new framework for INFLuencer prEdiCTion with Dynamic Graph Neural Networks that combines Graph Neural Networks (GNN) and Recurrent Neural Networks (RNN) with weighted loss functions, the Synthetic Minority Oversampling TEchnique (SMOTE) adapted for graph data, and a carefully crafted rolling-window strategy. To evaluate predictive performance, we utilize a unique corporate data set with networks of three cities and derive a profit-driven evaluation methodology for influencer prediction. Our results show how using RNN to encode temporal attributes alongside GNNs significantly improves predictive performance. We compare the results of various models to demonstrate the importance of capturing graph representation, temporal dependencies, and using a profit-driven methodology for evaluation.  ( 2 min )
    FoPro-KD: Fourier Prompted Effective Knowledge Distillation for Long-Tailed Medical Image Recognition. (arXiv:2305.17421v2 [eess.IV] UPDATED)
    Representational transfer from publicly available models is a promising technique for improving medical image classification, especially in long-tailed datasets with rare diseases. However, existing methods often overlook the frequency-dependent behavior of these models, thereby limiting their effectiveness in transferring representations and generalizations to rare diseases. In this paper, we propose FoPro-KD, a novel framework that leverages the power of frequency patterns learned from frozen pre-trained models to enhance their transferability and compression, presenting a few unique insights: 1) We demonstrate that leveraging representations from publicly available pre-trained models can substantially improve performance, specifically for rare classes, even when utilizing representations from a smaller pre-trained model. 2) We observe that pre-trained models exhibit frequency preferences, which we explore using our proposed Fourier Prompt Generator (FPG), allowing us to manipulate specific frequencies in the input image, enhancing the discriminative representational transfer. 3) By amplifying or diminishing these frequencies in the input image, we enable Effective Knowledge Distillation (EKD). EKD facilitates the transfer of knowledge from pre-trained models to smaller models. Through extensive experiments in long-tailed gastrointestinal image recognition and skin lesion classification, where rare diseases are prevalent, our FoPro-KD framework outperforms existing methods, enabling more accessible medical models for rare disease classification. Code is available at https://github.com/xmed-lab/FoPro-KD.
    Weak-signal extraction enabled by deep-neural-network denoising of diffraction data. (arXiv:2209.09247v3 [eess.IV] UPDATED)
    Removal or cancellation of noise has wide-spread applications for imaging and acoustics. In every-day-life applications, denoising may even include generative aspects, which are unfaithful to the ground truth. For scientific use, however, denoising must reproduce the ground truth accurately. Here, we show how data can be denoised via a deep convolutional neural network such that weak signals appear with quantitative accuracy. In particular, we study X-ray diffraction on crystalline materials. We demonstrate that weak signals stemming from charge ordering, insignificant in the noisy data, become visible and accurate in the denoised data. This success is enabled by supervised training of a deep neural network with pairs of measured low- and high-noise data. We demonstrate that using artificial noise does not yield such quantitatively accurate results. Our approach thus illustrates a practical strategy for noise filtering that can be applied to challenging acquisition problems.
    Adaptive learning of density ratios in RKHS. (arXiv:2307.16164v2 [cs.LG] UPDATED)
    Estimating the ratio of two probability densities from finitely many observations of the densities is a central problem in machine learning and statistics with applications in two-sample testing, divergence estimation, generative modeling, covariate shift adaptation, conditional density estimation, and novelty detection. In this work, we analyze a large class of density ratio estimation methods that minimize a regularized Bregman divergence between the true density ratio and a model in a reproducing kernel Hilbert space (RKHS). We derive new finite-sample error bounds, and we propose a Lepskii type parameter choice principle that minimizes the bounds without knowledge of the regularity of the density ratio. In the special case of quadratic loss, our method adaptively achieves a minimax optimal error rate. A numerical illustration is provided.
    Promoting Counterfactual Robustness through Diversity. (arXiv:2312.06564v2 [cs.LG] UPDATED)
    Counterfactual explanations shed light on the decisions of black-box models by explaining how an input can be altered to obtain a favourable decision from the model (e.g., when a loan application has been rejected). However, as noted recently, counterfactual explainers may lack robustness in the sense that a minor change in the input can cause a major change in the explanation. This can cause confusion on the user side and open the door for adversarial attacks. In this paper, we study some sources of non-robustness. While there are fundamental reasons for why an explainer that returns a single counterfactual cannot be robust in all instances, we show that some interesting robustness guarantees can be given by reporting multiple rather than a single counterfactual. Unfortunately, the number of counterfactuals that need to be reported for the theoretical guarantees to hold can be prohibitively large. We therefore propose an approximation algorithm that uses a diversity criterion to select a feasible number of most relevant explanations and study its robustness empirically. Our experiments indicate that our method improves the state-of-the-art in generating robust explanations, while maintaining other desirable properties and providing competitive computational performance.
    Stochastic Nonlinear Control via Finite-dimensional Spectral Dynamic Embedding. (arXiv:2304.03907v2 [cs.LG] UPDATED)
    This paper presents an approach, Spectral Dynamics Embedding Control (SDEC), to optimal control for nonlinear stochastic systems. This method leverages an infinite-dimensional feature to linearly represent the state-action value function and exploits finite-dimensional truncation approximation for practical implementation. To characterize the effectiveness of these finite dimensional approximations, we provide an in-depth theoretical analysis to characterize the approximation error induced by the finite-dimension truncation and statistical error induced by finite-sample approximation in both policy evaluation and policy optimization. Our analysis includes two prominent kernel approximation methods: truncations onto random features and Nystrom features. We also empirically test the algorithm and compare the performance with Koopman-based, iLQR, and energy-based methods on a few benchmark problems.
    Coupled Confusion Correction: Learning from Crowds with Sparse Annotations. (arXiv:2312.07331v1 [cs.LG])
    As the size of the datasets getting larger, accurately annotating such datasets is becoming more impractical due to the expensiveness on both time and economy. Therefore, crowd-sourcing has been widely adopted to alleviate the cost of collecting labels, which also inevitably introduces label noise and eventually degrades the performance of the model. To learn from crowd-sourcing annotations, modeling the expertise of each annotator is a common but challenging paradigm, because the annotations collected by crowd-sourcing are usually highly-sparse. To alleviate this problem, we propose Coupled Confusion Correction (CCC), where two models are simultaneously trained to correct the confusion matrices learned by each other. Via bi-level optimization, the confusion matrices learned by one model can be corrected by the distilled data from the other. Moreover, we cluster the ``annotator groups'' who share similar expertise so that their confusion matrices could be corrected together. In this way, the expertise of the annotators, especially of those who provide seldom labels, could be better captured. Remarkably, we point out that the annotation sparsity not only means the average number of labels is low, but also there are always some annotators who provide very few labels, which is neglected by previous works when constructing synthetic crowd-sourcing annotations. Based on that, we propose to use Beta distribution to control the generation of the crowd-sourcing labels so that the synthetic annotations could be more consistent with the real-world ones. Extensive experiments are conducted on two types of synthetic datasets and three real-world datasets, the results of which demonstrate that CCC significantly outperforms state-of-the-art approaches.
    Reconstructing Turbulent Flows Using Physics-Aware Spatio-Temporal Dynamics and Test-Time Refinement. (arXiv:2304.12130v3 [physics.flu-dyn] UPDATED)
    Simulating turbulence is critical for many societally important applications in aerospace engineering, environmental science, the energy industry, and biomedicine. Large eddy simulation (LES) has been widely used as an alternative to direct numerical simulation (DNS) for simulating turbulent flows due to its reduced computational cost. However, LES is unable to capture all of the scales of turbulent transport accurately. Reconstructing DNS from low-resolution LES is critical for many scientific and engineering disciplines, but it poses many challenges to existing super-resolution methods due to the spatio-temporal complexity of turbulent flows. In this work, we propose a new physics-guided neural network for reconstructing the sequential DNS from low-resolution LES data. The proposed method leverages the partial differential equation that underlies the flow dynamics in the design of spatio-temporal model architecture. A degradation-based refinement method is also developed to enforce physical constraints and further reduce the accumulated reconstruction errors over long periods. The results on two different types of turbulent flow data confirm the superiority of the proposed method in reconstructing the high-resolution DNS data and preserving the physical characteristics of flow transport.
    The Copycat Perceptron: Smashing Barriers Through Collective Learning. (arXiv:2308.03743v2 [cond-mat.dis-nn] UPDATED)
    We characterize the equilibrium properties of a model of $y$ coupled binary perceptrons in the teacher-student scenario, subject to a learning rule, with an explicit ferromagnetic coupling proportional to the Hamming distance between the students' weights. In contrast to recent works, we analyze a more general setting in which thermal noise is present that affects each student's generalization performance. In the nonzero temperature regime, we find that the coupling of replicas produces a bend of the phase diagram towards smaller values of $\alpha$: This suggests that the free energy landscape gets smoother around the solution with perfect generalization (i.e., the teacher's) at a fixed fraction of examples, allowing standard thermal updates such as Simulated Annealing to easily reach the teacher solution and avoid entrapment in metastable states as it happens in the unreplicated case, even in the so-called computationally easy regime. These results provide additional analytic and numerical evidence for the recently conjectured Bayes-optimal property of Replicated Simulated Annealing (RSA) for a sufficient number of replicas. From a learning perspective, these results also suggest that multiple students working together (in this case reviewing the same data) are able to learn the same rule both significantly faster and with fewer examples, a property that could be exploited in the context of cooperative and federated learning.
    Removing Dust from CMB Observations with Diffusion Models. (arXiv:2310.16285v2 [astro-ph.CO] UPDATED)
    In cosmology, the quest for primordial $B$-modes in cosmic microwave background (CMB) observations has highlighted the critical need for a refined model of the Galactic dust foreground. We investigate diffusion-based modeling of the dust foreground and its interest for component separation. Under the assumption of a Gaussian CMB with known cosmology (or covariance matrix), we show that diffusion models can be trained on examples of dust emission maps such that their sampling process directly coincides with posterior sampling in the context of component separation. We illustrate this on simulated mixtures of dust emission and CMB. We show that common summary statistics (power spectrum, Minkowski functionals) of the components are well recovered by this process. We also introduce a model conditioned by the CMB cosmology that outperforms models trained using a single cosmology on component separation. Such a model will be used in future work for diffusion-based cosmological inference.
    Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time. (arXiv:2305.15546v2 [cs.LG] UPDATED)
    A crucial problem in reinforcement learning is learning the optimal policy. We study this in tabular infinite-horizon discounted Markov decision processes under the online setting. The existing algorithms either fail to achieve regret optimality or have to incur a high memory and computational cost. In addition, existing optimal algorithms all require a long burn-in time in order to achieve optimal sample efficiency, i.e., their optimality is not guaranteed unless sample size surpasses a high threshold. We address both open problems by introducing a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner. This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.
    Learning Broadcast Protocols. (arXiv:2306.14284v2 [cs.FL] UPDATED)
    The problem of learning a computational model from examples has been receiving growing attention. For the particularly challenging problem of learning models of distributed systems, existing results are restricted to models with a fixed number of interacting processes. In this work we look for the first time (to the best of our knowledge) at the problem of learning a distributed system with an arbitrary number of processes, assuming only that there exists a cutoff, i.e., a number of processes that is sufficient to produce all observable behaviors. Specifically, we consider fine broadcast protocols, these are broadcast protocols (BPs) with a finite cutoff and no hidden states. We provide a learning algorithm that can infer a correct BP from a sample that is consistent with a fine BP, and a minimal equivalent BP if the sample is sufficiently complete. On the negative side we show that (a) characteristic sets of exponential size are unavoidable, (b) the consistency problem for fine BPs is NP hard, and (c) that fine BPs are not polynomially predictable.
    State-Wise Safe Reinforcement Learning With Pixel Observations. (arXiv:2311.02227v2 [cs.LG] UPDATED)
    In the context of safe exploration, Reinforcement Learning (RL) has long grappled with the challenges of balancing the tradeoff between maximizing rewards and minimizing safety violations, particularly in complex environments with contact-rich or non-smooth dynamics, and when dealing with high-dimensional pixel observations. Furthermore, incorporating state-wise safety constraints in the exploration and learning process, where the agent must avoid unsafe regions without prior knowledge, adds another layer of complexity. In this paper, we propose a novel pixel-observation safe RL algorithm that efficiently encodes state-wise safety constraints with unknown hazard regions through a newly introduced latent barrier-like function learning mechanism. As a joint learning framework, our approach begins by constructing a latent dynamics model with low-dimensional latent spaces derived from pixel observations. We then build and learn a latent barrier-like function on top of the latent dynamics and conduct policy optimization simultaneously, thereby improving both safety and the total expected return. Experimental evaluations on the safety-gym benchmark suite demonstrate that our proposed method significantly reduces safety violations throughout the training process, and demonstrates faster safety convergence compared to existing methods while achieving competitive results in reward return.
    DiffAIL: Diffusion Adversarial Imitation Learning. (arXiv:2312.06348v2 [cs.LG] UPDATED)
    Imitation learning aims to solve the problem of defining reward functions in real-world decision-making tasks. The current popular approach is the Adversarial Imitation Learning (AIL) framework, which matches expert state-action occupancy measures to obtain a surrogate reward for forward reinforcement learning. However, the traditional discriminator is a simple binary classifier and doesn't learn an accurate distribution, which may result in failing to identify expert-level state-action pairs induced by the policy interacting with the environment. To address this issue, we propose a method named diffusion adversarial imitation learning (DiffAIL), which introduces the diffusion model into the AIL framework. Specifically, DiffAIL models the state-action pairs as unconditional diffusion models and uses diffusion loss as part of the discriminator's learning objective, which enables the discriminator to capture better expert demonstrations and improve generalization. Experimentally, the results show that our method achieves state-of-the-art performance and significantly surpasses expert demonstration on two benchmark tasks, including the standard state-action setting and state-only settings. Our code can be available at the link https://github.com/ML-Group-SDU/DiffAIL.
    Multi-Modal Conformal Prediction Regions by Optimizing Convex Shape Templates. (arXiv:2312.07434v1 [cs.LG])
    Conformal prediction is a statistical tool for producing prediction regions for machine learning models that are valid with high probability. A key component of conformal prediction algorithms is a non-conformity score function that quantifies how different a model's prediction is from the unknown ground truth value. Essentially, these functions determine the shape and the size of the conformal prediction regions. However, little work has gone into finding non-conformity score functions that produce prediction regions that are multi-modal and practical, i.e., that can efficiently be used in engineering applications. We propose a method that optimizes parameterized shape template functions over calibration data, which results in non-conformity score functions that produce prediction regions with minimum volume. Our approach results in prediction regions that are multi-modal, so they can properly capture residuals of distributions that have multiple modes, and practical, so each region is convex and can be easily incorporated into downstream tasks, such as a motion planner using conformal prediction regions. Our method applies to general supervised learning tasks, while we illustrate its use in time-series prediction. We provide a toolbox and present illustrative case studies of F16 fighter jets and autonomous vehicles, showing an up to $68\%$ reduction in prediction region area.
    SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language Models. (arXiv:2312.07492v1 [cs.CL])
    Current datasets for unwanted social bias auditing are limited to studying protected demographic features such as race and gender. In this work, we introduce a comprehensive benchmark that is meant to capture the amplification of social bias, via stigmas, in generative language models. We start with a comprehensive list of 93 stigmas documented in social science literature and curate a question-answering (QA) dataset which involves simple social situations. Our benchmark, SocialStigmaQA, contains roughly 10K prompts, with a variety of prompt styles, carefully constructed to systematically test for both social bias and model robustness. We present results for SocialStigmaQA with two widely used open source generative language models and we demonstrate that the output generated by these models considerably amplifies existing social bias against stigmatized groups. Specifically, we find that the proportion of socially biased output ranges from 45% to 59% across a variety of decoding strategies and prompting styles. We discover that the deliberate design of the templates in our benchmark (e.g., by adding biasing text to the prompt or varying the answer that indicates bias) impact the model tendencies to generate socially biased output. Additionally, we report on patterns in the generated chain-of-thought output, finding a variety of problems from subtle bias to evidence of a lack of reasoning. Warning: This paper contains examples of text which is toxic, biased, and harmful.
    GTRL: An Entity Group-Aware Temporal Knowledge Graph Representation Learning Method. (arXiv:2302.11091v2 [cs.LG] UPDATED)
    Temporal Knowledge Graph (TKG) representation learning embeds entities and event types into a continuous low-dimensional vector space by integrating the temporal information, which is essential for downstream tasks, e.g., event prediction and question answering. Existing methods stack multiple graph convolution layers to model the influence of distant entities, leading to the over-smoothing problem. To alleviate the problem, recent studies infuse reinforcement learning to obtain paths that contribute to modeling the influence of distant entities. However, due to the limited number of hops, these studies fail to capture the correlation between entities that are far apart and even unreachable. To this end, we propose GTRL, an entity Group-aware Temporal knowledge graph Representation Learning method. GTRL is the first work that incorporates the entity group modeling to capture the correlation between entities by stacking only a finite number of layers. Specifically, the entity group mapper is proposed to generate entity groups from entities in a learning way. Based on entity groups, the implicit correlation encoder is introduced to capture implicit correlations between any pairwise entity groups. In addition, the hierarchical GCNs are exploited to accomplish the message aggregation and representation updating on the entity group graph and the entity graph. Finally, GRUs are employed to capture the temporal dependency in TKGs. Extensive experiments on three real-world datasets demonstrate that GTRL achieves the state-of-the-art performances on the event prediction task, outperforming the best baseline by an average of 13.44%, 9.65%, 12.15%, and 15.12% in MRR, Hits@1, Hits@3, and Hits@10, respectively.
    Graph Neural Network-based surrogate model for granular flows. (arXiv:2305.05218v2 [physics.geo-ph] UPDATED)
    Accurate simulation of granular flow dynamics is crucial for assessing various geotechnical risks, including landslides and debris flows. Granular flows involve a dynamic rearrangement of particles exhibiting complex transitions from solid-like to fluid-like responses. Traditional continuum and discrete numerical methods are limited by their computational cost in simulating large-scale systems. Statistical or machine learning-based models offer an alternative. Still, they are largely empirical, based on a limited set of parameters. Due to their permutation-dependent learning, traditional machine learning-based models require huge training data to generalize. To resolve these problems, we use a graph neural network, a state-of-the-art machine learning architecture that learns local interactions. Graphs represent the state of dynamically changing granular flows and the interaction laws, such as energy and momentum exchange between grains. We develop a graph neural network-based simulator (GNS) that takes the current state of granular flow and predicts the next state using Euler explicit integration by learning the local interaction laws. We train GNS on different granular trajectories. We then assess the performance of GNS by predicting granular column collapse. GNS accurately predicts flow dynamics for column collapses with different aspect ratios unseen during training. GNS is hundreds of times faster than high-fidelity numerical simulators. The model also generalizes to domains much larger than the training data, handling more than twice the number of particles than it was trained on.
    A probabilistic forecast methodology for volatile electricity prices in the Australian National Electricity Market. (arXiv:2311.07289v2 [cs.LG] UPDATED)
    The South Australia region of the Australian National Electricity Market (NEM) displays some of the highest levels of price volatility observed in modern electricity markets. This paper outlines an approach to probabilistic forecasting under these extreme conditions, including spike filtration and several post-processing steps. We propose using quantile regression as an ensemble tool for probabilistic forecasting, with our combined forecasts achieving superior results compared to all constituent models. Within our ensemble framework, we demonstrate that averaging models with varying training length periods leads to a more adaptive model and increased prediction accuracy. The applicability of the final model is evaluated by comparing our median forecasts with the point forecasts available from the Australian NEM operator, with our model outperforming these NEM forecasts by a significant margin.
    Early Weight Averaging meets High Learning Rates for LLM Pre-training. (arXiv:2306.03241v2 [cs.LG] UPDATED)
    Training Large Language Models (LLMs) incurs significant cost; hence, any strategy that accelerates model convergence is helpful. In this paper, we investigate the ability of a simple idea checkpoint averaging along the trajectory of a training run to improve both convergence and generalization quite early on during training. Here we show that models trained with high learning rates observe higher gains due to checkpoint averaging. Furthermore, these gains are amplified when checkpoints are sampled with considerable spacing in training steps. Our training recipe outperforms conventional training and popular checkpoint averaging baselines such as exponential moving average (EMA) and stochastic moving average (SWA). We evaluate our training recipe by pre-training LLMs, where high learning rates are inherently preferred due to extremely large batch sizes. Specifically, we pre-trained nanoGPT-2 models of varying sizes, small (125M), medium (335M), and large (770M)on the OpenWebText dataset, comprised of 9B tokens. Additionally, we present results for publicly available Pythia LLMs, ranging from 1B to 12B, which were trained on the PILE-deduped dataset containing 207B tokens.
    Dynamics Harmonic Analysis of Robotic Systems: Application in Data-Driven Koopman Modelling. (arXiv:2312.07457v1 [cs.RO])
    We introduce the use of harmonic analysis to decompose the state space of symmetric robotic systems into orthogonal isotypic subspaces. These are lower-dimensional spaces that capture distinct, symmetric, and synergistic motions. For linear dynamics, we characterize how this decomposition leads to a subdivision of the dynamics into independent linear systems on each subspace, a property we term dynamics harmonic analysis (DHA). To exploit this property, we use Koopman operator theory to propose an equivariant deep-learning architecture that leverages the properties of DHA to learn a global linear model of system dynamics. Our architecture, validated on synthetic systems and the dynamics of locomotion of a quadrupedal robot, demonstrates enhanced generalization, sample efficiency, and interpretability, with less trainable parameters and computational costs.
    Distributional Bellman Operators over Mean Embeddings. (arXiv:2312.07358v1 [stat.ML])
    We propose a novel algorithmic framework for distributional reinforcement learning, based on learning finite-dimensional mean embeddings of return distributions. We derive several new algorithms for dynamic programming and temporal-difference learning based on this framework, provide asymptotic convergence theory, and examine the empirical performance of the algorithms on a suite of tabular tasks. Further, we show that this approach can be straightforwardly combined with deep reinforcement learning, and obtain a new deep RL agent that improves over baseline distributional approaches on the Arcade Learning Environment.  ( 2 min )
    Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult. (arXiv:2310.17087v2 [cs.LG] UPDATED)
    Large learning rates, when applied to gradient descent for nonconvex optimization, yield various implicit biases including the edge of stability (Cohen et al., 2021), balancing (Wang et al., 2022), and catapult (Lewkowycz et al., 2020). These phenomena cannot be well explained by classical optimization theory. Though significant theoretical progress has been made in understanding these implicit biases, it remains unclear for which objective functions would they be more likely. This paper provides an initial step in answering this question and also shows that these implicit biases are in fact various tips of the same iceberg. To establish these results, we develop a global convergence theory under large learning rates, for a family of nonconvex functions without globally Lipschitz continuous gradient, which was typically assumed in existing convergence analysis. Specifically, these phenomena are more likely to occur when the optimization objective function has good regularity. This regularity, together with gradient descent using a large learning rate that favors flatter regions, results in these nontrivial dynamical behaviors. Another corollary is the first non-asymptotic convergence rate bound for large-learning-rate gradient descent optimization of nonconvex functions. Although our theory only applies to specific functions so far, the possibility of extrapolating it to neural networks is also experimentally validated, for which different choices of loss, activation functions, and other techniques such as batch normalization can all affect regularity significantly and lead to very different training dynamics.
    Score dynamics: scaling molecular dynamics with picosecond timesteps via conditional diffusion model. (arXiv:2310.01678v2 [physics.comp-ph] UPDATED)
    We propose score dynamics, a general framework for learning accelerated evolution operators with large timesteps from molecular-dynamics simulations. SD is centered around scores, or derivatives of the transition log-probability with respect to the dynamical degrees of freedom. The latter play the same role as force fields in MD but are used in denoising diffusion probability models to generate discrete transitions of the dynamical variables in an SD timestep, which can be orders of magnitude larger than a typical MD timestep. In this work, we construct graph neural network based score dynamics models of realistic molecular systems that are evolved with 10 ps timesteps. We demonstrate the efficacy of score dynamics with case studies of alanine dipeptide and short alkanes in aqueous solution. Both equilibrium predictions derived from the stationary distributions of the conditional probability and kinetic predictions for the transition rates and transition paths are in good agreement with MD. Our current SD implementation is about two orders of magnitude faster than the MD counterpart for the systems studied in this work. Open challenges and possible future remedies to improve score dynamics are also discussed.
    Simple diffusion: End-to-end diffusion for high resolution images. (arXiv:2301.11093v2 [cs.CV] UPDATED)
    Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.
    MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples. (arXiv:2312.06363v2 [cs.AI] UPDATED)
    Although In-Context Learning (ICL) brings remarkable performance gains to Large Language Models (LLMs), the improvements remain lower than fine-tuning on downstream tasks. This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning by fully leveraging the promising ICL capability of multi-modal LLMs (MM-LLMs). We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives. Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features. Moreover, leveraging the flexibility of M-Hub, we design a variety of in-context demonstrations. Extensive experiments on a diverse range of downstream multi-modal tasks demonstrate that MMICT significantly outperforms traditional fine-tuning strategy and the vanilla ICT method that directly takes the concatenation of all information from different modalities as input.
    FairSISA: Ensemble Post-Processing to Improve Fairness of Unlearning in LLMs. (arXiv:2312.07420v1 [cs.LG])
    Training large language models (LLMs) is a costly endeavour in terms of time and computational resources. The large amount of training data used during the unsupervised pre-training phase makes it difficult to verify all data and, unfortunately, undesirable data may be ingested during training. Re-training from scratch is impractical and has led to the creation of the 'unlearning' discipline where models are modified to "unlearn" undesirable information without retraining. However, any modification can alter the behaviour of LLMs, especially on key dimensions such as fairness. This is the first work that examines this interplay between unlearning and fairness for LLMs. In particular, we focus on a popular unlearning framework known as SISA [Bourtoule et al., 2021], which creates an ensemble of models trained on disjoint shards. We evaluate the performance-fairness trade-off for SISA, and empirically demsontrate that SISA can indeed reduce fairness in LLMs. To remedy this, we propose post-processing bias mitigation techniques for ensemble models produced by SISA. We adapt the post-processing fairness improvement technique from [Hardt et al., 2016] to design three methods that can handle model ensembles, and prove that one of the methods is an optimal fair predictor for ensemble of models. Through experimental results, we demonstrate the efficacy of our post-processing framework called 'FairSISA'.
    Building Variable-sized Models via Learngene Pool. (arXiv:2312.05743v2 [cs.LG] UPDATED)
    Recently, Stitchable Neural Networks (SN-Net) is proposed to stitch some pre-trained networks for quickly building numerous networks with different complexity and performance trade-offs. In this way, the burdens of designing or training the variable-sized networks, which can be used in application scenarios with diverse resource constraints, are alleviated. However, SN-Net still faces a few challenges. 1) Stitching from multiple independently pre-trained anchors introduces high storage resource consumption. 2) SN-Net faces challenges to build smaller models for low resource constraints. 3). SN-Net uses an unlearned initialization method for stitch layers, limiting the final performance. To overcome these challenges, motivated by the recently proposed Learngene framework, we propose a novel method called Learngene Pool. Briefly, Learngene distills the critical knowledge from a large pre-trained model into a small part (termed as learngene) and then expands this small part into a few variable-sized models. In our proposed method, we distill one pretrained large model into multiple small models whose network blocks are used as learngene instances to construct the learngene pool. Since only one large model is used, we do not need to store more large models as SN-Net and after distilling, smaller learngene instances can be created to build small models to satisfy low resource constraints. We also insert learnable transformation matrices between the instances to stitch them into variable-sized models to improve the performance of these models. Exhaustive experiments have been implemented and the results validate the effectiveness of the proposed Learngene Pool compared with SN-Net.
    Optimal Rates for Regularized Conditional Mean Embedding Learning. (arXiv:2208.01711v3 [stat.ML] UPDATED)
    We address the consistency of a kernel ridge regression estimate of the conditional mean embedding (CME), which is an embedding of the conditional distribution of $Y$ given $X$ into a target reproducing kernel Hilbert space $\mathcal{H}_Y$. The CME allows us to take conditional expectations of target RKHS functions, and has been employed in nonparametric causal and Bayesian inference. We address the misspecified setting, where the target CME is in the space of Hilbert-Schmidt operators acting from an input interpolation space between $\mathcal{H}_X$ and $L_2$, to $\mathcal{H}_Y$. This space of operators is shown to be isomorphic to a newly defined vector-valued interpolation space. Using this isomorphism, we derive a novel and adaptive statistical learning rate for the empirical CME estimator under the misspecified setting. Our analysis reveals that our rates match the optimal $O(\log n / n)$ rates without assuming $\mathcal{H}_Y$ to be finite dimensional. We further establish a lower bound on the learning rate, which shows that the obtained upper bound is optimal.
    Bayesian Optimization with Conformal Prediction Sets. (arXiv:2210.12496v4 [cs.LG] UPDATED)
    Bayesian optimization is a coherent, ubiquitous approach to decision-making under uncertainty, with applications including multi-arm bandits, active learning, and black-box optimization. Bayesian optimization selects decisions (i.e. objective function queries) with maximal expected utility with respect to the posterior distribution of a Bayesian model, which quantifies reducible, epistemic uncertainty about query outcomes. In practice, subjectively implausible outcomes can occur regularly for two reasons: 1) model misspecification and 2) covariate shift. Conformal prediction is an uncertainty quantification method with coverage guarantees even for misspecified models and a simple mechanism to correct for covariate shift. We propose conformal Bayesian optimization, which directs queries towards regions of search space where the model predictions have guaranteed validity, and investigate its behavior on a suite of black-box optimization tasks and tabular ranking tasks. In many cases we find that query coverage can be significantly improved without harming sample-efficiency.
    QSMVM: QoS-aware and social-aware multimetric routing protocol for video-streaming services over MANETs. (arXiv:2312.07414v1 [cs.NI])
    A mobile ad hoc network (MANET) is a set of autonomous mobile devices connected by wireless links in a distributed manner and without a fixed infrastructure. Real-time multimedia services, such as video-streaming over MANETs, offers very promising applications, e.g. two members of a group of tourists who want to share a video transmitted through the MANET they form; a video-streaming service deployed over a MANET where users watch a film; among other examples. On the other hand, social web technologies, where people actively interact online with others through social networks, are leading to a socialization of networks. Information of interaction among users is being used to provide socially-enhanced software. To achieve this, we need to know the strength of the relationship between a given user and each user they interact with. This strength of the relationship can be measured through a concept called tie strength (TS), first introduced by Mark Granovetter in 1973. In this article, we modify our previous proposal named multipath multimedia dynamic source routing (MMDSR) protocol to include a social metric TS in the decisions taken by the forwarding algorithm. We find a trade-off between the quality of service (QoS) and the trust level between users who form the forwarding path in the MANET. Our goal is to increase the trust metric while the QoS is not affected significantly.
    A General Implicit Framework for Fast NeRF Composition and Rendering. (arXiv:2308.04669v3 [cs.CV] UPDATED)
    A variety of Neural Radiance Fields (NeRF) methods have recently achieved remarkable success in high render speed. However, current accelerating methods are specialized and incompatible with various implicit methods, preventing real-time composition over various types of NeRF works. Because NeRF relies on sampling along rays, it is possible to provide general guidance for acceleration. To that end, we propose a general implicit pipeline for composing NeRF objects quickly. Our method enables the casting of dynamic shadows within or between objects using analytical light sources while allowing multiple NeRF objects to be seamlessly placed and rendered together with any arbitrary rigid transformations. Mainly, our work introduces a new surface representation known as Neural Depth Fields (NeDF) that quickly determines the spatial relationship between objects by allowing direct intersection computation between rays and implicit surfaces. It leverages an intersection neural network to query NeRF for acceleration instead of depending on an explicit spatial structure.Our proposed method is the first to enable both the progressive and interactive composition of NeRF objects. Additionally, it also serves as a previewing plugin for a range of existing NeRF works.
    DifAttack: Query-Efficient Black-Box Attack via Disentangled Feature Space. (arXiv:2309.14585v2 [cs.CV] UPDATED)
    This work investigates efficient score-based black-box adversarial attacks with a high Attack Success Rate (ASR) and good generalizability. We design a novel attack method based on a Disentangled Feature space, called DifAttack, which differs significantly from the existing ones operating over the entire feature space. Specifically, DifAttack firstly disentangles an image's latent feature into an adversarial feature and a visual feature, where the former dominates the adversarial capability of an image, while the latter largely determines its visual appearance. We train an autoencoder for the disentanglement by using pairs of clean images and their Adversarial Examples (AEs) generated from available surrogate models via white-box attack methods. Eventually, DifAttack iteratively optimizes the adversarial feature according to the query feedback from the victim model until a successful AE is generated, while keeping the visual feature unaltered. In addition, due to the avoidance of using surrogate models' gradient information when optimizing AEs for black-box models, our proposed DifAttack inherently possesses better attack capability in the open-set scenario, where the training dataset of the victim model is unknown. Extensive experimental results demonstrate that our method achieves significant improvements in ASR and query efficiency simultaneously, especially in the targeted attack and open-set scenarios. The code will be available at https://github.com/csjunjun/DifAttack.git soon.
    Factorized Discriminant Analysis for Genetic Signatures of Neuronal Phenotypes. (arXiv:2010.02171v7 [q-bio.QM] UPDATED)
    Navigating the complex landscape of single-cell transcriptomic data presents significant challenges. Central to this challenge is the identification of a meaningful representation of high-dimensional gene expression patterns that sheds light on the structural and functional properties of cell types. Pursuing model interpretability and computational simplicity, we often look for a linear transformation of the original data that aligns with key phenotypic features of cells. In response to this need, we introduce factorized linear discriminant analysis (FLDA), a novel method for linear dimensionality reduction. The crux of FLDA lies in identifying a linear function of gene expression levels that is highly correlated with one phenotypic feature while minimizing the influence of others. To augment this method, we integrate it with a sparsity-based regularization algorithm. This integration is crucial as it selects a subset of genes pivotal to a specific phenotypic feature or a combination thereof. To illustrate the effectiveness of FLDA, we apply it to transcriptomic datasets from neurons in the Drosophila optic lobe. We demonstrate that FLDA not only captures the inherent structural patterns aligned with phenotypic features but also uncovers key genes associated with each phenotype.
    Large Foundation Models for Power Systems. (arXiv:2312.07044v1 [eess.SY])
    Foundation models, such as Large Language Models (LLMs), can respond to a wide range of format-free queries without any task-specific data collection or model training, creating various research and application opportunities for the modeling and operation of large-scale power systems. In this paper, we outline how such large foundation model such as GPT-4 are developed, and discuss how they can be leveraged in challenging power and energy system tasks. We first investigate the potential of existing foundation models by validating their performance on four representative tasks across power system domains, including the optimal power flow (OPF), electric vehicle (EV) scheduling, knowledge retrieval for power engineering technical reports, and situation awareness. Our results indicate strong capabilities of such foundation models on boosting the efficiency and reliability of power system operational pipelines. We also provide suggestions and projections on future deployment of foundation models in power system applications.
    DEFT: Dexterous Fine-Tuning for Real-World Hand Policies. (arXiv:2310.19797v2 [cs.RO] UPDATED)
    Dexterity is often seen as a cornerstone of complex manipulation. Humans are able to perform a host of skills with their hands, from making food to operating tools. In this paper, we investigate these challenges, especially in the case of soft, deformable objects as well as complex, relatively long-horizon tasks. However, learning such behaviors from scratch can be data inefficient. To circumvent this, we propose a novel approach, DEFT (DExterous Fine-Tuning for Hand Policies), that leverages human-driven priors, which are executed directly in the real world. In order to improve upon these priors, DEFT involves an efficient online optimization procedure. With the integration of human-based learning and online fine-tuning, coupled with a soft robotic hand, DEFT demonstrates success across various tasks, establishing a robust, data-efficient pathway toward general dexterous manipulation. Please see our website at https://dexterous-finetuning.github.io for video results.
    AI Control: Improving Safety Despite Intentional Subversion. (arXiv:2312.06942v1 [cs.LG])
    As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion. We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to human contractors who provide high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases. We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines.
    Promoting Fairness in GNNs: A Characterization of Stability. (arXiv:2309.03648v3 [cs.LG] UPDATED)
    The Lipschitz bound, a technique from robust statistics, can limit the maximum changes in the output concerning the input, taking into account associated irrelevant biased factors. It is an efficient and provable method for examining the output stability of machine learning models without incurring additional computation costs. Recently, Graph Neural Networks (GNNs), which operate on non-Euclidean data, have gained significant attention. However, no previous research has investigated the GNN Lipschitz bounds to shed light on stabilizing model outputs, especially when working on non-Euclidean data with inherent biases. Given the inherent biases in common graph data used for GNN training, it poses a serious challenge to constraining the GNN output perturbations induced by input biases, thereby safeguarding fairness during training. Recently, despite the Lipschitz constant's use in controlling the stability of Euclideanneural networks, the calculation of the precise Lipschitz constant remains elusive for non-Euclidean neural networks like GNNs, especially within fairness contexts. To narrow this gap, we begin with the general GNNs operating on an attributed graph, and formulate a Lipschitz bound to limit the changes in the output regarding biases associated with the input. Additionally, we theoretically analyze how the Lipschitz constant of a GNN model could constrain the output perturbations induced by biases learned from data for fairness training. We experimentally validate the Lipschitz bound's effectiveness in limiting biases of the model output. Finally, from a training dynamics perspective, we demonstrate why the theoretical Lipschitz bound can effectively guide the GNN training to better trade-off between accuracy and fairness.
    Equivariant Flow Matching with Hybrid Probability Transport. (arXiv:2312.07168v1 [cs.LG])
    The generation of 3D molecules requires simultaneously deciding the categorical features~(atom types) and continuous features~(atom coordinates). Deep generative models, especially Diffusion Models (DMs), have demonstrated effectiveness in generating feature-rich geometries. However, existing DMs typically suffer from unstable probability dynamics with inefficient sampling speed. In this paper, we introduce geometric flow matching, which enjoys the advantages of both equivariant modeling and stabilized probability dynamics. More specifically, we propose a hybrid probability path where the coordinates probability path is regularized by an equivariant optimal transport, and the information between different modalities is aligned. Experimentally, the proposed method could consistently achieve better performance on multiple molecule generation benchmarks with 4.75$\times$ speed up of sampling on average.
    Class Probability Matching Using Kernel Methods for Label Shift Adaptation. (arXiv:2312.07282v1 [stat.ML])
    In domain adaptation, covariate shift and label shift problems are two distinct and complementary tasks. In covariate shift adaptation where the differences in data distribution arise from variations in feature probabilities, existing approaches naturally address this problem based on \textit{feature probability matching} (\textit{FPM}). However, for label shift adaptation where the differences in data distribution stem solely from variations in class probability, current methods still use FPM on the $d$-dimensional feature space to estimate the class probability ratio on the one-dimensional label space. To address label shift adaptation more naturally and effectively, inspired by a new representation of the source domain's class probability, we propose a new framework called \textit{class probability matching} (\textit{CPM}) which matches two class probability functions on the one-dimensional label space to estimate the class probability ratio, fundamentally different from FPM operating on the $d$-dimensional feature space. Furthermore, by incorporating the kernel logistic regression into the CPM framework to estimate the conditional probability, we propose an algorithm called \textit{class probability matching using kernel methods} (\textit{CPMKM}) for label shift adaptation. From the theoretical perspective, we establish the optimal convergence rates of CPMKM with respect to the cross-entropy loss for multi-class label shift adaptation. From the experimental perspective, comparisons on real datasets demonstrate that CPMKM outperforms existing FPM-based and maximum-likelihood-based algorithms.
    Online Saddle Point Problem and Online Convex-Concave Optimization. (arXiv:2312.06957v1 [cs.LG])
    Centered around solving the Online Saddle Point problem, this paper introduces the Online Convex-Concave Optimization (OCCO) framework, which involves a sequence of two-player time-varying convex-concave games. We propose the generalized duality gap (Dual-Gap) as the performance metric and establish the parallel relationship between OCCO with Dual-Gap and Online Convex Optimization (OCO) with regret. To demonstrate the natural extension of OCCO from OCO, we develop two algorithms, the implicit online mirror descent-ascent and its optimistic variant. Analysis reveals that their duality gaps share similar expression forms with the corresponding dynamic regrets arising from implicit updates in OCO. Empirical results further substantiate the effectiveness of our algorithms. Simultaneously, we unveil that the dynamic Nash equilibrium regret, which was initially introduced in a recent paper, has inherent defects.
    The Computational Complexity of Concise Hypersphere Classification. (arXiv:2312.07103v1 [cs.LG])
    Hypersphere classification is a classical and foundational method that can provide easy-to-process explanations for the classification of real-valued and binary data. However, obtaining an (ideally concise) explanation via hypersphere classification is much more difficult when dealing with binary data than real-valued data. In this paper, we perform the first complexity-theoretic study of the hypersphere classification problem for binary data. We use the fine-grained parameterized complexity paradigm to analyze the impact of structural properties that may be present in the input data as well as potential conciseness constraints. Our results include stronger lower bounds and new fixed-parameter algorithms for hypersphere classification of binary data, which can find an exact and concise explanation when one exists.
    Scalable Motion Style Transfer with Constrained Diffusion Generation. (arXiv:2312.07311v1 [cs.CV])
    Current training of motion style transfer systems relies on consistency losses across style domains to preserve contents, hindering its scalable application to a large number of domains and private data. Recent image transfer works show the potential of independent training on each domain by leveraging implicit bridging between diffusion models, with the content preservation, however, limited to simple data patterns. We address this by imposing biased sampling in backward diffusion while maintaining the domain independence in the training stage. We construct the bias from the source domain keyframes and apply them as the gradient of content constraints, yielding a framework with keyframe manifold constraint gradients (KMCGs). Our validation demonstrates the success of training separate models to transfer between as many as ten dance motion styles. Comprehensive experiments find a significant improvement in preserving motion contents in comparison to baseline and ablative diffusion-based style transfer models. In addition, we perform a human study for a subjective assessment of the quality of generated dance motions. The results validate the competitiveness of KMCGs.
    Adversarial Purification with the Manifold Hypothesis. (arXiv:2210.14404v4 [cs.LG] UPDATED)
    In this work, we formulate a novel framework for adversarial robustness using the manifold hypothesis. This framework provides sufficient conditions for defending against adversarial examples. We develop an adversarial purification method with this framework. Our method combines manifold learning with variational inference to provide adversarial robustness without the need for expensive adversarial training. Experimentally, our approach can provide adversarial robustness even if attackers are aware of the existence of the defense. In addition, our method can also serve as a test-time defense mechanism for variational autoencoders.
    Analyze the Robustness of Classifiers under Label Noise. (arXiv:2312.07271v1 [cs.LG])
    This study explores the robustness of label noise classifiers, aiming to enhance model resilience against noisy data in complex real-world scenarios. Label noise in supervised learning, characterized by erroneous or imprecise labels, significantly impairs model performance. This research focuses on the increasingly pertinent issue of label noise's impact on practical applications. Addressing the prevalent challenge of inaccurate training data labels, we integrate adversarial machine learning (AML) and importance reweighting techniques. Our approach involves employing convolutional neural networks (CNN) as the foundational model, with an emphasis on parameter adjustment for individual training samples. This strategy is designed to heighten the model's focus on samples critically influencing performance.
    Momentum Particle Maximum Likelihood. (arXiv:2312.07335v1 [cs.LG])
    Maximum likelihood estimation (MLE) of latent variable models is often recast as an optimization problem over the extended space of parameters and probability distributions. For example, the Expectation Maximization (EM) algorithm can be interpreted as coordinate descent applied to a suitable free energy functional over this space. Recently, this perspective has been combined with insights from optimal transport and Wasserstein gradient flows to develop particle-based algorithms applicable to wider classes of models than standard EM. Drawing inspiration from prior works which interpret `momentum-enriched' optimisation algorithms as discretizations of ordinary differential equations, we propose an analogous dynamical systems-inspired approach to minimizing the free energy functional over the extended space of parameters and probability distributions. The result is a dynamic system that blends elements of Nesterov's Accelerated Gradient method, the underdamped Langevin diffusion, and particle methods. Under suitable assumptions, we establish quantitative convergence of the proposed system to the unique minimiser of the functional in continuous time. We then propose a numerical discretization of this system which enables its application to parameter estimation in latent variable models. Through numerical experiments, we demonstrate that the resulting algorithm converges faster than existing methods and compares favourably with other (approximate) MLE algorithms.
    ReRoGCRL: Representation-based Robustness in Goal-Conditioned Reinforcement Learning. (arXiv:2312.07392v1 [cs.LG])
    While Goal-Conditioned Reinforcement Learning (GCRL) has gained attention, its algorithmic robustness, particularly against adversarial perturbations, remains unexplored. Unfortunately, the attacks and robust representation training methods specifically designed for traditional RL are not so effective when applied to GCRL. To address this challenge, we propose the \textit{Semi-Contrastive Representation} attack, a novel approach inspired by the adversarial contrastive attack. Unlike existing attacks in RL, it only necessitates information from the policy function and can be seamlessly implemented during deployment. Furthermore, to mitigate the vulnerability of existing GCRL algorithms, we introduce \textit{Adversarial Representation Tactics}. This strategy combines \textit{Semi-Contrastive Adversarial Augmentation} with \textit{Sensitivity-Aware Regularizer}. It improves the adversarial robustness of the underlying agent against various types of perturbations. Extensive experiments validate the superior performance of our attack and defence mechanism across multiple state-of-the-art GCRL algorithms. Our tool {\bf ReRoGCRL} is available at \url{https://github.com/TrustAI/ReRoGCRL}.
    How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation. (arXiv:2312.07424v1 [cs.LG])
    In machine learning, generalization against distribution shifts -- where deployment conditions diverge from the training scenarios -- is crucial, particularly in fields like climate modeling, biomedicine, and autonomous driving. The emergence of foundation models, distinguished by their extensive pretraining and task versatility, has led to an increased interest in their adaptability to distribution shifts. GPT-4V(ision) acts as the most advanced publicly accessible multimodal foundation model, with extensive applications across various domains, including anomaly detection, video understanding, image generation, and medical diagnosis. However, its robustness against data distributions remains largely underexplored. Addressing this gap, this study rigorously evaluates GPT-4V's adaptability and generalization capabilities in dynamic environments, benchmarking against prominent models like CLIP and LLaVA. We delve into GPT-4V's zero-shot generalization across 13 diverse datasets spanning natural, medical, and molecular domains. We further investigate its adaptability to controlled data perturbations and examine the efficacy of in-context learning as a tool to enhance its adaptation. Our findings delineate GPT-4V's capability boundaries in distribution shifts, shedding light on its strengths and limitations across various scenarios. Importantly, this investigation contributes to our understanding of how AI foundation models generalize to distribution shifts, offering pivotal insights into their adaptability and robustness. Code is publicly available at https://github.com/jameszhou-gl/gpt-4v-distribution-shift.
    A churn prediction dataset from the telecom sector: a new benchmark for uplift modeling. (arXiv:2312.07206v1 [cs.LG])
    Uplift modeling, also known as individual treatment effect (ITE) estimation, is an important approach for data-driven decision making that aims to identify the causal impact of an intervention on individuals. This paper introduces a new benchmark dataset for uplift modeling focused on churn prediction, coming from a telecom company in Belgium, Orange Belgium. Churn, in this context, refers to customers terminating their subscription to the telecom service. This is the first publicly available dataset offering the possibility to evaluate the efficiency of uplift modeling on the churn prediction problem. Moreover, its unique characteristics make it more challenging than the few other public uplift datasets.
    Complex Recurrent Spectral Network. (arXiv:2312.07296v1 [cs.LG])
    This paper presents a novel approach to advancing artificial intelligence (AI) through the development of the Complex Recurrent Spectral Network ($\mathbb{C}$-RSN), an innovative variant of the Recurrent Spectral Network (RSN) model. The $\mathbb{C}$-RSN is designed to address a critical limitation in existing neural network models: their inability to emulate the complex processes of biological neural networks dynamically and accurately. By integrating key concepts from dynamical systems theory and leveraging principles from statistical mechanics, the $\mathbb{C}$-RSN model introduces localized non-linearity, complex fixed eigenvalues, and a distinct separation of memory and input processing functionalities. These features collectively enable the $\mathbb{C}$-RSN evolving towards a dynamic, oscillating final state that more closely mirrors biological cognition. Central to this work is the exploration of how the $\mathbb{C}$-RSN manages to capture the rhythmic, oscillatory dynamics intrinsic to biological systems, thanks to its complex eigenvalue structure and the innovative segregation of its linear and non-linear components. The model's ability to classify data through a time-dependent function, and the localization of information processing, is demonstrated with an empirical evaluation using the MNIST dataset. Remarkably, distinct items supplied as a sequential input yield patterns in time which bear the indirect imprint of the insertion order (and of the time of separation between contiguous insertions).
    Local Function Complexity for Active Learning via Mixture of Gaussian Processes. (arXiv:1902.10664v6 [cs.LG] UPDATED)
    Inhomogeneities in real-world data, e.g., due to changes in the observation noise level or variations in the structural complexity of the source function, pose a unique set of challenges for statistical inference. Accounting for them can greatly improve predictive power when physical resources or computation time is limited. In this paper, we draw on recent theoretical results on the estimation of local function complexity (LFC), derived from the domain of local polynomial smoothing (LPS), to establish a notion of local structural complexity, which is used to develop a model-agnostic active learning (AL) framework. Due to its reliance on pointwise estimates, the LPS model class is not robust and scalable concerning large input space dimensions that typically come along with real-world problems. Here, we derive and estimate the Gaussian process regression (GPR)-based analog of the LPS-based LFC and use it as a substitute in the above framework to make it robust and scalable. We assess the effectiveness of our LFC estimate in an AL application on a prototypical low-dimensional synthetic dataset, before taking on the challenging real-world task of reconstructing a quantum chemical force field for a small organic molecule and demonstrating state-of-the-art performance with a significantly reduced training demand.
    EdgePruner: Poisoned Edge Pruning in Graph Contrastive Learning. (arXiv:2312.07022v1 [cs.CR])
    Graph Contrastive Learning (GCL) is unsupervised graph representation learning that can obtain useful representation of unknown nodes. The node representation can be utilized as features of downstream tasks. However, GCL is vulnerable to poisoning attacks as with existing learning models. A state-of-the-art defense cannot sufficiently negate adverse effects by poisoned graphs although such a defense introduces adversarial training in the GCL. To achieve further improvement, pruning adversarial edges is important. To the best of our knowledge, the feasibility remains unexplored in the GCL domain. In this paper, we propose a simple defense for GCL, EdgePruner. We focus on the fact that the state-of-the-art poisoning attack on GCL tends to mainly add adversarial edges to create poisoned graphs, which means that pruning edges is important to sanitize the graphs. Thus, EdgePruner prunes edges that contribute to minimizing the contrastive loss based on the node representation obtained after training on poisoned graphs by GCL. Furthermore, we focus on the fact that nodes with distinct features are connected by adversarial edges in poisoned graphs. Thus, we introduce feature similarity between neighboring nodes to help more appropriately determine adversarial edges. This similarity is helpful in further eliminating adverse effects from poisoned graphs on various datasets. Finally, EdgePruner outputs a graph that yields the minimum contrastive loss as the sanitized graph. Our results demonstrate that pruning adversarial edges is feasible on six datasets. EdgePruner can improve the accuracy of node classification under the attack by up to 5.55% compared with that of the state-of-the-art defense. Moreover, we show that EdgePruner is immune to an adaptive attack.
    APG: Adaptive Parameter Generation Network for Click-Through Rate Prediction. (arXiv:2203.16218v3 [cs.IR] UPDATED)
    In many web applications, deep learning-based CTR prediction models (deep CTR models for short) are widely adopted. Traditional deep CTR models learn patterns in a static manner, i.e., the network parameters are the same across all the instances. However, such a manner can hardly characterize each of the instances which may have different underlying distributions. It actually limits the representation power of deep CTR models, leading to sub-optimal results. In this paper, we propose an efficient, effective, and universal module, named as Adaptive Parameter Generation network (APG), which can dynamically generate parameters for deep CTR models on-the-fly based on different instances. Extensive experimental evaluation results show that APG can be applied to a variety of deep CTR models and significantly improve their performance. Meanwhile, APG can reduce the time cost by 38.7\% and memory usage by 96.6\% compared to a regular deep CTR model. We have deployed APG in the industrial sponsored search system and achieved 3\% CTR gain and 1\% RPM gain respectively.
    LLMs Perform Poorly at Concept Extraction in Cyber-security Research Literature. (arXiv:2312.07110v1 [cs.CL])
    The cybersecurity landscape evolves rapidly and poses threats to organizations. To enhance resilience, one needs to track the latest developments and trends in the domain. It has been demonstrated that standard bibliometrics approaches show their limits in such a fast-evolving domain. For this purpose, we use large language models (LLMs) to extract relevant knowledge entities from cybersecurity-related texts. We use a subset of arXiv preprints on cybersecurity as our data and compare different LLMs in terms of entity recognition (ER) and relevance. The results suggest that LLMs do not produce good knowledge entities that reflect the cybersecurity context, but our results show some potential for noun extractors. For this reason, we developed a noun extractor boosted with some statistical analysis to extract specific and relevant compound nouns from the domain. Later, we tested our model to identify trends in the LLM domain. We observe some limitations, but it offers promising results to monitor the evolution of emergent trends.
    System-level Safety Guard: Safe Tracking Control through Uncertain Neural Network Dynamics Models. (arXiv:2312.06810v1 [cs.RO])
    The Neural Network (NN), as a black-box function approximator, has been considered in many control and robotics applications. However, difficulties in verifying the overall system safety in the presence of uncertainties hinder the modular deployment of NN in safety-critical systems. In this paper, we leverage the NNs as predictive models for trajectory tracking of unknown dynamical systems. We consider controller design in the presence of both intrinsic uncertainty and uncertainties from other system modules. In this setting, we formulate the constrained trajectory tracking problem and show that it can be solved using Mixed-integer Linear Programming (MILP). The proposed MILP-based solution enjoys a provable safety guarantee for the overall system, and the approach is empirically demonstrated in robot navigation and obstacle avoidance through simulations. The demonstration videos are available at https://xiaolisean.github.io/publication/2023-11-01-L4DC2024.
    Predictive variational autoencoder for learning robust representations of time-series data. (arXiv:2312.06932v1 [cs.LG])
    Variational autoencoders (VAEs) have been used extensively to discover low-dimensional latent factors governing neural activity and animal behavior. However, without careful model selection, the uncovered latent factors may reflect noise in the data rather than true underlying features, rendering such representations unsuitable for scientific interpretation. Existing solutions to this problem involve introducing additional measured variables or data augmentations specific to a particular data type. We propose a VAE architecture that predicts the next point in time and show that it mitigates the learning of spurious features. In addition, we introduce a model selection metric based on smoothness over time in the latent space. We show that together these two constraints on VAEs to be smooth over time produce robust latent representations and faithfully recover latent factors on synthetic datasets.
    Anytime Approximate Formal Feature Attribution. (arXiv:2312.06973v1 [cs.AI])
    Widespread use of artificial intelligence (AI) algorithms and machine learning (ML) models on the one hand and a number of crucial issues pertaining to them warrant the need for explainable artificial intelligence (XAI). A key explainability question is: given this decision was made, what are the input features which contributed to the decision? Although a range of XAI approaches exist to tackle this problem, most of them have significant limitations. Heuristic XAI approaches suffer from the lack of quality guarantees, and often try to approximate Shapley values, which is not the same as explaining which features contribute to a decision. A recent alternative is so-called formal feature attribution (FFA), which defines feature importance as the fraction of formal abductive explanations (AXp's) containing the given feature. This measures feature importance from the view of formally reasoning about the model's behavior. It is challenging to compute FFA using its definition because that involves counting AXp's, although one can approximate it. Based on these results, this paper makes several contributions. First, it gives compelling evidence that computing FFA is intractable, even if the set of contrastive formal explanations (CXp's) is provided, by proving that the problem is #P-hard. Second, by using the duality between AXp's and CXp's, it proposes an efficient heuristic to switch from CXp enumeration to AXp enumeration on-the-fly resulting in an adaptive explanation enumeration algorithm effectively approximating FFA in an anytime fashion. Finally, experimental results obtained on a range of widely used datasets demonstrate the effectiveness of the proposed FFA approximation approach in terms of the error of FFA approximation as well as the number of explanations computed and their diversity given a fixed time limit.
    Benchmarking Deep Learning Classifiers for SAR Automatic Target Recognition. (arXiv:2312.06940v1 [cs.CV])
    Synthetic Aperture Radar SAR Automatic Target Recognition ATR is a key technique of remote-sensing image recognition which can be supported by deep neural networks The existing works of SAR ATR mostly focus on improving the accuracy of the target recognition while ignoring the systems performance in terms of speed and storage which is critical to real-world applications of SAR ATR For decision-makers aiming to identify a proper deep learning model to deploy in a SAR ATR system it is important to understand the performance of different candidate deep learning models and determine the best model accordingly This paper comprehensively benchmarks several advanced deep learning models for SAR ATR with multiple distinct SAR imagery datasets Specifically we train and test five SAR image classifiers based on Residual Neural Networks ResNet18 ResNet34 ResNet50 Graph Neural Network GNN and Vision Transformer for Small-Sized Datasets (SS-ViT) We select three datasets MSTAR GBSAR and SynthWakeSAR that offer heterogeneity We evaluate and compare the five classifiers concerning their classification accuracy runtime performance in terms of inference throughput and analytical performance in terms of number of parameters number of layers model size and number of operations Experimental results show that the GNN classifier outperforms with respect to throughput and latency However it is also shown that no clear model winner emerges from all of our chosen metrics and a one model rules all case is doubtful in the domain of SAR ATR
    Feature Norm Regularized Federated Learning: Transforming Skewed Distributions into Global Insights. (arXiv:2312.06951v1 [cs.LG])
    In the field of federated learning, addressing non-independent and identically distributed (non-i.i.d.) data remains a quintessential challenge for improving global model performance. This work introduces the Feature Norm Regularized Federated Learning (FNR-FL) algorithm, which uniquely incorporates class average feature norms to enhance model accuracy and convergence in non-i.i.d. scenarios. Our comprehensive analysis reveals that FNR-FL not only accelerates convergence but also significantly surpasses other contemporary federated learning algorithms in test accuracy, particularly under feature distribution skew scenarios. The novel modular design of FNR-FL facilitates seamless integration with existing federated learning frameworks, reinforcing its adaptability and potential for widespread application. We substantiate our claims through rigorous empirical evaluations, demonstrating FNR-FL's exceptional performance across various skewed data distributions. Relative to FedAvg, FNR-FL exhibits a substantial 66.24\% improvement in accuracy and a significant 11.40\% reduction in training time, underscoring its enhanced effectiveness and efficiency.
    DYAD: A Descriptive Yet Abjuring Density efficient approximation to linear neural network layers. (arXiv:2312.06881v1 [cs.LG])
    We devise, implement and performance-asses DYAD, a layer which can serve as a faster and more memory-efficient approximate replacement for linear layers, (nn.Linear() in Pytorch). These layers appear in common subcomponents, such as in the ff module of Transformers. DYAD is based on a bespoke near-sparse matrix structure which approximates the dense "weight" matrix W that matrix-multiplies the input in the typical realization of such a layer, a.k.a DENSE. Our alternative near-sparse matrix structure is decomposable to a sum of 2 matrices permutable to a block-sparse counterpart. These can be represented as 3D tensors, which in unison allow a faster execution of matrix multiplication with the mini-batched input matrix X compared to DENSE (O(rows(W ) x cols(W )) --> O( rows(W ) x cols(W ) # of blocks )). As the crux of our experiments, we pretrain both DYAD and DENSE variants of 2 sizes of the OPT arch and 1 size of the Pythia arch, including at different token scales of the babyLM benchmark. We find DYAD to be competitive (>= 90%) of DENSE performance on zero-shot (e.g. BLIMP), few-shot (OPENLM) and finetuning (GLUE) benchmarks, while being >=7-15% faster to train on-GPU even at 125m scale, besides surfacing larger speedups at increasing scale and model width.
    One-dimensional Convolutional Neural Networks for Detecting Transiting Exoplanets. (arXiv:2312.07161v1 [astro-ph.EP])
    The transit method is one of the most relevant exoplanet detection techniques, which consists of detecting periodic eclipses in the light curves of stars. This is not always easy due to the presence of noise in the light curves, which is induced, for example, by the response of a telescope to stellar flux. For this reason, we aimed to develop an artificial neural network model that is able to detect these transits in light curves obtained from different telescopes and surveys. We created artificial light curves with and without transits to try to mimic those expected for the extended mission of the Kepler telescope (K2) in order to train and validate a 1D convolutional neural network model, which was later tested, obtaining an accuracy of 99.02 % and an estimated error (loss function) of 0.03. These results, among others, helped to confirm that the 1D CNN is a good choice for working with non-phased-folded Mandel and Agol light curves with transits. It also reduces the number of light curves that have to be visually inspected to decide if they present transit-like signals and decreases the time needed for analyzing each (with respect to traditional analysis).
    Learning Polynomial Representations of Physical Objects with Application to Certifying Correct Packing Configurations. (arXiv:2312.06791v1 [math.OC])
    This paper introduces a novel approach for learning polynomial representations of physical objects. Given a point cloud data set associated with a physical object, we solve a one-class classification problem to bound the data points by a polynomial sublevel set while harnessing Sum-of-Squares (SOS) programming to enforce prior shape knowledge constraints. By representing objects as polynomial sublevel sets we further show it is possible to construct a secondary SOS program to certify whether objects are packed correctly, that is object boundaries do not overlap and are inside some container set. While not employing reinforcement learning (RL) in this work, our proposed secondary SOS program does provide a potential surrogate reward function for RL algorithms, autonomously rewarding agents that propose object rotations and translations that correctly pack objects within a given container set.
    Spectral State Space Models. (arXiv:2312.06837v1 [cs.LG])
    This paper studies sequence modeling for prediction tasks with long range dependencies. We propose a new formulation for state space models based on learning linear dynamical systems with the spectral filtering algorithm [HSZ17]. This gives rise to a novel sequence prediction architecture we call spectral state space models. The resulting models are evaluated on synthetic dynamical systems. These evaluations support the theoretical benefits of spectral filtering for tasks requiring very long range memory.
    Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI. (arXiv:2312.06941v1 [cs.LG])
    This study investigates the forecasting accuracy of human experts versus Large Language Models (LLMs) in the retail sector, particularly during standard and promotional sales periods. Utilizing a controlled experimental setup with 123 human forecasters and five LLMs, including ChatGPT4, ChatGPT3.5, Bard, Bing, and Llama2, we evaluated forecasting precision through Mean Absolute Percentage Error. Our analysis centered on the effect of the following factors on forecasters performance: the supporting statistical model (baseline and advanced), whether the product was on promotion, and the nature of external impact. The findings indicate that LLMs do not consistently outperform humans in forecasting accuracy and that advanced statistical forecasting models do not uniformly enhance the performance of either human forecasters or LLMs. Both human and LLM forecasters exhibited increased forecasting errors, particularly during promotional periods and under the influence of positive external impacts. Our findings call for careful consideration when integrating LLMs into practical forecasting processes.
    Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models. (arXiv:2312.07046v1 [cs.LG])
    Due to the substantial scale of Large Language Models (LLMs), the direct application of conventional compression methodologies proves impractical. The computational demands associated with even minimal gradient updates present challenges, particularly on consumer-grade hardware. This paper introduces an innovative approach for the parametric and practical compression of LLMs based on reduced order modelling, which entails low-rank decomposition within the feature space and re-parameterization in the weight space. Notably, this compression technique operates in a layer-wise manner, obviating the need for a GPU device and enabling the compression of billion-scale models within stringent constraints of both memory and time. Our method represents a significant advancement in model compression by leveraging matrix decomposition, demonstrating superior efficacy compared to the prevailing state-of-the-art structured pruning method.
    Resetting a fixed broken ELBO. (arXiv:2312.06828v1 [stat.ML])
    Variational autoencoders (VAEs) are one class of generative probabilistic latent-variable models designed for inference based on known data. They balance reconstruction and regularizer terms. A variational approximation produces an evidence lower bound (ELBO). Multiplying the regularizer term by beta provides a beta-VAE/ELBO, improving disentanglement of the latent space. However, any beta value different than unity violates the laws of conditional probability. To provide a similarly-parameterized VAE, we develop a Renyi (versus Shannon) entropy VAE, and a variational approximation RELBO that introduces a similar parameter. The Renyi VAE has an additional Renyi regularizer-like term with a conditional distribution that is not learned. The term is evaluated essentially analytically using a Singular Value Decomposition method.
    Can a Transformer Represent a Kalman Filter?. (arXiv:2312.06937v1 [cs.LG])
    Transformers are a class of autoregressive deep learning architectures which have recently achieved state-of-the-art performance in various vision, language, and robotics tasks. We revisit the problem of Kalman Filtering in linear dynamical systems and show that Transformers can approximate the Kalman Filter in a strong sense. Specifically, for any observable LTI system we construct an explicit causally-masked Transformer which implements the Kalman Filter, up to a small additive error which is bounded uniformly in time; we call our construction the Transformer Filter. Our construction is based on a two-step reduction. We first show that a softmax self-attention block can exactly represent a certain Gaussian kernel smoothing estimator. We then show that this estimator closely approximates the Kalman Filter. We also investigate how the Transformer Filter can be used for measurement-feedback control and prove that the resulting nonlinear controllers closely approximate the performance of standard optimal control policies such as the LQG controller.
    Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment. (arXiv:2312.06960v1 [cs.CV])
    We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language. Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired internet and satellite images. Our unsupervised approach enables the training of a first-of-its-kind large-scale vision language model (VLM) for remote sensing images at two different resolutions. We show that these VLMs enable zero-shot, open-vocabulary image classification, retrieval, segmentation and visual question answering for satellite images. On each of these tasks, our VLM trained without textual annotations outperforms existing VLMs trained with supervision, with gains of up to 20% for classification and 80% for segmentation.
    From Complexity to Clarity: Analytical Expressions of Deep Neural Network Weights via Clifford's Geometric Algebra and Convexity. (arXiv:2309.16512v2 [cs.LG] UPDATED)
    In this paper, we introduce a novel analysis of neural networks based on geometric (Clifford) algebra and convex optimization. We show that optimal weights of deep ReLU neural networks are given by the wedge product of training samples when trained with standard regularized loss. Furthermore, the training problem reduces to convex optimization over wedge product features, which encode the geometric structure of the training dataset. This structure is given in terms of signed volumes of triangles and parallelotopes generated by data vectors. The convex problem finds a small subset of samples via $\ell_1$ regularization to discover only relevant wedge product features. Our analysis provides a novel perspective on the inner workings of deep neural networks and sheds light on the role of the hidden layers.
    Generating High-Resolution Regional Precipitation Using Conditional Diffusion Model. (arXiv:2312.07112v1 [cs.LG])
    Climate downscaling is a crucial technique within climate research, serving to project low-resolution (LR) climate data to higher resolutions (HR). Previous research has demonstrated the effectiveness of deep learning for downscaling tasks. However, most deep learning models for climate downscaling may not perform optimally for high scaling factors (i.e., 4x, 8x) due to their limited ability to capture the intricate details required for generating HR climate data. Furthermore, climate data behaves differently from image data, necessitating a nuanced approach when employing deep generative models. In response to these challenges, this paper presents a deep generative model for downscaling climate data, specifically precipitation on a regional scale. We employ a denoising diffusion probabilistic model (DDPM) conditioned on multiple LR climate variables. The proposed model is evaluated using precipitation data from the Community Earth System Model (CESM) v1.2.2 simulation. Our results demonstrate significant improvements over existing baselines, underscoring the effectiveness of the conditional diffusion model in downscaling climate data.
    LoRA-Enhanced Distillation on Guided Diffusion Models. (arXiv:2312.06899v1 [cs.CV])
    Diffusion models, such as Stable Diffusion (SD), offer the ability to generate high-resolution images with diverse features, but they come at a significant computational and memory cost. In classifier-free guided diffusion models, prolonged inference times are attributed to the necessity of computing two separate diffusion models at each denoising step. Recent work has shown promise in improving inference time through distillation techniques, teaching the model to perform similar denoising steps with reduced computations. However, the application of distillation introduces additional memory overhead to these already resource-intensive diffusion models, making it less practical. To address these challenges, our research explores a novel approach that combines Low-Rank Adaptation (LoRA) with model distillation to efficiently compress diffusion models. This approach not only reduces inference time but also mitigates memory overhead, and notably decreases memory consumption even before applying distillation. The results are remarkable, featuring a significant reduction in inference time due to the distillation process and a substantial 50% reduction in memory consumption. Our examination of the generated images underscores that the incorporation of LoRA-enhanced distillation maintains image quality and alignment with the provided prompts. In summary, while conventional distillation tends to increase memory consumption, LoRA-enhanced distillation offers optimization without any trade-offs or compromises in quality.
    On Classification-Calibration of Gamma-Phi Losses. (arXiv:2302.07321v2 [stat.ML] UPDATED)
    Gamma-Phi losses constitute a family of multiclass classification loss functions that generalize the logistic and other common losses, and have found application in the boosting literature. We establish the first general sufficient condition for the classification-calibration (CC) of such losses. To our knowledge, this sufficient condition gives the first family of nonconvex multiclass surrogate losses for which CC has been fully justified. In addition, we show that a previously proposed sufficient condition is in fact not sufficient. This contribution highlights a technical issue that is important in the study of multiclass CC but has been neglected in prior work.
    Faster Stochastic Variance Reduction Methods for Compositional MiniMax Optimization. (arXiv:2308.09604v2 [cs.LG] UPDATED)
    This paper delves into the realm of stochastic optimization for compositional minimax optimization - a pivotal challenge across various machine learning domains, including deep AUC and reinforcement learning policy evaluation. Despite its significance, the problem of compositional minimax optimization is still under-explored. Adding to the complexity, current methods of compositional minimax optimization are plagued by sub-optimal complexities or heavy reliance on sizable batch sizes. To respond to these constraints, this paper introduces a novel method, called Nested STOchastic Recursive Momentum (NSTORM), which can achieve the optimal sample complexity of $O(\kappa^3 /\epsilon^3 )$ to obtain the $\epsilon$-accuracy solution. We also demonstrate that NSTORM can achieve the same sample complexity under the Polyak-\L ojasiewicz (PL)-condition - an insightful extension of its capabilities. Yet, NSTORM encounters an issue with its requirement for low learning rates, potentially constraining its real-world applicability in machine learning. To overcome this hurdle, we present ADAptive NSTORM (ADA-NSTORM) with adaptive learning rates. We demonstrate that ADA-NSTORM can achieve the same sample complexity but the experimental results show its more effectiveness. All the proposed complexities indicate that our proposed methods can match lower bounds to existing minimax optimizations, without requiring a large batch size in each iteration. Extensive experiments support the efficiency of our proposed methods.
    Toward Robustness in Multi-label Classification: A Data Augmentation Strategy against Imbalance and Noise. (arXiv:2312.07087v1 [cs.LG])
    Multi-label classification poses challenges due to imbalanced and noisy labels in training data. We propose a unified data augmentation method, named BalanceMix, to address these challenges. Our approach includes two samplers for imbalanced labels, generating minority-augmented instances with high diversity. It also refines multi-labels at the label-wise granularity, categorizing noisy labels as clean, re-labeled, or ambiguous for robust optimization. Extensive experiments on three benchmark datasets demonstrate that BalanceMix outperforms existing state-of-the-art methods. We release the code at https://github.com/DISL-Lab/BalanceMix.
    General Tail Bounds for Non-Smooth Stochastic Mirror Descent. (arXiv:2312.07142v1 [cs.LG])
    In this paper, we provide novel tail bounds on the optimization error of Stochastic Mirror Descent for convex and Lipschitz objectives. Our analysis extends the existing tail bounds from the classical light-tailed Sub-Gaussian noise case to heavier-tailed noise regimes. We study the optimization error of the last iterate as well as the average of the iterates. We instantiate our results in two important cases: a class of noise with exponential tails and one with polynomial tails. A remarkable feature of our results is that they do not require an upper bound on the diameter of the domain. Finally, we support our theory with illustrative experiments that compare the behavior of the average of the iterates with that of the last iterate in heavy-tailed noise regimes.
    Improving Offline-to-Online Reinforcement Learning with Q-Ensembles. (arXiv:2306.06871v3 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) is a learning paradigm where an agent learns from a fixed dataset of experience. However, learning solely from a static dataset can limit the performance due to the lack of exploration. To overcome it, offline-to-online RL combines offline pre-training with online fine-tuning, which enables the agent to further refine its policy by interacting with the environment in real-time. Despite its benefits, existing offline-to-online RL methods suffer from performance degradation and slow improvement during the online phase. To tackle these challenges, we propose a novel framework called Ensemble-based Offline-to-Online (E2O) RL. By increasing the number of Q-networks, we seamlessly bridge offline pre-training and online fine-tuning without degrading performance. Moreover, to expedite online performance enhancement, we appropriately loosen the pessimism of Q-value estimation and incorporate ensemble-based exploration mechanisms into our framework. Experimental results demonstrate that E2O can substantially improve the training stability, learning efficiency, and final performance of existing offline RL methods during online fine-tuning on a range of locomotion and navigation tasks, significantly outperforming existing offline-to-online RL methods.
    Dynamic Adversarial Attacks on Autonomous Driving Systems. (arXiv:2312.06701v1 [cs.RO])
    This paper introduces an attacking mechanism to challenge the resilience of autonomous driving systems. Specifically, we manipulate the decision-making processes of an autonomous vehicle by dynamically displaying adversarial patches on a screen mounted on another moving vehicle. These patches are optimized to deceive the object detection models into misclassifying targeted objects, e.g., traffic signs. Such manipulation has significant implications for critical multi-vehicle interactions such as intersection crossing and lane changing, which are vital for safe and efficient autonomous driving systems. Particularly, we make four major contributions. First, we introduce a novel adversarial attack approach where the patch is not co-located with its target, enabling more versatile and stealthy attacks. Moreover, our method utilizes dynamic patches displayed on a screen, allowing for adaptive changes and movement, enhancing the flexibility and performance of the attack. To do so, we design a Screen Image Transformation Network (SIT-Net), which simulates environmental effects on the displayed images, narrowing the gap between simulated and real-world scenarios. Further, we integrate a positional loss term into the adversarial training process to increase the success rate of the dynamic attack. Finally, we shift the focus from merely attacking perceptual systems to influencing the decision-making algorithms of self-driving systems. Our experiments demonstrate the first successful implementation of such dynamic adversarial attacks in real-world autonomous driving scenarios, paving the way for advancements in the field of robust and secure autonomous driving.
    Forecasting Intraday Power Output by a Set of PV Systems using Recurrent Neural Networks and Physical Covariates. (arXiv:2303.08459v2 [cs.LG] UPDATED)
    Accurate intraday forecasts of the power output by PhotoVoltaic (PV) systems are critical to improve the operation of energy distribution grids. We describe a neural autoregressive model which aims at performing such intraday forecasts. We build upon a physical, deterministic PV performance model, the output of which being used as covariates in the context of the neural model. In addition, our application data relates to a geographically distributed set of PV systems. We address all PV sites with a single neural model, which embeds the information about the PV site in specific covariates. We use a scale-free approach which does rely on explicit modelling of seasonal effects. Our proposal repurposes a model initially used in the retail sector, and discloses a novel truncated Gaussian output distribution. An ablation study and a comparison to alternative architectures from the literature shows that the components in the best performing proposed model variant work synergistically to reach a skill score of 15.72% with respect to the physical model, used as a baseline.
    An Association Test Based on Kernel-Based Neural Networks for Complex Genetic Association Analysis. (arXiv:2312.06669v1 [q-bio.QM])
    The advent of artificial intelligence, especially the progress of deep neural networks, is expected to revolutionize genetic research and offer unprecedented potential to decode the complex relationships between genetic variants and disease phenotypes, which could mark a significant step toward improving our understanding of the disease etiology. While deep neural networks hold great promise for genetic association analysis, limited research has been focused on developing neural-network-based tests to dissect complex genotype-phenotype associations. This complexity arises from the opaque nature of neural networks and the absence of defined limiting distributions. We have previously developed a kernel-based neural network model (KNN) that synergizes the strengths of linear mixed models with conventional neural networks. KNN adopts a computationally efficient minimum norm quadratic unbiased estimator (MINQUE) algorithm and uses KNN structure to capture the complex relationship between large-scale sequencing data and a disease phenotype of interest. In the KNN framework, we introduce a MINQUE-based test to assess the joint association of genetic variants with the phenotype, which considers non-linear and non-additive effects and follows a mixture of chi-square distributions. We also construct two additional tests to evaluate and interpret linear and non-linear/non-additive genetic effects, including interaction effects. Our simulations show that our method consistently controls the type I error rate under various conditions and achieves greater power than a commonly used sequence kernel association test (SKAT), especially when involving non-linear and interaction effects. When applied to real data from the UK Biobank, our approach identified genes associated with hippocampal volume, which can be further replicated and evaluated for their role in the pathogenesis of Alzheimer's disease.
    Perceiving University Student's Opinions from Google App Reviews. (arXiv:2312.06705v1 [cs.CL])
    Google app market captures the school of thought of users from every corner of the globe via ratings and text reviews, in a multilinguistic arena. The potential information from the reviews cannot be extracted manually, due to its exponential growth. So, Sentiment analysis, by machine learning and deep learning algorithms employing NLP, explicitly uncovers and interprets the emotions. This study performs the sentiment classification of the app reviews and identifies the university student's behavior towards the app market via exploratory analysis. We applied machine learning algorithms using the TP, TF, and TF IDF text representation scheme and evaluated its performance on Bagging, an ensemble learning method. We used word embedding, Glove, on the deep learning paradigms. Our model was trained on Google app reviews and tested on Student's App Reviews(SAR). The various combinations of these algorithms were compared amongst each other using F score and accuracy and inferences were highlighted graphically. SVM, amongst other classifiers, gave fruitful accuracy(93.41%), F score(89%) on bigram and TF IDF scheme. Bagging enhanced the performance of LR and NB with accuracy of 87.88% and 86.69% and F score of 86% and 78% respectively. Overall, LSTM on Glove embedding recorded the highest accuracy(95.2%) and F score(88%).
    Class-Prototype Conditional Diffusion Model for Continual Learning with Generative Replay. (arXiv:2312.06710v1 [cs.LG])
    Mitigating catastrophic forgetting is a key hurdle in continual learning. Deep Generative Replay (GR) provides techniques focused on generating samples from prior tasks to enhance the model's memory capabilities. With the progression in generative AI, generative models have advanced from Generative Adversarial Networks (GANs) to the more recent Diffusion Models (DMs). A major issue is the deterioration in the quality of generated data compared to the original, as the generator continuously self-learns from its outputs. This degradation can lead to the potential risk of catastrophic forgetting occurring in the classifier. To address this, we propose the Class-Prototype Conditional Diffusion Model (CPDM), a GR-based approach for continual learning that enhances image quality in generators and thus reduces catastrophic forgetting in classifiers. The cornerstone of CPDM is a learnable class-prototype that captures the core characteristics of images in a given class. This prototype, integrated into the diffusion model's denoising process, ensures the generation of high-quality images. It maintains its effectiveness for old tasks even when new tasks are introduced, preserving image generation quality and reducing the risk of catastrophic forgetting in classifiers. Our empirical studies on diverse datasets demonstrate that our proposed method significantly outperforms existing state-of-the-art models, highlighting its exceptional ability to preserve image quality and enhance the model's memory retention.
    Contextual Bandits with Online Neural Regression. (arXiv:2312.07145v1 [cs.LG])
    Recent works have shown a reduction from contextual bandits to online regression under a realizability assumption [Foster and Rakhlin, 2020, Foster and Krishnamurthy, 2021]. In this work, we investigate the use of neural networks for such online regression and associated Neural Contextual Bandits (NeuCBs). Using existing results for wide networks, one can readily show a ${\mathcal{O}}(\sqrt{T})$ regret for online regression with square loss, which via the reduction implies a ${\mathcal{O}}(\sqrt{K} T^{3/4})$ regret for NeuCBs. Departing from this standard approach, we first show a $\mathcal{O}(\log T)$ regret for online regression with almost convex losses that satisfy QG (Quadratic Growth) condition, a generalization of the PL (Polyak-\L ojasiewicz) condition, and that have a unique minima. Although not directly applicable to wide networks since they do not have unique minima, we show that adding a suitable small random perturbation to the network predictions surprisingly makes the loss satisfy QG with unique minima. Based on such a perturbed prediction, we show a ${\mathcal{O}}(\log T)$ regret for online regression with both squared loss and KL loss, and subsequently convert these respectively to $\tilde{\mathcal{O}}(\sqrt{KT})$ and $\tilde{\mathcal{O}}(\sqrt{KL^*} + K)$ regret for NeuCB, where $L^*$ is the loss of the best policy. Separately, we also show that existing regret bounds for NeuCBs are $\Omega(T)$ or assume i.i.d. contexts, unlike this work. Finally, our experimental results on various datasets demonstrate that our algorithms, especially the one based on KL loss, persistently outperform existing algorithms.
    Intelligent Virtual Assistants with LLM-based Process Automation. (arXiv:2312.06677v1 [cs.LG])
    While intelligent virtual assistants like Siri, Alexa, and Google Assistant have become ubiquitous in modern life, they still face limitations in their ability to follow multi-step instructions and accomplish complex goals articulated in natural language. However, recent breakthroughs in large language models (LLMs) show promise for overcoming existing barriers by enhancing natural language processing and reasoning capabilities. Though promising, applying LLMs to create more advanced virtual assistants still faces challenges like ensuring robust performance and handling variability in real-world user commands. This paper proposes a novel LLM-based virtual assistant that can automatically perform multi-step operations within mobile apps based on high-level user requests. The system represents an advance in assistants by providing an end-to-end solution for parsing instructions, reasoning about goals, and executing actions. LLM-based Process Automation (LLMPA) has modules for decomposing instructions, generating descriptions, detecting interface elements, predicting next actions, and error checking. Experiments demonstrate the system completing complex mobile operation tasks in Alipay based on natural language instructions. This showcases how large language models can enable automated assistants to accomplish real-world tasks. The main contributions are the novel LLMPA architecture optimized for app process automation, the methodology for applying LLMs to mobile apps, and demonstrations of multi-step task completion in a real-world environment. Notably, this work represents the first real-world deployment and extensive evaluation of a large language model-based virtual assistant in a widely used mobile application with an enormous user base numbering in the hundreds of millions.
    Honeybee: Locality-enhanced Projector for Multimodal LLM. (arXiv:2312.06742v1 [cs.CV])
    In Multimodal Large Language Models (MLLMs), a visual projector plays a crucial role in bridging pre-trained vision encoders with LLMs, enabling profound visual understanding while harnessing the LLMs' robust capabilities. Despite the importance of the visual projector, it has been relatively less explored. In this study, we first identify two essential projector properties: (i) flexibility in managing the number of visual tokens, crucial for MLLMs' overall efficiency, and (ii) preservation of local context from visual features, vital for spatial understanding. Based on these findings, we propose a novel projector design that is both flexible and locality-enhanced, effectively satisfying the two desirable properties. Additionally, we present comprehensive strategies to effectively utilize multiple and multifaceted instruction datasets. Through extensive experiments, we examine the impact of individual design choices. Finally, our proposed MLLM, Honeybee, remarkably outperforms previous state-of-the-art methods across various benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench, achieving significantly higher efficiency. Code and models are available at https://github.com/kakaobrain/honeybee.
    SplitOut: Out-of-the-Box Training-Hijacking Detection in Split Learning via Outlier Detection. (arXiv:2302.08618v2 [cs.LG] UPDATED)
    Split learning enables efficient and privacy-aware training of a deep neural network by splitting a neural network so that the clients (data holders) compute the first layers and only share the intermediate output with the central compute-heavy server. This paradigm introduces a new attack medium in which the server has full control over what the client models learn, which has already been exploited to infer the private data of clients and to implement backdoors in the client models. Although previous work has shown that clients can successfully detect such training-hijacking attacks, the proposed methods rely on heuristics, require tuning of many hyperparameters, and do not fully utilize the clients' capabilities. In this work, we show that given modest assumptions regarding the clients' compute capabilities, an out-of-the-box outlier detection method can be used to detect existing training-hijacking attacks with almost-zero false positive rates. We conclude through experiments on different tasks that the simplicity of our approach we name SplitOut makes it a more viable and reliable alternative compared to the earlier detection methods.
    Ensemble flow reconstruction in the atmospheric boundary layer from spatially limited measurements through latent diffusion models. (arXiv:2303.00836v2 [physics.ao-ph] UPDATED)
    Due to costs and practical constraints, field campaigns in the atmospheric boundary layer typically only measure a fraction of the atmospheric volume of interest. Machine learning techniques have previously successfully reconstructed unobserved regions of flow in canonical fluid mechanics problems and two-dimensional geophysical flows, but these techniques have not yet been demonstrated in the three-dimensional atmospheric boundary layer. Here, we conduct a numerical analogue of a field campaign with spatially limited measurements using large-eddy simulation. We pose flow reconstruction as an inpainting problem, and reconstruct realistic samples of turbulent, three-dimensional flow with the use of a latent diffusion model. The diffusion model generates physically plausible turbulent structures on larger spatial scales, even when input observations cover less than 1% of the volume. Through a combination of qualitative visualization and quantitative assessment, we demonstrate that the diffusion model generates meaningfully diverse samples when conditioned on just one observation. These samples successfully serve as initial conditions for a large-eddy simulation code. We find that diffusion models show promise and potential for other applications for other turbulent flow reconstruction problems.
    Quantifying disparities in intimate partner violence: a machine learning method to correct for underreporting. (arXiv:2110.04133v4 [cs.CY] UPDATED)
    Estimating the prevalence of a medical condition, or the proportion of the population in which it occurs, is a fundamental problem in healthcare and public health. Accurate estimates of the relative prevalence across groups -- capturing, for example, that a condition affects women more frequently than men -- facilitate effective and equitable health policy which prioritizes groups who are disproportionately affected by a condition. However, it is difficult to estimate relative prevalence when a medical condition is underreported. In this work, we provide a method for accurately estimating the relative prevalence of underreported medical conditions, building upon the positive unlabeled learning framework. We show that under the commonly made covariate shift assumption -- i.e., that the probability of having a disease conditional on symptoms remains constant across groups -- we can recover the relative prevalence, even without restrictive assumptions commonly made in positive unlabeled learning and even if it is impossible to recover the absolute prevalence. We conduct experiments on synthetic and real health data which demonstrate our method's ability to recover the relative prevalence more accurately than do baselines, and demonstrate the method's robustness to plausible violations of the covariate shift assumption. We conclude by illustrating the applicability of our method to case studies of intimate partner violence and hate speech.
    SPRING: Studying the Paper and Reasoning to Play Games. (arXiv:2305.15486v3 [cs.AI] UPDATED)
    Open-world survival games pose significant challenges for AI algorithms due to their multi-tasking, deep exploration, and goal prioritization requirements. Despite reinforcement learning (RL) being popular for solving games, its high sample complexity limits its effectiveness in complex open-world games like Crafter or Minecraft. We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM). Prompted with the LaTeX source as game context and a description of the agent's current observation, our SPRING framework employs a directed acyclic graph (DAG) with game-related questions as nodes and dependencies as edges. We identify the optimal action to take in the environment by traversing the DAG and calculating LLM responses for each node in topological order, with the LLM's answer to final node directly translating to environment actions. In our experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment. Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories. Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RL baselines, trained for 1M steps, without any training. Finally, we show the potential of games as a test bed for LLMs.
    FP8-BERT: Post-Training Quantization for Transformer. (arXiv:2312.05725v2 [cs.AI] UPDATED)
    Transformer-based models, such as BERT, have been widely applied in a wide range of natural language processing tasks. However, one inevitable side effect is that they require massive memory storage and inference cost when deployed in production. Quantization is one of the popularized ways to alleviate the cost. However, the previous 8-bit quantization strategy based on INT8 data format either suffers from the degradation of accuracy in a Post-Training Quantization (PTQ) fashion or requires an expensive Quantization-Aware Training (QAT) process. Recently, a new numeric format FP8 (i.e. floating-point of 8-bits) has been proposed and supported in commercial AI computing platforms such as H100. In this paper, we empirically validate the effectiveness of FP8 as a way to do Post-Training Quantization without significant loss of accuracy, with a simple calibration and format conversion process. We adopt the FP8 standard proposed by NVIDIA Corp. (2022) in our extensive experiments of BERT variants on GLUE and SQuAD v1.1 datasets, and show that PTQ with FP8 can significantly improve the accuracy upon that with INT8, to the extent of the full-precision model.
    Experimental Investigation of Machine Learning based Soft-Failure Management using the Optical Spectrum. (arXiv:2312.07208v1 [cs.NI])
    The demand for high-speed data is exponentially growing. To conquer this, optical networks underwent significant changes getting more complex and versatile. The increasing complexity necessitates the fault management to be more adaptive to enhance network assurance. In this paper, we experimentally compare the performance of soft-failure management of different machine learning algorithms. We further introduce a machine-learning based soft-failure management framework. It utilizes a variational autoencoder based generative adversarial network (VAE-GAN) running on optical spectral data obtained by optical spectrum analyzers. The framework is able to reliably run on a fraction of available training data as well as identifying unknown failure types. The investigations show, that the VAE-GAN outperforms the other machine learning algorithms when up to 10\% of the total training data is available in identification tasks. Furthermore, the advanced training mechanism for the GAN shows a high F1-score for unknown spectrum identification. The failure localization comparison shows the advantage of a low complexity neural network in combination with a VAE over established machine learning algorithms.
    BIRB: A Generalization Benchmark for Information Retrieval in Bioacoustics. (arXiv:2312.07439v1 [cs.LG])
    The ability for a machine learning model to cope with differences in training and deployment conditions--e.g. in the presence of distribution shift or the generalization to new classes altogether--is crucial for real-world use cases. However, most empirical work in this area has focused on the image domain with artificial benchmarks constructed to measure individual aspects of generalization. We present BIRB, a complex benchmark centered on the retrieval of bird vocalizations from passively-recorded datasets given focal recordings from a large citizen science corpus available for training. We propose a baseline system for this collection of tasks using representation learning and a nearest-centroid search. Our thorough empirical evaluation and analysis surfaces open research directions, suggesting that BIRB fills the need for a more realistic and complex benchmark to drive progress on robustness to distribution shifts and generalization of ML models.
    Disentanglement Learning via Topology. (arXiv:2308.12696v2 [cs.LG] UPDATED)
    We propose TopDis (Topological Disentanglement), a method for learning disentangled representations via adding multi-scale topological loss term. Disentanglement is a crucial property of data representations substantial for the explainability and robustness of deep learning models and a step towards high-level cognition. The state-of-the-art method based on VAE minimizes the total correlation of the joint distribution of latent variables. We take a different perspective on disentanglement by analyzing topological properties of data manifolds. In particular, we optimize the topological similarity for data manifolds traversals. To the best of our knowledge, our paper is the first one to propose a differentiable topological loss for disentanglement. Our experiments have shown that the proposed topological loss improves disentanglement scores such as MIG, FactorVAE score, SAP score and DCI disentanglement score with respect to state-of-the-art results. Our method works in an unsupervised manner, permitting to apply it for problems without labeled factors of variation. Additionally, we show how to use the proposed topological loss to find disentangled directions in a trained GAN.
    Evolving Reservoirs for Meta Reinforcement Learning. (arXiv:2312.06695v1 [cs.LG])
    Animals often demonstrate a remarkable ability to adapt to their environments during their lifetime. They do so partly due to the evolution of morphological and neural structures. These structures capture features of environments shared between generations to bias and speed up lifetime learning. In this work, we propose a computational model for studying a mechanism that can enable such a process. We adopt a computational framework based on meta reinforcement learning as a model of the interplay between evolution and development. At the evolutionary scale, we evolve reservoirs, a family of recurrent neural networks that differ from conventional networks in that one optimizes not the weight values but hyperparameters of the architecture: the later control macro-level properties, such as memory and dynamics. At the developmental scale, we employ these evolved reservoirs to facilitate the learning of a behavioral policy through Reinforcement Learning (RL). Within an RL agent, a reservoir encodes the environment state before providing it to an action policy. We evaluate our approach on several 2D and 3D simulated environments. Our results show that the evolution of reservoirs can improve the learning of diverse challenging tasks. We study in particular three hypotheses: the use of an architecture combining reservoirs and reinforcement learning could enable (1) solving tasks with partial observability, (2) generating oscillatory dynamics that facilitate the learning of locomotion tasks, and (3) facilitating the generalization of learned behaviors to new tasks unknown during the evolution phase.
    Segment Anything Model for Medical Images?. (arXiv:2304.14660v5 [eess.IV] UPDATED)
    The Segment Anything Model (SAM) is the first foundation model for general image segmentation. It has achieved impressive results on various natural image segmentation tasks. However, medical image segmentation (MIS) is more challenging because of the complex modalities, fine anatomical structures, uncertain and complex object boundaries, and wide-range object scales. To fully validate SAM's performance on medical data, we collected and sorted 53 open-source datasets and built a large medical segmentation dataset with 18 modalities, 84 objects, 125 object-modality paired targets, 1050K 2D images, and 6033K masks. We comprehensively analyzed different models and strategies on the so-called COSMOS 1050K dataset. Our findings mainly include the following: 1) SAM showed remarkable performance in some specific objects but was unstable, imperfect, or even totally failed in other situations. 2) SAM with the large ViT-H showed better overall performance than that with the small ViT-B. 3) SAM performed better with manual hints, especially box, than the Everything mode. 4) SAM could help human annotation with high labeling quality and less time. 5) SAM was sensitive to the randomness in the center point and tight box prompts, and may suffer from a serious performance drop. 6) SAM performed better than interactive methods with one or a few points, but will be outpaced as the number of points increases. 7) SAM's performance correlated to different factors, including boundary complexity, intensity differences, etc. 8) Finetuning the SAM on specific medical tasks could improve its average DICE performance by 4.39% and 6.68% for ViT-B and ViT-H, respectively. We hope that this comprehensive report can help researchers explore the potential of SAM applications in MIS, and guide how to appropriately use and develop SAM.
    Protein Design with Guided Discrete Diffusion. (arXiv:2305.20009v2 [cs.LG] UPDATED)
    A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling. The generative model samples plausible sequences while the discriminative model guides a search for sequences with high fitness. Given its broad success in conditional sampling, classifier-guided diffusion modeling is a promising foundation for protein design, leading many to develop guided diffusion models for structure with inverse folding to recover sequences. In this work, we propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models that follows gradients in the hidden states of the denoising network. NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods, including scarce data and challenging inverse design. Moreover, we use NOS to generalize LaMBO, a Bayesian optimization procedure for sequence design that facilitates multiple objectives and edit-based constraints. The resulting method, LaMBO-2, enables discrete diffusions and stronger performance with limited edits through a novel application of saliency maps. We apply LaMBO-2 to a real-world protein design task, optimizing antibodies for higher expression yield and binding affinity to several therapeutic targets under locality and developability constraints, attaining a 99% expression rate and 40% binding rate in exploratory in vitro experiments.
    Can LLM-Generated Misinformation Be Detected?. (arXiv:2309.13788v2 [cs.CL] UPDATED)
    The advent of Large Language Models (LLMs) has made a transformative impact. However, the potential that LLMs such as ChatGPT can be exploited to generate misinformation has posed a serious concern to online safety and public trust. A fundamental research question is: will LLM-generated misinformation cause more harm than human-written misinformation? We propose to tackle this question from the perspective of detection difficulty. We first build a taxonomy of LLM-generated misinformation. Then we categorize and validate the potential real-world methods for generating misinformation with LLMs. Then, through extensive empirical investigation, we discover that LLM-generated misinformation can be harder to detect for humans and detectors compared to human-written misinformation with the same semantics, which suggests it can have more deceptive styles and potentially cause more harm. We also discuss the implications of our discovery on combating misinformation in the age of LLMs and the countermeasures.
    AI capabilities can be significantly improved without expensive retraining. (arXiv:2312.07413v1 [cs.AI])
    State-of-the-art AI systems can be significantly improved without expensive retraining via "post-training enhancements"-techniques applied after initial training like fine-tuning the system to use a web browser. We review recent post-training enhancements, categorizing them into five types: tool-use, prompting methods, scaffolding, solution selection, and data generation. Different enhancements improve performance on different tasks, making it hard to compare their significance. So we translate improvements from different enhancements into a common currency, the compute-equivalent gain: how much additional training compute would be needed to improve performance by the same amount as the enhancement. Our non-experimental work shows that post-training enhancements have significant benefits: most surveyed enhancements improve benchmark performance by more than a 5x increase in training compute, some by more than 20x. Post-training enhancements are relatively cheap to develop: fine-tuning costs are typically <1% of the original training cost. Governing the development of capable post-training enhancements may be challenging because frontier models could be enhanced by a wide range of actors.
    Detect, Retrieve, Comprehend: A Flexible Framework for Zero-Shot Document-Level Question Answering. (arXiv:2210.01959v3 [cs.CL] UPDATED)
    Researchers produce thousands of scholarly documents containing valuable technical knowledge. The community faces the laborious task of reading these documents to identify, extract, and synthesize information. To automate information gathering, document-level question answering (QA) offers a flexible framework where human-posed questions can be adapted to extract diverse knowledge. Finetuning QA systems requires access to labeled data (tuples of context, question and answer). However, data curation for document QA is uniquely challenging because the context (i.e. answer evidence passage) needs to be retrieved from potentially long, ill-formatted documents. Existing QA datasets sidestep this challenge by providing short, well-defined contexts that are unrealistic in real-world applications. We present a three-stage document QA approach: (1) text extraction from PDF; (2) evidence retrieval from extracted texts to form well-posed contexts; (3) QA to extract knowledge from contexts to return high-quality answers -- extractive, abstractive, or Boolean. Using QASPER for evaluation, our detect-retrieve-comprehend (DRC) system achieves a +7.19 improvement in Answer-F1 over existing baselines while delivering superior context selection. Our results demonstrate that DRC holds tremendous promise as a flexible framework for practical scientific document QA.
    DIFFender: Diffusion-Based Adversarial Defense against Patch Attacks. (arXiv:2306.09124v3 [cs.CV] UPDATED)
    Adversarial attacks, particularly patch attacks, pose significant threats to the robustness and reliability of deep learning models. Developing reliable defenses against patch attacks is crucial for real-world applications, yet current research in this area is unsatisfactory. In this paper, we propose DIFFender, a novel defense method that leverages a text-guided diffusion model to defend against adversarial patches. DIFFender includes two main stages: patch localization and patch restoration. In the localization stage, we find and exploit an intriguing property of the diffusion model to precisely identify the locations of adversarial patches. In the restoration stage, we employ the diffusion model to reconstruct the adversarial regions in the images while preserving the integrity of the visual content. Thanks to the former finding, these two stages can be simultaneously guided by a unified diffusion model. Thus, we can utilize the close interaction between them to improve the whole defense performance. Moreover, we propose a few-shot prompt-tuning algorithm to fine-tune the diffusion model, enabling the pre-trained diffusion model to adapt to the defense task easily. We conduct extensive experiments on image classification, face recognition, and further in the physical world, demonstrating that our proposed method exhibits superior robustness under strong adaptive attacks and generalizes well across various scenarios, diverse classifiers, and multiple patch attack methods.
    Facial Emotion Recognition in VR Games. (arXiv:2312.06925v1 [cs.HC])
    Emotion detection is a crucial component of Games User Research (GUR), as it allows game developers to gain insights into players' emotional experiences and tailor their games accordingly. However, detecting emotions in Virtual Reality (VR) games is challenging due to the Head-Mounted Display (HMD) that covers the top part of the player's face, namely, their eyes and eyebrows, which provide crucial information for recognizing the impression. To tackle this we used a Convolutional Neural Network (CNN) to train a model to predict emotions in full-face images where the eyes and eyebrows are covered. We used the FER2013 dataset, which we modified to cover eyes and eyebrows in images. The model in these images can accurately recognize seven different emotions which are anger, happiness, disgust, fear, impartiality, sadness and surprise. We assessed the model's performance by testing it on two VR games and using it to detect players' emotions. We collected self-reported emotion data from the players after the gameplay sessions. We analyzed the data collected from our experiment to understand which emotions players experience during the gameplay. We found that our approach has the potential to enhance gameplay analysis by enabling the detection of players' emotions in VR games, which can help game developers create more engaging and immersive game experiences.
    Multi-Granularity Framework for Unsupervised Representation Learning of Time Series. (arXiv:2312.07248v1 [cs.LG])
    Representation learning plays a critical role in the analysis of time series data and has high practical value across a wide range of applications. including trend analysis, time series data retrieval and forecasting. In practice, data confusion is a significant issue as it can considerably impact the effectiveness and accuracy of data analysis, machine learning models and decision-making processes. In general, previous studies did not consider the variability at various levels of granularity, thus resulting in inadequate information utilization, which further exacerbated the issue of data confusion. This paper proposes an unsupervised framework to realize multi-granularity representation learning for time series. Specifically, we employed a cross-granularity transformer to develop an association between fine- and coarse-grained representations. In addition, we introduced a retrieval task as an unsupervised training task to learn the multi-granularity representation of time series. Moreover, a novel loss function was designed to obtain the comprehensive multi-granularity representation of the time series via unsupervised learning. The experimental results revealed that the proposed framework demonstrates significant advantages over alternative representation learning models.
    Integral Continual Learning Along the Tangent Vector Field of Tasks. (arXiv:2211.13108v3 [cs.LG] UPDATED)
    We propose a lightweight continual learning method which incorporates information from specialized datasets incrementally, by integrating it along the vector field of "generalist" models. The tangent plane to the specialist model acts as a generalist guide and avoids the kind of over-fitting that leads to catastrophic forgetting, while exploiting the convexity of the optimization landscape in the tangent plane. It maintains a small fixed-size memory buffer, as low as 0.4% of the source datasets, which is updated by simple resampling. Our method achieves strong performance across various buffer sizes for different datasets. Specifically, in the class-incremental setting we outperform the existing methods that do not require distillation by an average of 18.77% and 28.48%, for Seq-CIFAR-10 and Seq-TinyImageNet respectively. Our method can easily be used in conjunction with existing replay-based continual learning methods. When memory buffer constraints are relaxed to allow storage of metadata such as logits, we attain an error reduction of 17.84% towards the paragon performance on Seq-CIFAR-10.
    Symptom-based Machine Learning Models for the Early Detection of COVID-19: A Narrative Review. (arXiv:2312.06832v1 [cs.LG])
    Despite the widespread testing protocols for COVID-19, there are still significant challenges in early detection of the disease, which is crucial for preventing its spread and optimizing patient outcomes. Owing to the limited testing capacity in resource-strapped settings and the limitations of the available traditional methods of testing, it has been established that a fast and efficient strategy is important to fully stop the virus. Machine learning models can analyze large datasets, incorporating patient-reported symptoms, clinical data, and medical imaging. Symptom-based detection methods have been developed to predict COVID-19, and they have shown promising results. In this paper, we provide an overview of the landscape of symptoms-only machine learning models for predicting COVID-19, including their performance and limitations. The review will also examine the performance of symptom-based models when compared to image-based models. Because different studies used varying datasets, methodologies, and performance metrics. Selecting the model that performs best relies on the context and objectives of the research. However, based on the results, we observed that ensemble classifier performed exceptionally well in predicting the occurrence of COVID-19 based on patient symptoms with the highest overall accuracy of 97.88%. Gradient Boosting Algorithm achieved an AUC (Area Under the Curve) of 0.90 and identified key features contributing to the decision-making process. Image-based models, as observed in the analyzed studies, have consistently demonstrated higher accuracy than symptom-based models, often reaching impressive levels ranging from 96.09% to as high as 99%.
    High-Cadence Thermospheric Density Estimation enabled by Machine Learning on Solar Imagery. (arXiv:2312.06845v1 [physics.space-ph])
    Accurate estimation of thermospheric density is critical for precise modeling of satellite drag forces in low Earth orbit (LEO). Improving this estimation is crucial to tasks such as state estimation, collision avoidance, and re-entry calculations. The largest source of uncertainty in determining thermospheric density is modeling the effects of space weather driven by solar and geomagnetic activity. Current operational models rely on ground-based proxy indices which imperfectly correlate with the complexity of solar outputs and geomagnetic responses. In this work, we directly incorporate NASA's Solar Dynamics Observatory (SDO) extreme ultraviolet (EUV) spectral images into a neural thermospheric density model to determine whether the predictive performance of the model is increased by using space-based EUV imagery data instead of, or in addition to, the ground-based proxy indices. We demonstrate that EUV imagery can enable predictions with much higher temporal resolution and replace ground-based proxies while significantly increasing performance relative to current operational models. Our method paves the way for assimilating EUV image data into operational thermospheric density forecasting models for use in LEO satellite navigation processes.
    Graph AI in Medicine. (arXiv:2310.13767v2 [cs.LG] UPDATED)
    In clinical artificial intelligence (AI), graph representation learning, mainly through graph neural networks (GNNs), stands out for its capability to capture intricate relationships within structured clinical datasets. With diverse data -- from patient records to imaging -- GNNs process data holistically by viewing modalities as nodes interconnected by their relationships. Graph AI facilitates model transfer across clinical tasks, enabling models to generalize across patient populations without additional parameters or minimal re-training. However, the importance of human-centered design and model interpretability in clinical decision-making cannot be overstated. Since graph AI models capture information through localized neural transformations defined on graph relationships, they offer both an opportunity and a challenge in elucidating model rationale. Knowledge graphs can enhance interpretability by aligning model-driven insights with medical knowledge. Emerging graph models integrate diverse data modalities through pre-training, facilitate interactive feedback loops, and foster human-AI collaboration, paving the way to clinically meaningful predictions.
    PatchMorph: A Stochastic Deep Learning Approach for Unsupervised 3D Brain Image Registration with Small Patches. (arXiv:2312.06958v1 [cs.CV])
    We introduce "PatchMorph," an new stochastic deep learning algorithm tailored for unsupervised 3D brain image registration. Unlike other methods, our method uses compact patches of a constant small size to derive solutions that can combine global transformations with local deformations. This approach minimizes the memory footprint of the GPU during training, but also enables us to operate on numerous amounts of randomly overlapping small patches during inference to mitigate image and patch boundary problems. PatchMorph adeptly handles world coordinate transformations between two input images, accommodating variances in attributes such as spacing, array sizes, and orientations. The spatial resolution of patches transitions from coarse to fine, addressing both global and local attributes essential for aligning the images. Each patch offers a unique perspective, together converging towards a comprehensive solution. Experiments on human T1 MRI brain images and marmoset brain images from serial 2-photon tomography affirm PatchMorph's superior performance.
    Physics Informed Neural Network for Option Pricing. (arXiv:2312.06711v1 [q-fin.PR])
    We apply a physics-informed deep-learning approach the PINN approach to the Black-Scholes equation for pricing American and European options. We test our approach on both simulated as well as real market data, compare it to analytical/numerical benchmarks. Our model is able to accurately capture the price behaviour on simulation data, while also exhibiting reasonable performance for market data. We also experiment with the architecture and learning process of our PINN model to provide more understanding of convergence and stability issues that impact performance.
    Privacy-Aware Energy Consumption Modeling of Connected Battery Electric Vehicles using Federated Learning. (arXiv:2312.07371v1 [cs.LG])
    Battery Electric Vehicles (BEVs) are increasingly significant in modern cities due to their potential to reduce air pollution. Precise and real-time estimation of energy consumption for them is imperative for effective itinerary planning and optimizing vehicle systems, which can reduce driving range anxiety and decrease energy costs. As public awareness of data privacy increases, adopting approaches that safeguard data privacy in the context of BEV energy consumption modeling is crucial. Federated Learning (FL) is a promising solution mitigating the risk of exposing sensitive information to third parties by allowing local data to remain on devices and only sharing model updates with a central server. Our work investigates the potential of using FL methods, such as FedAvg, and FedPer, to improve BEV energy consumption prediction while maintaining user privacy. We conducted experiments using data from 10 BEVs under simulated real-world driving conditions. Our results demonstrate that the FedAvg-LSTM model achieved a reduction of up to 67.84\% in the MAE value of the prediction results. Furthermore, we explored various real-world scenarios and discussed how FL methods can be employed in those cases. Our findings show that FL methods can effectively improve the performance of BEV energy consumption prediction while maintaining user privacy.
    Densify Your Labels: Unsupervised Clustering with Bipartite Matching for Weakly Supervised Point Cloud Segmentation. (arXiv:2312.06799v1 [cs.CV])
    We propose a weakly supervised semantic segmentation method for point clouds that predicts "per-point" labels from just "whole-scene" annotations while achieving the performance of recent fully supervised approaches. Our core idea is to propagate the scene-level labels to each point in the point cloud by creating pseudo labels in a conservative way. Specifically, we over-segment point cloud features via unsupervised clustering and associate scene-level labels with clusters through bipartite matching, thus propagating scene labels only to the most relevant clusters, leaving the rest to be guided solely via unsupervised clustering. We empirically demonstrate that over-segmentation and bipartite assignment plays a crucial role. We evaluate our method on ScanNet and S3DIS datasets, outperforming state of the art, and demonstrate that we can achieve results comparable to fully supervised methods.
    Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation. (arXiv:2312.07231v1 [cs.CV])
    Diffusion Transformers have recently shown remarkable effectiveness in generating high-quality 3D point clouds. However, training voxel-based diffusion models for high-resolution 3D voxels remains prohibitively expensive due to the cubic complexity of attention operators, which arises from the additional dimension of voxels. Motivated by the inherent redundancy of 3D compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer tailored for efficient 3D point cloud generation, which greatly reduces training costs. Specifically, we draw inspiration from masked autoencoders to dynamically operate the denoising process on masked voxelized point clouds. We also propose a novel voxel-aware masking strategy to adaptively aggregate background/foreground information from voxelized point clouds. Our method achieves state-of-the-art performance with an extreme masking ratio of nearly 99%. Moreover, to improve multi-category 3D generation, we introduce Mixture-of-Expert (MoE) in 3D diffusion model. Each category can learn a distinct diffusion path with different experts, relieving gradient conflict. Experimental results on the ShapeNet dataset demonstrate that our method achieves state-of-the-art high-fidelity and diverse 3D point cloud generation performance. Our FastDiT-3D improves 1-Nearest Neighbor Accuracy and Coverage metrics when generating 128-resolution voxel point clouds, using only 6.5% of the original training cost.
    Efficient Object Detection in Autonomous Driving using Spiking Neural Networks: Performance, Energy Consumption Analysis, and Insights into Open-set Object Discovery. (arXiv:2312.07466v1 [cs.CV])
    Besides performance, efficiency is a key design driver of technologies supporting vehicular perception. Indeed, a well-balanced trade-off between performance and energy consumption is crucial for the sustainability of autonomous vehicles. In this context, the diversity of real-world contexts in which autonomous vehicles can operate motivates the need for empowering perception models with the capability to detect, characterize and identify newly appearing objects by themselves. In this manuscript we elaborate on this threefold conundrum (performance, efficiency and open-world learning) for object detection modeling tasks over image data collected from vehicular scenarios. Specifically, we show that well-performing and efficient models can be realized by virtue of Spiking Neural Networks (SNNs), reaching competitive levels of detection performance when compared to their non-spiking counterparts at dramatic energy consumption savings (up to 85%) and a slightly improved robustness against image noise. Our experiments herein offered also expose qualitatively the complexity of detecting new objects based on the preliminary results of a simple approach to discriminate potential object proposals in the captured image.
    Classifying complex documents: comparing bespoke solutions to large language models. (arXiv:2312.07182v1 [cs.CL])
    Here we search for the best automated classification approach for a set of complex legal documents. Our classification task is not trivial: our aim is to classify ca 30,000 public courthouse records from 12 states and 267 counties at two different levels using nine sub-categories. Specifically, we investigated whether a fine-tuned large language model (LLM) can achieve the accuracy of a bespoke custom-trained model, and what is the amount of fine-tuning necessary.
    Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment Retrieval. (arXiv:2312.07435v1 [cs.CV])
    Video moment retrieval is a challenging task requiring fine-grained interactions between video and text modalities. Recent work in image-text pretraining has demonstrated that most existing pretrained models suffer from information asymmetry due to the difference in length between visual and textual sequences. We question whether the same problem also exists in the video-text domain with an auxiliary need to preserve both spatial and temporal information. Thus, we evaluate a recently proposed solution involving the addition of an asymmetric co-attention network for video grounding tasks. Additionally, we incorporate momentum contrastive loss for robust, discriminative representation learning in both modalities. We note that the integration of these supplementary modules yields better performance compared to state-of-the-art models on the TACoS dataset and comparable results on ActivityNet Captions, all while utilizing significantly fewer parameters with respect to baseline.
    Non-Stationary Bandits with Auto-Regressive Temporal Dependency. (arXiv:2210.16386v3 [cs.LG] UPDATED)
    Traditional multi-armed bandit (MAB) frameworks, predominantly examined under stochastic or adversarial settings, often overlook the temporal dynamics inherent in many real-world applications such as recommendation systems and online advertising. This paper introduces a novel non-stationary MAB framework that captures the temporal structure of these real-world dynamics through an auto-regressive (AR) reward structure. We propose an algorithm that integrates two key mechanisms: (i) an alternation mechanism adept at leveraging temporal dependencies to dynamically balance exploration and exploitation, and (ii) a restarting mechanism designed to discard out-of-date information. Our algorithm achieves a regret upper bound that nearly matches the lower bound, with regret measured against a robust dynamic benchmark. Finally, via a real-world case study on tourism demand prediction, we demonstrate both the efficacy of our algorithm and the broader applicability of our techniques to more complex, rapidly evolving time series.
    A Novel Differentiable Loss Function for Unsupervised Graph Neural Networks in Graph Partitioning. (arXiv:2312.06877v1 [cs.LG])
    In this paper, we explore the graph partitioning problem, a pivotal combina-torial optimization challenge with extensive applications in various fields such as science, technology, and business. Recognized as an NP-hard prob-lem, graph partitioning lacks polynomial-time algorithms for its resolution. Recently, there has been a burgeoning interest in leveraging machine learn-ing, particularly approaches like supervised, unsupervised, and reinforce-ment learning, to tackle such NP-hard problems. However, these methods face significant hurdles: supervised learning is constrained by the necessity of labeled solution instances, which are often computationally impractical to obtain; reinforcement learning grapples with instability in the learning pro-cess; and unsupervised learning contends with the absence of a differentia-ble loss function, a consequence of the discrete nature of most combinatorial optimization problems. Addressing these challenges, our research introduces a novel pipeline employing an unsupervised graph neural network to solve the graph partitioning problem. The core innovation of this study is the for-mulation of a differentiable loss function tailored for this purpose. We rigor-ously evaluate our methodology against contemporary state-of-the-art tech-niques, focusing on metrics: cuts and balance, and our findings reveal that our is competitive with these leading methods.
    Exploring Novel Object Recognition and Spontaneous Location Recognition Machine Learning Analysis Techniques in Alzheimer's Mice. (arXiv:2312.06914v1 [cs.LG])
    Understanding object recognition patterns in mice is crucial for advancing behavioral neuroscience and has significant implications for human health, particularly in the realm of Alzheimer's research. This study is centered on the development, application, and evaluation of a state-of-the-art computational pipeline designed to analyze such behaviors, specifically focusing on Novel Object Recognition (NOR) and Spontaneous Location Recognition (SLR) tasks. The pipeline integrates three advanced computational models: Any-Maze for initial data collection, DeepLabCut for detailed pose estimation, and Convolutional Neural Networks (CNNs) for nuanced behavioral classification. Employed across four distinct mouse groups, this pipeline demonstrated high levels of accuracy and robustness. Despite certain challenges like video quality limitations and the need for manual calculations, the results affirm the pipeline's efficacy and potential for scalability. The study serves as a proof of concept for a multidimensional computational approach to behavioral neuroscience, emphasizing the pipeline's versatility and readiness for future, more complex analyses.
    Forced Exploration in Bandit Problems. (arXiv:2312.07285v1 [cs.LG])
    The multi-armed bandit(MAB) is a classical sequential decision problem. Most work requires assumptions about the reward distribution (e.g., bounded), while practitioners may have difficulty obtaining information about these distributions to design models for their problems, especially in non-stationary MAB problems. This paper aims to design a multi-armed bandit algorithm that can be implemented without using information about the reward distribution while still achieving substantial regret upper bounds. To this end, we propose a novel algorithm alternating between greedy rule and forced exploration. Our method can be applied to Gaussian, Bernoulli and other subgaussian distributions, and its implementation does not require additional information. We employ a unified analysis method for different forced exploration strategies and provide problem-dependent regret upper bounds for stationary and piecewise-stationary settings. Furthermore, we compare our algorithm with popular bandit algorithms on different reward distributions.
    DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving. (arXiv:2304.01168v4 [cs.CV] UPDATED)
    Safety is the primary priority of autonomous driving. Nevertheless, no published dataset currently supports the direct and explainable safety evaluation for autonomous driving. In this work, we propose DeepAccident, a large-scale dataset generated via a realistic simulator containing diverse accident scenarios that frequently occur in real-world driving. The proposed DeepAccident dataset includes 57K annotated frames and 285K annotated samples, approximately 7 times more than the large-scale nuScenes dataset with 40k annotated samples. In addition, we propose a new task, end-to-end motion and accident prediction, which can be used to directly evaluate the accident prediction ability for different autonomous driving algorithms. Furthermore, for each scenario, we set four vehicles along with one infrastructure to record data, thus providing diverse viewpoints for accident scenarios and enabling V2X (vehicle-to-everything) research on perception and prediction tasks. Finally, we present a baseline V2X model named V2XFormer that demonstrates superior performance for motion and accident prediction and 3D object detection compared to the single-vehicle model.
    Rethinking Gauss-Newton for learning over-parameterized models. (arXiv:2302.02904v3 [cs.LG] UPDATED)
    This work studies the global convergence and implicit bias of Gauss Newton's (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN's method. While GN is consistently faster than GD in finding a global optimum, the learned model generalizes well on test data when starting from random initial weights with a small variance and using a small step size to slow down convergence. Specifically, our study shows that such a setting results in a hidden learning phenomenon, where the dynamics are able to recover features with good generalization properties despite the model having sub-optimal training and test performances due to an under-optimized linear layer. This study exhibits a trade-off between the convergence speed of GN and the generalization ability of the learned solution.
    Language-Guided Transformer for Federated Multi-Label Classification. (arXiv:2312.07165v1 [cs.CV])
    Federated Learning (FL) is an emerging paradigm that enables multiple users to collaboratively train a robust model in a privacy-preserving manner without sharing their private data. Most existing approaches of FL only consider traditional single-label image classification, ignoring the impact when transferring the task to multi-label image classification. Nevertheless, it is still challenging for FL to deal with user heterogeneity in their local data distribution in the real-world FL scenario, and this issue becomes even more severe in multi-label image classification. Inspired by the recent success of Transformers in centralized settings, we propose a novel FL framework for multi-label classification. Since partial label correlation may be observed by local clients during training, direct aggregation of locally updated models would not produce satisfactory performances. Thus, we propose a novel FL framework of Language-Guided Transformer (FedLGT) to tackle this challenging task, which aims to exploit and transfer knowledge across different clients for learning a robust global model. Through extensive experiments on various multi-label datasets (e.g., FLAIR, MS-COCO, etc.), we show that our FedLGT is able to achieve satisfactory performance and outperforms standard FL techniques under multi-label FL scenarios. Code is available at https://github.com/Jack24658735/FedLGT.
    Instrumental Variable Estimation for Causal Inference in Longitudinal Data with Time-Dependent Latent Confounders. (arXiv:2312.07175v1 [cs.LG])
    Causal inference from longitudinal observational data is a challenging problem due to the difficulty in correctly identifying the time-dependent confounders, especially in the presence of latent time-dependent confounders. Instrumental variable (IV) is a powerful tool for addressing the latent confounders issue, but the traditional IV technique cannot deal with latent time-dependent confounders in longitudinal studies. In this work, we propose a novel Time-dependent Instrumental Factor Model (TIFM) for time-varying causal effect estimation from data with latent time-dependent confounders. At each time-step, the proposed TIFM method employs the Recurrent Neural Network (RNN) architecture to infer latent IV, and then uses the inferred latent IV factor for addressing the confounding bias caused by the latent time-dependent confounders. We provide a theoretical analysis for the proposed TIFM method regarding causal effect estimation in longitudinal data. Extensive evaluation with synthetic datasets demonstrates the effectiveness of TIFM in addressing causal effect estimation over time. We further apply TIFM to a climate dataset to showcase the potential of the proposed method in tackling real-world problems.
    Identifying Drivers of Predictive Uncertainty using Variance Feature Attribution. (arXiv:2312.07252v1 [cs.LG])
    Explainability and uncertainty quantification are two pillars of trustable artificial intelligence. However, the reasoning behind uncertainty estimates is generally left unexplained. Identifying the drivers of uncertainty complements explanations of point predictions in recognizing potential model limitations. It facilitates the detection of oversimplification in the uncertainty estimation process. Explanations of uncertainty enhance communication and trust in decisions. They allow for verifying whether the main drivers of model uncertainty are relevant and may impact model usage. So far, the subject of explaining uncertainties has been rarely studied. The few exceptions in existing literature are tailored to Bayesian neural networks or rely heavily on technically intricate approaches, hindering their broad adoption. We propose variance feature attribution, a simple and scalable solution to explain predictive aleatoric uncertainties. First, we estimate uncertainty as predictive variance by equipping a neural network with a Gaussian output distribution by adding a variance output neuron. Thereby, we can rely on pre-trained point prediction models and fine-tune them for meaningful variance estimation. Second, we apply out-of-the-box explainers on the variance output of these models to explain the uncertainty estimation. We evaluate our approach in a synthetic setting where the data-generating process is known. We show that our method can explain uncertainty influences more reliably and faster than the established baseline CLUE. We fine-tune a state-of-the-art age regression model to estimate uncertainty and obtain attributions. Our explanations highlight potential sources of uncertainty, such as laugh lines. Variance feature attribution provides accurate explanations for uncertainty estimates with little modifications to the model architecture and low computational overhead.
    Dozerformer: Sequence Adaptive Sparse Transformer for Multivariate Time Series Forecasting. (arXiv:2312.06874v1 [cs.LG])
    Transformers have achieved remarkable performance in multivariate time series(MTS) forecasting due to their capability to capture long-term dependencies. However, the canonical attention mechanism has two key limitations: (1) its quadratic time complexity limits the sequence length, and (2) it generates future values from the entire historical sequence. To address this, we propose a Dozer Attention mechanism consisting of three sparse components: (1) Local, each query exclusively attends to keys within a localized window of neighboring time steps. (2) Stride, enables each query to attend to keys at predefined intervals. (3) Vary, allows queries to selectively attend to keys from a subset of the historical sequence. Notably, the size of this subset dynamically expands as forecasting horizons extend. Those three components are designed to capture essential attributes of MTS data, including locality, seasonality, and global temporal dependencies. Additionally, we present the Dozerformer Framework, incorporating the Dozer Attention mechanism for the MTS forecasting task. We evaluated the proposed Dozerformer framework with recent state-of-the-art methods on nine benchmark datasets and confirmed its superior performance. The code will be released after the manuscript is accepted.
    Mixture-of-Linear-Experts for Long-term Time Series Forecasting. (arXiv:2312.06786v1 [cs.LG])
    Long-term time series forecasting (LTSF) aims to predict future values of a time series given the past values. The current state-of-the-art (SOTA) on this problem is attained in some cases by linear-centric models, which primarily feature a linear mapping layer. However, due to their inherent simplicity, they are not able to adapt their prediction rules to periodic changes in time series patterns. To address this challenge, we propose a Mixture-of-Experts-style augmentation for linear-centric models and propose Mixture-of-Linear-Experts (MoLE). Instead of training a single model, MoLE trains multiple linear-centric models (i.e., experts) and a router model that weighs and mixes their outputs. While the entire framework is trained end-to-end, each expert learns to specialize in a specific temporal pattern, and the router model learns to compose the experts adaptively. Experiments show that MoLE reduces forecasting error of linear-centric models, including DLinear, RLinear, and RMLP, in over 78% of the datasets and settings we evaluated. By using MoLE existing linear-centric models can achieve SOTA LTSF results in 68% of the experiments that PatchTST reports and we compare to, whereas existing single-head linear-centric models achieve SOTA results in only 25% of cases. Additionally, MoLE models achieve SOTA in all settings for the newly released Weather2K datasets.
    Adversarial Estimation of Topological Dimension with Harmonic Score Maps. (arXiv:2312.06869v1 [cs.LG])
    Quantification of the number of variables needed to locally explain complex data is often the first step to better understanding it. Existing techniques from intrinsic dimension estimation leverage statistical models to glean this information from samples within a neighborhood. However, existing methods often rely on well-picked hyperparameters and ample data as manifold dimension and curvature increases. Leveraging insight into the fixed point of the score matching objective as the score map is regularized by its Dirichlet energy, we show that it is possible to retrieve the topological dimension of the manifold learned by the score map. We then introduce a novel method to measure the learned manifold's topological dimension (i.e., local intrinsic dimension) using adversarial attacks, thereby generating useful interpretations of the learned manifold.
    Understanding and Leveraging the Learning Phases of Neural Networks. (arXiv:2312.06887v1 [cs.LG])
    The learning dynamics of deep neural networks are not well understood. The information bottleneck (IB) theory proclaimed separate fitting and compression phases. But they have since been heavily debated. We comprehensively analyze the learning dynamics by investigating a layer's reconstruction ability of the input and prediction performance based on the evolution of parameters during training. We empirically show the existence of three phases using common datasets and architectures such as ResNet and VGG: (i) near constant reconstruction loss, (ii) decrease, and (iii) increase. We also derive an empirically grounded data model and prove the existence of phases for single-layer networks. Technically, our approach leverages classical complexity analysis. It differs from IB by relying on measuring reconstruction loss rather than information theoretic measures to relate information of intermediate layers and inputs. Our work implies a new best practice for transfer learning: We show empirically that the pre-training of a classifier should stop well before its performance is optimal.
    Steering Llama 2 via Contrastive Activation Addition. (arXiv:2312.06681v1 [cs.CL])
    We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying activations during their forward passes. CAA computes ``steering vectors'' by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using both multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, outperforms traditional methods like finetuning and few-shot prompting, and minimally reduces capabilities. Moreover, by employing various activation space interpretation methods, we gain deeper insights into CAA's mechanisms. CAA both accurately steers model outputs and also sheds light on how high-level concepts are represented in Large Language Models (LLMs).
    Perseus: Removing Energy Bloat from Large Model Training. (arXiv:2312.06902v1 [cs.LG])
    Training large AI models on numerous GPUs consumes a massive amount of energy. We observe that not all energy consumed during training directly contributes to end-to-end training throughput, and a significant portion can be removed without slowing down training, which we call energy bloat. In this work, we identify two independent sources of energy bloat in large model training, intrinsic and extrinsic, and propose Perseus, a unified optimization framework that mitigates both. Perseus obtains the "iteration time-energy" Pareto frontier of any large model training job using an efficient iterative graph cut-based algorithm and schedules energy consumption of its forward and backward computations across time to remove intrinsic and extrinsic energy bloat. Evaluation on large models like GPT-3 and Bloom shows that Perseus reduces energy consumption of large model training by up to 30%, enabling savings otherwise unobtainable before.
    RACER: Rational Artificial Intelligence Car-following-model Enhanced by Reality. (arXiv:2312.07003v1 [cs.AI])
    This paper introduces RACER, the Rational Artificial Intelligence Car-following model Enhanced by Reality, a cutting-edge deep learning car-following model, that satisfies partial derivative constraints, designed to predict Adaptive Cruise Control (ACC) driving behavior while staying theoretically feasible. Unlike conventional models, RACER effectively integrates Rational Driving Constraints (RDCs), crucial tenets of actual driving, resulting in strikingly accurate and realistic predictions. Against established models like the Optimal Velocity Relative Velocity (OVRV), a car-following Neural Network (NN), and a car-following Physics-Informed Neural Network (PINN), RACER excels across key metrics, such as acceleration, velocity, and spacing. Notably, it displays a perfect adherence to the RDCs, registering zero violations, in stark contrast to other models. This study highlights the immense value of incorporating physical constraints within AI models, especially for augmenting safety measures in transportation. It also paves the way for future research to test these models against human driving data, with the potential to guide safer and more rational driving behavior. The versatility of the proposed model, including its potential to incorporate additional derivative constraints and broader architectural applications, enhances its appeal and broadens its impact within the scientific community.
    Learning to Denoise Unreliable Interactions for Link Prediction on Biomedical Knowledge Graph. (arXiv:2312.06682v1 [cs.AI])
    Link prediction in biomedical knowledge graphs (KGs) aims at predicting unknown interactions between entities, including drug-target interaction (DTI) and drug-drug interaction (DDI), which is critical for drug discovery and therapeutics. Previous methods prefer to utilize the rich semantic relations and topological structure of the KG to predict missing links, yielding promising outcomes. However, all these works only focus on improving the predictive performance without considering the inevitable noise and unreliable interactions existing in the KGs, which limits the development of KG-based computational methods. To address these limitations, we propose a Denoised Link Prediction framework, called DenoisedLP. DenoisedLP obtains reliable interactions based on the local subgraph by denoising noisy links in a learnable way, providing a universal module for mining underlying task-relevant relations. To collaborate with the smoothed semantic information, DenoisedLP introduces the semantic subgraph by blurring conflict relations around the predicted link. By maximizing the mutual information between the reliable structure and smoothed semantic relations, DenoisedLP emphasizes the informative interactions for predicting relation-specific links. Experimental results on real-world datasets demonstrate that DenoisedLP outperforms state-of-the-art methods on DTI and DDI prediction tasks, and verify the effectiveness and robustness of denoising unreliable interactions on the contaminated KGs.
    Dynamically configured physics-informed neural network in topology optimization applications. (arXiv:2312.06993v1 [cs.LG])
    Integration of machine learning (ML) into the topology optimization (TO) framework is attracting increasing attention, but data acquisition in data-driven models is prohibitive. Compared with popular ML methods, the physics-informed neural network (PINN) can avoid generating enormous amounts of data when solving forward problems and additionally provide better inference. To this end, a dynamically configured PINN-based topology optimization (DCPINN-TO) method is proposed. The DCPINN is composed of two subnetworks, namely the backbone neural network (NN) and the coefficient NN, where the coefficient NN has fewer trainable parameters. The designed architecture aims to dynamically configure trainable parameters; that is, an inexpensive NN is used to replace an expensive one at certain optimization cycles. Furthermore, an active sampling strategy is proposed to selectively sample collocations depending on the pseudo-densities at each optimization cycle. In this manner, the number of collocations will decrease with the optimization process but will hardly affect it. The Gaussian integral is used to calculate the strain energy of elements, which yields a byproduct of decoupling the mapping of the material at the collocations. Several examples with different resolutions validate the feasibility of the DCPINN-TO method, and multiload and multiconstraint problems are employed to illustrate its generalization. In addition, compared to finite element analysis-based TO (FEA-TO), the accuracy of the displacement prediction and optimization results indicate that the DCPINN-TO method is effective and efficient.
    Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning. (arXiv:2312.06699v1 [cs.CV])
    A thorough comprehension of textual data is a fundamental element in multi-modal video analysis tasks. However, recent works have shown that the current models do not achieve a comprehensive understanding of the textual data during the training for the target downstream tasks. Orthogonal to the previous approaches to this limitation, we postulate that understanding the significance of the sentence components according to the target task can potentially enhance the performance of the models. Hence, we utilize the knowledge of a pre-trained large language model (LLM) to generate text samples from the original ones, targeting specific sentence components. We propose a weakly supervised importance estimation module to compute the relative importance of the components and utilize them to improve different video-language tasks. Through rigorous quantitative analysis, our proposed method exhibits significant improvement across several video-language tasks. In particular, our approach notably enhances video-text retrieval by a relative improvement of 8.3\% in video-to-text and 1.4\% in text-to-video retrieval over the baselines, in terms of R@1. Additionally, in video moment retrieval, average mAP shows a relative improvement ranging from 2.0\% to 13.7 \% across different baselines.
    Neural Architecture Codesign for Fast Bragg Peak Analysis. (arXiv:2312.05978v2 [cs.LG] UPDATED)
    We develop an automated pipeline to streamline neural architecture codesign for fast, real-time Bragg peak analysis in high-energy diffraction microscopy. Traditional approaches, notably pseudo-Voigt fitting, demand significant computational resources, prompting interest in deep learning models for more efficient solutions. Our method employs neural architecture search and AutoML to enhance these models, including hardware costs, leading to the discovery of more hardware-efficient neural architectures. Our results match the performance, while achieving a 13$\times$ reduction in bit operations compared to the previous state-of-the-art. We show further speedup through model compression techniques such as quantization-aware-training and neural network pruning. Additionally, our hierarchical search space provides greater flexibility in optimization, which can easily extend to other tasks and domains.
    Risk Preferences of Learning Algorithms. (arXiv:2205.04619v3 [cs.LG] UPDATED)
    Agents' learning from feedback shapes economic outcomes, and many economic decision-makers today employ learning algorithms to make consequential choices. This note shows that a widely used learning algorithm, $\varepsilon$-Greedy, exhibits emergent risk aversion: it prefers actions with lower variance. When presented with actions of the same expectation, under a wide range of conditions, $\varepsilon$-Greedy chooses the lower-variance action with probability approaching one. This emergent preference can have wide-ranging consequences, ranging from concerns about fairness to homogenization, and holds transiently even when the riskier action has a strictly higher expected payoff. We discuss two methods to correct this bias. The first method requires the algorithm to reweight data as a function of how likely the actions were to be chosen. The second requires the algorithm to have optimistic estimates of actions for which it has not collected much data. We show that risk-neutrality is restored with these corrections.
    Efficient Cross-Domain Federated Learning by MixStyle Approximation. (arXiv:2312.07064v1 [cs.LG])
    With the advent of interconnected and sensor-equipped edge devices, Federated Learning (FL) has gained significant attention, enabling decentralized learning while maintaining data privacy. However, FL faces two challenges in real-world tasks: expensive data labeling and domain shift between source and target samples. In this paper, we introduce a privacy-preserving, resource-efficient FL concept for client adaptation in hardware-constrained environments. Our approach includes server model pre-training on source data and subsequent fine-tuning on target data via low-end clients. The local client adaptation process is streamlined by probabilistic mixing of instance-level feature statistics approximated from source and target domain data. The adapted parameters are transferred back to the central server and globally aggregated. Preliminary results indicate that our method reduces computational and transmission costs while maintaining competitive performance on downstream tasks.
    Focus on Hiders: Exploring Hidden Threats for Enhancing Adversarial Training. (arXiv:2312.07067v1 [cs.LG])
    Adversarial training is often formulated as a min-max problem, however, concentrating only on the worst adversarial examples causes alternating repetitive confusion of the model, i.e., previously defended or correctly classified samples are not defensible or accurately classifiable in subsequent adversarial training. We characterize such non-ignorable samples as "hiders", which reveal the hidden high-risk regions within the secure area obtained through adversarial training and prevent the model from finding the real worst cases. We demand the model to prevent hiders when defending against adversarial examples for improving accuracy and robustness simultaneously. By rethinking and redefining the min-max optimization problem for adversarial training, we propose a generalized adversarial training algorithm called Hider-Focused Adversarial Training (HFAT). HFAT introduces the iterative evolution optimization strategy to simplify the optimization problem and employs an auxiliary model to reveal hiders, effectively combining the optimization directions of standard adversarial training and prevention hiders. Furthermore, we introduce an adaptive weighting mechanism that facilitates the model in adaptively adjusting its focus between adversarial examples and hiders during different training periods. We demonstrate the effectiveness of our method based on extensive experiments, and ensure that HFAT can provide higher robustness and accuracy.
    Context Matter: Data-Efficient Augmentation of Large Language Models for Scientific Applications. (arXiv:2312.07069v1 [cs.CL])
    In this paper, we explore the challenges inherent to Large Language Models (LLMs) like GPT-4, particularly their propensity for hallucinations, logic mistakes, and incorrect conclusions when tasked with answering complex questions. The capacity of LLMs to present erroneous answers in a coherent and semantically rigorous manner further complicates the detection of factual inaccuracies. This issue is especially pronounced in fields that require specialized expertise. Our work delves into these challenges, aiming to enhance the understanding and mitigation of such errors, thereby contributing to the improvement of LLM accuracy and reliability in scientific and other specialized domains. Our findings reveal a non-linear relationship between the context's relevancy and the answers' measured quality. In addition, we demonstrate that with the correct calibration, it is possible to automate the grading procedure -- a finding suggesting that, at least to some degree, the LLMs can be used to self-examine the quality of their own performance. Finally, we describe an experimental platform that can be seen as a proof-of-concept of the techniques described in this work.
    Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents. (arXiv:2303.00855v2 [cs.RO] UPDATED)
    Recent progress in large language models (LLMs) has demonstrated the ability to learn and leverage Internet-scale knowledge through pre-training with autoregressive models. Unfortunately, applying such models to settings with embodied agents, such as robots, is challenging due to their lack of experience with the physical world, inability to parse non-language observations, and ignorance of rewards or safety constraints that robots may require. On the other hand, language-conditioned robotic policies that learn from interaction data can provide the necessary grounding that allows the agent to be correctly situated in the real world, but such policies are limited by the lack of high-level semantic understanding due to the limited breadth of the interaction data available for training them. Thus, if we want to make use of the semantic knowledge in a language model while still situating it in an embodied setting, we must construct an action sequence that is both likely according to the language model and also realizable according to grounded models of the environment. We frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives. We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon embodiment tasks in a robotic setting by leveraging the knowledge of both models. The project's website can be found at grounded-decoding.github.io.
    On the notion of Hallucinations from the lens of Bias and Validity in Synthetic CXR Images. (arXiv:2312.06979v1 [eess.IV])
    Medical imaging has revolutionized disease diagnosis, yet the potential is hampered by limited access to diverse and privacy-conscious datasets. Open-source medical datasets, while valuable, suffer from data quality and clinical information disparities. Generative models, such as diffusion models, aim to mitigate these challenges. At Stanford, researchers explored the utility of a fine-tuned Stable Diffusion model (RoentGen) for medical imaging data augmentation. Our work examines specific considerations to expand the Stanford research question, Could Stable Diffusion Solve a Gap in Medical Imaging Data? from the lens of bias and validity of the generated outcomes. We leveraged RoentGen to produce synthetic Chest-XRay (CXR) images and conducted assessments on bias, validity, and hallucinations. Diagnostic accuracy was evaluated by a disease classifier, while a COVID classifier uncovered latent hallucinations. The bias analysis unveiled disparities in classification performance among various subgroups, with a pronounced impact on the Female Hispanic subgroup. Furthermore, incorporating race and gender into input prompts exacerbated fairness issues in the generated images. The quality of synthetic images exhibited variability, particularly in certain disease classes, where there was more significant uncertainty compared to the original images. Additionally, we observed latent hallucinations, with approximately 42% of the images incorrectly indicating COVID, hinting at the presence of hallucinatory elements. These identifications provide new research directions towards interpretability of synthetic CXR images, for further understanding of associated risks and patient safety in medical applications.
    Local Spatiotemporal Representation Learning for Longitudinally-consistent Neuroimage Analysis. (arXiv:2206.04281v4 [cs.CV] UPDATED)
    Recent self-supervised advances in medical computer vision exploit global and local anatomical self-similarity for pretraining prior to downstream tasks such as segmentation. However, current methods assume i.i.d. image acquisition, which is invalid in clinical study designs where follow-up longitudinal scans track subject-specific temporal changes. Further, existing self-supervised methods for medically-relevant image-to-image architectures exploit only spatial or temporal self-similarity and only do so via a loss applied at a single image-scale, with naive multi-scale spatiotemporal extensions collapsing to degenerate solutions. To these ends, this paper makes two contributions: (1) It presents a local and multi-scale spatiotemporal representation learning method for image-to-image architectures trained on longitudinal images. It exploits the spatiotemporal self-similarity of learned multi-scale intra-subject features for pretraining and develops several feature-wise regularizations that avoid collapsed identity representations; (2) During finetuning, it proposes a surprisingly simple self-supervised segmentation consistency regularization to exploit intra-subject correlation. Benchmarked in the one-shot segmentation setting, the proposed framework outperforms both well-tuned randomly-initialized baselines and current self-supervised techniques designed for both i.i.d. and longitudinal datasets. These improvements are demonstrated across both longitudinal neurodegenerative adult MRI and developing infant brain MRI and yield both higher performance and longitudinal consistency.
    HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts. (arXiv:2312.07035v1 [cs.LG])
    By routing input tokens to only a few split experts, Sparse Mixture-of-Experts has enabled efficient training of large language models. Recent findings suggest that fixing the routers can achieve competitive performance by alleviating the collapsing problem, where all experts eventually learn similar representations. However, this strategy has two key limitations: (i) the policy derived from random routers might be sub-optimal, and (ii) it requires extensive resources during training and evaluation, leading to limited efficiency gains. This work introduces \HyperRout, which dynamically generates the router's parameters through a fixed hypernetwork and trainable embeddings to achieve a balance between training the routers and freezing them to learn an improved routing policy. Extensive experiments across a wide range of tasks demonstrate the superior performance and efficiency gains of \HyperRouter compared to existing routing methods. Our implementation is publicly available at {\url{{https://github.com/giangdip2410/HyperRouter}}}.
    Building Trustworthy NeuroSymbolic AI Systems: Consistency, Reliability, Explainability, and Safety. (arXiv:2312.06798v1 [cs.AI])
    Explainability and Safety engender Trust. These require a model to exhibit consistency and reliability. To achieve these, it is necessary to use and analyze data and knowledge with statistical and symbolic AI methods relevant to the AI application - neither alone will do. Consequently, we argue and seek to demonstrate that the NeuroSymbolic AI approach is better suited for making AI a trusted AI system. We present the CREST framework that shows how Consistency, Reliability, user-level Explainability, and Safety are built on NeuroSymbolic methods that use data and knowledge to support requirements for critical applications such as health and well-being. This article focuses on Large Language Models (LLMs) as the chosen AI system within the CREST framework. LLMs have garnered substantial attention from researchers due to their versatility in handling a broad array of natural language processing (NLP) scenarios. For example, ChatGPT and Google's MedPaLM have emerged as highly promising platforms for providing information in general and health-related queries, respectively. Nevertheless, these models remain black boxes despite incorporating human feedback and instruction-guided tuning. For instance, ChatGPT can generate unsafe responses despite instituting safety guardrails. CREST presents a plausible approach harnessing procedural and graph-based knowledge within a NeuroSymbolic framework to shed light on the challenges associated with LLMs.
    Diffusion Schr\"odinger Bridge Matching. (arXiv:2303.16852v3 [stat.ML] UPDATED)
    Solving transport problems, i.e. finding a map transporting one given distribution to another, has numerous applications in machine learning. Novel mass transport methods motivated by generative modeling have recently been proposed, e.g. Denoising Diffusion Models (DDMs) and Flow Matching Models (FMMs) implement such a transport through a Stochastic Differential Equation (SDE) or an Ordinary Differential Equation (ODE). However, while it is desirable in many applications to approximate the deterministic dynamic Optimal Transport (OT) map which admits attractive properties, DDMs and FMMs are not guaranteed to provide transports close to the OT map. In contrast, Schr\"odinger bridges (SBs) compute stochastic dynamic mappings which recover entropy-regularized versions of OT. Unfortunately, existing numerical methods approximating SBs either scale poorly with dimension or accumulate errors across iterations. In this work, we introduce Iterative Markovian Fitting (IMF), a new methodology for solving SB problems, and Diffusion Schr\"odinger Bridge Matching (DSBM), a novel numerical algorithm for computing IMF iterates. DSBM significantly improves over previous SB numerics and recovers as special/limiting cases various recent transport methods. We demonstrate the performance of DSBM on a variety of problems.
    Accelerating Scalable Graph Neural Network Inference with Node-Adaptive Propagation. (arXiv:2310.10998v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have exhibited exceptional efficacy in a diverse array of applications. However, the sheer size of large-scale graphs presents a significant challenge to real-time inference with GNNs. Although existing Scalable GNNs leverage linear propagation to preprocess the features and accelerate the training and inference procedure, these methods still suffer from scalability issues when making inferences on unseen nodes, as the feature preprocessing requires the graph to be known and fixed. To further accelerate Scalable GNNs inference in this inductive setting, we propose an online propagation framework and two novel node-adaptive propagation methods that can customize the optimal propagation depth for each node based on its topological information and thereby avoid redundant feature propagation. The trade-off between accuracy and latency can be flexibly managed through simple hyper-parameters to accommodate various latency constraints. Moreover, to compensate for the inference accuracy loss caused by the potential early termination of propagation, we further propose Inception Distillation to exploit the multi-scale receptive field information within graphs. The rigorous and comprehensive experimental study on public datasets with varying scales and characteristics demonstrates that the proposed inference acceleration framework outperforms existing state-of-the-art graph inference acceleration methods in terms of accuracy and efficiency. Particularly, the superiority of our approach is notable on datasets with larger scales, yielding a 75x inference speedup on the largest Ogbn-products dataset.
    The unreasonable effectiveness of AI CADe polyp detectors to generalize to new countries. (arXiv:2312.06833v1 [cs.LG])
    $\textbf{Background and aims}$: Artificial Intelligence (AI) Computer-Aided Detection (CADe) is commonly used for polyp detection, but data seen in clinical settings can differ from model training. Few studies evaluate how well CADe detectors perform on colonoscopies from countries not seen during training, and none are able to evaluate performance without collecting expensive and time-intensive labels. $\textbf{Methods}$: We trained a CADe polyp detector on Israeli colonoscopy videos (5004 videos, 1106 hours) and evaluated on Japanese videos (354 videos, 128 hours) by measuring the True Positive Rate (TPR) versus false alarms per minute (FAPM). We introduce a colonoscopy dissimilarity measure called "MAsked mediCal Embedding Distance" (MACE) to quantify differences between colonoscopies, without labels. We evaluated CADe on all Japan videos and on those with the highest MACE. $\textbf{Results}$: MACE correctly quantifies that narrow-band imaging (NBI) and chromoendoscopy (CE) frames are less similar to Israel data than Japan whitelight (bootstrapped z-test, |z| > 690, p 45.2, p 47.3, p < $10^{-8}$, $\delta$ = 1.5% for both). $\textbf{Conclusion}$: Differences that prevent CADe detectors from performing well in non-medical settings do not degrade the performance of our AI CADe polyp detector when applied to data from a new country. MACE can help medical AI models internationalize by identifying the most "dissimilar" data on which to evaluate models.  ( 3 min )
    Using Analytics on Student Created Data to Content Validate Pedagogical Tools. (arXiv:2312.06871v1 [cs.AI])
    Conceptual and simulation models can function as useful pedagogical tools, however it is important to categorize different outcomes when evaluating them in order to more meaningfully interpret results. VERA is a ecology-based conceptual modeling software that enables users to simulate interactions between biotics and abiotics in an ecosystem, allowing users to form and then verify hypothesis through observing a time series of the species populations. In this paper, we classify this time series into common patterns found in the domain of ecological modeling through two methods, hierarchical clustering and curve fitting, illustrating a general methodology for showing content validity when combining different pedagogical tools. When applied to a diverse sample of 263 models containing 971 time series collected from three different VERA user categories: a Georgia Tech (GATECH), North Georgia Technical College (NGTC), and ``Self Directed Learners'', results showed agreement between both classification methods on 89.38\% of the sample curves in the test set. This serves as a good indication that our methodology for determining content validity was successful.  ( 2 min )
    RAFIC: Retrieval-Augmented Few-shot Image Classification. (arXiv:2312.06868v1 [cs.CV])
    Few-shot image classification is the task of classifying unseen images to one of N mutually exclusive classes, using only a small number of training examples for each class. The limited availability of these examples (denoted as K) presents a significant challenge to classification accuracy in some cases. To address this, we have developed a method for augmenting the set of K with an addition set of A retrieved images. We call this system Retrieval-Augmented Few-shot Image Classification (RAFIC). Through a series of experiments, we demonstrate that RAFIC markedly improves performance of few-shot image classification across two challenging datasets. RAFIC consists of two main components: (a) a retrieval component which uses CLIP, LAION-5B, and faiss, in order to efficiently retrieve images similar to the supplied images, and (b) retrieval meta-learning, which learns to judiciously utilize the retrieved images. Code and data is available at github.com/amirziai/rafic.  ( 2 min )
    ELSA: Partial Weight Freezing for Overhead-Free Sparse Network Deployment. (arXiv:2312.06872v1 [cs.LG])
    We present ELSA, a practical solution for creating deep networks that can easily be deployed at different levels of sparsity. The core idea is to embed one or more sparse networks within a single dense network as a proper subset of the weights. At prediction time, any sparse model can be extracted effortlessly simply be zeroing out weights according to a predefined mask. ELSA is simple, powerful and highly flexible. It can use essentially any existing technique for network sparsification and network training. In particular, it does not restrict the loss function, architecture or the optimization technique. Our experiments show that ELSA's advantages of flexible deployment comes with no or just a negligible reduction in prediction quality compared to the standard way of using multiple sparse networks that are trained and stored independently.  ( 2 min )
    Self-supervised Machine Learning Based Approach to Orbit Modelling Applied to Space Traffic Management. (arXiv:2312.06854v1 [physics.space-ph])
    This paper presents a novel methodology for improving the performance of machine learning based space traffic management tasks through the use of a pre-trained orbit model. Taking inspiration from BERT-like self-supervised language models in the field of natural language processing, we introduce ORBERT, and demonstrate the ability of such a model to leverage large quantities of readily available orbit data to learn meaningful representations that can be used to aid in downstream tasks. As a proof of concept of this approach we consider the task of all vs. all conjunction screening, phrased here as a machine learning time series classification task. We show that leveraging unlabelled orbit data leads to improved performance, and that the proposed approach can be particularly beneficial for tasks where the availability of labelled data is limited.  ( 2 min )
    Multimodal Pretraining of Medical Time Series and Notes. (arXiv:2312.06855v1 [cs.LG])
    Within the intensive care unit (ICU), a wealth of patient data, including clinical measurements and clinical notes, is readily available. This data is a valuable resource for comprehending patient health and informing medical decisions, but it also contains many challenges in analysis. Deep learning models show promise in extracting meaningful patterns, but they require extensive labeled data, a challenge in critical care. To address this, we propose a novel approach employing self-supervised pretraining, focusing on the alignment of clinical measurements and notes. Our approach combines contrastive and masked token prediction tasks during pretraining. Semi-supervised experiments on the MIMIC-III dataset demonstrate the effectiveness of our self-supervised pretraining. In downstream tasks, including in-hospital mortality prediction and phenotyping, our pretrained model outperforms baselines in settings where only a fraction of the data is labeled, emphasizing its ability to enhance ICU data analysis. Notably, our method excels in situations where very few labels are available, as evidenced by an increase in the AUC-ROC for in-hospital mortality by 0.17 and in AUC-PR for phenotyping by 0.1 when only 1% of labels are accessible. This work advances self-supervised learning in the healthcare domain, optimizing clinical insights from abundant yet challenging ICU data.  ( 2 min )
    NDELS: A Novel Approach for Nighttime Dehazing, Low-Light Enhancement, and Light Suppression. (arXiv:2312.06850v1 [cs.CV])
    This paper tackles the intricate challenge of improving the quality of nighttime images under hazy and low-light conditions. Overcoming issues including nonuniform illumination glows, texture blurring, glow effects, color distortion, noise disturbance, and overall, low light have proven daunting. Despite the inherent difficulties, this paper introduces a pioneering solution named Nighttime Dehazing, Low-Light Enhancement, and Light Suppression (NDELS). NDELS utilizes a unique network that combines three essential processes to enhance visibility, brighten low-light regions, and effectively suppress glare from bright light sources. In contrast to limited progress in nighttime dehazing, unlike its daytime counterpart, NDELS presents a comprehensive and innovative approach. The efficacy of NDELS is rigorously validated through extensive comparisons with eight state-of-the-art algorithms across four diverse datasets. Experimental results showcase the superior performance of our method, demonstrating its outperformance in terms of overall image quality, including color and edge enhancement. Quantitative (PSNR, SSIM) and qualitative metrics (CLIPIQA, MANIQA, TRES), measure these results.  ( 2 min )
    Extracting Self-Consistent Causal Insights from Users Feedback with LLMs and In-context Learning. (arXiv:2312.06820v1 [cs.AI])
    Microsoft Windows Feedback Hub is designed to receive customer feedback on a wide variety of subjects including critical topics such as power and battery. Feedback is one of the most effective ways to have a grasp of users' experience with Windows and its ecosystem. However, the sheer volume of feedback received by Feedback Hub makes it immensely challenging to diagnose the actual cause of reported issues. To better understand and triage issues, we leverage Double Machine Learning (DML) to associate users' feedback with telemetry signals. One of the main challenges we face in the DML pipeline is the necessity of domain knowledge for model design (e.g., causal graph), which sometimes is either not available or hard to obtain. In this work, we take advantage of reasoning capabilities in Large Language Models (LLMs) to generate a prior model that which to some extent compensates for the lack of domain knowledge and could be used as a heuristic for measuring feedback informativeness. Our LLM-based approach is able to extract previously known issues, uncover new bugs, and identify sequences of events that lead to a bug, while minimizing out-of-domain outputs.  ( 2 min )
    Model Breadcrumbs: Scaling Multi-Task Model Merging with Sparse Masks. (arXiv:2312.06795v1 [cs.LG])
    The rapid development of AI systems has been greatly influenced by the emergence of foundation models. A common approach for targeted problems involves fine-tuning these pre-trained foundation models for specific target tasks, resulting in a rapid spread of models fine-tuned across a diverse array of tasks. This work focuses on the problem of merging multiple fine-tunings of the same foundation model derived from a spectrum of auxiliary tasks. We introduce a new simple method, Model Breadcrumbs, which consists of a sparsely defined set of weights that carve out a trajectory within the weight space of a pre-trained model, enhancing task performance when traversed. These breadcrumbs are constructed by subtracting the weights from a pre-trained model before and after fine-tuning, followed by a sparsification process that eliminates weight outliers and negligible perturbations. Our experiments demonstrate the effectiveness of Model Breadcrumbs to simultaneously improve performance across multiple tasks. This contribution aligns with the evolving paradigm of updatable machine learning, reminiscent of the collaborative principles underlying open-source software development, fostering a community-driven effort to reliably update machine learning models. Our method is shown to be more efficient and unlike previous proposals does not require hyperparameter tuning for each new task added. Through extensive experimentation involving various models, tasks, and modalities we establish that integrating Model Breadcrumbs offers a simple, efficient, and highly effective approach for constructing multi-task models and facilitating updates to foundation models.  ( 2 min )
  • Open

    Wiener Chaos in Kernel Regression: Towards Untangling Aleatoric and Epistemic Uncertainty. (arXiv:2312.07387v1 [stat.ML])
    Gaussian Processes (GPs) are a versatile method that enables different approaches towards learning for dynamics and control. Gaussianity assumptions appear in two dimensions in GPs: The positive semi-definite kernel of the underlying reproducing kernel Hilbert space is used to construct the co-variance of a Gaussian distribution over functions, while measurement noise (i.e. data corruption) is usually modeled as i.i.d. additive Gaussian. In this note, we relax the latter Gaussianity assumption, i.e., we consider kernel ridge regression with additive i.i.d. non-Gaussian measurement noise. To apply the usual kernel trick, we rely on the representation of the uncertainty via polynomial chaos expansions, which are series expansions for random variables of finite variance introduced by Norbert Wiener. We derive and discuss the analytic $\mathcal{L}^2$ solution to the arising Wiener kernel regression. Considering a polynomial system as numerical example, we show that our approach allows to untangle the effects of epistemic and aleatoric uncertainties.
    Class Probability Matching Using Kernel Methods for Label Shift Adaptation. (arXiv:2312.07282v1 [stat.ML])
    In domain adaptation, covariate shift and label shift problems are two distinct and complementary tasks. In covariate shift adaptation where the differences in data distribution arise from variations in feature probabilities, existing approaches naturally address this problem based on \textit{feature probability matching} (\textit{FPM}). However, for label shift adaptation where the differences in data distribution stem solely from variations in class probability, current methods still use FPM on the $d$-dimensional feature space to estimate the class probability ratio on the one-dimensional label space. To address label shift adaptation more naturally and effectively, inspired by a new representation of the source domain's class probability, we propose a new framework called \textit{class probability matching} (\textit{CPM}) which matches two class probability functions on the one-dimensional label space to estimate the class probability ratio, fundamentally different from FPM operating on the $d$-dimensional feature space. Furthermore, by incorporating the kernel logistic regression into the CPM framework to estimate the conditional probability, we propose an algorithm called \textit{class probability matching using kernel methods} (\textit{CPMKM}) for label shift adaptation. From the theoretical perspective, we establish the optimal convergence rates of CPMKM with respect to the cross-entropy loss for multi-class label shift adaptation. From the experimental perspective, comparisons on real datasets demonstrate that CPMKM outperforms existing FPM-based and maximum-likelihood-based algorithms.
    Adaptive learning of density ratios in RKHS. (arXiv:2307.16164v2 [cs.LG] UPDATED)
    Estimating the ratio of two probability densities from finitely many observations of the densities is a central problem in machine learning and statistics with applications in two-sample testing, divergence estimation, generative modeling, covariate shift adaptation, conditional density estimation, and novelty detection. In this work, we analyze a large class of density ratio estimation methods that minimize a regularized Bregman divergence between the true density ratio and a model in a reproducing kernel Hilbert space (RKHS). We derive new finite-sample error bounds, and we propose a Lepskii type parameter choice principle that minimizes the bounds without knowledge of the regularity of the density ratio. In the special case of quadratic loss, our method adaptively achieves a minimax optimal error rate. A numerical illustration is provided.
    Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time. (arXiv:2305.15546v2 [cs.LG] UPDATED)
    A crucial problem in reinforcement learning is learning the optimal policy. We study this in tabular infinite-horizon discounted Markov decision processes under the online setting. The existing algorithms either fail to achieve regret optimality or have to incur a high memory and computational cost. In addition, existing optimal algorithms all require a long burn-in time in order to achieve optimal sample efficiency, i.e., their optimality is not guaranteed unless sample size surpasses a high threshold. We address both open problems by introducing a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner. This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.
    Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult. (arXiv:2310.17087v2 [cs.LG] UPDATED)
    Large learning rates, when applied to gradient descent for nonconvex optimization, yield various implicit biases including the edge of stability (Cohen et al., 2021), balancing (Wang et al., 2022), and catapult (Lewkowycz et al., 2020). These phenomena cannot be well explained by classical optimization theory. Though significant theoretical progress has been made in understanding these implicit biases, it remains unclear for which objective functions would they be more likely. This paper provides an initial step in answering this question and also shows that these implicit biases are in fact various tips of the same iceberg. To establish these results, we develop a global convergence theory under large learning rates, for a family of nonconvex functions without globally Lipschitz continuous gradient, which was typically assumed in existing convergence analysis. Specifically, these phenomena are more likely to occur when the optimization objective function has good regularity. This regularity, together with gradient descent using a large learning rate that favors flatter regions, results in these nontrivial dynamical behaviors. Another corollary is the first non-asymptotic convergence rate bound for large-learning-rate gradient descent optimization of nonconvex functions. Although our theory only applies to specific functions so far, the possibility of extrapolating it to neural networks is also experimentally validated, for which different choices of loss, activation functions, and other techniques such as batch normalization can all affect regularity significantly and lead to very different training dynamics.
    Forced Exploration in Bandit Problems. (arXiv:2312.07285v1 [cs.LG])
    The multi-armed bandit(MAB) is a classical sequential decision problem. Most work requires assumptions about the reward distribution (e.g., bounded), while practitioners may have difficulty obtaining information about these distributions to design models for their problems, especially in non-stationary MAB problems. This paper aims to design a multi-armed bandit algorithm that can be implemented without using information about the reward distribution while still achieving substantial regret upper bounds. To this end, we propose a novel algorithm alternating between greedy rule and forced exploration. Our method can be applied to Gaussian, Bernoulli and other subgaussian distributions, and its implementation does not require additional information. We employ a unified analysis method for different forced exploration strategies and provide problem-dependent regret upper bounds for stationary and piecewise-stationary settings. Furthermore, we compare our algorithm with popular bandit algorithms on different reward distributions.
    The Gaussian-Linear Hidden Markov model: a Python package. (arXiv:2312.07151v1 [q-bio.NC])
    We propose the Gaussian-Linear Hidden Markov model (GLHMM), a generalisation of different types of HMMs commonly used in neuroscience. In short, the GLHMM is a general framework where linear regression is used to flexibly parameterise the Gaussian state distribution, thereby accommodating a wide range of uses -including unsupervised, encoding and decoding models. GLHMM is implemented as a Python toolbox with an emphasis on statistical testing and out-of-sample prediction -i.e. aimed at finding and characterising brain-behaviour associations. The toolbox uses a stochastic variational inference approach, enabling it to handle large data sets at reasonable computational time. Overall, the approach can be applied to several data modalities, including animal recordings or non-brain data, and applied over a broad range of experimental paradigms. For demonstration, we show examples with fMRI, electrocorticography, magnetoencephalo-graphy and pupillometry.
    On Classification-Calibration of Gamma-Phi Losses. (arXiv:2302.07321v2 [stat.ML] UPDATED)
    Gamma-Phi losses constitute a family of multiclass classification loss functions that generalize the logistic and other common losses, and have found application in the boosting literature. We establish the first general sufficient condition for the classification-calibration (CC) of such losses. To our knowledge, this sufficient condition gives the first family of nonconvex multiclass surrogate losses for which CC has been fully justified. In addition, we show that a previously proposed sufficient condition is in fact not sufficient. This contribution highlights a technical issue that is important in the study of multiclass CC but has been neglected in prior work.
    Luck, skill, and depth of competition in games and social hierarchies. (arXiv:2312.04711v1 [physics.soc-ph] CROSS LISTED)
    Patterns of wins and losses in pairwise contests, such as occur in sports and games, consumer research and paired comparison studies, and human and animal social hierarchies, are commonly analyzed using probabilistic models that allow one to quantify the strength of competitors or predict the outcome of future contests. Here we generalize this approach to incorporate two additional features: an element of randomness or luck that leads to upset wins, and a "depth of competition" variable that measures the complexity of a game or hierarchy. Fitting the resulting model to a large collection of data sets we estimate depth and luck in a range of games, sports, and social situations. In general, we find that social competition tends to be "deep," meaning it has a pronounced hierarchy with many distinct levels, but also that there is often a nonzero chance of an upset victory, meaning that dominance challenges can be won even by significant underdogs. Competition in sports and games, by contrast, tends to be shallow and in most cases there is little evidence of upset wins, beyond those already implied by the shallowness of the hierarchy.
    Optimal Rates for Regularized Conditional Mean Embedding Learning. (arXiv:2208.01711v3 [stat.ML] UPDATED)
    We address the consistency of a kernel ridge regression estimate of the conditional mean embedding (CME), which is an embedding of the conditional distribution of $Y$ given $X$ into a target reproducing kernel Hilbert space $\mathcal{H}_Y$. The CME allows us to take conditional expectations of target RKHS functions, and has been employed in nonparametric causal and Bayesian inference. We address the misspecified setting, where the target CME is in the space of Hilbert-Schmidt operators acting from an input interpolation space between $\mathcal{H}_X$ and $L_2$, to $\mathcal{H}_Y$. This space of operators is shown to be isomorphic to a newly defined vector-valued interpolation space. Using this isomorphism, we derive a novel and adaptive statistical learning rate for the empirical CME estimator under the misspecified setting. Our analysis reveals that our rates match the optimal $O(\log n / n)$ rates without assuming $\mathcal{H}_Y$ to be finite dimensional. We further establish a lower bound on the learning rate, which shows that the obtained upper bound is optimal.
    Neural Likelihood Surfaces for Spatial Processes with Computationally Intensive or Intractable Likelihoods. (arXiv:2305.04634v2 [stat.ME] UPDATED)
    In spatial statistics, fast and accurate parameter estimation, coupled with a reliable means of uncertainty quantification, can be challenging when fitting a spatial process to real-world data because the likelihood function might be slow to evaluate or wholly intractable. In this work, we propose using convolutional neural networks to learn the likelihood function of a spatial process. Through a specifically designed classification task, our neural network implicitly learns the likelihood function, even in situations where the exact likelihood is not explicitly available. Once trained on the classification task, our neural network is calibrated using Platt scaling which improves the accuracy of the neural likelihood surfaces. To demonstrate our approach, we compare neural likelihood surfaces and the resulting maximum likelihood estimates and approximate confidence regions with the equivalent for exact or approximate likelihood for two different spatial processes: a Gaussian process and a Brown-Resnick process which have computationally intensive and intractable likelihoods, respectively. We conclude that our method provides fast and accurate parameter estimation with a reliable method of uncertainty quantification in situations where standard methods are either undesirably slow or inaccurate. The method is applicable to any spatial process on a grid from which fast simulations are available.  ( 2 min )
    Distributional Bellman Operators over Mean Embeddings. (arXiv:2312.07358v1 [stat.ML])
    We propose a novel algorithmic framework for distributional reinforcement learning, based on learning finite-dimensional mean embeddings of return distributions. We derive several new algorithms for dynamic programming and temporal-difference learning based on this framework, provide asymptotic convergence theory, and examine the empirical performance of the algorithms on a suite of tabular tasks. Further, we show that this approach can be straightforwardly combined with deep reinforcement learning, and obtain a new deep RL agent that improves over baseline distributional approaches on the Arcade Learning Environment.  ( 2 min )
    Resetting a fixed broken ELBO. (arXiv:2312.06828v1 [stat.ML])
    Variational autoencoders (VAEs) are one class of generative probabilistic latent-variable models designed for inference based on known data. They balance reconstruction and regularizer terms. A variational approximation produces an evidence lower bound (ELBO). Multiplying the regularizer term by beta provides a beta-VAE/ELBO, improving disentanglement of the latent space. However, any beta value different than unity violates the laws of conditional probability. To provide a similarly-parameterized VAE, we develop a Renyi (versus Shannon) entropy VAE, and a variational approximation RELBO that introduces a similar parameter. The Renyi VAE has an additional Renyi regularizer-like term with a conditional distribution that is not learned. The term is evaluated essentially analytically using a Singular Value Decomposition method.  ( 2 min )
    Convex Parameter Estimation of Perturbed Multivariate Generalized Gaussian Distributions. (arXiv:2312.07479v1 [stat.ME])
    The multivariate generalized Gaussian distribution (MGGD), also known as the multivariate exponential power (MEP) distribution, is widely used in signal and image processing. However, estimating MGGD parameters, which is required in practical applications, still faces specific theoretical challenges. In particular, establishing convergence properties for the standard fixed-point approach when both the distribution mean and the scatter (or the precision) matrix are unknown is still an open problem. In robust estimation, imposing classical constraints on the precision matrix, such as sparsity, has been limited by the non-convexity of the resulting cost function. This paper tackles these issues from an optimization viewpoint by proposing a convex formulation with well-established convergence properties. We embed our analysis in a noisy scenario where robustness is induced by modelling multiplicative perturbations. The resulting framework is flexible as it combines a variety of regularizations for the precision matrix, the mean and model perturbations. This paper presents proof of the desired theoretical properties, specifies the conditions preserving these properties for different regularization choices and designs a general proximal primal-dual optimization strategy. The experiments show a more accurate precision and covariance matrix estimation with similar performance for the mean vector parameter compared to Tyler's M-estimator. In a high-dimensional setting, the proposed method outperforms the classical GLASSO, one of its robust extensions, and the regularized Tyler's estimator.  ( 2 min )
    From Complexity to Clarity: Analytical Expressions of Deep Neural Network Weights via Clifford's Geometric Algebra and Convexity. (arXiv:2309.16512v2 [cs.LG] UPDATED)
    In this paper, we introduce a novel analysis of neural networks based on geometric (Clifford) algebra and convex optimization. We show that optimal weights of deep ReLU neural networks are given by the wedge product of training samples when trained with standard regularized loss. Furthermore, the training problem reduces to convex optimization over wedge product features, which encode the geometric structure of the training dataset. This structure is given in terms of signed volumes of triangles and parallelotopes generated by data vectors. The convex problem finds a small subset of samples via $\ell_1$ regularization to discover only relevant wedge product features. Our analysis provides a novel perspective on the inner workings of deep neural networks and sheds light on the role of the hidden layers.  ( 2 min )
    Ahpatron: A New Budgeted Online Kernel Learning Machine with Tighter Mistake Bound. (arXiv:2312.07032v1 [cs.LG])
    In this paper, we study the mistake bound of online kernel learning on a budget. We propose a new budgeted online kernel learning model, called Ahpatron, which significantly improves the mistake bound of previous work and resolves the open problem posed by Dekel, Shalev-Shwartz, and Singer (2005). We first present an aggressive variant of Perceptron, named AVP, a model without budget, which uses an active updating rule. Then we design a new budget maintenance mechanism, which removes a half of examples,and projects the removed examples onto a hypothesis space spanned by the remaining examples. Ahpatron adopts the above mechanism to approximate AVP. Theoretical analyses prove that Ahpatron has tighter mistake bounds, and experimental results show that Ahpatron outperforms the state-of-the-art algorithms on the same or a smaller budget.  ( 2 min )
    MCFNet: Multi-scale Covariance Feature Fusion Network for Real-time Semantic Segmentation. (arXiv:2312.07207v1 [cs.CV])
    The low-level spatial detail information and high-level semantic abstract information are both essential to the semantic segmentation task. The features extracted by the deep network can obtain rich semantic information, while a lot of spatial information is lost. However, how to recover spatial detail information effectively and fuse it with high-level semantics has not been well addressed so far. In this paper, we propose a new architecture based on Bilateral Segmentation Network (BiseNet) called Multi-scale Covariance Feature Fusion Network (MCFNet). Specifically, this network introduces a new feature refinement module and a new feature fusion module. Furthermore, a gating unit named L-Gate is proposed to filter out invalid information and fuse multi-scale features. We evaluate our proposed model on Cityscapes, CamVid datasets and compare it with the state-of-the-art methods. Extensive experiments show that our method achieves competitive success. On Cityscapes, we achieve 75.5% mIOU with a speed of 151.3 FPS.  ( 2 min )
    Diffusion Schr\"odinger Bridge Matching. (arXiv:2303.16852v3 [stat.ML] UPDATED)
    Solving transport problems, i.e. finding a map transporting one given distribution to another, has numerous applications in machine learning. Novel mass transport methods motivated by generative modeling have recently been proposed, e.g. Denoising Diffusion Models (DDMs) and Flow Matching Models (FMMs) implement such a transport through a Stochastic Differential Equation (SDE) or an Ordinary Differential Equation (ODE). However, while it is desirable in many applications to approximate the deterministic dynamic Optimal Transport (OT) map which admits attractive properties, DDMs and FMMs are not guaranteed to provide transports close to the OT map. In contrast, Schr\"odinger bridges (SBs) compute stochastic dynamic mappings which recover entropy-regularized versions of OT. Unfortunately, existing numerical methods approximating SBs either scale poorly with dimension or accumulate errors across iterations. In this work, we introduce Iterative Markovian Fitting (IMF), a new methodology for solving SB problems, and Diffusion Schr\"odinger Bridge Matching (DSBM), a novel numerical algorithm for computing IMF iterates. DSBM significantly improves over previous SB numerics and recovers as special/limiting cases various recent transport methods. We demonstrate the performance of DSBM on a variety of problems.  ( 2 min )
    Can a Transformer Represent a Kalman Filter?. (arXiv:2312.06937v1 [cs.LG])
    Transformers are a class of autoregressive deep learning architectures which have recently achieved state-of-the-art performance in various vision, language, and robotics tasks. We revisit the problem of Kalman Filtering in linear dynamical systems and show that Transformers can approximate the Kalman Filter in a strong sense. Specifically, for any observable LTI system we construct an explicit causally-masked Transformer which implements the Kalman Filter, up to a small additive error which is bounded uniformly in time; we call our construction the Transformer Filter. Our construction is based on a two-step reduction. We first show that a softmax self-attention block can exactly represent a certain Gaussian kernel smoothing estimator. We then show that this estimator closely approximates the Kalman Filter. We also investigate how the Transformer Filter can be used for measurement-feedback control and prove that the resulting nonlinear controllers closely approximate the performance of standard optimal control policies such as the LQG controller.  ( 2 min )
    Simple diffusion: End-to-end diffusion for high resolution images. (arXiv:2301.11093v2 [cs.CV] UPDATED)
    Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.  ( 2 min )
    Contextual Bandits with Online Neural Regression. (arXiv:2312.07145v1 [cs.LG])
    Recent works have shown a reduction from contextual bandits to online regression under a realizability assumption [Foster and Rakhlin, 2020, Foster and Krishnamurthy, 2021]. In this work, we investigate the use of neural networks for such online regression and associated Neural Contextual Bandits (NeuCBs). Using existing results for wide networks, one can readily show a ${\mathcal{O}}(\sqrt{T})$ regret for online regression with square loss, which via the reduction implies a ${\mathcal{O}}(\sqrt{K} T^{3/4})$ regret for NeuCBs. Departing from this standard approach, we first show a $\mathcal{O}(\log T)$ regret for online regression with almost convex losses that satisfy QG (Quadratic Growth) condition, a generalization of the PL (Polyak-\L ojasiewicz) condition, and that have a unique minima. Although not directly applicable to wide networks since they do not have unique minima, we show that adding a suitable small random perturbation to the network predictions surprisingly makes the loss satisfy QG with unique minima. Based on such a perturbed prediction, we show a ${\mathcal{O}}(\log T)$ regret for online regression with both squared loss and KL loss, and subsequently convert these respectively to $\tilde{\mathcal{O}}(\sqrt{KT})$ and $\tilde{\mathcal{O}}(\sqrt{KL^*} + K)$ regret for NeuCB, where $L^*$ is the loss of the best policy. Separately, we also show that existing regret bounds for NeuCBs are $\Omega(T)$ or assume i.i.d. contexts, unlike this work. Finally, our experimental results on various datasets demonstrate that our algorithms, especially the one based on KL loss, persistently outperform existing algorithms.  ( 2 min )
    Analyze the Robustness of Classifiers under Label Noise. (arXiv:2312.07271v1 [cs.LG])
    This study explores the robustness of label noise classifiers, aiming to enhance model resilience against noisy data in complex real-world scenarios. Label noise in supervised learning, characterized by erroneous or imprecise labels, significantly impairs model performance. This research focuses on the increasingly pertinent issue of label noise's impact on practical applications. Addressing the prevalent challenge of inaccurate training data labels, we integrate adversarial machine learning (AML) and importance reweighting techniques. Our approach involves employing convolutional neural networks (CNN) as the foundational model, with an emphasis on parameter adjustment for individual training samples. This strategy is designed to heighten the model's focus on samples critically influencing performance.  ( 2 min )
    Prediction De-Correlated Inference. (arXiv:2312.06478v1 [stat.ME] CROSS LISTED)
    Leveraging machine-learning methods to predict outcomes on some unlabeled datasets and then using these pseudo-outcomes in subsequent statistical inference is common in modern data analysis. Inference in this setting is often called post-prediction inference. We propose a novel, assumption-lean framework for inference under post-prediction setting, called \emph{Prediction De-Correlated inference} (PDC). Our approach can automatically adapt to any black-box machine-learning model and consistently outperforms supervised methods. The PDC framework also offers easy extensibility for accommodating multiple predictive models. Both numerical results and real-world data analysis support our theoretical results.  ( 2 min )
    Towards Optimal Sobolev Norm Rates for the Vector-Valued Regularized Least-Squares Algorithm. (arXiv:2312.07186v1 [stat.ML])
    We present the first optimal rates for infinite-dimensional vector-valued ridge regression on a continuous scale of norms that interpolate between $L_2$ and the hypothesis space, which we consider as a vector-valued reproducing kernel Hilbert space. These rates allow to treat the misspecified case in which the true regression function is not contained in the hypothesis space. We combine standard assumptions on the capacity of the hypothesis space with a novel tensor product construction of vector-valued interpolation spaces in order to characterize the smoothness of the regression function. Our upper bound not only attains the same rate as real-valued kernel ridge regression, but also removes the assumption that the target regression function is bounded. For the lower bound, we reduce the problem to the scalar setting using a projection argument. We show that these rates are optimal in most cases and independent of the dimension of the output space. We illustrate our results for the special case of vector-valued Sobolev spaces.  ( 2 min )
    Local Function Complexity for Active Learning via Mixture of Gaussian Processes. (arXiv:1902.10664v6 [cs.LG] UPDATED)
    Inhomogeneities in real-world data, e.g., due to changes in the observation noise level or variations in the structural complexity of the source function, pose a unique set of challenges for statistical inference. Accounting for them can greatly improve predictive power when physical resources or computation time is limited. In this paper, we draw on recent theoretical results on the estimation of local function complexity (LFC), derived from the domain of local polynomial smoothing (LPS), to establish a notion of local structural complexity, which is used to develop a model-agnostic active learning (AL) framework. Due to its reliance on pointwise estimates, the LPS model class is not robust and scalable concerning large input space dimensions that typically come along with real-world problems. Here, we derive and estimate the Gaussian process regression (GPR)-based analog of the LPS-based LFC and use it as a substitute in the above framework to make it robust and scalable. We assess the effectiveness of our LFC estimate in an AL application on a prototypical low-dimensional synthetic dataset, before taking on the challenging real-world task of reconstructing a quantum chemical force field for a small organic molecule and demonstrating state-of-the-art performance with a significantly reduced training demand.  ( 3 min )
    Investigation into the Training Dynamics of Learned Optimizers. (arXiv:2312.07174v1 [cs.LG])
    Optimization is an integral part of modern deep learning. Recently, the concept of learned optimizers has emerged as a way to accelerate this optimization process by replacing traditional, hand-crafted algorithms with meta-learned functions. Despite the initial promising results of these methods, issues with stability and generalization still remain, limiting their practical use. Moreover, their inner workings and behavior under different conditions are not yet fully understood, making it difficult to come up with improvements. For this reason, our work examines their optimization trajectories from the perspective of network architecture symmetries and parameter update distributions. Furthermore, by contrasting the learned optimizers with their manually designed counterparts, we identify several key insights that demonstrate how each approach can benefit from the strengths of the other.  ( 2 min )
    Bayesian Optimization with Conformal Prediction Sets. (arXiv:2210.12496v4 [cs.LG] UPDATED)
    Bayesian optimization is a coherent, ubiquitous approach to decision-making under uncertainty, with applications including multi-arm bandits, active learning, and black-box optimization. Bayesian optimization selects decisions (i.e. objective function queries) with maximal expected utility with respect to the posterior distribution of a Bayesian model, which quantifies reducible, epistemic uncertainty about query outcomes. In practice, subjectively implausible outcomes can occur regularly for two reasons: 1) model misspecification and 2) covariate shift. Conformal prediction is an uncertainty quantification method with coverage guarantees even for misspecified models and a simple mechanism to correct for covariate shift. We propose conformal Bayesian optimization, which directs queries towards regions of search space where the model predictions have guaranteed validity, and investigate its behavior on a suite of black-box optimization tasks and tabular ranking tasks. In many cases we find that query coverage can be significantly improved without harming sample-efficiency.  ( 2 min )
    Safe Multi-Task Bayesian Optimization. (arXiv:2312.07281v1 [cs.LG])
    Bayesian optimization has become a powerful tool for safe online optimization of systems, due to its high sample efficiency and noise robustness. For further speed-up reduced physical models of the system can be incorporated into the optimization to accelerate the process, since the models are able to offer an approximation of the actual system, and sampling from them is significantly cheaper. The similarity between model and reality is represented by additional hyperparameters and learned within the optimization process. Safety is an important criteria for online optimization methods like Bayesian optimization, which has been addressed by recent literature, which provide safety guarantees under the assumption of known hyperparameters. However, in practice this is not applicable. Therefore, we extend the robust Gaussian process uniform error bounds to meet the multi-task setting, which involves the calculation of a confidence region from the hyperparameter posterior distribution utilizing Markov chain Monte Carlo methods. Then, using the robust safety bounds, Bayesian optimization is applied to safely optimize the system while incorporating measurements of the models. Simulations show that the optimization can be significantly accelerated compared to other state-of-the-art safe Bayesian optimization methods depending on the fidelity of the models.  ( 2 min )
    General Tail Bounds for Non-Smooth Stochastic Mirror Descent. (arXiv:2312.07142v1 [cs.LG])
    In this paper, we provide novel tail bounds on the optimization error of Stochastic Mirror Descent for convex and Lipschitz objectives. Our analysis extends the existing tail bounds from the classical light-tailed Sub-Gaussian noise case to heavier-tailed noise regimes. We study the optimization error of the last iterate as well as the average of the iterates. We instantiate our results in two important cases: a class of noise with exponential tails and one with polynomial tails. A remarkable feature of our results is that they do not require an upper bound on the diameter of the domain. Finally, we support our theory with illustrative experiments that compare the behavior of the average of the iterates with that of the last iterate in heavy-tailed noise regimes.  ( 2 min )
    Identifying Drivers of Predictive Uncertainty using Variance Feature Attribution. (arXiv:2312.07252v1 [cs.LG])
    Explainability and uncertainty quantification are two pillars of trustable artificial intelligence. However, the reasoning behind uncertainty estimates is generally left unexplained. Identifying the drivers of uncertainty complements explanations of point predictions in recognizing potential model limitations. It facilitates the detection of oversimplification in the uncertainty estimation process. Explanations of uncertainty enhance communication and trust in decisions. They allow for verifying whether the main drivers of model uncertainty are relevant and may impact model usage. So far, the subject of explaining uncertainties has been rarely studied. The few exceptions in existing literature are tailored to Bayesian neural networks or rely heavily on technically intricate approaches, hindering their broad adoption. We propose variance feature attribution, a simple and scalable solution to explain predictive aleatoric uncertainties. First, we estimate uncertainty as predictive variance by equipping a neural network with a Gaussian output distribution by adding a variance output neuron. Thereby, we can rely on pre-trained point prediction models and fine-tune them for meaningful variance estimation. Second, we apply out-of-the-box explainers on the variance output of these models to explain the uncertainty estimation. We evaluate our approach in a synthetic setting where the data-generating process is known. We show that our method can explain uncertainty influences more reliably and faster than the established baseline CLUE. We fine-tune a state-of-the-art age regression model to estimate uncertainty and obtain attributions. Our explanations highlight potential sources of uncertainty, such as laugh lines. Variance feature attribution provides accurate explanations for uncertainty estimates with little modifications to the model architecture and low computational overhead.  ( 3 min )

  • Open

    Three MIT students selected as inaugural MIT-Pillar AI Collective Fellows
    The graduate students will aim to commercialize innovations in AI, machine learning, and data science.  ( 8 min )
    Deep neural networks show promise as models of human hearing
    Study shows computational models trained to perform auditory tasks display an internal organization similar to that of the human auditory cortex.  ( 9 min )
    Closing the design-to-manufacturing gap for optical devices
    A new method enables optical devices that more closely match their design specifications, boosting accuracy and efficiency.  ( 10 min )
  • Open

    How Is AI Used in Fraud Detection?
    The Wild West had gunslingers, bank robberies and bounties — today’s digital frontier has identity theft, credit card fraud and chargebacks. Cashing in on financial fraud has become a multibillion-dollar criminal enterprise. And generative AI in the hands of fraudsters only promises to make this more profitable. Credit card losses worldwide are expected to reach Read article >  ( 9 min )
    Pie From the Sky: Drone Startup Delivers Pizza, Meds and Side of Excitement
    Zipline isn’t just some pie-in-the-sky drone startup. The San Francisco-based company has completed more than 800,000 deliveries in seven countries since its start in 2011. It recently added services for Seattle’s Pagliacci Pizza, vitamin and supplement giant GNC, and large health systems like Intermountain Health, OhioHealth and Michigan Medicine. Zipline developed its drones — which Read article >  ( 6 min )
  • Open

    Create summaries of recordings using generative AI with Amazon Bedrock and Amazon Transcribe
    Meeting notes are a crucial part of collaboration, yet they often fall through the cracks. Between leading discussions, listening closely, and typing notes, it’s easy for key information to slip away unrecorded. Even when notes are captured, they can be disorganized or illegible, rendering them useless. In this post, we explore how to use Amazon […]  ( 8 min )
    Fine-tune Llama 2 using QLoRA and Deploy it on Amazon SageMaker with AWS Inferentia2
    In this post, we showcase fine-tuning a Llama 2 model using a Parameter-Efficient Fine-Tuning (PEFT) method and deploy the fine-tuned model on AWS Inferentia2. We use the AWS Neuron software development kit (SDK) to access the AWS Inferentia2 device and benefit from its high performance. We then use a large model inference container powered by […]  ( 10 min )
    Build an end-to-end MLOps pipeline using Amazon SageMaker Pipelines, GitHub, and GitHub Actions
    Machine learning (ML) models do not operate in isolation. To deliver value, they must integrate into existing production systems and infrastructure, which necessitates considering the entire ML lifecycle during design and development. ML operations, known as MLOps, focus on streamlining, automating, and monitoring ML models throughout their lifecycle. Building a robust MLOps pipeline demands cross-functional […]  ( 13 min )
  • Open

    Fine-grained file differences
    The diff utility compares files by lines, which is often what you’d like it to do. But sometimes you’d like more granularity. For example, supposed we want to compare two versions of Psalm 23. Here are the first three verses in the King James version: The Lord is my shepherd; I shall not want. He […] Fine-grained file differences first appeared on John D. Cook.  ( 6 min )
  • Open

    Partnership with Axel Springer to deepen beneficial use of AI in journalism
    Axel Springer is the first publishing house globally to partner with us on a deeper integration of journalism in AI technologies.  ( 2 min )
  • Open

    BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-inspired Materials. (arXiv:2309.08788v2 [cond-mat.mtrl-sci] UPDATED)
    The study of biological materials and bio-inspired materials science is well established; however, surprisingly little knowledge has been systematically translated to engineering solutions. To accelerate discovery and guide insights, an open-source autoregressive transformer large language model (LLM), BioinspiredLLM, is reported. The model was finetuned with a corpus of over a thousand peer-reviewed articles in the field of structural biological and bio-inspired materials and can be prompted to recall information, assist with research tasks, and function as an engine for creativity. The model has proven that it is able to accurately recall information about biological materials and is further enhanced with enhanced reasoning ability, as well as with retrieval-augmented generation to incorporate new data during generation that can also help to traceback sources, update the knowledge base, and connect knowledge domains. BioinspiredLLM also has been shown to develop sound hypotheses regarding biological materials design and remarkably so for materials that have never been explicitly studied before. Lastly, the model showed impressive promise in collaborating with other generative artificial intelligence models in a workflow that can reshape the traditional materials design process. This collaborative generative artificial intelligence method can stimulate and enhance bio-inspired materials design workflows. Biological materials are at a critical intersection of multiple scientific fields and models like BioinspiredLLM help to connect knowledge domains.  ( 3 min )
    Improving Computational Efficiency for Powered Descent Guidance via Transformer-based Tight Constraint Prediction. (arXiv:2311.05135v2 [math.OC] UPDATED)
    In this work, we present Transformer-based Powered Descent Guidance (T-PDG), a scalable algorithm for reducing the computational complexity of the direct optimization formulation of the spacecraft powered descent guidance problem. T-PDG uses data from prior runs of trajectory optimization algorithms to train a transformer neural network, which accurately predicts the relationship between problem parameters and the globally optimal solution for the powered descent guidance problem. The solution is encoded as the set of tight constraints corresponding to the constrained minimum-cost trajectory and the optimal final time of landing. By leveraging the attention mechanism of transformer neural networks, large sequences of time series data can be accurately predicted when given only the spacecraft state and landing site parameters. When applied to the real problem of Mars powered descent guidance, T-PDG reduces the time for computing the 3 degree of freedom fuel-optimal trajectory, when compared to lossless convexification, from an order of 1-8 seconds to less than 500 milliseconds. A safe and optimal solution is guaranteed by including a feasibility check in T-PDG before returning the final trajectory.  ( 2 min )
    Forward Invariance in Neural Network Controlled Systems. (arXiv:2309.09043v2 [eess.SY] UPDATED)
    We present a framework based on interval analysis and monotone systems theory to certify and search for forward invariant sets in nonlinear systems with neural network controllers. The framework (i) constructs localized first-order inclusion functions for the closed-loop system using Jacobian bounds and existing neural network verification tools; (ii) builds a dynamical embedding system where its evaluation along a single trajectory directly corresponds with a nested family of hyper-rectangles provably converging to an attractive set of the original system; (iii) utilizes linear transformations to build families of nested paralleletopes with the same properties. The framework is automated in Python using our interval analysis toolbox $\texttt{npinterval}$, in conjunction with the symbolic arithmetic toolbox $\texttt{sympy}$, demonstrated on an $8$-dimensional leader-follower system.  ( 2 min )
    Variational Automatic Curriculum Learning for Sparse-Reward Cooperative Multi-Agent Problems. (arXiv:2111.04613v2 [cs.LG] CROSS LISTED)
    We introduce a curriculum learning algorithm, Variational Automatic Curriculum Learning (VACL), for solving challenging goal-conditioned cooperative multi-agent reinforcement learning problems. We motivate our paradigm through a variational perspective, where the learning objective can be decomposed into two terms: task learning on the current task distribution, and curriculum update to a new task distribution. Local optimization over the second term suggests that the curriculum should gradually expand the training tasks from easy to hard. Our VACL algorithm implements this variational paradigm with two practical components, task expansion and entity progression, which produces training curricula over both the task configurations as well as the number of entities in the task. Experiment results show that VACL solves a collection of sparse-reward problems with a large number of agents. Particularly, using a single desktop machine, VACL achieves 98% coverage rate with 100 agents in the simple-spread benchmark and reproduces the ramp-use behavior originally shown in OpenAI's hide-and-seek project. Our project website is at https://sites.google.com/view/vacl-neurips-2021.  ( 2 min )
    NAS-NeRF: Generative Neural Architecture Search for Neural Radiance Fields. (arXiv:2309.14293v3 [cs.CV] UPDATED)
    Neural radiance fields (NeRFs) enable high-quality novel view synthesis, but their high computational complexity limits deployability. While existing neural-based solutions strive for efficiency, they use one-size-fits-all architectures regardless of scene complexity. The same architecture may be unnecessarily large for simple scenes but insufficient for complex ones. Thus, there is a need to dynamically optimize the neural network component of NeRFs to achieve a balance between computational complexity and specific targets for synthesis quality. We introduce NAS-NeRF, a generative neural architecture search strategy that generates compact, scene-specialized NeRF architectures by balancing architecture complexity and target synthesis quality metrics. Our method incorporates constraints on target metrics and budgets to guide the search towards architectures tailored for each scene. Experiments on the Blender synthetic dataset show the proposed NAS-NeRF can generate architectures up to 5.74$\times$ smaller, with 4.19$\times$ fewer FLOPs, and 1.93$\times$ faster on a GPU than baseline NeRFs, without suffering a drop in SSIM. Furthermore, we illustrate that NAS-NeRF can also achieve architectures up to 23$\times$ smaller, with 22$\times$ fewer FLOPs, and 4.7$\times$ faster than baseline NeRFs with only a 5.3% average SSIM drop. Our source code is also made publicly available at https://saeejithnair.github.io/NAS-NeRF.  ( 2 min )
    Self-Supervised Pre-Training for Precipitation Post-Processor. (arXiv:2310.20187v2 [cs.LG] UPDATED)
    Obtaining a sufficient forecast lead time for local precipitation is essential in preventing hazardous weather events. Global warming-induced climate change increases the challenge of accurately predicting severe precipitation events, such as heavy rainfall. In this paper, we propose a deep learning-based precipitation post-processor for numerical weather prediction (NWP) models. The precipitation post-processor consists of (i) employing self-supervised pre-training, where the parameters of the encoder are pre-trained on the reconstruction of the masked variables of the atmospheric physics domain; and (ii) conducting transfer learning on precipitation segmentation tasks (the target domain) from the pre-trained encoder. In addition, we introduced a heuristic labeling approach to effectively train class-imbalanced datasets. Our experiments on precipitation correction for regional NWP show that the proposed method outperforms other approaches.  ( 2 min )
    Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework. (arXiv:2302.12247v5 [cs.LG] UPDATED)
    The recent explosion of interest in multimodal applications has resulted in a wide selection of datasets and methods for representing and integrating information from different modalities. Despite these empirical advances, there remain fundamental research questions: How can we quantify the interactions that are necessary to solve a multimodal task? Subsequently, what are the most suitable multimodal models to capture these interactions? To answer these questions, we propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy relating input modalities with an output task. We term these three measures as the PID statistics of a multimodal distribution (or PID for short), and introduce two new estimators for these PID statistics that scale to high-dimensional distributions. To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks where PID estimations are compared with human annotations. Finally, we demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies engaging with domain experts in pathology, mood prediction, and robotic perception where our framework helps to recommend strong multimodal models for each application.  ( 3 min )
    HGPROMPT: Bridging Homogeneous and Heterogeneous Graphs for Few-shot Prompt Learning. (arXiv:2312.01878v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) and heterogeneous graph neural networks (HGNNs) are prominent techniques for homogeneous and heterogeneous graph representation learning, yet their performance in an end-to-end supervised framework greatly depends on the availability of task-specific supervision. To reduce the labeling cost, pre-training on self-supervised pretext tasks has become a popular paradigm,but there is often a gap between the pre-trained model and downstream tasks, stemming from the divergence in their objectives. To bridge the gap, prompt learning has risen as a promising direction especially in few-shot settings, without the need to fully fine-tune the pre-trained model. While there has been some early exploration of prompt-based learning on graphs, they primarily deal with homogeneous graphs, ignoring the heterogeneous graphs that are prevalent in downstream applications. In this paper, we propose HGPROMPT, a novel pre-training and prompting framework to unify not only pre-training and downstream tasks but also homogeneous and heterogeneous graphs via a dual-template design. Moreover, we propose dual-prompt in HGPROMPT to assist a downstream task in locating the most relevant prior to bridge the gaps caused by not only feature variations but also heterogeneity differences across tasks. Finally, we thoroughly evaluate and analyze HGPROMPT through extensive experiments on three public datasets.  ( 2 min )
    Learning to be Simple. (arXiv:2312.05299v1 [cs.LG])
    In this work we employ machine learning to understand structured mathematical data involving finite groups and derive a theorem about necessary properties of generators of finite simple groups. We create a database of all 2-generated subgroups of the symmetric group on n-objects and conduct a classification of finite simple groups among them using shallow feed-forward neural networks. We show that this neural network classifier can decipher the property of simplicity with varying accuracies depending on the features. Our neural network model leads to a natural conjecture concerning the generators of a finite simple group. We subsequently prove this conjecture. This new toy theorem comments on the necessary properties of generators of finite simple groups. We show this explicitly for a class of sporadic groups for which the result holds. Our work further makes the case for a machine motivated study of algebraic structures in pure mathematics and highlights the possibility of generating new conjectures and theorems in mathematics with the aid of machine learning.  ( 2 min )
    Multi-dimensional Fair Federated Learning. (arXiv:2312.05551v1 [cs.LG])
    Federated learning (FL) has emerged as a promising collaborative and secure paradigm for training a model from decentralized data without compromising privacy. Group fairness and client fairness are two dimensions of fairness that are important for FL. Standard FL can result in disproportionate disadvantages for certain clients, and it still faces the challenge of treating different groups equitably in a population. The problem of privately training fair FL models without compromising the generalization capability of disadvantaged clients remains open. In this paper, we propose a method, called mFairFL, to address this problem and achieve group fairness and client fairness simultaneously. mFairFL leverages differential multipliers to construct an optimization objective for empirical risk minimization with fairness constraints. Before aggregating locally trained models, it first detects conflicts among their gradients, and then iteratively curates the direction and magnitude of gradients to mitigate these conflicts. Theoretical analysis proves mFairFL facilitates the fairness in model development. The experimental evaluations based on three benchmark datasets show significant advantages of mFairFL compared to seven state-of-the-art baselines.
    Consistency Models for Scalable and Fast Simulation-Based Inference. (arXiv:2312.05440v1 [cs.LG])
    Simulation-based inference (SBI) is constantly in search of more expressive algorithms for accurately inferring the parameters of complex models from noisy data. We present consistency models for neural posterior estimation (CMPE), a new free-form conditional sampler for scalable, fast, and amortized SBI with generative neural networks. CMPE combines the advantages of normalizing flows and flow matching methods into a single generative architecture: It essentially distills a continuous probability flow and enables rapid few-shot inference with an unconstrained architecture that can be tailored to the structure of the estimation problem. Our empirical evaluation demonstrates that CMPE not only outperforms current state-of-the-art algorithms on three hard low-dimensional problems, but also achieves competitive performance in a high-dimensional Bayesian denoising experiment and in estimating a computationally demanding multi-scale model of tumor spheroid growth.
    GANs and Closures: Micro-Macro Consistency in Multiscale Modeling. (arXiv:2208.10715v4 [cs.LG] UPDATED)
    Sampling the phase space of molecular systems -- and, more generally, of complex systems effectively modeled by stochastic differential equations -- is a crucial modeling step in many fields, from protein folding to materials discovery. These problems are often multiscale in nature: they can be described in terms of low-dimensional effective free energy surfaces parametrized by a small number of "slow" reaction coordinates; the remaining "fast" degrees of freedom populate an equilibrium measure on the reaction coordinate values. Sampling procedures for such problems are used to estimate effective free energy differences as well as ensemble averages with respect to the conditional equilibrium distributions; these latter averages lead to closures for effective reduced dynamic models. Over the years, enhanced sampling techniques coupled with molecular simulation have been developed. An intriguing analogy arises with the field of Machine Learning (ML), where Generative Adversarial Networks can produce high dimensional samples from low dimensional probability distributions. This sample generation returns plausible high dimensional space realizations of a model state, from information about its low-dimensional representation. In this work, we present an approach that couples physics-based simulations and biasing methods for sampling conditional distributions with ML-based conditional generative adversarial networks for the same task. The "coarse descriptors" on which we condition the fine scale realizations can either be known a priori, or learned through nonlinear dimensionality reduction. We suggest that this may bring out the best features of both approaches: we demonstrate that a framework that couples cGANs with physics-based enhanced sampling techniques can improve multiscale SDE dynamical systems sampling, and even shows promise for systems of increasing complexity.
    Runtime Stealthy Perception Attacks against DNN-based Adaptive Cruise Control Systems. (arXiv:2307.08939v2 [cs.CR] UPDATED)
    Adaptive Cruise Control (ACC) is a widely used driver assistance technology for maintaining the desired speed and safe distance to the leading vehicle. This paper evaluates the security of the deep neural network (DNN) based ACC systems under runtime stealthy perception attacks that strategically inject perturbations into camera data to cause forward collisions. We present a context-aware strategy for the selection of the most critical times for triggering the attacks and a novel optimization-based method for the adaptive generation of image perturbations at runtime. We evaluate the effectiveness of the proposed attack using a publicly available driving dataset, an actual vehicle, and a realistic simulation platform with the control software from a production ACC system, a physical-world driving simulator, and interventions by the human driver and safety features such as Advanced Emergency Braking System (AEBS). Experimental results show that the proposed attack achieves 142.9 times higher success rate in causing hazards and 89.6% higher evasion rate than baselines while being stealthy and robust to real-world factors and dynamic changes in the environment. This study highlights the role of human drivers and basic safety mechanisms in preventing attacks.
    A Survey of Deep Causal Models and Their Industrial Applications. (arXiv:2209.08860v5 [stat.ML] UPDATED)
    The notion of causality assumes a paramount position within the realm of human cognition. Over the past few decades, there has been significant advancement in the domain of causal effect estimation across various disciplines, including but not limited to computer science, medicine, economics, and industrial applications. Given the continued advancements in deep learning methodologies, there has been a notable surge in its utilization for the estimation of causal effects using counterfactual data. Typically, deep causal models map the characteristics of covariates to a representation space and then design various objective functions to estimate counterfactual data unbiasedly. Different from the existing surveys on causal models in machine learning, this review mainly focuses on the overview of the deep causal models, and its core contributions are as follows: 1) we cast insight on a comprehensive overview of deep causal models from both timeline of development and method classification perspectives; 2) we outline some typical applications of causal effect estimation to industry; 3) we also endeavor to present a detailed categorization and analysis on relevant datasets, source codes and experiments.
    Factorized Diffusion Architectures for Unsupervised Image Generation and Segmentation. (arXiv:2309.15726v2 [cs.CV] UPDATED)
    We develop a neural network architecture which, trained in an unsupervised manner as a denoising diffusion model, simultaneously learns to both generate and segment images. Learning is driven entirely by the denoising diffusion objective, without any annotation or prior knowledge about regions during training. A computational bottleneck, built into the neural architecture, encourages the denoising network to partition an input into regions, denoise them in parallel, and combine the results. Our trained model generates both synthetic images and, by simple examination of its internal predicted partitions, a semantic segmentation of those images. Without any finetuning, we directly apply our unsupervised model to the downstream task of segmenting real images via noising and subsequently denoising them. Experiments demonstrate that our model achieves accurate unsupervised image segmentation and high-quality synthetic image generation across multiple datasets.
    Bidirectional Attention as a Mixture of Continuous Word Experts. (arXiv:2307.04057v2 [cs.CL] UPDATED)
    Bidirectional attention $\unicode{x2013}$ composed of self-attention with positional encodings and the masked language model (MLM) objective $\unicode{x2013}$ has emerged as a key component of modern large language models (LLMs). Despite its empirical success, few studies have examined its statistical underpinnings: What statistical model is bidirectional attention implicitly fitting? What sets it apart from its non-attention predecessors? We explore these questions in this paper. The key observation is that fitting a single-layer single-head bidirectional attention, upon reparameterization, is equivalent to fitting a continuous bag of words (CBOW) model with mixture-of-experts (MoE) weights. Further, bidirectional attention with multiple heads and multiple layers is equivalent to stacked MoEs and a mixture of MoEs, respectively. This statistical viewpoint reveals the distinct use of MoE in bidirectional attention, which aligns with its practical effectiveness in handling heterogeneous data. It also suggests an immediate extension to categorical tabular data, if we view each word location in a sentence as a tabular feature. Across empirical studies, we find that this extension outperforms existing tabular extensions of transformers in out-of-distribution (OOD) generalization. Finally, this statistical perspective of bidirectional attention enables us to theoretically characterize when linear word analogies are present in its word embeddings. These analyses show that bidirectional attention can require much stronger assumptions to exhibit linear word analogies than its non-attention predecessors.
    Classification for everyone : Building geography agnostic models for fairer recognition. (arXiv:2312.02957v2 [cs.CV] UPDATED)
    In this paper, we analyze different methods to mitigate inherent geographical biases present in state of the art image classification models. We first quantitatively present this bias in two datasets - The Dollar Street Dataset and ImageNet, using images with location information. We then present different methods which can be employed to reduce this bias. Finally, we analyze the effectiveness of the different techniques on making these models more robust to geographical locations of the images.
    Exact Recovery for System Identification with More Corrupt Data than Clean Data. (arXiv:2305.10506v2 [cs.LG] UPDATED)
    In this paper, we study the system identification problem for linear discrete-time systems under adversaries and analyze two lasso-type estimators. We study both asymptotic and non-asymptotic properties of these estimators in two separate scenarios, corresponding to deterministic and stochastic models for the attack times. Since the samples collected from the system are correlated, the existing results on lasso are not applicable. We show that when the system is stable and the attacks are injected periodically, the sample complexity for the exact recovery of the system dynamics is O(n), where n is the dimension of the states. When the adversarial attacks occur at each time instance with probability p, the required sample complexity for the exact recovery scales as O(\log(n)p/(1-p)^2). This result implies the almost sure convergence to the true system dynamics under the asymptotic regime. As a by-product, even when more than half of the data is compromised, our estimators still learn the system correctly. This paper provides the first mathematical guarantee in the literature on learning from correlated data for dynamical systems in the case when there is less clean data than corrupt data.
    Robot Synesthesia: In-Hand Manipulation with Visuotactile Sensing. (arXiv:2312.01853v2 [cs.RO] UPDATED)
    Executing contact-rich manipulation tasks necessitates the fusion of tactile and visual feedback. However, the distinct nature of these modalities poses significant challenges. In this paper, we introduce a system that leverages visual and tactile sensory inputs to enable dexterous in-hand manipulation. Specifically, we propose Robot Synesthesia, a novel point cloud-based tactile representation inspired by human tactile-visual synesthesia. This approach allows for the simultaneous and seamless integration of both sensory inputs, offering richer spatial information and facilitating better reasoning about robot actions. The method, trained in a simulated environment and then deployed to a real robot, is applicable to various in-hand object rotation tasks. Comprehensive ablations are performed on how the integration of vision and touch can improve reinforcement learning and Sim2Real performance. Our project page is available at https://yingyuan0414.github.io/visuotactile/ .
    Partial-label Learning with Mixed Closed-set and Open-set Out-of-candidate Examples. (arXiv:2307.00553v2 [cs.LG] UPDATED)
    Partial-label learning (PLL) relies on a key assumption that the true label of each training example must be in the candidate label set. This restrictive assumption may be violated in complex real-world scenarios, and thus the true label of some collected examples could be unexpectedly outside the assigned candidate label set. In this paper, we term the examples whose true label is outside the candidate label set OOC (out-of-candidate) examples, and pioneer a new PLL study to learn with OOC examples. We consider two types of OOC examples in reality, i.e., the closed-set/open-set OOC examples whose true label is inside/outside the known label space. To solve this new PLL problem, we first calculate the wooden cross-entropy loss from candidate and non-candidate labels respectively, and dynamically differentiate the two types of OOC examples based on specially designed criteria. Then, for closed-set OOC examples, we conduct reversed label disambiguation in the non-candidate label set; for open-set OOC examples, we leverage them for training by utilizing an effective regularization strategy that dynamically assigns random candidate labels from the candidate label set. In this way, the two types of OOC examples can be differentiated and further leveraged for model training. Extensive experiments demonstrate that our proposed method outperforms state-of-the-art PLL methods.
    Meta-Value Learning: a General Framework for Learning with Learning Awareness. (arXiv:2307.08863v3 [cs.LG] UPDATED)
    Gradient-based learning in multi-agent systems is difficult because the gradient derives from a first-order model which does not account for the interaction between agents' learning processes. LOLA (arXiv:1709.04326) accounts for this by differentiating through one step of optimization. We propose to judge joint policies by their long-term prospects as measured by the meta-value, a discounted sum over the returns of future optimization iterates. We apply a form of Q-learning to the meta-game of optimization, in a way that avoids the need to explicitly represent the continuous action space of policy updates. The resulting method, MeVa, is consistent and far-sighted, and does not require REINFORCE estimators. We analyze the behavior of our method on a toy game and compare to prior work on repeated matrix games.
    The Role of Entropy and Reconstruction in Multi-View Self-Supervised Learning. (arXiv:2307.10907v2 [cs.LG] UPDATED)
    The mechanisms behind the success of multi-view self-supervised learning (MVSSL) are not yet fully understood. Contrastive MVSSL methods have been studied through the lens of InfoNCE, a lower bound of the Mutual Information (MI). However, the relation between other MVSSL methods and MI remains unclear. We consider a different lower bound on the MI consisting of an entropy and a reconstruction term (ER), and analyze the main MVSSL families through its lens. Through this ER bound, we show that clustering-based methods such as DeepCluster and SwAV maximize the MI. We also re-interpret the mechanisms of distillation-based approaches such as BYOL and DINO, showing that they explicitly maximize the reconstruction term and implicitly encourage a stable entropy, and we confirm this empirically. We show that replacing the objectives of common MVSSL methods with this ER bound achieves competitive performance, while making them stable when training with smaller batch sizes or smaller exponential moving average (EMA) coefficients. Github repo: https://github.com/apple/ml-entropy-reconstruction.
    AdaptCL: Adaptive Continual Learning for Tackling Heterogeneity in Sequential Datasets. (arXiv:2207.11005v3 [cs.LG] UPDATED)
    Managing heterogeneous datasets that vary in complexity, size, and similarity in continual learning presents a significant challenge. Task-agnostic continual learning is necessary to address this challenge, as datasets with varying similarity pose difficulties in distinguishing task boundaries. Conventional task-agnostic continual learning practices typically rely on rehearsal or regularization techniques. However, rehearsal methods may struggle with varying dataset sizes and regulating the importance of old and new data due to rigid buffer sizes. Meanwhile, regularization methods apply generic constraints to promote generalization but can hinder performance when dealing with dissimilar datasets lacking shared features, necessitating a more adaptive approach. In this paper, we propose AdaptCL, a novel adaptive continual learning method to tackle heterogeneity in sequential datasets. AdaptCL employs fine-grained data-driven pruning to adapt to variations in data complexity and dataset size. It also utilizes task-agnostic parameter isolation to mitigate the impact of varying degrees of catastrophic forgetting caused by differences in data similarity. Through a two-pronged case study approach, we evaluate AdaptCL on both datasets of MNIST Variants and DomainNet, as well as datasets from different domains. The latter include both large-scale, diverse binary-class datasets and few-shot, multi-class datasets. Across all these scenarios, AdaptCL consistently exhibits robust performance, demonstrating its flexibility and general applicability in handling heterogeneous datasets.
    L2MAC: Large Language Model Automatic Computer for Unbounded Code Generation. (arXiv:2310.02003v2 [cs.SE] UPDATED)
    Transformer-based large language models (LLMs) are constrained by the fixed context window of the underlying transformer architecture, hindering their ability to produce long and logically consistent code. Memory-augmented LLMs are a promising solution, but current approaches cannot handle long code generation tasks since they (1) only focus on reading memory and reduce its evolution to the concatenation of new memories or (2) use very specialized memories that cannot adapt to other domains. This paper presents L2MAC, the first practical LLM-based stored-program automatic computer for long and consistent code generation. Its memory has two components: the instruction registry, which is populated with a prompt program to solve the user-given task, and a file store, which will contain the final and intermediate outputs. Each instruction is executed by a separate LLM instance, whose context is managed by a control unit capable of precise memory reading and writing to ensure effective interaction with the file store. These components enable L2MAC to generate virtually unbounded code structures, bypassing the constraints of the finite context window while producing code that fulfills complex user-specified requirements. We empirically show that L2MAC succeeds in generating large code bases for system design tasks where other coding methods fall short in implementing user requirements and provide insight into the reasons for this performance gap.
    Quantum-Enhanced Forecasting: Leveraging Quantum Gramian Angular Field and CNNs for Stock Return Predictions. (arXiv:2310.07427v3 [cs.LG] UPDATED)
    We propose a time series forecasting method named Quantum Gramian Angular Field (QGAF). This approach merges the advantages of quantum computing technology with deep learning, aiming to enhance the precision of time series classification and forecasting. We successfully transformed stock return time series data into two-dimensional images suitable for Convolutional Neural Network (CNN) training by designing specific quantum circuits. Distinct from the classical Gramian Angular Field (GAF) approach, QGAF's uniqueness lies in eliminating the need for data normalization and inverse cosine calculations, simplifying the transformation process from time series data to two-dimensional images. To validate the effectiveness of this method, we conducted experiments on datasets from three major stock markets: the China A-share market, the Hong Kong stock market, and the US stock market. Experimental results revealed that compared to the classical GAF method, the QGAF approach significantly improved time series prediction accuracy, reducing prediction errors by an average of 25% for Mean Absolute Error (MAE) and 48% for Mean Squared Error (MSE). This research confirms the potential and promising prospects of integrating quantum computing with deep learning techniques in financial time series forecasting.
    Improving Parameter Training for VQEs by Sequential Hamiltonian Assembly. (arXiv:2312.05552v1 [quant-ph])
    A central challenge in quantum machine learning is the design and training of parameterized quantum circuits (PQCs). Similar to deep learning, vanishing gradients pose immense problems in the trainability of PQCs, which have been shown to arise from a multitude of sources. One such cause are non-local loss functions, that demand the measurement of a large subset of involved qubits. To facilitate the parameter training for quantum applications using global loss functions, we propose a Sequential Hamiltonian Assembly, which iteratively approximates the loss function using local components. Aiming for a prove of principle, we evaluate our approach using Graph Coloring problem with a Varational Quantum Eigensolver (VQE). Simulation results show, that our approach outperforms conventional parameter training by 29.99% and the empirical state of the art, Layerwise Learning, by 5.12% in the mean accuracy. This paves the way towards locality-aware learning techniques, allowing to evade vanishing gradients for a large class of practically relevant problems.
    Estimating Shape Distances on Neural Representations with Limited Samples. (arXiv:2310.05742v2 [stat.ML] UPDATED)
    Measuring geometric similarity between high-dimensional network representations is a topic of longstanding interest to neuroscience and deep learning. Although many methods have been proposed, only a few works have rigorously analyzed their statistical efficiency or quantified estimator uncertainty in data-limited regimes. Here, we derive upper and lower bounds on the worst-case convergence of standard estimators of shape distance$\unicode{x2014}$a measure of representational dissimilarity proposed by Williams et al. (2021).These bounds reveal the challenging nature of the problem in high-dimensional feature spaces. To overcome these challenges, we introduce a new method-of-moments estimator with a tunable bias-variance tradeoff. We show that this estimator achieves substantially lower bias than standard estimators in simulation and on neural data, particularly in high-dimensional settings. Thus, we lay the foundation for a rigorous statistical theory for high-dimensional shape analysis, and we contribute a new estimation method that is well-suited to practical scientific settings.
    Label Augmentation Method for Medical Landmark Detection in Hip Radiograph Images. (arXiv:2309.16066v2 [cs.LG] UPDATED)
    This work reports the empirical performance of an automated medical landmark detection method for predict clinical markers in hip radiograph images. Notably, the detection method was trained using a label-only augmentation scheme; our results indicate that this form of augmentation outperforms traditional data augmentation and produces highly sample efficient estimators. We train a generic U-Net-based architecture under a curriculum consisting of two phases: initially relaxing the landmarking task by enlarging the label points to regions, then gradually eroding these label regions back to the base task. We measure the benefits of this approach on six datasets of radiographs with gold-standard expert annotations.
    JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models. (arXiv:2311.00286v3 [cs.CL] UPDATED)
    In this paper, we present JADE, a targeted linguistic fuzzing platform which strengthens the linguistic complexity of seed questions to simultaneously and consistently break a wide range of widely-used LLMs categorized in three groups: eight open-sourced Chinese, six commercial Chinese and four commercial English LLMs. JADE generates three safety benchmarks for the three groups of LLMs, which contain unsafe questions that are highly threatening: the questions simultaneously trigger harmful generation of multiple LLMs, with an average unsafe generation ratio of $70\%$ (please see the table below), while are still natural questions, fluent and preserving the core unsafe semantics. We release the benchmark demos generated for commercial English LLMs and open-sourced English LLMs in the following link: https://github.com/whitzard-ai/jade-db. For readers who are interested in evaluating on more questions generated by JADE, please contact us. JADE is based on Noam Chomsky's seminal theory of transformational-generative grammar. Given a seed question with unsafe intention, JADE invokes a sequence of generative and transformational rules to increment the complexity of the syntactic structure of the original question, until the safety guardrail is broken. Our key insight is: Due to the complexity of human language, most of the current best LLMs can hardly recognize the invariant evil from the infinite number of different syntactic structures which form an unbound example space that can never be fully covered. Technically, the generative/transformative rules are constructed by native speakers of the languages, and, once developed, can be used to automatically grow and transform the parse tree of a given question, until the guardrail is broken. For more evaluation results and demo, please check our website: https://whitzard-ai.github.io/jade.html.
    LogGPT: Log Anomaly Detection via GPT. (arXiv:2309.14482v2 [cs.LG] UPDATED)
    Detecting system anomalies based on log data is important for ensuring the security and reliability of computer systems. Recently, deep learning models have been widely used for log anomaly detection. The core idea is to model the log sequences as natural language and adopt deep sequential models, such as LSTM or Transformer, to encode the normal patterns in log sequences via language modeling. However, there is a gap between language modeling and anomaly detection as the objective of training a sequential model via a language modeling loss is not directly related to anomaly detection. To fill up the gap, we propose LogGPT, a novel framework that employs GPT for log anomaly detection. LogGPT is first trained to predict the next log entry based on the preceding sequence. To further enhance the performance of LogGPT, we propose a novel reinforcement learning strategy to finetune the model specifically for the log anomaly detection task. The experimental results on three datasets show that LogGPT significantly outperforms existing state-of-the-art approaches.
    Multi-Domain Causal Representation Learning via Weak Distributional Invariances. (arXiv:2310.02854v3 [cs.LG] UPDATED)
    Causal representation learning has emerged as the center of action in causal machine learning research. In particular, multi-domain datasets present a natural opportunity for showcasing the advantages of causal representation learning over standard unsupervised representation learning. While recent works have taken crucial steps towards learning causal representations, they often lack applicability to multi-domain datasets due to over-simplifying assumptions about the data; e.g. each domain comes from a different single-node perfect intervention. In this work, we relax these assumptions and capitalize on the following observation: there often exists a subset of latents whose certain distributional properties (e.g., support, variance) remain stable across domains; this property holds when, for example, each domain comes from a multi-node imperfect intervention. Leveraging this observation, we show that autoencoders that incorporate such invariances can provably identify the stable set of latents from the rest across different settings.
    MGAS: Multi-Granularity Architecture Search for Trade-Off Between Model Effectiveness and Efficiency. (arXiv:2310.15074v3 [cs.LG] UPDATED)
    Neural architecture search (NAS) has gained significant traction in automating the design of neural networks. To reduce the time cost, differentiable architecture search (DAS) transforms the traditional paradigm of discrete candidate sampling and evaluation into that of differentiable super-net optimization and discretization. However, existing DAS methods fail to trade off between model performance and model size. They either only conduct coarse-grained operation-level search, which results in redundant model parameters, or restrictively explore fine-grained filter-level and weight-level units with pre-defined remaining ratios, suffering from excessive pruning problem. Additionally, these methods compromise search quality to save memory during the search process. To tackle these issues, we introduce multi-granularity architecture search (MGAS), a unified framework which aims to discover both effective and efficient neural networks by comprehensively yet memory-efficiently exploring the multi-granularity search space. Specifically, we improve the existing DAS methods in two aspects. First, we balance the model unit numbers at different granularity levels with adaptive pruning. We learn discretization functions specific to each granularity level to adaptively determine the unit remaining ratio according to the evolving architecture. Second, we reduce the memory consumption without degrading the search quality using multi-stage search. We break down the super-net optimization and discretization into multiple sub-net stages, and perform progressive re-evaluation to allow for re-pruning and regrowing of previous units during subsequent stages, compensating for potential bias. Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet demonstrate that MGAS outperforms other state-of-the-art methods in achieving a better trade-off between model performance and model size.
    Localisation of Regularised and Multiview Support Vector Machine Learning. (arXiv:2304.05655v2 [math.FA] UPDATED)
    We prove a few representer theorems for a localised version of the regularised and multiview support vector machine learning problem introduced by H.Q.~Minh, L.~Bazzani, and V.~Murino, \textit{Journal of Machine Learning Research}, \textbf{17}(2016) 1--72, that involves operator valued positive semidefinite kernels and their reproducing kernel Hilbert spaces. The results concern general cases when convex or nonconvex loss functions and finite or infinite dimensional input spaces are considered. We show that the general framework allows infinite dimensional input spaces and nonconvex loss functions for some special cases, in particular in case the loss functions are G\^ateaux differentiable. Detailed calculations are provided for the exponential least squares loss functions that leads to partially nonlinear problems.
    SAM as an Optimal Relaxation of Bayes. (arXiv:2210.01620v3 [cs.LG] UPDATED)
    Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. By connecting adversarial and Bayesian methods, our work opens a new path to robustness.
    The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit. (arXiv:2306.17759v2 [stat.ML] UPDATED)
    In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffusion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.
    How the level sampling process impacts zero-shot generalisation in deep reinforcement learning. (arXiv:2310.03494v2 [cs.LG] UPDATED)
    A key limitation preventing the wider adoption of autonomous agents trained via deep reinforcement learning (RL) is their limited ability to generalise to new environments, even when these share similar characteristics with environments encountered during training. In this work, we investigate how a non-uniform sampling strategy of individual environment instances, or levels, affects the zero-shot generalisation (ZSG) ability of RL agents, considering two failure modes: overfitting and over-generalisation. As a first step, we measure the mutual information (MI) between the agent's internal representation and the set of training levels, which we find to be well-correlated to instance overfitting. In contrast to uniform sampling, adaptive sampling strategies prioritising levels based on their value loss are more effective at maintaining lower MI, which provides a novel theoretical justification for this class of techniques. We then turn our attention to unsupervised environment design (UED) methods, which adaptively generate new training levels and minimise MI more effectively than methods sampling from a fixed set. However, we find UED methods significantly shift the training distribution, resulting in over-generalisation and worse ZSG performance over the distribution of interest. To prevent both instance overfitting and over-generalisation, we introduce self-supervised environment design (SSED). SSED generates levels using a variational autoencoder, effectively reducing MI while minimising the shift with the distribution of interest, and leads to statistically significant improvements in ZSG over fixed-set level sampling strategies and UED methods.
    Generation of 3D Molecules in Pockets via Language Model. (arXiv:2305.10133v3 [cs.LG] UPDATED)
    Generative models for molecules based on sequential line notation (e.g. SMILES) or graph representation have attracted an increasing interest in the field of structure-based drug design, but they struggle to capture important 3D spatial interactions and often produce undesirable molecular structures. To address these challenges, we introduce Lingo3DMol, a pocket-based 3D molecule generation method that combines language models and geometric deep learning technology. A new molecular representation, fragment-based SMILES with local and global coordinates, was developed to assist the model in learning molecular topologies and atomic spatial positions. Additionally, we trained a separate noncovalent interaction predictor to provide essential binding pattern information for the generative model. Lingo3DMol can efficiently traverse drug-like chemical spaces, preventing the formation of unusual structures. The Directory of Useful Decoys-Enhanced (DUD-E) dataset was used for evaluation. Lingo3DMol outperformed state-of-the-art methods in terms of drug-likeness, synthetic accessibility, pocket binding mode, and molecule generation speed.
    Ignorance is Bliss: Robust Control via Information Gating. (arXiv:2303.06121v2 [cs.LG] UPDATED)
    Informational parsimony provides a useful inductive bias for learning representations that achieve better generalization by being robust to noise and spurious correlations. We propose \textit{information gating} as a way to learn parsimonious representations that identify the minimal information required for a task. When gating information, we can learn to reveal as little information as possible so that a task remains solvable, or hide as little information as possible so that a task becomes unsolvable. We gate information using a differentiable parameterization of the signal-to-noise ratio, which can be applied to arbitrary values in a network, e.g., erasing pixels at the input layer or activations in some intermediate layer. When gating at the input layer, our models learn which visual cues matter for a given task. When gating intermediate layers, our models learn which activations are needed for subsequent stages of computation. We call our approach \textit{InfoGating}. We apply InfoGating to various objectives such as multi-step forward and inverse dynamics models, Q-learning, and behavior cloning, highlighting how InfoGating can naturally help in discarding information not relevant for control. Results show that learning to identify and use minimal information can improve generalization in downstream tasks. Policies based on InfoGating are considerably more robust to irrelevant visual features, leading to improved pretraining and finetuning of RL models.
    Leveraging Multi-time Hamilton-Jacobi PDEs for Certain Scientific Machine Learning Problems. (arXiv:2303.12928v3 [cs.LG] UPDATED)
    Hamilton-Jacobi partial differential equations (HJ PDEs) have deep connections with a wide range of fields, including optimal control, differential games, and imaging sciences. By considering the time variable to be a higher dimensional quantity, HJ PDEs can be extended to the multi-time case. In this paper, we establish a novel theoretical connection between specific optimization problems arising in machine learning and the multi-time Hopf formula, which corresponds to a representation of the solution to certain multi-time HJ PDEs. Through this connection, we increase the interpretability of the training process of certain machine learning applications by showing that when we solve these learning problems, we also solve a multi-time HJ PDE and, by extension, its corresponding optimal control problem. As a first exploration of this connection, we develop the relation between the regularized linear regression problem and the Linear Quadratic Regulator (LQR). We then leverage our theoretical connection to adapt standard LQR solvers (namely, those based on the Riccati ordinary differential equations) to design new training approaches for machine learning. Finally, we provide some numerical examples that demonstrate the versatility and possible computational advantages of our Riccati-based approach in the context of continual learning, post-training calibration, transfer learning, and sparse dynamics identification.
    Dis-inhibitory neuronal circuits can control the sign of synaptic plasticity. (arXiv:2310.19614v2 [q-bio.NC] UPDATED)
    How neuronal circuits achieve credit assignment remains a central unsolved question in systems neuroscience. Various studies have suggested plausible solutions for back-propagating error signals through multi-layer networks. These purely functionally motivated models assume distinct neuronal compartments to represent local error signals that determine the sign of synaptic plasticity. However, this explicit error modulation is inconsistent with phenomenological plasticity models in which the sign depends primarily on postsynaptic activity. Here we show how a plausible microcircuit model and Hebbian learning rule derived within an adaptive control theory framework can resolve this discrepancy. Assuming errors are encoded in top-down dis-inhibitory synaptic afferents, we show that error-modulated learning emerges naturally at the circuit level when recurrent inhibition explicitly influences Hebbian plasticity. The same learning rule accounts for experimentally observed plasticity in the absence of inhibition and performs comparably to back-propagation of error (BP) on several non-linearly separable benchmarks. Our findings bridge the gap between functional and experimentally observed plasticity rules and make concrete predictions on inhibitory modulation of excitatory plasticity.
    KEEC: Embed to Control on An Equivariant Geometry. (arXiv:2312.01544v2 [cs.LG] UPDATED)
    This paper investigates how representation learning can enable optimal control in unknown and complex dynamics, such as chaotic and non-linear systems, without relying on prior domain knowledge of the dynamics. The core idea is to establish an equivariant geometry that is diffeomorphic to the manifold defined by a dynamical system and to perform optimal control within this corresponding geometry, which is a non-trivial task. To address this challenge, Koopman Embed to Equivariant Control (KEEC) is proposed for model learning and control. Inspired by Lie theory, KEEC begins by learning a non-linear dynamical system defined on a manifold and embedding trajectories into a Lie group. Subsequently, KEEC formulates an equivariant value function equation in reinforcement learning on the equivariant geometry, ensuring an invariant effect as the value function on the original manifold. By deriving analytical-form optimal actions on the equivariant value function, KEEC theoretically achieves quadratic convergence for the optimal equivariant value function by leveraging the differential information on the equivariant geometry. The effectiveness of KEEC is demonstrated in challenging dynamical systems, including chaotic ones like Lorenz-63. Notably, our results show that isometric functions, which maintain the compactness and completeness of geometry while preserving metric and differential information, consistently outperform loss functions lacking these characteristics.
    A Language-Agent Approach to Formal Theorem-Proving. (arXiv:2310.04353v2 [cs.LG] UPDATED)
    Language agents, which use a large language model (LLM) capable of in-context learning to interact with an external environment, have recently emerged as a promising approach to control tasks. We present the first language-agent approach to formal theorem-proving. Our method, COPRA, uses a high-capacity, black-box LLM (GPT-4) as part of a policy for a stateful backtracking search. During the search, the policy can select proof tactics and retrieve lemmas and definitions from an external database. Each selected tactic is executed in the underlying proof framework, and the execution feedback is used to build the prompt for the next policy invocation. The search also tracks selected information from its history and uses it to reduce hallucinations and unnecessary LLM queries. We evaluate our implementation of COPRA on the miniF2F benchmark for Lean and a set of Coq tasks from the Compcert project. On these benchmarks, COPRA significantly outperforms one-shot invocations of GPT-4, as well as state-of-the-art models fine-tuned on proof data, at finding correct proofs quickly. Our code and data are available at https://github.com/trishullab/copra.
    Diffusion Models for Reinforcement Learning: A Survey. (arXiv:2311.01223v2 [cs.LG] UPDATED)
    Diffusion models have emerged as a prominent class of generative models, surpassing previous methods regarding sample quality and training stability. Recent works have shown the advantages of diffusion models in improving reinforcement learning (RL) solutions, including as trajectory planners, expressive policy classes, data synthesizers, etc. This survey aims to provide an overview of the advancements in this emerging field and hopes to inspire new avenues of research. First, we examine several challenges encountered by current RL algorithms. Then, we present a taxonomy of existing methods based on the roles played by diffusion models in RL and explore how the existing challenges are addressed. We further outline successful applications of diffusion models in various RL-related tasks while discussing the limitations of current approaches. Finally, we conclude the survey and offer insights into future research directions, focusing on enhancing model performance and applying diffusion models to broader tasks. We are actively maintaining a GitHub repository for papers and other related resources in applying diffusion models in RL: https://github.com/apexrl/Diff4RLSurvey
    Granular-ball computing: an efficient, robust, and interpretable adaptive multi-granularity representation and computation method. (arXiv:2304.11171v3 [cs.LG] UPDATED)
    Human cognition operates on a "Global-first" cognitive mechanism, prioritizing information processing based on coarse-grained details. This mechanism inherently possesses an adaptive multi-granularity description capacity, resulting in computational traits such as efficiency, robustness, and interpretability. The analysis pattern reliance on the finest granularity and single-granularity makes most existing computational methods less efficient, robust, and interpretable, which is an important reason for the current lack of interpretability in neural networks. Multi-granularity granular-ball computing employs granular-balls of varying sizes to daptively represent and envelop the sample space, facilitating learning based on these granular-balls. Given that the number of coarse-grained "granular-balls" is fewer than sample points, granular-ball computing proves more efficient. Moreover, the inherent coarse-grained nature of granular-balls reduces susceptibility to fine-grained sample disturbances, enhancing robustness. The multi-granularity construct of granular-balls generates topological structures and coarse-grained descriptions, naturally augmenting interpretability. Granular-ball computing has successfully ventured into diverse AI domains, fostering the development of innovative theoretical methods, including granular-ball classifiers, clustering techniques, neural networks, rough sets, and evolutionary computing. This has notably ameliorated the efficiency, noise robustness, and interpretability of traditional methods. Overall, granular-ball computing is a rare and innovative theoretical approach in AI that can adaptively and simultaneously enhance efficiency, robustness, and interpretability. This article delves into the main application landscapes for granular-ball computing, aiming to equip future researchers with references and insights to refine and expand this promising theory.
    Tractability of approximation by general shallow networks. (arXiv:2308.03230v2 [cs.LG] UPDATED)
    In this paper, we present a sharper version of the results in the paper Dimension independent bounds for general shallow networks; Neural Networks, \textbf{123} (2020), 142-152. Let $\mathbb{X}$ and $\mathbb{Y}$ be compact metric spaces. We consider approximation of functions of the form $ x\mapsto\int_{\mathbb{Y}} G( x, y)d\tau( y)$, $ x\in\mathbb{X}$, by $G$-networks of the form $ x\mapsto \sum_{k=1}^n a_kG( x, y_k)$, $ y_1,\cdots, y_n\in\mathbb{Y}$, $a_1,\cdots, a_n\in\mathbb{R}$. Defining the dimensions of $\mathbb{X}$ and $\mathbb{Y}$ in terms of covering numbers, we obtain dimension independent bounds on the degree of approximation in terms of $n$, where also the constants involved are all dependent at most polynomially on the dimensions. Applications include approximation by power rectified linear unit networks, zonal function networks, certain radial basis function networks as well as the important problem of function extension to higher dimensional spaces.
    Policy Gradient for Rectangular Robust Markov Decision Processes. (arXiv:2301.13589v2 [cs.LG] UPDATED)
    Policy gradient methods have become a standard for training reinforcement learning agents in a scalable and efficient manner. However, they do not account for transition uncertainty, whereas learning robust policies can be computationally expensive. In this paper, we introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs). We provide a closed-form expression for the worst occupation measure. Incidentally, we find that the worst kernel is a rank-one perturbation of the nominal. Combining the worst occupation measure with a robust Q-value estimation yields an explicit form of the robust gradient. Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent. Hence, it relieves the computational burden of convex optimization problems required for training robust policies by current policy gradient approaches.
    Grad DFT: a software library for machine learning enhanced density functional theory. (arXiv:2309.15127v2 [physics.chem-ph] UPDATED)
    Density functional theory (DFT) stands as a cornerstone method in computational quantum chemistry and materials science due to its remarkable versatility and scalability. Yet, it suffers from limitations in accuracy, particularly when dealing with strongly correlated systems. To address these shortcomings, recent work has begun to explore how machine learning can expand the capabilities of DFT; an endeavor with many open questions and technical challenges. In this work, we present Grad DFT: a fully differentiable JAX-based DFT library, enabling quick prototyping and experimentation with machine learning-enhanced exchange-correlation energy functionals. Grad DFT employs a pioneering parametrization of exchange-correlation functionals constructed using a weighted sum of energy densities, where the weights are determined using neural networks. Moreover, Grad DFT encompasses a comprehensive suite of auxiliary functions, notably featuring a just-in-time compilable and fully differentiable self-consistent iterative procedure. To support training and benchmarking efforts, we additionally compile a curated dataset of experimental dissociation energies of dimers, half of which contain transition metal atoms characterized by strong electronic correlations. The software library is tested against experimental results to study the generalization capabilities of a neural functional across potential energy surfaces and atomic species, as well as the effect of training data noise on the resulting model accuracy.
    Causality Guided Disentanglement for Cross-Platform Hate Speech Detection. (arXiv:2308.02080v3 [cs.CL] UPDATED)
    Social media platforms, despite their value in promoting open discourse, are often exploited to spread harmful content. Current deep learning and natural language processing models used for detecting this harmful content overly rely on domain-specific terms affecting their capabilities to adapt to generalizable hate speech detection. This is because they tend to focus too narrowly on particular linguistic signals or the use of certain categories of words. Another significant challenge arises when platforms lack high-quality annotated data for training, leading to a need for cross-platform models that can adapt to different distribution shifts. Our research introduces a cross-platform hate speech detection model capable of being trained on one platform's data and generalizing to multiple unseen platforms. To achieve good generalizability across platforms, one way is to disentangle the input representations into invariant and platform-dependent features. We also argue that learning causal relationships, which remain constant across diverse environments, can significantly aid in understanding invariant representations in hate speech. By disentangling input into platform-dependent features (useful for predicting hate targets) and platform-independent features (used to predict the presence of hate), we learn invariant representations resistant to distribution shifts. These features are then used to predict hate speech across unseen platforms. Our extensive experiments across four platforms highlight our model's enhanced efficacy compared to existing state-of-the-art methods in detecting generalized hate speech.
    Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception. (arXiv:2305.06324v2 [cs.CV] UPDATED)
    We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model and task scaling. We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model. 2) Sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigates the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including video classification, image classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L variant focusing on video tasks that achieves new state-of-the-art in zero-shot video classification: 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 68.3% on Kinetics-700, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.
    PINNslope: seismic data interpolation and local slope estimation with physics informed neural networks. (arXiv:2305.15990v2 [physics.geo-ph] UPDATED)
    Interpolation of aliased seismic data constitutes a key step in a seismic processing workflow to obtain high quality velocity models and seismic images. Building on the idea of describing seismic wavefields as a superposition of local plane waves, we propose to interpolate seismic data by utilizing a physics informed neural network (PINN). In the proposed framework, two feed-forward neural networks are jointly trained using the local plane wave differential equation as well as the available data as two terms in the objective function: a primary network assisted by positional encoding is tasked with reconstructing the seismic data, whilst an auxiliary, smaller network estimates the associated local slopes. Results on synthetic and field data validate the effectiveness of the proposed method in handling aliased (coarsely sampled) data and data with large gaps. Our method compares favorably against a classic least-squares inversion approach regularized by the local plane-wave equation as well as a PINN-based approach with a single network and pre-computed local slopes. We find that introducing a second network to estimate the local slopes whilst at the same time interpolating the aliased data enhances the overall reconstruction capabilities and convergence behavior of the primary network. Moreover, an additional positional encoding layer embedded as the first layer of the wavefield network confers to the network the ability to converge faster improving the accuracy of the data term.
    DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data. (arXiv:2306.09344v3 [cs.CV] UPDATED)
    Current perceptual similarity metrics operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level similarities and differences in image layout, object pose, and semantic content. In this paper, we develop a perceptual metric that assesses images holistically. Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, DreamSim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks.
    Bridging the Gaps: Learning Verifiable Model-Free Quadratic Programming Controllers Inspired by Model Predictive Control. (arXiv:2312.05332v1 [eess.SY])
    In this paper, we introduce a new class of parameterized controllers, drawing inspiration from Model Predictive Control (MPC). These controllers adopt a Quadratic Programming (QP) structure similar to linear MPC, with problem parameters being learned rather than derived from models. This approach may address the limitations of commonly learned controllers with Multi-Layer Perceptron (MLP) architecture in deep reinforcement learning, in terms of explainability and performance guarantees. The learned controllers not only possess verifiable properties like persistent feasibility and asymptotic stability akin to MPC, but they also empirically match MPC and MLP controllers in control performance. Moreover, they are more computationally efficient in implementation compared to MPC and require significantly fewer learnable policy parameters than MLP controllers. Practical application is demonstrated through a vehicle drift maneuvering task, showcasing the potential of these controllers in real-world scenarios.
    Supply-Side Equilibria in Recommender Systems. (arXiv:2206.13489v3 [cs.GT] UPDATED)
    Algorithmic recommender systems such as Spotify and Netflix affect not only consumer behavior but also producer incentives. Producers seek to create content that will be shown by the recommendation algorithm, which can impact both the diversity and quality of their content. In this work, we investigate the resulting supply-side equilibria in personalized content recommender systems. We model users and content as $D$-dimensional vectors, the recommendation algorithm as showing each user the content with highest dot product, and producers as maximizing the number of users who are recommended their content minus the cost of production. Two key features of our model are that the producer decision space is multi-dimensional and the user base is heterogeneous, which contrasts with classical low-dimensional models. Multi-dimensionality and heterogeneity create the potential for specialization, where different producers create different types of content at equilibrium. Using a duality argument, we derive necessary and sufficient conditions for whether specialization occurs: these conditions depend on the extent to which users are heterogeneous and to which producers can perform well on all dimensions at once without incurring a high cost. Then, we characterize the distribution of content at equilibrium in concrete settings with two populations of users. Lastly, we show that specialization can enable producers to achieve positive profit at equilibrium, which means that specialization can reduce the competitiveness of the marketplace. At a conceptual level, our analysis of supply-side competition takes a step towards elucidating how personalized recommendations shape the marketplace of digital goods, and towards understanding what new phenomena arise in multi-dimensional competitive settings.
    Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model. (arXiv:2305.15265v2 [cs.LG] UPDATED)
    With the rapid growth in model size, fine-tuning the large pre-trained language model has become increasingly difficult due to its extensive memory usage. Previous works usually focus on reducing the number of trainable parameters in the network. While the model parameters do contribute to memory usage, the primary memory bottleneck during training arises from storing feature maps, also known as activations, as they are crucial for gradient calculation. Notably, neural networks are usually trained using stochastic gradient descent. We argue that in stochastic optimization, models can handle noisy gradients as long as the gradient estimator is unbiased with reasonable variance. Following this motivation, we propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance, which only requires storing the sub-sampled activations for calculating the gradient. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones. By replacing the linear operation with our approximated one in transformers, we can achieve up to 2.7$\times$ peak memory reduction with almost no accuracy drop and enables up to $6.4\times$ larger batch size. Under the same hardware, WTA-CRS enables better down-streaming task performance by applying larger models and/or faster training speed with larger batch sizes.
    Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation. (arXiv:2307.08875v2 [cs.LG] UPDATED)
    We study robust reinforcement learning (RL) with the goal of determining a well-performing policy that is robust against model mismatch between the training simulator and the testing environment. Previous policy-based robust RL algorithms mainly focus on the tabular setting under uncertainty sets that facilitate robust policy evaluation, but are no longer tractable when the number of states scales up. To this end, we propose two novel uncertainty set formulations, one based on double sampling and the other on an integral probability metric. Both make large-scale robust RL tractable even when one only has access to a simulator. We propose a robust natural actor-critic (RNAC) approach that incorporates the new uncertainty sets and employs function approximation. We provide finite-time convergence guarantees for the proposed RNAC algorithm to the optimal robust policy within the function approximation error. Finally, we demonstrate the robust performance of the policy learned by our proposed RNAC approach in multiple MuJoCo environments and a real-world TurtleBot navigation task.
    Evaluating the Ebb and Flow: An In-depth Analysis of Question-Answering Trends across Diverse Platforms. (arXiv:2309.05961v3 [cs.SI] UPDATED)
    Community Question Answering (CQA) platforms steadily gain popularity as they provide users with fast responses to their queries. The swiftness of these responses is contingent on a mixture of query-specific and user-related elements. This paper scrutinizes these contributing factors within the context of six highly popular CQA platforms, identified through their standout answering speed. Our investigation reveals a correlation between the time taken to yield the first response to a question and several variables: the metadata, the formulation of the questions, and the level of interaction among users. Additionally, by employing conventional machine learning models to analyze these metadata and patterns of user interaction, we endeavor to predict which queries will receive their initial responses promptly.
    Discovering Dynamic Causal Space for DAG Structure Learning. (arXiv:2306.02822v3 [cs.LG] UPDATED)
    Discovering causal structure from purely observational data (i.e., causal discovery), aiming to identify causal relationships among variables, is a fundamental task in machine learning. The recent invention of differentiable score-based DAG learners is a crucial enabler, which reframes the combinatorial optimization problem into a differentiable optimization with a DAG constraint over directed graph space. Despite their great success, these cutting-edge DAG learners incorporate DAG-ness independent score functions to evaluate the directed graph candidates, lacking in considering graph structure. As a result, measuring the data fitness alone regardless of DAG-ness inevitably leads to discovering suboptimal DAGs and model vulnerabilities. Towards this end, we propose a dynamic causal space for DAG structure learning, coined CASPER, that integrates the graph structure into the score function as a new measure in the causal space to faithfully reflect the causal distance between estimated and ground truth DAG. CASPER revises the learning process as well as enhances the DAG structure learning via adaptive attention to DAG-ness. Grounded by empirical visualization, CASPER, as a space, satisfies a series of desired properties, such as structure awareness and noise robustness. Extensive experiments on both synthetic and real-world datasets clearly validate the superiority of our CASPER over the state-of-the-art causal discovery methods in terms of accuracy and robustness.
    MedShapeNet -- A Large-Scale Dataset of 3D Medical Shapes for Computer Vision. (arXiv:2308.16139v4 [cs.CV] UPDATED)
    Prior to the deep learning era, \textit{shape} was commonly used to describe the objects. Nowadays, state-of-the-art (SOTA) algorithms in medical imaging are predominantly diverging from computer vision, where voxel grids, meshes, point clouds, and implicit surface models are used. This is seen from numerous shape-related publications in premier vision conferences as well as the growing popularity of \textit{ShapeNet} (about 51,300 models) and \textit{Princeton ModelNet} (127,915 models). For the medical domain, we present a large collection of anatomical shapes (e.g., bones, organs, vessels) and 3D models of surgical instrument, called \textit{MedShapeNet}, created to facilitate the translation of data-driven vision algorithms to medical applications and to adapt SOTA vision algorithms to medical problems. As a unique feature, we directly model the majority of shapes on the imaging data of real patients. As of today, \textit{MedShapeNet} includes 23 dataset with more than 100,000 shapes that are paired with annotations (ground truth). Our data is freely accessible via a web interface and a Python application programming interface (API) and can be used for discriminative, reconstructive, and variational benchmarks as well as various applications in virtual, augmented, or mixed reality, and 3D printing. Exemplary, we present use cases in the fields of classification of brain tumors, facial and skull reconstructions, multi-class anatomy completion, education, and 3D printing. In future, we will extend the data and improve the interfaces. The project pages are: \url{https://medshapenet.ikim.nrw/} and \url{https://github.com/Jianningli/medshapenet-feedback}
    Enhancing the Accuracy of Predictors of Activity Sequences of Business Processes. (arXiv:2312.05560v1 [cs.LG])
    Predictive process monitoring is an evolving research field that studies how to train and use predictive models for operational decision-making. One of the problems studied in this field is that of predicting the sequence of upcoming activities in a case up to its completion, a.k.a. the case suffix. The prediction of case suffixes provides input to estimate short-term workloads and execution times under different resource schedules. Existing methods to address this problem often generate suffixes wherein some activities are repeated many times, whereas this pattern is not observed in the data. Closer examination shows that this shortcoming stems from the approach used to sample the successive activity instances to generate a case suffix. Accordingly, the paper introduces a sampling approach aimed at reducing repetitions of activities in the predicted case suffixes. The approach, namely Daemon action, strikes a balance between exploration and exploitation when generating the successive activity instances. We enhance a deep learning approach for case suffix predictions using this sampling approach, and experimentally show that the enhanced approach outperforms the unenhanced ones with respect to control-flow accuracy measures.
    Robust Nonparametric Regression under Poisoning Attack. (arXiv:2305.16771v2 [math.ST] UPDATED)
    This paper studies robust nonparametric regression, in which an adversarial attacker can modify the values of up to $q$ samples from a training dataset of size $N$. Our initial solution is an M-estimator based on Huber loss minimization. Compared with simple kernel regression, i.e. the Nadaraya-Watson estimator, this method can significantly weaken the impact of malicious samples on the regression performance. We provide the convergence rate as well as the corresponding minimax lower bound. The result shows that, with proper bandwidth selection, $\ell_\infty$ error is minimax optimal. The $\ell_2$ error is optimal with relatively small $q$, but is suboptimal with larger $q$. The reason is that this estimator is vulnerable if there are many attacked samples concentrating in a small region. To address this issue, we propose a correction method by projecting the initial estimate to the space of Lipschitz functions. The final estimate is nearly minimax optimal for arbitrary $q$, up to a $\ln N$ factor.
    Multi-Tier Hierarchical Federated Learning-assisted NTN for Intelligent IoT Services. (arXiv:2305.05463v2 [cs.NI] UPDATED)
    In the ever-expanding landscape of the IoT, managing the intricate network of interconnected devices presents a fundamental challenge. This leads us to ask: "What if we invite the IoT devices to collaboratively participate in real-time network management and IoT data-handling decisions?" This inquiry forms the foundation of our innovative approach, addressing the burgeoning complexities in IoT through the integration of NTN architecture, in particular, VHetNet, and an MT-HFL framework. VHetNets transcend traditional network paradigms by harmonizing terrestrial and non-terrestrial elements, thus ensuring expansive connectivity and resilience, especially crucial in areas with limited terrestrial infrastructure. The incorporation of MT-HFL further revolutionizes this architecture, distributing intelligent data processing across a multi-tiered network spectrum, from edge devices on the ground to aerial platforms and satellites above. This study explores MT-HFL's role in fostering a decentralized, collaborative learning environment, enabling IoT devices to not only contribute but also make informed decisions in network management. This methodology adeptly handles the challenges posed by the non-IID nature of IoT data and efficiently curtails communication overheads prevalent in extensive IoT networks. Significantly, MT-HFL enhances data privacy, a paramount aspect in IoT ecosystems, by facilitating local data processing and limiting the sharing of model updates instead of raw data. By evaluating a case-study, our findings demonstrate that the synergistic integration of MT-HFL within VHetNets creates an intelligent network architecture that is robust, scalable, and dynamically adaptive to the ever-changing demands of IoT environments. This setup ensures efficient data handling, advanced privacy and security measures, and responsive adaptability to fluctuating network conditions.
    ReLoRA: High-Rank Training Through Low-Rank Updates. (arXiv:2307.05695v4 [cs.CL] UPDATED)
    Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparameterized models remains poorly understood, while training costs grow exponentially. In this paper, we explore parameter-efficient training techniques as an approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to training transformer language models with up to 1.3B parameters and demonstrate comparable performance to regular neural network training. ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on the model size and hardware setup. Our findings show the potential of parameter-efficient techniques for large-scale pre-training.
    On the Role of Entanglement and Statistics in Learning. (arXiv:2306.03161v2 [quant-ph] UPDATED)
    In this work we make progress in understanding the relationship between learning models with access to entangled, separable and statistical measurements in the quantum statistical query (QSQ) model. To this end, we show the following results. $\textbf{Entangled versus separable measurements.}$ The goal here is to learn an unknown $f$ from the concept class $C\subseteq \{f:\{0,1\}^n\rightarrow [k]\}$ given copies of $\frac{1}{\sqrt{2^n}}\sum_x \vert x,f(x)\rangle$. We show that, if $T$ copies suffice to learn $f$ using entangled measurements, then $O(nT^2)$ copies suffice to learn $f$ using just separable measurements. $\textbf{Entangled versus statistical measurements}$ The goal here is to learn a function $f \in C$ given access to separable measurements and statistical measurements. We exhibit a class $C$ that gives an exponential separation between QSQ learning and quantum learning with entangled measurements (even in the presence of noise). This proves the "quantum analogue" of the seminal result of Blum et al. [BKW'03]. that separates classical SQ and PAC learning with classification noise. $\textbf{QSQ lower bounds for learning states.}$ We introduce a quantum statistical query dimension (QSD), which we use to give lower bounds on the QSQ learning. With this we prove superpolynomial QSQ lower bounds for testing purity, shadow tomography, Abelian hidden subgroup problem, degree-$2$ functions, planted bi-clique states and output states of Clifford circuits of depth $\textsf{polylog}(n)$. $\textbf{Further applications.}$ We give and $\textit{unconditional}$ separation between weak and strong error mitigation and prove lower bounds for learning distributions in the QSQ model. Prior works by Quek et al. [QFK+'22], Hinsche et al. [HIN+'22], and Nietner et al. [NIS+'23] proved the analogous results $\textit{assuming}$ diagonal measurements and our work removes this assumption.
    Knowledge Distillation Performs Partial Variance Reduction. (arXiv:2305.17581v2 [cs.LG] UPDATED)
    Knowledge distillation is a popular approach for enhancing the performance of ''student'' models, with lower representational capacity, by taking advantage of more powerful ''teacher'' models. Despite its apparent simplicity and widespread use, the underlying mechanics behind knowledge distillation (KD) are still not fully understood. In this work, we shed new light on the inner workings of this method, by examining it from an optimization perspective. We show that, in the context of linear and deep linear models, KD can be interpreted as a novel type of stochastic variance reduction mechanism. We provide a detailed convergence analysis of the resulting dynamics, which hold under standard assumptions for both strongly-convex and non-convex losses, showing that KD acts as a form of partial variance reduction, which can reduce the stochastic gradient noise, but may not eliminate it completely, depending on the properties of the ''teacher'' model. Our analysis puts further emphasis on the need for careful parametrization of KD, in particular w.r.t. the weighting of the distillation loss, and is validated empirically on both linear models and deep neural networks.
    Transformers learn through gradual rank increase. (arXiv:2306.07042v2 [cs.LG] UPDATED)
    We identify incremental learning dynamics in transformers, where the difference between trained and initial weights progressively increases in rank. We rigorously prove this occurs under the simplifying assumptions of diagonal weight matrices and small initialization. Our experiments support the theory and also show that phenomenon can occur in practice without the simplifying assumptions.
    Learning Bayesian Networks with Heterogeneous Agronomic Data Sets via Mixed-Effect Models and Hierarchical Clustering. (arXiv:2308.06399v4 [stat.ML] UPDATED)
    Maize, a crucial crop globally cultivated across vast regions, especially in sub-Saharan Africa, Asia, and Latin America, occupies 197 million hectares as of 2021. Various statistical and machine learning models, including mixed-effect models, random coefficients models, random forests, and deep learning architectures, have been devised to predict maize yield. These models consider factors such as genotype, environment, genotype-environment interaction, and field management. However, the existing models often fall short of fully exploiting the complex network of causal relationships among these factors and the hierarchical structure inherent in agronomic data. This study introduces an innovative approach integrating random effects into Bayesian networks (BNs), leveraging their capacity to model causal and probabilistic relationships through directed acyclic graphs. Rooted in the linear mixed-effects models framework and tailored for hierarchical data, this novel approach demonstrates enhanced BN learning. Application to a real-world agronomic trial produces a model with improved interpretability, unveiling new causal connections. Notably, the proposed method significantly reduces the error rate in maize yield prediction from 28% to 17%. These results advocate for the preference of BNs in constructing practical decision support tools for hierarchical agronomic data, facilitating causal inference.
    Variance-Reduced Gradient Estimation via Noise-Reuse in Online Evolution Strategies. (arXiv:2304.12180v2 [cs.NE] UPDATED)
    Unrolled computation graphs are prevalent throughout machine learning but present challenges to automatic differentiation (AD) gradient estimation methods when their loss functions exhibit extreme local sensitivtiy, discontinuity, or blackbox characteristics. In such scenarios, online evolution strategies methods are a more capable alternative, while being more parallelizable than vanilla evolution strategies (ES) by interleaving partial unrolls and gradient updates. In this work, we propose a general class of unbiased online evolution strategies methods. We analytically and empirically characterize the variance of this class of gradient estimators and identify the one with the least variance, which we term Noise-Reuse Evolution Strategies (NRES). Experimentally, we show NRES results in faster convergence than existing AD and ES methods in terms of wall-clock time and number of unroll steps across a variety of applications, including learning dynamical systems, meta-training learned optimizers, and reinforcement learning.
    Communication and Energy Efficient Wireless Federated Learning with Intrinsic Privacy. (arXiv:2304.07460v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a collaborative learning framework that enables edge devices to collaboratively learn a global model while keeping raw data locally. Although FL avoids leaking direct information from local datasets, sensitive information can still be inferred from the shared models. To address the privacy issue in FL, differential privacy (DP) mechanisms are leveraged to provide formal privacy guarantee. However, when deploying FL at the wireless edge with over-the-air computation, ensuring client-level DP faces significant challenges. In this paper, we propose a novel wireless FL scheme called private federated edge learning with sparsification (PFELS) to provide client-level DP guarantee with intrinsic channel noise while reducing communication and energy overhead and improving model accuracy. The key idea of PFELS is for each device to first compress its model update and then adaptively design the transmit power of the compressed model update according to the wireless channel status without any artificial noise addition. We provide a privacy analysis for PFELS and prove the convergence of PFELS under general non-convex and non-IID settings. Experimental results show that compared with prior work, PFELS can improve the accuracy with the same DP guarantee and save communication and energy costs simultaneously.
    Framework and Benchmarks for Combinatorial and Mixed-variable Bayesian Optimization. (arXiv:2306.09803v3 [cs.LG] UPDATED)
    This paper introduces a modular framework for Mixed-variable and Combinatorial Bayesian Optimization (MCBO) to address the lack of systematic benchmarking and standardized evaluation in the field. Current MCBO papers often introduce non-diverse or non-standard benchmarks to evaluate their methods, impeding the proper assessment of different MCBO primitives and their combinations. Additionally, papers introducing a solution for a single MCBO primitive often omit benchmarking against baselines that utilize the same methods for the remaining primitives. This omission is primarily due to the significant implementation overhead involved, resulting in a lack of controlled assessments and an inability to showcase the merits of a contribution effectively. To overcome these challenges, our proposed framework enables an effortless combination of Bayesian Optimization components, and provides a diverse set of synthetic and real-world benchmarking tasks. Leveraging this flexibility, we implement 47 novel MCBO algorithms and benchmark them against seven existing MCBO solvers and five standard black-box optimization algorithms on ten tasks, conducting over 4000 experiments. Our findings reveal a superior combination of MCBO primitives outperforming existing approaches and illustrate the significance of model fit and the use of a trust region. We make our MCBO library available under the MIT license at \url{https://github.com/huawei-noah/HEBO/tree/master/MCBO}.
    Trust Your $\nabla$: Gradient-based Intervention Targeting for Causal Discovery. (arXiv:2211.13715v3 [stat.ML] UPDATED)
    Inferring causal structure from data is a challenging task of fundamental importance in science. Observational data are often insufficient to identify a system's causal structure uniquely. While conducting interventions (i.e., experiments) can improve the identifiability, such samples are usually challenging and expensive to obtain. Hence, experimental design approaches for causal discovery aim to minimize the number of interventions by estimating the most informative intervention target. In this work, we propose a novel Gradient-based Intervention Targeting method, abbreviated GIT, that 'trusts' the gradient estimator of a gradient-based causal discovery framework to provide signals for the intervention acquisition function. We provide extensive experiments in simulated and real-world datasets and demonstrate that GIT performs on par with competitive baselines, surpassing them in the low-data regime.
    The Waymo Open Sim Agents Challenge. (arXiv:2305.12032v4 [cs.CV] UPDATED)
    Simulation with realistic, interactive agents represents a key task for autonomous vehicle software development. In this work, we introduce the Waymo Open Sim Agents Challenge (WOSAC). WOSAC is the first public challenge to tackle this task and propose corresponding metrics. The goal of the challenge is to stimulate the design of realistic simulators that can be used to evaluate and train a behavior model for autonomous driving. We outline our evaluation methodology, present results for a number of different baseline simulation agent methods, and analyze several submissions to the 2023 competition which ran from March 16, 2023 to May 23, 2023. The WOSAC evaluation server remains open for submissions and we discuss open problems for the task.
    Factorized Explainer for Graph Neural Networks. (arXiv:2312.05596v1 [cs.LG])
    Graph Neural Networks (GNNs) have received increasing attention due to their ability to learn from graph-structured data. To open the black-box of these deep learning models, post-hoc instance-level explanation methods have been proposed to understand GNN predictions. These methods seek to discover substructures that explain the prediction behavior of a trained GNN. In this paper, we show analytically that for a large class of explanation tasks, conventional approaches, which are based on the principle of graph information bottleneck (GIB), admit trivial solutions that do not align with the notion of explainability. Instead, we argue that a modified GIB principle may be used to avoid the aforementioned trivial solutions. We further introduce a novel factorized explanation model with theoretical performance guarantees. The modified GIB is used to analyze the structural properties of the proposed factorized explainer. We conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness of our proposed factorized explainer over existing approaches.
    Leveraging Neo4j and deep learning for traffic congestion simulation & optimization. (arXiv:2304.00192v2 [cs.AI] UPDATED)
    Traffic congestion has been a major challenge in many urban road networks. Extensive research studies have been conducted to highlight traffic-related congestion and address the issue using data-driven approaches. Currently, most traffic congestion analyses are done using simulation software that offers limited insight due to the limitations in the tools and utilities being used to render various traffic congestion scenarios. All that impacts the formulation of custom business problems which vary from place to place and country to country. By exploiting the power of the knowledge graph, we model a traffic congestion problem into the Neo4j graph and then use the load balancing, optimization algorithm to identify congestion-free road networks. We also show how traffic propagates backward in case of congestion or accident scenarios and its overall impact on other segments of the roads. We also train a sequential RNN-LSTM (Long Short-Term Memory) deep learning model on the real-time traffic data to assess the accuracy of simulation results based on a road-specific congestion. Our results show that graph-based traffic simulation, supplemented by AI ML-based traffic prediction can be more effective in estimating the congestion level in a road network.
    EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models. (arXiv:2307.02028v3 [cs.LG] UPDATED)
    While the general machine learning (ML) community has benefited from public datasets, tasks, and models, the progress of ML in healthcare has been hampered by a lack of such shared assets. The success of foundation models creates new challenges for healthcare ML by requiring access to shared pretrained models to validate performance benefits. We help address these challenges through three contributions. First, we publish a new dataset, EHRSHOT, which contains deidentified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and not restricted to ICU/ED patients. Second, we publish the weights of CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. We are one of the first to fully release such a model for coded EHR data; in contrast, most prior models released for clinical data (e.g. GatorTron, ClinicalBERT) only work with unstructured text and cannot process the rich, structured data within an EHR. We provide an end-to-end pipeline for the community to validate and build upon its performance. Third, we define 15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaptation. Our model and dataset are available via a research data use agreement from our website: https://ehrshot.stanford.edu. Code to reproduce our results are available at our Github repo: https://github.com/som-shahlab/ehrshot-benchmark
    Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation. (arXiv:2312.01648v2 [cs.AI] UPDATED)
    Large Language Models~(LLMs) drive current AI breakthroughs despite very little being known about their internal representations, e.g., how to extract a few informative features to solve various downstream tasks. To provide a practical and principled answer, we propose to characterize LLMs from a geometric perspective. We obtain in closed form (i) the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and (ii) the partition and per-region affine mappings of the per-layer feedforward networks. Our results are informative, do not rely on approximations, and are actionable. First, we show that, motivated by our geometric interpretation, we can bypass Llama$2$'s RLHF by controlling its embedding's intrinsic dimension through informed prompt manipulation. Second, we derive $7$ interpretable spline features that can be extracted from any (pre-trained) LLM layer, providing a rich abstract representation of their inputs. Those features alone ($224$ for Mistral-7B/Llama$2$-7B and $560$ for Llama$2$-70B) are sufficient to help solve toxicity detection, infer the domain of the prompt, and even tackle the Jigsaw challenge, which aims at characterizing the type of toxicity of various prompts. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in language models. Code: \url{https://github.com/RandallBalestriero/SplineLLM}.
    Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling. (arXiv:2207.03584v2 [cs.LG] UPDATED)
    Graph convolutional networks (GCNs) have recently achieved great empirical success in learning graph-structured data. To address its scalability issue due to the recursive embedding of neighboring features, graph topology sampling has been proposed to reduce the memory and computational cost of training GCNs, and it has achieved comparable test performance to those without topology sampling in many empirical studies. To the best of our knowledge, this paper provides the first theoretical justification of graph topology sampling in training (up to) three-layer GCNs for semi-supervised node classification. We formally characterize some sufficient conditions on graph topology sampling such that GCN training leads to a diminishing generalization error. Moreover, our method tackles the nonconvex interaction of weights across layers, which is under-explored in the existing theoretical analyses of GCNs. This paper characterizes the impact of graph structures and topology sampling on the generalization performance and sample complexity explicitly, and the theoretical findings are also justified through numerical experiments.
    Lassoed Tree Boosting. (arXiv:2205.10697v6 [stat.ML] UPDATED)
    Gradient boosting performs exceptionally in most prediction problems and scales well to large datasets. In this paper we prove that a ``lassoed'' gradient boosted tree algorithm with early stopping achieves faster than $n^{-1/4}$ L2 convergence in the large nonparametric space of cadlag functions of bounded sectional variation. This rate is remarkable because it does not depend on the dimension, sparsity, or smoothness. We use simulation and real data to confirm our theory and demonstrate empirical performance and scalability on par with standard boosting. Our convergence proofs are based on a novel, general theorem on early stopping with empirical loss minimizers of nested Donsker classes.
    Bounded Robustness in Reinforcement Learning via Lexicographic Objectives. (arXiv:2209.15320v2 [cs.LG] UPDATED)
    Policy robustness in Reinforcement Learning may not be desirable at any cost: the alterations caused by robustness requirements from otherwise optimal policies should be explainable, quantifiable and formally verifiable. In this work we study how policies can be maximally robust to arbitrary observational noise by analysing how they are altered by this noise through a stochastic linear operator interpretation of the disturbances, and establish connections between robustness and properties of the noise kernel and of the underlying MDPs. Then, we construct sufficient conditions for policy robustness, and propose a robustness-inducing scheme, applicable to any policy gradient algorithm, that formally trades off expected policy utility for robustness through lexicographic optimisation, while preserving convergence and sub-optimality in the policy synthesis.
    Bidirectional Contrastive Split Learning for Visual Question Answering. (arXiv:2208.11435v4 [cs.CV] UPDATED)
    Visual Question Answering (VQA) based on multi-modal data facilitates real-life applications such as home robots and medical diagnoses. One significant challenge is to devise a robust decentralized learning framework for various client models where centralized data collection is refrained due to confidentiality concerns. This work aims to tackle privacy-preserving VQA by decoupling a multi-modal model into representation modules and a contrastive module and leveraging inter-module gradients sharing and inter-client weight sharing. To this end, we propose Bidirectional Contrastive Split Learning (BiCSL) to train a global multi-modal model on the entire data distribution of decentralized clients. We employ the contrastive loss that enables a more efficient self-supervised learning of decentralized modules. Comprehensive experiments are conducted on the VQA-v2 dataset based on five SOTA VQA models, demonstrating the effectiveness of the proposed method. Furthermore, we inspect BiCSL's robustness against a dual-key backdoor attack on VQA. Consequently, BiCSL shows much better robustness to the multi-modal adversarial attack compared to the centralized learning method, which provides a promising approach to decentralized multi-modal learning.
    ClimaX: A foundation model for weather and climate. (arXiv:2301.10343v4 [cs.LG] UPDATED)
    Most state-of-the-art approaches for weather and climate modeling are based on physics-informed numerical models of the atmosphere. These approaches aim to model the non-linear dynamics and complex interactions between multiple variables, which are challenging to approximate. Additionally, many such numerical models are computationally intensive, especially when modeling the atmospheric phenomenon at a fine-grained spatial and temporal resolution. Recent data-driven approaches based on machine learning instead aim to directly solve a downstream forecasting or projection task by learning a data-driven functional mapping using deep neural networks. However, these networks are trained using curated and homogeneous climate datasets for specific spatiotemporal tasks, and thus lack the generality of numerical models. We develop and demonstrate ClimaX, a flexible and generalizable deep learning model for weather and climate science that can be trained using heterogeneous datasets spanning different variables, spatio-temporal coverage, and physical groundings. ClimaX extends the Transformer architecture with novel encoding and aggregation blocks that allow effective use of available compute while maintaining general utility. ClimaX is pre-trained with a self-supervised learning objective on climate datasets derived from CMIP6. The pre-trained ClimaX can then be fine-tuned to address a breadth of climate and weather tasks, including those that involve atmospheric variables and spatio-temporal scales unseen during pretraining. Compared to existing data-driven baselines, we show that this generality in ClimaX results in superior performance on benchmarks for weather forecasting and climate projections, even when pretrained at lower resolutions and compute budgets. The source code is available at https://github.com/microsoft/ClimaX.
    Guaranteed Trust Region Optimization via Two-Phase KL Penalization. (arXiv:2312.05405v1 [cs.LG])
    On-policy reinforcement learning (RL) has become a popular framework for solving sequential decision problems due to its computational efficiency and theoretical simplicity. Some on-policy methods guarantee every policy update is constrained to a trust region relative to the prior policy to ensure training stability. These methods often require computationally intensive non-linear optimization or require a particular form of action distribution. In this work, we show that applying KL penalization alone is nearly sufficient to enforce such trust regions. Then, we show that introducing a "fixup" phase is sufficient to guarantee a trust region is enforced on every policy update while adding fewer than 5% additional gradient steps in practice. The resulting algorithm, which we call FixPO, is able to train a variety of policy architectures and action spaces, is easy to implement, and produces results competitive with other trust region methods.
    Ethical Considerations for Responsible Data Curation. (arXiv:2302.03629v3 [cs.CV] UPDATED)
    Human-centric computer vision (HCCV) data curation practices often neglect privacy and bias concerns, leading to dataset retractions and unfair models. HCCV datasets constructed through nonconsensual web scraping lack crucial metadata for comprehensive fairness and robustness evaluations. Current remedies are post hoc, lack persuasive justification for adoption, or fail to provide proper contextualization for appropriate application. Our research focuses on proactive, domain-specific recommendations, covering purpose, privacy and consent, and diversity, for curating HCCV evaluation datasets, addressing privacy and bias concerns. We adopt an ante hoc reflective perspective, drawing from current practices, guidelines, dataset withdrawals, and audits, to inform our considerations and recommendations.
    Generative Network Layer for Communication Systems with Artificial Intelligence. (arXiv:2312.05398v1 [cs.IT])
    The traditional role of the network layer is the transfer of packet replicas from source to destination through intermediate network nodes. We present a generative network layer that uses Generative AI (GenAI) at intermediate or edge network nodes and analyze its impact on the required data rates in the network. We conduct a case study where the GenAI-aided nodes generate images from prompts that consist of substantially compressed latent representations. The results from network flow analyses under image quality constraints show that the generative network layer can achieve an improvement of more than 100% in terms of the required data rate.
    Solving Bilevel Knapsack Problem using Graph Neural Networks. (arXiv:2211.13436v3 [cs.AI] UPDATED)
    The Bilevel Optimization Problem is a hierarchical optimization problem with two agents, a leader and a follower. The leader make their own decisions first, and the followers make the best choices accordingly. The leader knows the information of the followers, and the goal of the problem is to find the optimal solution by considering the reactions of the followers from the leader's point of view. For the Bilevel Optimization Problem, there are no general and efficient algorithms or commercial solvers to get an optimal solution, and it is very difficult to get a good solution even for a simple problem. In this paper, we propose a deep learning approach using Graph Neural Networks to solve the bilevel knapsack problem. We train the model to predict the leader's solution and use it to transform the hierarchical optimization problem into a single-level optimization problem to get the solution. Our model found the feasible solution that was about 500 times faster than the exact algorithm with $1.7\%$ optimal gap. Also, our model performed well on problems of different size from the size it was trained on.
    Neither hype nor gloom do DNNs justice. (arXiv:2312.05355v1 [cs.LG])
    Neither the hype exemplified in some exaggerated claims about deep neural networks (DNNs), nor the gloom expressed by Bowers et al. do DNNs as models in vision science justice: DNNs rapidly evolve, and today's limitations are often tomorrow's successes. In addition, providing explanations as well as prediction and image-computability are model desiderata; one should not be favoured at the expense of the other.
    FedAVO: Improving Communication Efficiency in Federated Learning with African Vultures Optimizer. (arXiv:2305.01154v3 [cs.LG] UPDATED)
    Federated Learning (FL), a distributed machine learning technique has recently experienced tremendous growth in popularity due to its emphasis on user data privacy. However, the distributed computations of FL can result in constrained communication and drawn-out learning processes, necessitating the client-server communication cost optimization. The ratio of chosen clients and the quantity of local training passes are two hyperparameters that have a significant impact on FL performance. Due to different training preferences across various applications, it can be difficult for FL practitioners to manually select such hyperparameters. In our research paper, we introduce FedAVO, a novel FL algorithm that enhances communication effectiveness by selecting the best hyperparameters leveraging the African Vulture Optimizer (AVO). Our research demonstrates that the communication costs associated with FL operations can be substantially reduced by adopting AVO for FL hyperparameter adjustment. Through extensive evaluations of FedAVO on benchmark datasets, we show that FedAVO achieves significant improvement in terms of model accuracy and communication round, particularly with realistic cases of Non-IID datasets. Our extensive evaluation of the FedAVO algorithm identifies the optimal hyperparameters that are appropriately fitted for the benchmark datasets, eventually increasing global model accuracy by 6% in comparison to the state-of-the-art FL algorithms (such as FedAvg, FedProx, FedPSO, etc.).
    A Review of Machine Learning Methods Applied to Video Analysis Systems. (arXiv:2312.05352v1 [cs.CV])
    The paper provides a survey of the development of machine-learning techniques for video analysis. The survey provides a summary of the most popular deep learning methods used for human activity recognition. We discuss how popular architectures perform on standard datasets and highlight the differences from real-life datasets dominated by multiple activities performed by multiple participants over long periods. For real-life datasets, we describe the use of low-parameter models (with 200X or 1,000X fewer parameters) that are trained to detect a single activity after the relevant objects have been successfully detected. Our survey then turns to a summary of machine learning methods that are specifically developed for working with a small number of labeled video samples. Our goal here is to describe modern techniques that are specifically designed so as to minimize the amount of ground truth that is needed for training and testing video analysis systems. We provide summaries of the development of self-supervised learning, semi-supervised learning, active learning, and zero-shot learning for applications in video analysis. For each method, we provide representative examples.
    Learning Confident Classifiers in the Presence of Label Noise. (arXiv:2301.00524v2 [cs.CV] UPDATED)
    The success of Deep Neural Network (DNN) models significantly depends on the quality of provided annotations. In medical image segmentation, for example, having multiple expert annotations for each data point is common to minimize subjective annotation bias. Then, the goal of estimation is to filter out the label noise and recover the ground-truth masks, which are not explicitly given. This paper proposes a probabilistic model for noisy observations that allows us to build a confident classification and segmentation models. To accomplish it, we explicitly model label noise and introduce a new information-based regularization that pushes the network to recover the ground-truth labels. In addition, for segmentation task we adjust the loss function by prioritizing learning in high-confidence regions where all the annotators agree on labeling. We evaluate the proposed method on a series of classification tasks such as noisy versions of MNIST, CIFAR-10, Fashion-MNIST datasets as well as CIFAR-10N, which is real-world dataset with noisy human annotations. Additionally, for segmentation task, we consider several medical imaging datasets, such as, LIDC and RIGA that reflect real-world inter-variability among multiple annotators. Our experiments show that our algorithm outperforms state-of-the-art solutions for the considered classification and segmentation problems.
    Stateful Large Language Model Serving with Pensieve. (arXiv:2312.05516v1 [cs.LG])
    Large Language Models (LLMs) have recently experienced great success, as evident in the widespread popularity of ChatGPT. Existing LLM serving systems are stateless across requests. Consequently, when LLMs are used in the common setting of multi-turn conversations, a growing log of the conversation history must be processed alongside any request by the serving system at each turn, resulting in repeated history processing. In this paper, we design $Pensieve$, a system optimized for multi-turn conversation LLM serving. $Pensieve$ maintains the conversation state across requests by caching previously processed history to avoid duplicate processing. $Pensieve$'s multi-tier caching strategy can utilize both GPU and CPU memory to efficiently store and retrieve cached data. $Pensieve$ also generalizes the recent PagedAttention kernel to support attention between multiple input tokens with a GPU cache spread over non-contiguous memory. Our evaluation shows that $Pensieve$ is able to achieve 1.51-1.95x throughput compared to vLLM and reduce latency by 60-75%.
    Mitigating Nonlinear Algorithmic Bias in Binary Classification. (arXiv:2312.05429v1 [cs.LG])
    This paper proposes the use of causal modeling to detect and mitigate algorithmic bias that is nonlinear in the protected attribute. We provide a general overview of our approach. We use the German Credit data set, which is available for download from the UC Irvine Machine Learning Repository, to develop (1) a prediction model, which is treated as a black box, and (2) a causal model for bias mitigation. In this paper, we focus on age bias and the problem of binary classification. We show that the probability of getting correctly classified as "low risk" is lowest among young people. The probability increases with age nonlinearly. To incorporate the nonlinearity into the causal model, we introduce a higher order polynomial term. Based on the fitted causal model, the de-biased probability estimates are computed, showing improved fairness with little impact on overall classification accuracy. Causal modeling is intuitive and, hence, its use can enhance explicability and promotes trust among different stakeholders of AI.
    On the Performance of Temporal Difference Learning With Neural Networks. (arXiv:2312.05397v1 [cs.LG])
    Neural Temporal Difference (TD) Learning is an approximate temporal difference method for policy evaluation that uses a neural network for function approximation. Analysis of Neural TD Learning has proven to be challenging. In this paper we provide a convergence analysis of Neural TD Learning with a projection onto $B(\theta_0, \omega)$, a ball of fixed radius $\omega$ around the initial point $\theta_0$. We show an approximation bound of $O(\epsilon) + \tilde{O} (1/\sqrt{m})$ where $\epsilon$ is the approximation quality of the best neural network in $B(\theta_0, \omega)$ and $m$ is the width of all hidden layers in the network.
    D3A-TS: Denoising-Driven Data Augmentation in Time Series. (arXiv:2312.05550v1 [cs.AI])
    It has been demonstrated that the amount of data is crucial in data-driven machine learning methods. Data is always valuable, but in some tasks, it is almost like gold. This occurs in engineering areas where data is scarce or very expensive to obtain, such as predictive maintenance, where faults are rare. In this context, a mechanism to generate synthetic data can be very useful. While in fields such as Computer Vision or Natural Language Processing synthetic data generation has been extensively explored with promising results, in other domains such as time series it has received less attention. This work specifically focuses on studying and analyzing the use of different techniques for data augmentation in time series for classification and regression problems. The proposed approach involves the use of diffusion probabilistic models, which have recently achieved successful results in the field of Image Processing, for data augmentation in time series. Additionally, the use of meta-attributes to condition the data augmentation process is investigated. The results highlight the high utility of this methodology in creating synthetic data to train classification and regression models. To assess the results, six different datasets from diverse domains were employed, showcasing versatility in terms of input size and output types. Finally, an extensive ablation study is conducted to further support the obtained outcomes.
    A Lightweight and Gradient-Stable Neural Layer. (arXiv:2106.04088v3 [cs.LG] UPDATED)
    We propose a neural-layer architecture based on Householder weighting and absolute-value activating, hence called Householder-absolute neural layer or simply Han-layer. Compared to a fully connected layer with $d$-neurons and $d$ outputs, a Han-layer reduces the number of parameters and the corresponding complexity from $O(d^2)$ to $O(d)$. The Han-layer structure guarantees two desirable properties: (1) gradient stability (free of vanishing or exploding gradient), and (2) 1-Lipschitz continuity. Extensive numerical experiments show that one can strategically use Han-layers to replace fully connected (FC) layers, reducing the number of model parameters while maintaining or even improving the generalization performance. We will also showcase the capabilities of the Han-layer architecture on a few small stylized models, and discuss its current limitations.
    Model Copyright Protection in Buyer-seller Environment. (arXiv:2312.05262v1 [cs.CR])
    Training a deep neural network (DNN) requires a high computational cost. Buying models from sellers with a large number of computing resources has become prevailing. However, the buyer-seller environment is not always trusted. To protect the neural network models from leaking in an untrusted environment, we propose a novel copyright protection scheme for DNN using an input-sensitive neural network (ISNN). The main idea of ISNN is to make a DNN sensitive to the key and copyright information. Therefore, only the buyer with a correct key can utilize the ISNN. During the training phase, we add a specific perturbation to the clean images and mark them as legal inputs, while the other inputs are treated as illegal input. We design a loss function to make the outputs of legal inputs close to the true ones, while the illegal inputs are far away from true results. Experimental results demonstrate that the proposed scheme is effective, valid, and secure.
    Aligner: One Global Token is Worth Millions of Parameters When Aligning Large Language Models. (arXiv:2312.05503v1 [cs.CL])
    We introduce Aligner, a novel Parameter-Efficient Fine-Tuning (PEFT) method for aligning multi-billion-parameter-sized Large Language Models (LLMs). Aligner employs a unique design that constructs a globally shared set of tunable tokens that modify the attention of every layer. Remarkably with this method, even when using one token accounting for a mere 5,000 parameters, Aligner can still perform comparably well to state-of-the-art LLM adaptation methods like LoRA that require millions of parameters. This capacity is substantiated in both instruction following and value alignment tasks. Besides the multiple order-of-magnitude improvement in parameter efficiency, the insight Aligner provides into the internal mechanisms of LLMs is also valuable. The architectural features and efficacy of our method, in addition to our experiments demonstrate that an LLM separates its internal handling of "form" and "knowledge" in a somewhat orthogonal manner. This finding promises to motivate new research into LLM mechanism understanding and value alignment.
    Better Neural PDE Solvers Through Data-Free Mesh Movers. (arXiv:2312.05583v1 [cs.LG])
    Recently, neural networks have been extensively employed to solve partial differential equations (PDEs) in physical system modeling. While major studies focus on learning system evolution on predefined static mesh discretizations, some methods utilize reinforcement learning or supervised learning techniques to create adaptive and dynamic meshes, due to the dynamic nature of these systems. However, these approaches face two primary challenges: (1) the need for expensive optimal mesh data, and (2) the change of the solution space's degree of freedom and topology during mesh refinement. To address these challenges, this paper proposes a neural PDE solver with a neural mesh adapter. To begin with, we introduce a novel data-free neural mesh adaptor, called Data-free Mesh Mover (DMM), with two main innovations. Firstly, it is an operator that maps the solution to adaptive meshes and is trained using the Monge-Ampere equation without optimal mesh data. Secondly, it dynamically changes the mesh by moving existing nodes rather than adding or deleting nodes and edges. Theoretical analysis shows that meshes generated by DMM have the lowest interpolation error bound. Based on DMM, to efficiently and accurately model dynamic systems, we develop a moving mesh based neural PDE solver (MM-PDE) that embeds the moving mesh with a two-branch architecture and a learnable interpolation framework to preserve information within the data. Empirical experiments demonstrate that our method generates suitable meshes and considerably enhances accuracy when modeling widely considered PDE systems.
    Federated Causality Learning with Explainable Adaptive Optimization. (arXiv:2312.05540v1 [cs.LG])
    Discovering the causality from observational data is a crucial task in various scientific domains. With increasing awareness of privacy, data are not allowed to be exposed, and it is very hard to learn causal graphs from dispersed data, since these data may have different distributions. In this paper, we propose a federated causal discovery strategy (FedCausal) to learn the unified global causal graph from decentralized heterogeneous data. We design a global optimization formula to naturally aggregate the causal graphs from client data and constrain the acyclicity of the global graph without exposing local data. Unlike other federated causal learning algorithms, FedCausal unifies the local and global optimizations into a complete directed acyclic graph (DAG) learning process with a flexible optimization objective. We prove that this optimization objective has a high interpretability and can adaptively handle homogeneous and heterogeneous data. Experimental results on synthetic and real datasets show that FedCausal can effectively deal with non-independently and identically distributed (non-iid) data and has a superior performance.
    Revisiting RIP guarantees for sketching operators on mixture models. (arXiv:2312.05573v1 [stat.ML])
    In the context of sketching for compressive mixture modeling, we revisit existing proofs of the Restricted Isometry Property of sketching operators with respect to certain mixtures models. After examining the shortcomings of existing guarantees, we propose an alternative analysis that circumvents the need to assume importance sampling when drawing random Fourier features to build random sketching operators. Our analysis is based on new deterministic bounds on the restricted isometry constant that depend solely on the set of frequencies used to define the sketching operator; then we leverage these bounds to establish concentration inequalities for random sketching operators that lead to the desired RIP guarantees. Our analysis also opens the door to theoretical guarantees for structured sketching with frequencies associated to fast random linear operators.
    Improving Adversarial Robust Fairness via Anti-Bias Soft Label Distillation. (arXiv:2312.05508v1 [cs.LG])
    Adversarial Training (AT) has been widely proved to be an effective method to improve the adversarial robustness against adversarial examples for Deep Neural Networks (DNNs). As a variant of AT, Adversarial Robustness Distillation (ARD) has demonstrated its superior performance in improving the robustness of small student models with the guidance of large teacher models. However, both AT and ARD encounter the robust fairness problem: these models exhibit strong robustness when facing part of classes (easy class), but weak robustness when facing others (hard class). In this paper, we give an in-depth analysis of the potential factors and argue that the smoothness degree of samples' soft labels for different classes (i.e., hard class or easy class) will affect the robust fairness of DNN models from both empirical observation and theoretical analysis. Based on the above finding, we propose an Anti-Bias Soft Label Distillation (ABSLD) method to mitigate the adversarial robust fairness problem within the framework of Knowledge Distillation (KD). Specifically, ABSLD adaptively reduces the student's error risk gap between different classes to achieve fairness by adjusting the class-wise smoothness degree of samples' soft labels during the training process, and the smoothness degree of soft labels is controlled by assigning different temperatures in KD to different classes. Extensive experiments demonstrate that ABSLD outperforms state-of-the-art AT, ARD, and robust fairness methods in terms of overall performance of robustness and fairness.
    Deeper Understanding of Black-box Predictions via Generalized Influence Functions. (arXiv:2312.05586v1 [cs.LG])
    Influence functions (IFs) elucidate how learning data affects model behavior. However, growing non-convexity and the number of parameters in modern large-scale models lead to imprecise influence approximation and instability in computations. We highly suspect that the first-order approximation in large models causes such fragility, as IFs change all parameters including possibly nuisance parameters that are irrelevant to the examined data. Thus, we attempt to selectively analyze parameters associated with the data. However, simply computing influence from the chosen parameters can be misleading, as it fails to nullify the subliminal impact of unselected parameters. Our approach introduces generalized IFs, precisely estimating target parameters' influence while considering fixed parameters' effects. Unlike the classic IFs, we newly adopt a method to identify pertinent target parameters closely associated with the analyzed data. Furthermore, we tackle computational instability with a robust inverse-Hessian-vector product approximation. Remarkably, the proposed approximation algorithm guarantees convergence regardless of the network configurations. We evaluated our approach on ResNet-18 and VGG-11 for class removal and backdoor model recovery. Modifying just 10\% of the network yields results comparable to the network retrained from scratch. Aligned with our first guess, we also confirm that modifying an excessive number of parameters results in a decline in network utility. We believe our proposal can become a versatile tool for model analysis across various AI domains, appealing to both specialists and general readers. Codes are available at https://github.com/hslyu/GIF.
    A Unified Multi-Phase CT Synthesis and Classification Framework for Kidney Cancer Diagnosis with Incomplete Data. (arXiv:2312.05548v1 [eess.IV])
    Multi-phase CT is widely adopted for the diagnosis of kidney cancer due to the complementary information among phases. However, the complete set of multi-phase CT is often not available in practical clinical applications. In recent years, there have been some studies to generate the missing modality image from the available data. Nevertheless, the generated images are not guaranteed to be effective for the diagnosis task. In this paper, we propose a unified framework for kidney cancer diagnosis with incomplete multi-phase CT, which simultaneously recovers missing CT images and classifies cancer subtypes using the completed set of images. The advantage of our framework is that it encourages a synthesis model to explicitly learn to generate missing CT phases that are helpful for classifying cancer subtypes. We further incorporate lesion segmentation network into our framework to exploit lesion-level features for effective cancer classification in the whole CT volumes. The proposed framework is based on fully 3D convolutional neural networks to jointly optimize both synthesis and classification of 3D CT volumes. Extensive experiments on both in-house and external datasets demonstrate the effectiveness of our framework for the diagnosis with incomplete data compared with state-of-the-art baselines. In particular, cancer subtype classification using the completed CT data by our method achieves higher performance than the classification using the given incomplete data.
    A sampling criterion for constrained Bayesian optimization with uncertainties. (arXiv:2103.05706v4 [stat.ML] UPDATED)
    We consider the problem of chance constrained optimization where it is sought to optimize a function and satisfy constraints, both of which are affected by uncertainties. The real world declinations of this problem are particularly challenging because of their inherent computational cost. To tackle such problems, we propose a new Bayesian optimization method. It applies to the situation where the uncertainty comes from some of the inputs, so that it becomes possible to define an acquisition criterion in the joint controlled-uncontrolled input space. The main contribution of this work is an acquisition criterion that accounts for both the average improvement in objective function and the constraint reliability. The criterion is derived following the Stepwise Uncertainty Reduction logic and its maximization provides both optimal controlled and uncontrolled parameters. Analytical expressions are given to efficiently calculate the criterion. Numerical studies on test functions are presented. It is found through experimental comparisons with alternative sampling criteria that the adequation between the sampling criterion and the problem contributes to the efficiency of the overall optimization. As a side result, an expression for the variance of the improvement is given.
    3D Copy-Paste: Physically Plausible Object Insertion for Monocular 3D Detection. (arXiv:2312.05277v1 [cs.CV])
    A major challenge in monocular 3D object detection is the limited diversity and quantity of objects in real datasets. While augmenting real scenes with virtual objects holds promise to improve both the diversity and quantity of the objects, it remains elusive due to the lack of an effective 3D object insertion method in complex real captured scenes. In this work, we study augmenting complex real indoor scenes with virtual objects for monocular 3D object detection. The main challenge is to automatically identify plausible physical properties for virtual assets (e.g., locations, appearances, sizes, etc.) in cluttered real scenes. To address this challenge, we propose a physically plausible indoor 3D object insertion approach to automatically copy virtual objects and paste them into real scenes. The resulting objects in scenes have 3D bounding boxes with plausible physical locations and appearances. In particular, our method first identifies physically feasible locations and poses for the inserted objects to prevent collisions with the existing room layout. Subsequently, it estimates spatially-varying illumination for the insertion location, enabling the immersive blending of the virtual objects into the original scene with plausible appearances and cast shadows. We show that our augmentation method significantly improves existing monocular 3D object models and achieves state-of-the-art performance. For the first time, we demonstrate that a physically plausible 3D object insertion, serving as a generative data augmentation technique, can lead to significant improvements for discriminative downstream tasks such as monocular 3D object detection. Project website: https://gyhandy.github.io/3D-Copy-Paste/
    Multi-granularity Causal Structure Learning. (arXiv:2312.05549v1 [cs.LG])
    Unveil, model, and comprehend the causal mechanisms underpinning natural phenomena stand as fundamental endeavors across myriad scientific disciplines. Meanwhile, new knowledge emerges when discovering causal relationships from data. Existing causal learning algorithms predominantly focus on the isolated effects of variables, overlook the intricate interplay of multiple variables and their collective behavioral patterns. Furthermore, the ubiquity of high-dimensional data exacts a substantial temporal cost for causal algorithms. In this paper, we develop a novel method called MgCSL (Multi-granularity Causal Structure Learning), which first leverages sparse auto-encoder to explore coarse-graining strategies and causal abstractions from micro-variables to macro-ones. MgCSL then takes multi-granularity variables as inputs to train multilayer perceptrons and to delve the causality between variables. To enhance the efficacy on high-dimensional data, MgCSL introduces a simplified acyclicity constraint to adeptly search the directed acyclic graph among variables. Experimental results show that MgCSL outperforms competitive baselines, and finds out explainable causal connections on fMRI datasets.
    Boosting the Cross-Architecture Generalization of Dataset Distillation through an Empirical Study. (arXiv:2312.05598v1 [cs.LG])
    The poor cross-architecture generalization of dataset distillation greatly weakens its practical significance. This paper attempts to mitigate this issue through an empirical study, which suggests that the synthetic datasets undergo an inductive bias towards the distillation model. Therefore, the evaluation model is strictly confined to having similar architectures of the distillation model. We propose a novel method of EvaLuation with distillation Feature (ELF), which utilizes features from intermediate layers of the distillation model for the cross-architecture evaluation. In this manner, the evaluation model learns from bias-free knowledge therefore its architecture becomes unfettered while retaining performance. By performing extensive experiments, we successfully prove that ELF can well enhance the cross-architecture generalization of current DD methods. Code of this project is at \url{https://github.com/Lirui-Zhao/ELF}.
    Transition Path Sampling with Boltzmann Generator-based MCMC Moves. (arXiv:2312.05340v1 [q-bio.QM])
    Sampling all possible transition paths between two 3D states of a molecular system has various applications ranging from catalyst design to drug discovery. Current approaches to sample transition paths use Markov chain Monte Carlo and rely on time-intensive molecular dynamics simulations to find new paths. Our approach operates in the latent space of a normalizing flow that maps from the molecule's Boltzmann distribution to a Gaussian, where we propose new paths without requiring molecular simulations. Using alanine dipeptide, we explore Metropolis-Hastings acceptance criteria in the latent space for exact sampling and investigate different latent proposal mechanisms.
    Spectroscopy-Guided Discovery of Three-Dimensional Structures of Disordered Materials with Diffusion Models. (arXiv:2312.05472v1 [cond-mat.mtrl-sci])
    The ability to rapidly develop materials with desired properties has a transformative impact on a broad range of emerging technologies. In this work, we introduce a new framework based on the diffusion model, a recent generative machine learning method to predict 3D structures of disordered materials from a target property. For demonstration, we apply the model to identify the atomic structures of amorphous carbons ($a$-C) as a representative material system from the target X-ray absorption near edge structure (XANES) spectra--a common experimental technique to probe atomic structures of materials. We show that conditional generation guided by XANES spectra reproduces key features of the target structures. Furthermore, we show that our model can steer the generative process to tailor atomic arrangements for a specific XANES spectrum. Finally, our generative model exhibits a remarkable scale-agnostic property, thereby enabling generation of realistic, large-scale structures through learning from a small-scale dataset (i.e., with small unit cells). Our work represents a significant stride in bridging the gap between materials characterization and atomic structure determination; in addition, it can be leveraged for materials discovery in exploring various material properties as targeted.
    Explainable Identification of Hate Speech towards Islam using Graph Neural Networks. (arXiv:2311.04916v2 [cs.CL] UPDATED)
    Islamophobic language is a prevalent challenge on online social interaction platforms. Identifying and eliminating such hatred is a crucial step towards a future of harmony and peace. This study presents a novel paradigm for identifying and explaining hate speech towards Islam using graph neural networks. Utilizing the intrinsic ability of graph neural networks to find, extract, and use relationships across disparate data points, our model consistently achieves outstanding performance while offering explanations for the underlying correlations and causation.
    Model Extraction Attacks Revisited. (arXiv:2312.05386v1 [cs.LG])
    Model extraction (ME) attacks represent one major threat to Machine-Learning-as-a-Service (MLaaS) platforms by ``stealing'' the functionality of confidential machine-learning models through querying black-box APIs. Over seven years have passed since ME attacks were first conceptualized in the seminal work. During this period, substantial advances have been made in both ME attacks and MLaaS platforms, raising the intriguing question: How has the vulnerability of MLaaS platforms to ME attacks been evolving? In this work, we conduct an in-depth study to answer this critical question. Specifically, we characterize the vulnerability of current, mainstream MLaaS platforms to ME attacks from multiple perspectives including attack strategies, learning techniques, surrogate-model design, and benchmark tasks. Many of our findings challenge previously reported results, suggesting emerging patterns of ME vulnerability. Further, by analyzing the vulnerability of the same MLaaS platforms using historical datasets from the past four years, we retrospectively characterize the evolution of ME vulnerability over time, leading to a set of interesting findings. Finally, we make suggestions about improving the current practice of MLaaS in terms of attack robustness. Our study sheds light on the current state of ME vulnerability in the wild and points to several promising directions for future research.
    Automated Small Kidney Cancer Detection in Non-Contrast Computed Tomography. (arXiv:2312.05258v1 [eess.IV])
    This study introduces an automated pipeline for renal cancer (RC) detection in non-contrast computed tomography (NCCT). In the development of our pipeline, we test three detections models: a shape model, a 2D-, and a 3D axial-sample model. Training (n=1348) and testing (n=64) data were gathered from open sources (KiTS23, Abdomen1k, CT-ORG) and Cambridge University Hospital (CUH). Results from cross-validation and testing revealed that the 2D axial sample model had the highest small ($\leq$40mm diameter) RC detection area under the curve (AUC) of 0.804. Our pipeline achieves 61.9\% sensitivity and 92.7\% specificity for small kidney cancers on unseen test data. Our results are much more accurate than previous attempts to automatically detect small renal cancers in NCCT, the most likely imaging modality for RC screening. This pipeline offers a promising advance that may enable screening for kidney cancers.
    A Closer Look at Advantage-Filtered Behavioral Cloning in High-Noise Datasets. (arXiv:2110.04698v2 [cs.LG] UPDATED)
    Recent Offline Reinforcement Learning methods have succeeded in learning high-performance policies from fixed datasets of experience. A particularly effective approach learns to first identify and then mimic optimal decision-making strategies. Our work evaluates this method's ability to scale to vast datasets consisting almost entirely of sub-optimal noise. A thorough investigation on a custom benchmark helps identify several key challenges involved in learning from high-noise datasets. We re-purpose prioritized experience sampling to locate expert-level demonstrations among millions of low-performance samples. This modification enables offline agents to learn state-of-the-art policies in benchmark tasks using datasets where expert actions are outnumbered nearly 65:1.
    How to Backdoor HyperNetwork in Personalized Federated Learning?. (arXiv:2201.07063v3 [cs.LG] UPDATED)
    This paper explores previously unknown backdoor risks in HyperNet-based personalized federated learning (HyperNetFL) through poisoning attacks. Based upon that, we propose a novel model transferring attack (called HNTroj), i.e., the first of its kind, to transfer a local backdoor infected model to all legitimate and personalized local models, which are generated by the HyperNetFL model, through consistent and effective malicious local gradients computed across all compromised clients in the whole training process. As a result, HNTroj reduces the number of compromised clients needed to successfully launch the attack without any observable signs of sudden shifts or degradation regarding model utility on legitimate data samples making our attack stealthy. To defend against HNTroj, we adapted several backdoor-resistant FL training algorithms into HyperNetFL. An extensive experiment that is carried out using several benchmark datasets shows that HNTroj significantly outperforms data poisoning and model replacement attacks and bypasses robust training algorithms even with modest numbers of compromised clients.
    ESPN: Memory-Efficient Multi-Vector Information Retrieval. (arXiv:2312.05417v1 [cs.IR])
    Recent advances in large language models have demonstrated remarkable effectiveness in information retrieval (IR) tasks. While many neural IR systems encode queries and documents into single-vector representations, multi-vector models elevate the retrieval quality by producing multi-vector representations and facilitating similarity searches at the granularity of individual tokens. However, these models significantly amplify memory and storage requirements for retrieval indices by an order of magnitude. This escalation in index size renders the scalability of multi-vector IR models progressively challenging due to their substantial memory demands. We introduce Embedding from Storage Pipelined Network (ESPN) where we offload the entire re-ranking embedding tables to SSDs and reduce the memory requirements by 5-16x. We design a software prefetcher with hit rates exceeding 90%, improving SSD based retrieval up to 6.4x, and demonstrate that we can maintain near memory levels of query latency even for large query batch sizes.
    FreeFlow: A Comprehensive Understanding on Diffusion Probabilistic Models via Optimal Transport. (arXiv:2312.05486v1 [cs.AI])
    The blooming diffusion probabilistic models (DPMs) have garnered significant interest due to their impressive performance and the elegant inspiration they draw from physics. While earlier DPMs relied upon the Markovian assumption, recent methods based on differential equations have been rapidly applied to enhance the efficiency and capabilities of these models. However, a theoretical interpretation encapsulating these diverse algorithms is insufficient yet pressingly required to guide further development of DPMs. In response to this need, we present FreeFlow, a framework that provides a thorough explanation of the diffusion formula as time-dependent optimal transport, where the evolutionary pattern of probability density is given by the gradient flows of a functional defined in Wasserstein space. Crucially, our framework necessitates a unified description that not only clarifies the subtle mechanism of DPMs but also indicates the roots of some defects through creative involvement of Lagrangian and Eulerian views to understand the evolution of probability flow. We particularly demonstrate that the core equation of FreeFlow condenses all stochastic and deterministic DPMs into a single case, showcasing the expansibility of our method. Furthermore, the Riemannian geometry employed in our work has the potential to bridge broader subjects in mathematics, which enable the involvement of more profound tools for the establishment of more outstanding and generalized models in the future.
    All Rivers Run to the Sea: Private Learning with Asymmetric Flows. (arXiv:2312.05264v1 [cs.CR])
    Data privacy is of great concern in cloud machine-learning service platforms, when sensitive data are exposed to service providers. While private computing environments (e.g., secure enclaves), and cryptographic approaches (e.g., homomorphic encryption) provide strong privacy protection, their computing performance still falls short compared to cloud GPUs. To achieve privacy protection with high computing performance, we propose Delta, a new private training and inference framework, with comparable model performance as non-private centralized training. Delta features two asymmetric data flows: the main information-sensitive flow and the residual flow. The main part flows into a small model while the residuals are offloaded to a large model. Specifically, Delta embeds the information-sensitive representations into a low-dimensional space while pushing the information-insensitive part into high-dimension residuals. To ensure privacy protection, the low-dimensional information-sensitive part is secured and fed to a small model in a private environment. On the other hand, the residual part is sent to fast cloud GPUs, and processed by a large model. To further enhance privacy and reduce the communication cost, Delta applies a random binary quantization technique along with a DP-based technique to the residuals before sharing them with the public platform. We theoretically show that Delta guarantees differential privacy in the public environment and greatly reduces the complexity in the private environment. We conduct empirical analyses on CIFAR-10, CIFAR-100 and ImageNet datasets and ResNet-18 and ResNet-34, showing that Delta achieves strong privacy protection, fast training, and inference without significantly compromising the model utility.
    Making Large Language Models Better Knowledge Miners for Online Marketing with Progressive Prompting Augmentation. (arXiv:2312.05276v1 [cs.AI])
    Nowadays, the rapid development of mobile economy has promoted the flourishing of online marketing campaigns, whose success greatly hinges on the efficient matching between user preferences and desired marketing campaigns where a well-established Marketing-oriented Knowledge Graph (dubbed as MoKG) could serve as the critical "bridge" for preference propagation. In this paper, we seek to carefully prompt a Large Language Model (LLM) with domain-level knowledge as a better marketing-oriented knowledge miner for marketing-oriented knowledge graph construction, which is however non-trivial, suffering from several inevitable issues in real-world marketing scenarios, i.e., uncontrollable relation generation of LLMs,insufficient prompting ability of a single prompt, the unaffordable deployment cost of LLMs. To this end, we propose PAIR, a novel Progressive prompting Augmented mIning fRamework for harvesting marketing-oriented knowledge graph with LLMs. In particular, we reduce the pure relation generation to an LLM based adaptive relation filtering process through the knowledge-empowered prompting technique. Next, we steer LLMs for entity expansion with progressive prompting augmentation,followed by a reliable aggregation with comprehensive consideration of both self-consistency and semantic relatedness. In terms of online serving, we specialize in a small and white-box PAIR (i.e.,LightPAIR),which is fine-tuned with a high-quality corpus provided by a strong teacher-LLM. Extensive experiments and practical applications in audience targeting verify the effectiveness of the proposed (Light)PAIR.
    Model Evaluation for Domain Identification of Unknown Classes in Open-World Recognition: A Proposal. (arXiv:2312.05454v1 [cs.CV])
    Open-World Recognition (OWR) is an emerging field that makes a machine learning model competent in rejecting the unknowns, managing them, and incrementally adding novel samples to the base knowledge. However, this broad objective is not practical for an agent that works on a specific task. Not all rejected samples will be used for learning continually in the future. Some novel images in the open environment may not belong to the domain of interest. Hence, identifying the unknown in the domain of interest is essential for a machine learning model to learn merely the important samples. In this study, we propose an evaluation protocol for estimating a model's capability in separating unknown in-domain (ID) and unknown out-of-domain (OOD). We evaluated using three approaches with an unknown domain and demonstrated the possibility of identifying the domain of interest using the pre-trained parameters through traditional transfer learning, Automated Machine Learning (AutoML), and Nearest Class Mean (NCM) classifier with First Integer Neighbor Clustering Hierarchy (FINCH). We experimented with five different domains: garbage, food, dogs, plants, and birds. The results show that all approaches can be used as an initial baseline yielding a good accuracy. In addition, a Balanced Accuracy (BACCU) score from a pre-trained model indicates a tendency to excel in one or more domains of interest. We observed that MobileNetV3 yielded the highest BACCU score for the garbage domain and surpassed complex models such as the transformer network. Meanwhile, our results also suggest that a strong representation in the pre-trained model is important for identifying unknown classes in the same domain. This study could open the bridge toward open-world recognition in domain-specific tasks where the relevancy of the unknown classes is vital.
    AI Competitions and Benchmarks: The life cycle of challenges and benchmarks. (arXiv:2312.05296v1 [cs.LG])
    Data Science research is undergoing a revolution fueled by the transformative power of technology, the Internet, and an ever increasing computational capacity. The rate at which sophisticated algorithms can be developed is unprecedented, yet they remain outpaced by the massive amounts of data that are increasingly available to researchers. Here we argue for the need to creatively leverage the scientific research and algorithm development community as an axis of robust innovation. Engaging these communities in the scientific discovery enterprise by critical assessments, community experiments, and/or crowdsourcing will multiply opportunities to develop new data driven, reproducible and well benchmarked algorithmic solutions to fundamental and applied problems of current interest. Coordinated community engagement in the analysis of highly complex and massive data has emerged as one approach to find robust methodologies that best address these challenges. When community engagement is done in the form of competitions, also known as challenges, the validation of the analytical methodology is inherently addressed, establishing performance benchmarks. Finally, challenges foster open innovation across multiple disciplines to create communities that collaborate directly or indirectly to address significant scientific gaps. Together, participants can solve important problems as varied as health research, climate change, and social equity. Ultimately, challenges can catalyze and accelerate the synthesis of complex data into knowledge or actionable information, and should be viewed a powerful tool to make lasting social and research contributions.
    Rethinking materials simulations: Blending direct numerical simulations with neural operators. (arXiv:2312.05410v1 [cs.LG])
    Direct numerical simulations (DNS) are accurate but computationally expensive for predicting materials evolution across timescales, due to the complexity of the underlying evolution equations, the nature of multiscale spatio-temporal interactions, and the need to reach long-time integration. We develop a new method that blends numerical solvers with neural operators to accelerate such simulations. This methodology is based on the integration of a community numerical solver with a U-Net neural operator, enhanced by a temporal-conditioning mechanism that enables accurate extrapolation and efficient time-to-solution predictions of the dynamics. We demonstrate the effectiveness of this framework on simulations of microstructure evolution during physical vapor deposition modeled via the phase-field method. Such simulations exhibit high spatial gradients due to the co-evolution of different material phases with simultaneous slow and fast materials dynamics. We establish accurate extrapolation of the coupled solver with up to 16.5$\times$ speed-up compared to DNS. This methodology is generalizable to a broad range of evolutionary models, from solid mechanics, to fluid dynamics, geophysics, climate, and more.
    Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning. (arXiv:2312.05571v1 [cs.AI])
    Large Language Models (LLM) exhibit zero-shot mathematical reasoning capacity as a behavior emergent with scale, commonly manifesting as chain-of-thoughts (CoT) reasoning. However, multiple empirical findings suggest that this prowess is exclusive to LLMs with exorbitant sizes (beyond 50 billion parameters). Meanwhile, educational neuroscientists suggest that symbolic algebraic manipulation be introduced around the same time as arithmetic word problems to modularize language-to-formulation, symbolic manipulation of the formulation, and endgame arithmetic. In this paper, we start with the hypothesis that much smaller LMs, which are weak at multi-step reasoning, can achieve reasonable arithmetic reasoning if arithmetic word problems are posed as a formalize-then-solve task. In our architecture, which we call SYRELM, the LM serves the role of a translator to map natural language arithmetic questions into a formal language (FL) description. A symbolic solver then evaluates the FL expression to obtain the answer. A small frozen LM, equipped with an efficient low-rank adapter, is capable of generating FL expressions that incorporate natural language descriptions of the arithmetic problem (e.g., variable names and their purposes, formal expressions combining variables, etc.). We adopt policy-gradient reinforcement learning to train the adapted LM, informed by the non-differentiable symbolic solver. This marks a sharp departure from the recent development in tool-augmented LLMs, in which the external tools (e.g., calculator, Web search, etc.) are essentially detached from the learning phase of the LM. SYRELM shows massive improvements (e.g., +30.65 absolute point improvement in accuracy on the SVAMP dataset using GPT-J 6B model) over base LMs, while keeping our testbed easy to diagnose, interpret and within reach of most researchers.
    Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs. (arXiv:2308.09895v3 [cs.PL] UPDATED)
    Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, Code LLMs produce impressive results on programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available. Low resource languages include OCaml, Racket, and several others. This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, MultiPL-T, translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize tests for commented code from a high-resource language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate Python code to a target low-resource language, and use tests to validate the translation. We apply this approach to generate tens of thousands of validated training items for Julia, Lua, OCaml, R, and Racket. Furthermore, we use an open model (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. With MultiPL-T generated data, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket. On established benchmarks (MultiPL-E), these models outperform other open Code LLMs. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.
    Neuron Patching: Neuron-level Model Editing on Code Generation and LLMs. (arXiv:2312.05356v1 [cs.SE])
    Large Language Models are successfully adopted in software engineering, especially in code generation. Updating these models with new knowledge is very expensive, and is often required to fully realize their value. In this paper, we propose a novel and effective model editing approach, \textsc{MENT}, to patch LLMs in coding tasks. Based on the mechanism of generative LLMs, \textsc{MENT} enables model editing in next-token predictions, and further supports common coding tasks. \textsc{MENT} is effective, efficient, and reliable. It can correct a neural model by patching 1 or 2 neurons. As the pioneer work on neuron-level model editing of generative models, we formalize the editing process and introduce the involved concepts. Besides, we also introduce new measures to evaluate its generalization ability, and build a benchmark for further study. Our approach is evaluated on three coding tasks, including API-seq recommendation, line-level code generation, and pseudocode-to-code transaction. It outperforms the state-of-the-art by a significant margin on both effectiveness and efficiency measures. In addition, we demonstrate the usages of \textsc{MENT} for LLM reasoning in software engineering. By editing the LLM knowledge with \textsc{MENT}, the directly or indirectly dependent behaviors in the chain-of-thought change accordingly and automatically.
    VLTSeg: Simple Transfer of CLIP-Based Vision-Language Representations for Domain Generalized Semantic Segmentation. (arXiv:2312.02021v2 [cs.CV] UPDATED)
    Domain generalization (DG) remains a significant challenge for perception based on deep neural networks (DNN), where domain shifts occur due to lighting, weather, or geolocation changes. In this work, we propose VLTSeg to enhance domain generalization in semantic segmentation, where the network is solely trained on the source domain and evaluated on unseen target domains. Our method leverages the inherent semantic robustness of vision-language models. First, by substituting traditional vision-only backbones with pre-trained encoders from CLIP and EVA-CLIP as transfer learning setting we find that in the field of DG, vision-language pre-training significantly outperforms supervised and self-supervised vision pre-training. We thus propose a new vision-language approach for domain generalized segmentation, which improves the domain generalization SOTA by 7.6% mIoU when training on the synthetic GTA5 dataset. We further show the superior generalization capabilities of vision-language segmentation models by reaching 76.48% mIoU on the popular Cityscapes-to-ACDC benchmark, outperforming the previous SOTA approach by 6.9% mIoU on the test set at the time of writing. Additionally, our approach shows strong in-domain generalization capabilities indicated by 86.1% mIoU on the Cityscapes test set, resulting in a shared first place with the previous SOTA on the current leaderboard at the time of submission.
    Identifying and Mitigating Model Failures through Few-shot CLIP-aided Diffusion Generation. (arXiv:2312.05464v1 [cs.CV])
    Deep learning models can encounter unexpected failures, especially when dealing with challenging sub-populations. One common reason for these failures is the occurrence of objects in backgrounds that are rarely seen during training. To gain a better understanding of these failure modes, human-interpretable descriptions are crucial for further analysis and improvement which is expensive. In this study, we propose an end-to-end framework that utilizes the capabilities of large language models (ChatGPT) and vision-language deep models (CLIP) to generate text descriptions of failure modes associated with spurious correlations (e.g. rarely seen backgrounds) without human-in-the-loop intervention. These descriptions can be used to generate synthetic data using generative models, such as diffusion models. The model can now use this generated data to learn from its weaknesses and enhance its performance on backgrounds that are uncommon for each class of data. Our approach serves as a broad solution, promising progress in comprehending model failure modes and strengthening deep learning models across a wide range of failure scenarios (e.g. bacckgrounds, colors) automatically in a few-shot manner. Our experiments have shown remarkable \textbf{improvements in accuracy ($\sim \textbf{21%}$)} on hard sub-populations (particularly for wrong background association) across $40$ different models, such as ResNets, EfficientNets, DenseNets, Vision Transformer (ViT), SwAVs, MoCos, DINOs, and CLIPs on various datasets such as ImageNet-1000, CIFAR-10, and CIFAR-100.
    Misclassification in Automated Content Analysis Causes Bias in Regression. Can We Fix It? Yes We Can!. (arXiv:2307.06483v2 [cs.LG] UPDATED)
    Automated classifiers (ACs), often built via supervised machine learning (SML), can categorize large, statistically powerful samples of data ranging from text to images and video, and have become widely popular measurement devices in communication science and related fields. Despite this popularity, even highly accurate classifiers make errors that cause misclassification bias and misleading results in downstream analyses-unless such analyses account for these errors. As we show in a systematic literature review of SML applications, communication scholars largely ignore misclassification bias. In principle, existing statistical methods can use "gold standard" validation data, such as that created by human annotators, to correct misclassification bias and produce consistent estimates. We introduce and test such methods, including a new method we design and implement in the R package misclassificationmodels, via Monte Carlo simulations designed to reveal each method's limitations, which we also release. Based on our results, we recommend our new error correction method as it is versatile and efficient. In sum, automated classifiers, even those below common accuracy standards or making systematic misclassifications, can be useful for measurement with careful study design and appropriate error correction methods.
    Target to Source: Guidance-Based Diffusion Model for Test-Time Adaptation. (arXiv:2312.05274v1 [cs.LG])
    Most recent works of test-time adaptation (TTA) aim to alleviate domain shift problems by re-training source classifiers in each domain. On the other hand, the emergence of the diffusion model provides another solution to TTA, which directly maps the test data from the target domain to the source domain based on a diffusion model pre-trained in the source domain. The source classifier does not need to be fine-tuned. However, 1) the semantic information loss from test data to the source domain and 2) the model shift between the source classifier and diffusion model would prevent the diffusion model from mapping the test data back to the source domain correctly. In this paper, we propose a novel guidance-based diffusion-driven adaptation (GDDA) to overcome the data shift and let the diffusion model find a better way to go back to the source. Concretely, we first propose detail and global guidance to better keep the common semantics of the test and source data. The two guidance include a contrastive loss and mean squared error to alleviate the information loss by fully exploring the diffusion model and the test data. Meanwhile, we propose a classifier-aware guidance to reduce the bias caused by the model shift, which can incorporate the source classifier's information into the generation process of the diffusion model. Extensive experiments on three image datasets with three classifier backbones demonstrate that GDDA significantly performs better than the state-of-the-art baselines. On CIFAR-10C, CIFAR-100C, and ImageNetC, GDDA achieves 11.54\%, 19.05\%, and 11.63\% average accuracy improvements, respectively. GDDA even achieves equal performance compared with methods of re-training classifiers. The code is available in the supplementary material.
    Learning 3D Particle-based Simulators from RGB-D Videos. (arXiv:2312.05359v1 [cs.LG])
    Realistic simulation is critical for applications ranging from robotics to animation. Traditional analytic simulators sometimes struggle to capture sufficiently realistic simulation which can lead to problems including the well known "sim-to-real" gap in robotics. Learned simulators have emerged as an alternative for better capturing real-world physical dynamics, but require access to privileged ground truth physics information such as precise object geometry or particle tracks. Here we propose a method for learning simulators directly from observations. Visual Particle Dynamics (VPD) jointly learns a latent particle-based representation of 3D scenes, a neural simulator of the latent particle dynamics, and a renderer that can produce images of the scene from arbitrary views. VPD learns end to end from posed RGB-D videos and does not require access to privileged information. Unlike existing 2D video prediction models, we show that VPD's 3D structure enables scene editing and long-term predictions. These results pave the way for downstream applications ranging from video editing to robotic planning.
    VerilogEval: Evaluating Large Language Models for Verilog Code Generation. (arXiv:2309.07544v2 [cs.LG] UPDATED)
    The increasing popularity of large language models (LLMs) has paved the way for their application in diverse domains. This paper proposes a benchmarking framework tailored specifically for evaluating LLM performance in the context of Verilog code generation for hardware design and verification. We present a comprehensive evaluation dataset consisting of 156 problems from the Verilog instructional website HDLBits. The evaluation set consists of a diverse set of Verilog code generation tasks, ranging from simple combinational circuits to complex finite state machines. The Verilog code completions can be automatically tested for functional correctness by comparing the transient simulation outputs of the generated design with a golden solution. We also demonstrate that the Verilog code generation capability of pretrained language models could be improved with supervised fine-tuning by bootstrapping with LLM generated synthetic problem-code pairs.
    Multi-source domain adaptation for regression. (arXiv:2312.05460v1 [stat.ML])
    Multi-source domain adaptation (DA) aims at leveraging information from more than one source domain to make predictions in a target domain, where different domains may have different data distributions. Most existing methods for multi-source DA focus on classification problems while there is only limited investigation in the regression settings. In this paper, we fill in this gap through a two-step procedure. First, we extend a flexible single-source DA algorithm for classification through outcome-coarsening to enable its application to regression problems. We then augment our single-source DA algorithm for regression with ensemble learning to achieve multi-source DA. We consider three learning paradigms in the ensemble algorithm, which combines linearly the target-adapted learners trained with each source domain: (i) a multi-source stacking algorithm to obtain the ensemble weights; (ii) a similarity-based weighting where the weights reflect the quality of DA of each target-adapted learner; and (iii) a combination of the stacking and similarity weights. We illustrate the performance of our algorithms with simulations and a data application where the goal is to predict High-density lipoprotein (HDL) cholesterol levels using gut microbiome. We observe a consistent improvement in prediction performance of our multi-source DA algorithm over the routinely used methods in all these scenarios.
    CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling. (arXiv:2312.05412v1 [cs.LG])
    We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. Recognizing the importance of accurate alignment between video and audio events in multi-modal generation tasks, we propose a joint contrastive training loss to enhance the synchronization between visual and auditory occurrences. Our research methodology involves conducting comprehensive experiments on multiple datasets to thoroughly evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline, substantiating its effectiveness and efficiency. Notably, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task. These results indicate the potential of our proposed model as a robust solution for improving the quality and alignment of multi-modal generation, thereby contributing to the advancement of video and audio conditional generation systems.
    On Self-Supervised Dynamic Incremental Regularised Adaptation. (arXiv:2311.07461v2 [cs.LG] UPDATED)
    In this paper, we give an overview of a recently developed method for dynamic domain adaptation, named DIRA, which relies on a few samples in addition to a regularisation approach, named elastic weight consolidation, to achieve state-of-the-art (SOTA) domain adaptation results. DIRA has been previously shown to perform competitively with SOTA unsupervised adaption techniques. However, a limitation of DIRA is that it relies on labels to be provided for the few samples used in adaption. This makes it a supervised technique. In this paper, we propose a modification to the DIRA method to make it self-supervised i.e. remove the need for providing labels. Our proposed approach will be evaluated experimentally in future work.
    Enhancing Medical Specialty Assignment to Patients using NLP Techniques. (arXiv:2312.05585v1 [cs.CL])
    The introduction of Large Language Models (LLMs), and the vast volume of publicly available medical data, amplified the application of NLP to the medical domain. However, LLMs are pretrained on data that are not explicitly relevant to the domain that are applied to and are often biased towards the original data they were pretrained upon. Even when pretrained on domainspecific data, these models typically require time-consuming fine-tuning to achieve good performance for a specific task. To address these limitations, we propose an alternative approach that achieves superior performance while being computationally efficient. Specifically, we utilize keywords to train a deep learning architecture that outperforms a language model pretrained on a large corpus of text. Our proposal does not require pretraining nor fine-tuning and can be applied directly to a specific setting for performing multi-label classification. Our objective is to automatically assign a new patient to the specialty of the medical professional they require, using a dataset that contains medical transcriptions and relevant keywords. To this end, we fine-tune the PubMedBERT model on this dataset, which serves as the baseline for our experiments. We then twice train/fine-tune a DNN and the RoBERTa language model, using both the keywords and the full transcriptions as input. We compare the performance of these approaches using relevant metrics. Our results demonstrate that utilizing keywords for text classification significantly improves classification performance, for both a basic DL architecture and a large language model. Our approach represents a promising and efficient alternative to traditional methods for finetuning language models on domain-specific data and has potential applications in various medical domains
    Compressive Recovery of Sparse Precision Matrices. (arXiv:2311.04673v2 [stat.ML] UPDATED)
    We consider the problem of learning a graph modeling the statistical relations of the $d$ variables from a dataset with $n$ samples $X \in \mathbb{R}^{n \times d}$. Standard approaches amount to searching for a precision matrix $\Theta$ representative of a Gaussian graphical model that adequately explains the data. However, most maximum likelihood-based estimators usually require storing the $d^{2}$ values of the empirical covariance matrix, which can become prohibitive in a high-dimensional setting. In this work, we adopt a compressive viewpoint and aim to estimate a sparse $\Theta$ from a \emph{sketch} of the data, i.e. a low-dimensional vector of size $m \ll d^{2}$ carefully designed from $X$ using non-linear random features. Under certain assumptions on the spectrum of $\Theta$ (or its condition number), we show that it is possible to estimate it from a sketch of size $m=\Omega\left((d+2k)\log(d)\right)$ where $k$ is the maximal number of edges of the underlying graph. These information-theoretic guarantees are inspired by compressed sensing theory and involve restricted isometry properties and instance optimal decoders. We investigate the possibility of achieving practical recovery with an iterative algorithm based on the graphical lasso, viewed as a specific denoiser. We compare our approach and graphical lasso on synthetic datasets, demonstrating its favorable performance even when the dataset is compressed.
    Fusing Multiple Algorithms for Heterogeneous Online Learning. (arXiv:2312.05432v1 [cs.LG])
    This study addresses the challenge of online learning in contexts where agents accumulate disparate data, face resource constraints, and use different local algorithms. This paper introduces the Switched Online Learning Algorithm (SOLA), designed to solve the heterogeneous online learning problem by amalgamating updates from diverse agents through a dynamic switching mechanism contingent upon their respective performance and available resources. We theoretically analyze the design of the selecting mechanism to ensure that the regret of SOLA is bounded. Our findings show that the number of changes in selection needs to be bounded by a parameter dependent on the performance of the different local algorithms. Additionally, two test cases are presented to emphasize the effectiveness of SOLA, first on an online linear regression problem and then on an online classification problem with the MNIST dataset.
    The Rashomon Importance Distribution: Getting RID of Unstable, Single Model-based Variable Importance. (arXiv:2309.13775v3 [cs.LG] UPDATED)
    Quantifying variable importance is essential for answering high-stakes questions in fields like genetics, public policy, and medicine. Current methods generally calculate variable importance for a given model trained on a given dataset. However, for a given dataset, there may be many models that explain the target outcome equally well; without accounting for all possible explanations, different researchers may arrive at many conflicting yet equally valid conclusions given the same data. Additionally, even when accounting for all possible explanations for a given dataset, these insights may not generalize because not all good explanations are stable across reasonable data perturbations. We propose a new variable importance framework that quantifies the importance of a variable across the set of all good models and is stable across the data distribution. Our framework is extremely flexible and can be integrated with most existing model classes and global variable importance metrics. We demonstrate through experiments that our framework recovers variable importance rankings for complex simulation setups where other methods fail. Further, we show that our framework accurately estimates the true importance of a variable for the underlying data distribution. We provide theoretical guarantees on the consistency and finite sample error rates for our estimator. Finally, we demonstrate its utility with a real-world case study exploring which genes are important for predicting HIV load in persons with HIV, highlighting an important gene that has not previously been studied in connection with HIV. Code is available at https://github.com/jdonnelly36/Rashomon_Importance_Distribution.
    Exciton-Polariton Condensates: A Fourier Neural Operator Approach. (arXiv:2309.15593v2 [cond-mat.quant-gas] UPDATED)
    Advancements in semiconductor fabrication over the past decade have catalyzed extensive research into all-optical devices driven by exciton-polariton condensates. Preliminary validations of such devices, including transistors, have shown encouraging results even under ambient conditions. A significant challenge still remains for large scale application however: the lack of a robust solver that can be used to simulate complex nonlinear systems which require an extended period of time to stabilize. Addressing this need, we propose the application of a machine-learning-based Fourier Neural Operator approach to find the solution to the Gross-Pitaevskii equations coupled with extra exciton rate equations. This work marks the first direct application of Neural Operators to an exciton-polariton condensate system. Our findings show that the proposed method can predict final-state solutions to a high degree of accuracy almost 1000 times faster than CUDA-based GPU solvers. Moreover, this paves the way for potential all-optical chip design workflows by integrating experimental data.
    Graph Condensation for Inductive Node Representation Learning. (arXiv:2307.15967v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) encounter significant computational challenges when handling large-scale graphs, which severely restricts their efficacy across diverse applications. To address this limitation, graph condensation has emerged as a promising technique, which constructs a small synthetic graph for efficiently training GNNs while retaining performance. However, due to the topology structure among nodes, graph condensation is limited to condensing only the observed training nodes and their corresponding structure, thus lacking the ability to effectively handle the unseen data. Consequently, the original large graph is still required in the inference stage to perform message passing to inductive nodes, resulting in substantial computational demands. To overcome this issue, we propose mapping-aware graph condensation (MCond), explicitly learning the one-to-many node mapping from original nodes to synthetic nodes to seamlessly integrate new nodes into the synthetic graph for inductive representation learning. This enables direct information propagation on the synthetic graph, which is much more efficient than on the original large graph. Specifically, MCond employs an alternating optimization scheme with innovative loss terms from transductive and inductive perspectives, facilitating the mutual promotion between graph condensation and node mapping learning. Extensive experiments demonstrate the efficacy of our approach in inductive inference. On the Reddit dataset, MCond achieves up to 121.5x inference speedup and 55.9x reduction in storage requirements compared with counterparts based on the original graph.
    UQ for Credit Risk Management: A deep evidence regression approach. (arXiv:2305.04967v2 [q-fin.RM] CROSS LISTED)
    Machine Learning has invariantly found its way into various Credit Risk applications. Due to the intrinsic nature of Credit Risk, quantifying the uncertainty of the predicted risk metrics is essential, and applying uncertainty-aware deep learning models to credit risk settings can be very helpful. In this work, we have explored the application of a scalable UQ-aware deep learning technique, Deep Evidence Regression and applied it to predicting Loss Given Default. We contribute to the literature by extending the Deep Evidence Regression methodology to learning target variables generated by a Weibull process and provide the relevant learning framework. We demonstrate the application of our approach to both simulated and real-world data.
    Large-Scale Quantum Separability Through a Reproducible Machine Learning Lens. (arXiv:2306.09444v2 [quant-ph] UPDATED)
    The quantum separability problem consists in deciding whether a bipartite density matrix is entangled or separable. In this work, we propose a machine learning pipeline for finding approximate solutions for this NP-hard problem in large-scale scenarios. We provide an efficient Frank-Wolfe-based algorithm to approximately seek the nearest separable density matrix and derive a systematic way for labeling density matrices as separable or entangled, allowing us to treat quantum separability as a classification problem. Our method is applicable to any two-qudit mixed states. Numerical experiments with quantum states of 3- and 7-dimensional qudits validate the efficiency of the proposed procedure, and demonstrate that it scales up to thousands of density matrices with a high quantum entanglement detection accuracy. This takes a step towards benchmarking quantum separability to support the development of more powerful entanglement detection techniques.
    Molecular De Novo Design through Transformer-based Reinforcement Learning. (arXiv:2310.05365v3 [cs.LG] UPDATED)
    In this work, we introduce a method to fine-tune a Transformer-based generative model for molecular de novo design. Leveraging the superior sequence learning capacity of Transformers over Recurrent Neural Networks (RNNs), our model can generate molecular structures with desired properties effectively. In contrast to the traditional RNN-based models, our proposed method exhibits superior performance in generating compounds predicted to be active against various biological targets, capturing long-term dependencies in the molecular structure sequence. The model's efficacy is demonstrated across numerous tasks, including generating analogues to a query structure and producing compounds with particular attributes, outperforming the baseline RNN-based methods. Our approach can be used for scaffold hopping, library expansion starting from a single molecule, and generating compounds with high predicted activity against biological targets.
    Large Language Models for Biomedical Knowledge Graph Construction: Information extraction from EMR notes. (arXiv:2301.12473v2 [cs.CL] UPDATED)
    The automatic construction of knowledge graphs (KGs) is an important research area in medicine, with far-reaching applications spanning drug discovery and clinical trial design. These applications hinge on the accurate identification of interactions among medical and biological entities. In this study, we propose an end-to-end machine learning solution based on large language models (LLMs) that utilize electronic medical record notes to construct KGs. The entities used in the KG construction process are diseases, factors, treatments, as well as manifestations that coexist with the patient while experiencing the disease. Given the critical need for high-quality performance in medical applications, we embark on a comprehensive assessment of 12 LLMs of various architectures, evaluating their performance and safety attributes. To gauge the quantitative efficacy of our approach by assessing both precision and recall, we manually annotate a dataset provided by the Macula and Retina Institute. We also assess the qualitative performance of LLMs, such as the ability to generate structured outputs or the tendency to hallucinate. The results illustrate that in contrast to encoder-only and encoder-decoder, decoder-only LLMs require further investigation. Additionally, we provide guided prompt design to utilize such LLMs. The application of the proposed methodology is demonstrated on age-related macular degeneration.
    MTP-GO: Graph-Based Probabilistic Multi-Agent Trajectory Prediction with Neural ODEs. (arXiv:2302.00735v4 [cs.RO] UPDATED)
    Enabling resilient autonomous motion planning requires robust predictions of surrounding road users' future behavior. In response to this need and the associated challenges, we introduce our model titled MTP-GO. The model encodes the scene using temporal graph neural networks to produce the inputs to an underlying motion model. The motion model is implemented using neural ordinary differential equations where the state-transition functions are learned with the rest of the model. Multimodal probabilistic predictions are obtained by combining the concept of mixture density networks and Kalman filtering. The results illustrate the predictive capabilities of the proposed model across various data sets, outperforming several state-of-the-art methods on a number of metrics.
    Exact and rapid linear clustering of networks with dynamic programming. (arXiv:2301.10403v2 [cs.SI] UPDATED)
    We study the problem of clustering networks whose nodes have imputed or physical positions in a single dimension, for example prestige hierarchies or the similarity dimension of hyperbolic embeddings. Existing algorithms, such as the critical gap method and other greedy strategies, only offer approximate solutions to this problem. Here, we introduce a dynamic programming approach that returns provably optimal solutions in polynomial time -- O(n^2) steps -- for a broad class of clustering objectives. We demonstrate the algorithm through applications to synthetic and empirical networks and show that it outperforms existing heuristics by a significant margin, with a similar execution time.
    Efficient Parallelization Layouts for Large-Scale Distributed Model Training. (arXiv:2311.05610v2 [cs.LG] UPDATED)
    Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or sequence parallelism. In this work, we conduct a comprehensive ablation study of possible training configurations for large language models. We distill this large study into several key recommendations for the most efficient training. For instance, we find that using a micro-batch size of 1 usually enables the most efficient training layouts. Larger micro-batch sizes necessitate activation checkpointing or higher degrees of model parallelism and also lead to larger pipeline bubbles. Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes, most notably a Model FLOPs utilization of 70.5% when training a Llama 13B model.
    Individual Fairness under Uncertainty. (arXiv:2302.08015v2 [cs.LG] UPDATED)
    Algorithmic fairness, the research field of making machine learning (ML) algorithms fair, is an established area in ML. As ML technologies expand their application domains, including ones with high societal impact, it becomes essential to take fairness into consideration during the building of ML systems. Yet, despite its wide range of socially sensitive applications, most work treats the issue of algorithmic bias as an intrinsic property of supervised learning, i.e., the class label is given as a precondition. Unlike prior studies in fairness, we propose an individual fairness measure and a corresponding algorithm that deal with the challenges of uncertainty arising from censorship in class labels, while enforcing similar individuals to be treated similarly from a ranking perspective, free of the Lipschitz condition in the conventional individual fairness definition. We argue that this perspective represents a more realistic model of fairness research for real-world application deployment and show how learning with such a relaxed precondition draws new insights that better explains algorithmic fairness. We conducted experiments on four real-world datasets to evaluate our proposed method compared to other fairness models, demonstrating its superiority in minimizing discrimination while maintaining predictive performance with uncertainty present.
    Poisoning $\times$ Evasion: Symbiotic Adversarial Robustness for Graph Neural Networks. (arXiv:2312.05502v1 [cs.LG])
    It is well-known that deep learning models are vulnerable to small input perturbations. Such perturbed instances are called adversarial examples. Adversarial examples are commonly crafted to fool a model either at training time (poisoning) or test time (evasion). In this work, we study the symbiosis of poisoning and evasion. We show that combining both threat models can substantially improve the devastating efficacy of adversarial attacks. Specifically, we study the robustness of Graph Neural Networks (GNNs) under structure perturbations and devise a memory-efficient adaptive end-to-end attack for the novel threat model using first-order optimization.
    Procedural generation of meta-reinforcement learning tasks. (arXiv:2302.05583v2 [cs.LG] UPDATED)
    Open-endedness stands to benefit from the ability to generate an infinite variety of diverse, challenging environments. One particularly interesting type of challenge is meta-learning ("learning-to-learn"), a hallmark of intelligent behavior. However, the number of meta-learning environments in the literature is limited. Here we describe a parametrized space for simple meta-reinforcement learning (meta-RL) tasks with arbitrary stimuli. The parametrization allows us to randomly generate an arbitrary number of novel simple meta-learning tasks. The parametrization is expressive enough to include many well-known meta-RL tasks, such as bandit problems, the Harlow task, T-mazes, the Daw two-step task and others. Simple extensions allow it to capture tasks based on two-dimensional topological spaces, such as full mazes or find-the-spot domains. We describe a number of randomly generated meta-RL domains of varying complexity and discuss potential issues arising from random generation.
    On the calibration of compartmental epidemiological models. (arXiv:2312.05456v1 [cs.LG])
    Epidemiological compartmental models are useful for understanding infectious disease propagation and directing public health policy decisions. Calibration of these models is an important step in offering accurate forecasts of disease dynamics and the effectiveness of interventions. In this study, we present an overview of calibrating strategies that can be employed, including several optimization methods and reinforcement learning (RL). We discuss the benefits and drawbacks of these methods and highlight relevant practical conclusions from our experiments. Optimization methods iteratively adjust the parameters of the model until the model output matches the available data, whereas RL uses trial and error to learn the optimal set of parameters by maximizing a reward signal. Finally, we discuss how the calibration of parameters of epidemiological compartmental models is an emerging field that has the potential to improve the accuracy of disease modeling and public health decision-making. Further research is needed to validate the effectiveness and scalability of these approaches in different epidemiological contexts. All codes and resources are available on \url{https://github.com/Nikunj-Gupta/On-the-Calibration-of-Compartmental-Epidemiological-Models}. We hope this work can facilitate related research.
    A Data-Driven Framework for Improving Public EV Charging Infrastructure: Modeling and Forecasting. (arXiv:2312.05333v1 [cs.LG])
    This work presents an investigation and assessment framework, which, supported by realistic data, aims at provisioning operators with in-depth insights into the consumer-perceived Quality-of-Experience (QoE) at public Electric Vehicle (EV) charging infrastructures. Motivated by the unprecedented EV market growth, it is suspected that the existing charging infrastructure will soon be no longer capable of sustaining the rapidly growing charging demands; let alone that the currently adopted ad hoc infrastructure expansion strategies seem to be far from contributing any quality service sustainability solutions that tangibly reduce (ultimately mitigate) the severity of this problem. Without suitable QoE metrics, operators, today, face remarkable difficulty in assessing the performance of EV Charging Stations (EVCSs) in this regard. This paper aims at filling this gap through the formulation of novel and original critical QoE performance metrics that provide operators with visibility into the per-EVCS operational dynamics and allow for the optimization of these stations' respective utilization. Such metrics shall then be used as inputs to a Machine Learning model finely tailored and trained using recent real-world data sets for the purpose of forecasting future long-term EVCS loads. This will, in turn, allow for making informed optimal EV charging infrastructure expansions that will be capable of reliably coping with the rising EV charging demands and maintaining acceptable QoE levels. The model's accuracy has been tested and extensive simulations are conducted to evaluate the achieved performance in terms of the above listed metrics and show the suitability of the recommended infrastructure expansions.
    Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption. (arXiv:2306.00196v2 [cs.LG] UPDATED)
    We study the infinite-horizon Restless Bandit problem with the average reward criterion, under both discrete-time and continuous-time settings. A fundamental goal is to design computationally efficient policies that achieve a diminishing optimality gap as the number of arms, $N$, grows large. Existing results on asymptotic optimality all rely on the uniform global attractor property (UGAP), a complex and challenging-to-verify assumption. In this paper, we propose a general, simulation-based framework, Follow-the-Virtual-Advice, that converts any single-armed policy into a policy for the original $N$-armed problem. This is done by simulating the single-armed policy on each arm and carefully steering the real state towards the simulated state. Our framework can be instantiated to produce a policy with an $O(1/\sqrt{N})$ optimality gap. In the discrete-time setting, our result holds under a simpler synchronization assumption, which covers some problem instances that violate UGAP. More notably, in the continuous-time setting, we do not require any additional assumptions beyond the standard unichain condition. In both settings, our work is the first asymptotic optimality result that does not require UGAP.
    Diversity from Human Feedback. (arXiv:2310.06648v2 [cs.LG] UPDATED)
    Diversity plays a significant role in many problems, such as ensemble learning, reinforcement learning, and combinatorial optimization. How to define the diversity measure is a longstanding problem. Many methods rely on expert experience to define a proper behavior space and then obtain the diversity measure, which is, however, challenging in many scenarios. In this paper, we propose the problem of learning a behavior space from human feedback and present a general method called Diversity from Human Feedback (DivHF) to solve it. DivHF learns a behavior descriptor consistent with human preference by querying human feedback. The learned behavior descriptor can be combined with any distance measure to define a diversity measure. We demonstrate the effectiveness of DivHF by integrating it with the Quality-Diversity optimization algorithm MAP-Elites and conducting experiments on the QDax suite. The results show that DivHF learns a behavior space that aligns better with human requirements compared to direct data-driven approaches and leads to more diverse solutions under human preference. Our contributions include formulating the problem, proposing the DivHF method, and demonstrating its effectiveness through experiments.
    Targeted and Troublesome: Tracking and Advertising on Children's Websites. (arXiv:2308.04887v2 [cs.CY] UPDATED)
    On the modern web, trackers and advertisers frequently construct and monetize users' detailed behavioral profiles without consent. Despite various studies on web tracking mechanisms and advertisements, there has been no rigorous study focusing on websites targeted at children. To address this gap, we present a measurement of tracking and (targeted) advertising on websites directed at children. Motivated by lacking a comprehensive list of child-directed (i.e., targeted at children) websites, we first build a multilingual classifier based on web page titles and descriptions. Applying this classifier to over two million pages, we compile a list of two thousand child-directed websites. Crawling these sites from five vantage points, we measure the prevalence of trackers, fingerprinting scripts, and advertisements. Our crawler detects ads displayed on child-directed websites and determines if ad targeting is enabled by scraping ad disclosure pages whenever available. Our results show that around 90% of child-directed websites embed one or more trackers, and about 27% contain targeted advertisements--a practice that should require verifiable parental consent. Next, we identify improper ads on child-directed websites by developing an ML pipeline that processes both images and text extracted from ads. The pipeline allows us to run semantic similarity queries for arbitrary search terms, revealing ads that promote services related to dating, weight loss, and mental health; as well as ads for sex toys and flirting chat services. Some of these ads feature repulsive and sexually explicit imagery. In summary, our findings indicate a trend of non-compliance with privacy regulations and troubling ad safety practices among many advertisers and child-directed websites. To protect children and create a safer online environment, regulators and stakeholders must adopt and enforce more stringent measures.
    Towards Stability of Autoregressive Neural Operators. (arXiv:2306.10619v2 [cs.LG] UPDATED)
    Neural operators have proven to be a promising approach for modeling spatiotemporal systems in the physical sciences. However, training these models for large systems can be quite challenging as they incur significant computational and memory expense -- these systems are often forced to rely on autoregressive time-stepping of the neural network to predict future temporal states. While this is effective in managing costs, it can lead to uncontrolled error growth over time and eventual instability. We analyze the sources of this autoregressive error growth using prototypical neural operator models for physical systems and explore ways to mitigate it. We introduce architectural and application-specific improvements that allow for careful control of instability-inducing operations within these models without inflating the compute/memory expense. We present results on several scientific systems that include Navier-Stokes fluid flow, rotating shallow water, and a high-resolution global weather forecasting system. We demonstrate that applying our design principles to neural operators leads to significantly lower errors for long-term forecasts as well as longer time horizons without qualitative signs of divergence compared to the original models for these systems. We open-source our \href{https://github.com/mikemccabe210/stabilizing_neural_operators}{code} for reproducibility.
    Weisfeiler and Lehman Go Paths: Learning Topological Features via Path Complexes. (arXiv:2308.06838v5 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs), despite achieving remarkable performance across different tasks, are theoretically bounded by the 1-Weisfeiler-Lehman test, resulting in limitations in terms of graph expressivity. Even though prior works on topological higher-order GNNs overcome that boundary, these models often depend on assumptions about sub-structures of graphs. Specifically, topological GNNs leverage the prevalence of cliques, cycles, and rings to enhance the message-passing procedure. Our study presents a novel perspective by focusing on simple paths within graphs during the topological message-passing process, thus liberating the model from restrictive inductive biases. We prove that by lifting graphs to path complexes, our model can generalize the existing works on topology while inheriting several theoretical results on simplicial complexes and regular cell complexes. Without making prior assumptions about graph sub-structures, our method outperforms earlier works in other topological domains and achieves state-of-the-art results on various benchmarks.
    Flexible Cross-Modal Steganography via Implicit Representations. (arXiv:2312.05496v1 [cs.CR])
    We present INRSteg, an innovative lossless steganography framework based on a novel data form Implicit Neural Representations (INR) that is modal-agnostic. Our framework is considered for effectively hiding multiple data without altering the original INR ensuring high-quality stego data. The neural representations of secret data are first concatenated to have independent paths that do not overlap, then weight freezing techniques are applied to the diagonal blocks of the weight matrices for the concatenated network to preserve the weights of secret data while additional free weights in the off-diagonal blocks of weight matrices are fitted to the cover data. Our framework can perform unexplored cross-modal steganography for various modalities including image, audio, video, and 3D shapes, and it achieves state-of-the-art performance compared to previous intra-modal steganographic methods.
    Sparse Variational Student-t Processes. (arXiv:2312.05568v1 [cs.LG])
    The theory of Bayesian learning incorporates the use of Student-t Processes to model heavy-tailed distributions and datasets with outliers. However, despite Student-t Processes having a similar computational complexity as Gaussian Processes, there has been limited emphasis on the sparse representation of this model. This is mainly due to the increased difficulty in modeling and computation compared to previous sparse Gaussian Processes. Our motivation is to address the need for a sparse representation framework that reduces computational complexity, allowing Student-t Processes to be more flexible for real-world datasets. To achieve this, we leverage the conditional distribution of Student-t Processes to introduce sparse inducing points. Bayesian methods and variational inference are then utilized to derive a well-defined lower bound, facilitating more efficient optimization of our model through stochastic gradient descent. We propose two methods for computing the variational lower bound, one utilizing Monte Carlo sampling and the other employing Jensen's inequality to compute the KL regularization term in the loss function. We propose adopting these approaches as viable alternatives to Gaussian processes when the data might contain outliers or exhibit heavy-tailed behavior, and we provide specific recommendations for their applicability. We evaluate the two proposed approaches on various synthetic and real-world datasets from UCI and Kaggle, demonstrating their effectiveness compared to baseline methods in terms of computational complexity and accuracy, as well as their robustness to outliers.
    LifelongMemory: Leveraging LLMs for Answering Queries in Egocentric Videos. (arXiv:2312.05269v1 [cs.CV])
    The egocentric video natural language query (NLQ) task involves localizing a temporal window in an egocentric video that provides an answer to a posed query, which has wide applications in building personalized AI assistants. Prior methods for this task have focused on improvements of network architecture and leveraging pre-training for enhanced image and video features, but have struggled with capturing long-range temporal dependencies in lengthy videos, and cumbersome end-to-end training. Motivated by recent advancements in Large Language Models (LLMs) and vision language models, we introduce LifelongMemory, a novel framework that utilizes multiple pre-trained models to answer queries from extensive egocentric video content. We address the unique challenge by employing a pre-trained captioning model to create detailed narratives of the videos. These narratives are then used to prompt a frozen LLM to generate coarse-grained temporal window predictions, which are subsequently refined using a pre-trained NLQ model. Empirical results demonstrate that our method achieves competitive performance against existing supervised end-to-end learning methods, underlining the potential of integrating multiple pre-trained multimodal large language models in complex vision-language tasks. We provide a comprehensive analysis of key design decisions and hyperparameters in our pipeline, offering insights and practical guidelines.
    TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting. (arXiv:2306.09364v4 [cs.LG] UPDATED)
    Transformers have gained popularity in time series forecasting for their ability to capture long-sequence interactions. However, their high memory and computing requirements pose a critical bottleneck for long-term forecasting. To address this, we propose TSMixer, a lightweight neural architecture exclusively composed of multi-layer perceptron (MLP) modules for multivariate forecasting and representation learning on patched time series. Inspired by MLP-Mixer's success in computer vision, we adapt it for time series, addressing challenges and introducing validated components for enhanced accuracy. This includes a novel design paradigm of attaching online reconciliation heads to the MLP-Mixer backbone, for explicitly modeling the time-series properties such as hierarchy and channel-correlations. We also propose a novel Hybrid channel modeling and infusion of a simple gating approach to effectively handle noisy channel interactions and generalization across diverse datasets. By incorporating these lightweight components, we significantly enhance the learning capability of simple MLP structures, outperforming complex Transformer models with minimal computing usage. Moreover, TSMixer's modular design enables compatibility with both supervised and masked self-supervised learning methods, making it a promising building block for time-series Foundation Models. TSMixer outperforms state-of-the-art MLP and Transformer models in forecasting by a considerable margin of 8-60%. It also outperforms the latest strong benchmarks of Patch-Transformer models (by 1-2%) with a significant reduction in memory and runtime (2-3X). The source code of our model is officially released as PatchTSMixer in the HuggingFace. Model: https://huggingface.co/docs/transformers/main/en/model_doc/patchtsmixer Examples: https://github.com/ibm/tsfm/#notebooks-links
    Mitigating Communications Threats in Decentralized Federated Learning through Moving Target Defense. (arXiv:2307.11730v2 [cs.CR] UPDATED)
    The rise of Decentralized Federated Learning (DFL) has enabled the training of machine learning models across federated participants, fostering decentralized model aggregation and reducing dependence on a server. However, this approach introduces unique communication security challenges that have yet to be thoroughly addressed in the literature. These challenges primarily originate from the decentralized nature of the aggregation process, the varied roles and responsibilities of the participants, and the absence of a central authority to oversee and mitigate threats. Addressing these challenges, this paper first delineates a comprehensive threat model focused on DFL communications. In response to these identified risks, this work introduces a security module to counter communication-based attacks for DFL platforms. The module combines security techniques such as symmetric and asymmetric encryption with Moving Target Defense (MTD) techniques, including random neighbor selection and IP/port switching. The security module is implemented in a DFL platform, Fedstellar, allowing the deployment and monitoring of the federation. A DFL scenario with physical and virtual deployments have been executed, encompassing three security configurations: (i) a baseline without security, (ii) an encrypted configuration, and (iii) a configuration integrating both encryption and MTD techniques. The effectiveness of the security module is validated through experiments with the MNIST dataset and eclipse attacks. The results showed an average F1 score of 95%, with the most secure configuration resulting in CPU usage peaking at 68% (+-9%) in virtual deployments and network traffic reaching 480.8 MB (+-18 MB), effectively mitigating risks associated with eavesdropping or eclipse attacks.
    Staleness-Alleviated Distributed GNN Training via Online Dynamic-Embedding Prediction. (arXiv:2308.13466v2 [cs.LG] UPDATED)
    Despite the recent success of Graph Neural Networks (GNNs), it remains challenging to train GNNs on large-scale graphs due to neighbor explosions. As a remedy, distributed computing becomes a promising solution by leveraging abundant computing resources (e.g., GPU). However, the node dependency of graph data increases the difficulty of achieving high concurrency in distributed GNN training, which suffers from the massive communication overhead. To address it, Historical value approximation is deemed a promising class of distributed training techniques. It utilizes an offline memory to cache historical information (e.g., node embedding) as an affordable approximation of the exact value and achieves high concurrency. However, such benefits come at the cost of involving dated training information, leading to staleness, imprecision, and convergence issues. To overcome these challenges, this paper proposes SAT (Staleness-Alleviated Training), a novel and scalable distributed GNN training framework that reduces the embedding staleness adaptively. The key idea of SAT is to model the GNN's embedding evolution as a temporal graph and build a model upon it to predict future embedding, which effectively alleviates the staleness of the cached historical embedding. We propose an online algorithm to train the embedding predictor and the distributed GNN alternatively and further provide a convergence analysis. Empirically, we demonstrate that SAT can effectively reduce embedding staleness and thus achieve better performance and convergence speed on multiple large-scale graph datasets.
    On Comparing Fair Classifiers under Data Bias. (arXiv:2302.05906v2 [cs.LG] UPDATED)
    In this paper, we consider a theoretical model for injecting data bias, namely, under-representation and label bias (Blum & Stangl, 2019). We empirically study the effect of varying data biases on the accuracy and fairness of fair classifiers. Through extensive experiments on both synthetic and real-world datasets (e.g., Adult, German Credit, Bank Marketing, COMPAS), we empirically audit pre-, in-, and post-processing fair classifiers from standard fairness toolkits for their fairness and accuracy by injecting varying amounts of under-representation and label bias in their training data (but not the test data). Our main observations are: 1. The fairness and accuracy of many standard fair classifiers degrade severely as the bias injected in their training data increases, 2. A simple logistic regression model trained on the right data can often outperform, in both accuracy and fairness, most fair classifiers trained on biased training data, and 3. A few, simple fairness techniques (e.g., reweighing, exponentiated gradients) seem to offer stable accuracy and fairness guarantees even when their training data is injected with under-representation and label bias. Our experiments also show how to integrate a measure of data bias risk in the existing fairness dashboards for real-world deployments.
    Reinforcement Learning in Non-Markovian Environments. (arXiv:2211.01595v3 [eess.SY] UPDATED)
    Motivated by the novel paradigm developed by Van Roy and coauthors for reinforcement learning in arbitrary non-Markovian environments, we propose a related formulation and explicitly pin down the error caused by non-Markovianity of observations when the Q-learning algorithm is applied on this formulation. Based on this observation, we propose that the criterion for agent design should be to seek good approximations for certain conditional laws. Inspired by classical stochastic control, we show that our problem reduces to that of recursive computation of approximate sufficient statistics. This leads to an autoencoder-based scheme for agent design which is then numerically tested on partially observed reinforcement learning environments.
    Contraction-Guided Adaptive Partitioning for Reachability Analysis of Neural Network Controlled Systems. (arXiv:2304.03671v2 [eess.SY] UPDATED)
    In this paper, we present a contraction-guided adaptive partitioning algorithm for improving interval-valued robust reachable set estimates in a nonlinear feedback loop with a neural network controller and disturbances. Based on an estimate of the contraction rate of over-approximated intervals, the algorithm chooses when and where to partition. Then, by leveraging a decoupling of the neural network verification step and reachability partitioning layers, the algorithm can provide accuracy improvements for little computational cost. This approach is applicable with any sufficiently accurate open-loop interval-valued reachability estimation technique and any method for bounding the input-output behavior of a neural network. Using contraction-based robustness analysis, we provide guarantees of the algorithm's performance with mixed monotone reachability. Finally, we demonstrate the algorithm's performance through several numerical simulations and compare it with existing methods in the literature. In particular, we report a sizable improvement in the accuracy of reachable set estimation in a fraction of the runtime as compared to state-of-the-art methods.
    Signatures Meet Dynamic Programming: Generalizing Bellman Equations for Trajectory Following. (arXiv:2312.05547v1 [eess.SY])
    Path signatures have been proposed as a powerful representation of paths that efficiently captures the path's analytic and geometric characteristics, having useful algebraic properties including fast concatenation of paths through tensor products. Signatures have recently been widely adopted in machine learning problems for time series analysis. In this work we establish connections between value functions typically used in optimal control and intriguing properties of path signatures. These connections motivate our novel control framework with signature transforms that efficiently generalizes the Bellman equation to the space of trajectories. We analyze the properties and advantages of the framework, termed signature control. In particular, we demonstrate that (i) it can naturally deal with varying/adaptive time steps; (ii) it propagates higher-level information more efficiently than value function updates; (iii) it is robust to dynamical system misspecification over long rollouts. As a specific case of our framework, we devise a model predictive control method for path tracking. This method generalizes integral control, being suitable for problems with unknown disturbances. The proposed algorithms are tested in simulation, with differentiable physics models including typical control and robotics tasks such as point-mass, curve following for an ant model, and a robotic manipulator.
    HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement. (arXiv:2203.13086v4 [cs.SD] UPDATED)
    Generative adversarial networks have recently demonstrated outstanding performance in neural vocoding outperforming best autoregressive and flow-based models. In this paper, we show that this success can be extended to other tasks of conditional audio generation. In particular, building upon HiFi vocoders, we propose a novel HiFi++ general framework for bandwidth extension and speech enhancement. We show that with the improved generator architecture, HiFi++ performs better or comparably with the state-of-the-art in these tasks while spending significantly less computational resources. The effectiveness of our approach is validated through a series of extensive experiments.
    Reinforcement Neighborhood Selection for Unsupervised Graph Anomaly Detection. (arXiv:2312.05526v1 [cs.LG])
    Unsupervised graph anomaly detection is crucial for various practical applications as it aims to identify anomalies in a graph that exhibit rare patterns deviating significantly from the majority of nodes. Recent advancements have utilized Graph Neural Networks (GNNs) to learn high-quality node representations for anomaly detection by aggregating information from neighborhoods. However, the presence of anomalies may render the observed neighborhood unreliable and result in misleading information aggregation for node representation learning. Selecting the proper neighborhood is critical for graph anomaly detection but also challenging due to the absence of anomaly-oriented guidance and the interdependence with representation learning. To address these issues, we utilize the advantages of reinforcement learning in adaptively learning in complex environments and propose a novel method that incorporates Reinforcement neighborhood selection for unsupervised graph ANomaly Detection (RAND). RAND begins by enriching the candidate neighbor pool of the given central node with multiple types of indirect neighbors. Next, RAND designs a tailored reinforcement anomaly evaluation module to assess the reliability and reward of considering the given neighbor. Finally, RAND selects the most reliable subset of neighbors based on these rewards and introduces an anomaly-aware aggregator to amplify messages from reliable neighbors while diminishing messages from unreliable ones. Extensive experiments on both three synthetic and two real-world datasets demonstrate that RAND outperforms the state-of-the-art methods.
    QAGCN: Answering Multi-Relation Questions via Single-Step Implicit Reasoning over Knowledge Graphs. (arXiv:2206.01818v2 [cs.AI] UPDATED)
    Multi-relation question answering (QA) is a challenging task, where given questions usually require long reasoning chains in KGs that consist of multiple relations. Recently, methods with explicit multi-step reasoning over KGs have been prominently used in this task and have demonstrated promising performance. Examples include methods that perform stepwise label propagation through KG triples and methods that navigate over KG triples based on reinforcement learning. A main weakness of these methods is that their reasoning mechanisms are usually complex and difficult to implement or train. In this paper, we argue that multi-relation QA can be achieved via end-to-end single-step implicit reasoning, which is simpler, more efficient, and easier to adopt. We propose QAGCN -- a Question-Aware Graph Convolutional Network (GCN)-based method that includes a novel GCN architecture with controlled question-dependent message propagation for the implicit reasoning. Extensive experiments have been conducted, where QAGCN achieved competitive and even superior performance compared to state-of-the-art explicit-reasoning methods.
    Dynamic Pricing and Learning with Bayesian Persuasion. (arXiv:2304.14385v2 [cs.GT] UPDATED)
    We consider a novel dynamic pricing and learning setting where in addition to setting prices of products in sequential rounds, the seller also ex-ante commits to 'advertising schemes'. That is, in the beginning of each round the seller can decide what kind of signal they will provide to the buyer about the product's quality upon realization. Using the popular Bayesian persuasion framework to model the effect of these signals on the buyers' valuation and purchase responses, we formulate the problem of finding an optimal design of the advertising scheme along with a pricing scheme that maximizes the seller's expected revenue. Without any apriori knowledge of the buyers' demand function, our goal is to design an online algorithm that can use past purchase responses to adaptively learn the optimal pricing and advertising strategy. We study the regret of the algorithm when compared to the optimal clairvoyant price and advertising scheme. Our main result is a computationally efficient online algorithm that achieves an $O(T^{2/3}(m\log T)^{1/3})$ regret bound when the valuation function is linear in the product quality. Here $m$ is the cardinality of the discrete product quality domain and $T$ is the time horizon. This result requires some natural monotonicity and Lipschitz assumptions on the valuation function, but no Lipschitz or smoothness assumption on the buyers' demand function. For constant $m$, our result matches the regret lower bound for dynamic pricing within logarithmic factors, which is a special case of our problem. We also obtain several improved results for the widely considered special case of additive valuations, including an $\tilde{O}(T^{2/3})$ regret bound independent of $m$ when $m\le T^{1/3}$.
    Nonconvex Zeroth-Order Stochastic ADMM Methods with Lower Function Query Complexity. (arXiv:1907.13463v4 [math.OC] UPDATED)
    Zeroth-order (a.k.a, derivative-free) methods are a class of effective optimization methods for solving complex machine learning problems, where gradients of the objective functions are not available or computationally prohibitive. Recently, although many zeroth-order methods have been developed, these approaches still have two main drawbacks: 1) high function query complexity; 2) not being well suitable for solving the problems with complex penalties and constraints. To address these challenging drawbacks, in this paper, we propose a class of faster zeroth-order stochastic alternating direction method of multipliers (ADMM) methods (ZO-SPIDER-ADMM) to solve the nonconvex finite-sum problems with multiple nonsmooth penalties. Moreover, we prove that the ZO-SPIDER-ADMM methods can achieve a lower function query complexity of $O(nd+dn^{\frac{1}{2}}\epsilon^{-1})$ for finding an $\epsilon$-stationary point, which improves the existing best nonconvex zeroth-order ADMM methods by a factor of $O(d^{\frac{1}{3}}n^{\frac{1}{6}})$, where $n$ and $d$ denote the sample size and data dimension, respectively. At the same time, we propose a class of faster zeroth-order online ADMM methods (ZOO-ADMM+) to solve the nonconvex online problems with multiple nonsmooth penalties. We also prove that the proposed ZOO-ADMM+ methods achieve a lower function query complexity of $O(d\epsilon^{-\frac{3}{2}})$, which improves the existing best result by a factor of $O(\epsilon^{-\frac{1}{2}})$. Extensive experimental results on the structure adversarial attack on black-box deep neural networks demonstrate the efficiency of our new algorithms.
    AdapterGNN: Parameter-Efficient Fine-Tuning Improves Generalization in GNNs. (arXiv:2304.09595v2 [cs.LG] UPDATED)
    Fine-tuning pre-trained models has recently yielded remarkable performance gains in graph neural networks (GNNs). In addition to pre-training techniques, inspired by the latest work in the natural language fields, more recent work has shifted towards applying effective fine-tuning approaches, such as parameter-efficient fine-tuning (PEFT). However, given the substantial differences between GNNs and transformer-based models, applying such approaches directly to GNNs proved to be less effective. In this paper, we present a comprehensive comparison of PEFT techniques for GNNs and propose a novel PEFT method specifically designed for GNNs, called AdapterGNN. AdapterGNN preserves the knowledge of the large pre-trained model and leverages highly expressive adapters for GNNs, which can adapt to downstream tasks effectively with only a few parameters, while also improving the model's generalization ability. Extensive experiments show that AdapterGNN achieves higher performance than other PEFT methods and is the only one consistently surpassing full fine-tuning (outperforming it by 1.6% and 5.7% in the chemistry and biology domains respectively, with only 5% and 4% of its parameters tuned) with lower generalization gaps. Moreover, we empirically show that a larger GNN model can have a worse generalization ability, which differs from the trend observed in large transformer-based models. Building upon this, we provide a theoretical justification for PEFT can improve generalization of GNNs by applying generalization bounds. Our code is available at https://github.com/Lucius-lsr/AdapterGNN.
    Artificial Neural Nets and the Representation of Human Concepts. (arXiv:2312.05337v1 [cs.LG])
    What do artificial neural networks (ANNs) learn? The machine learning (ML) community shares the narrative that ANNs must develop abstract human concepts to perform complex tasks. Some go even further and believe that these concepts are stored in individual units of the network. Based on current research, I systematically investigate the assumptions underlying this narrative. I conclude that ANNs are indeed capable of performing complex prediction tasks, and that they may learn human and non-human concepts to do so. However, evidence indicates that ANNs do not represent these concepts in individual units.
    Cross Domain Generative Augmentation: Domain Generalization with Latent Diffusion Models. (arXiv:2312.05387v1 [cs.LG])
    Despite the huge effort in developing novel regularizers for Domain Generalization (DG), adding simple data augmentation to the vanilla ERM which is a practical implementation of the Vicinal Risk Minimization principle (VRM) \citep{chapelle2000vicinal} outperforms or stays competitive with many of the proposed regularizers. The VRM reduces the estimation error in ERM by replacing the point-wise kernel estimates with a more precise estimation of true data distribution that reduces the gap between data points \textbf{within each domain}. However, in the DG setting, the estimation error of true data distribution by ERM is mainly caused by the distribution shift \textbf{between domains} which cannot be fully addressed by simple data augmentation techniques within each domain. Inspired by this limitation of VRM, we propose a novel data augmentation named Cross Domain Generative Augmentation (CDGA) that replaces the pointwise kernel estimates in ERM with new density estimates in the \textbf{vicinity of domain pairs} so that the gap between domains is further reduced. To this end, CDGA, which is built upon latent diffusion models (LDM), generates synthetic images to fill the gap between all domains and as a result, reduces the non-iidness. We show that CDGA outperforms SOTA DG methods under the Domainbed benchmark. To explain the effectiveness of CDGA, we generate more than 5 Million synthetic images and perform extensive ablation studies including data scaling laws, distribution visualization, domain shift quantification, adversarial robustness, and loss landscape analysis.
    Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving. (arXiv:2312.05385v1 [cs.DC])
    Machine learning (ML) inference platforms are tasked with balancing two competing goals: ensuring high throughput given many requests, and delivering low-latency responses to support interactive applications. Unfortunately, existing platform knobs (e.g., batch sizes) fail to ease this fundamental tension, and instead only enable users to harshly trade off one property for the other. This paper explores an alternate strategy to taming throughput-latency tradeoffs by changing the granularity at which inference is performed. We present Apparate, a system that automatically applies and manages early exits (EEs) in ML models, whereby certain inputs can exit with results at intermediate layers. To cope with the time-varying overhead and accuracy challenges that EEs bring, Apparate repurposes exits to provide continual feedback that powers several novel runtime monitoring and adaptation strategies. Apparate lowers median response latencies by 40.5-91.5% and 10.0-24.2% for diverse CV and NLP workloads, respectively, without affecting throughputs or violating tight accuracy constraints.
    Repairing Regressors for Fair Binary Classification at Any Decision Threshold. (arXiv:2203.07490v4 [cs.LG] UPDATED)
    We study the problem of post-processing a supervised machine-learned regressor to maximize fair binary classification at all decision thresholds. By decreasing the statistical distance between each group's score distributions, we show that we can increase fair performance across all thresholds at once, and that we can do so without a large decrease in accuracy. To this end, we introduce a formal measure of Distributional Parity, which captures the degree of similarity in the distributions of classifications for different protected groups. Our main result is to put forward a novel post-processing algorithm based on optimal transport, which provably maximizes Distributional Parity, thereby attaining common notions of group fairness like Equalized Odds or Equal Opportunity at all thresholds. We demonstrate on two fairness benchmarks that our technique works well empirically, while also outperforming and generalizing similar techniques from related work.
    Toward Scalable and Transparent Multimodal Analytics to Study Standard Medical Procedures: Linking Hand Movement, Proximity, and Gaze Data. (arXiv:2312.05368v1 [cs.AI])
    This study employed multimodal learning analytics (MMLA) to analyze behavioral dynamics during the ABCDE procedure in nursing education, focusing on gaze entropy, hand movement velocities, and proximity measures. Utilizing accelerometers and eye-tracking techniques, behaviorgrams were generated to depict various procedural phases. Results identified four primary phases characterized by distinct patterns of visual attention, hand movements, and proximity to the patient or instruments. The findings suggest that MMLA can offer valuable insights into procedural competence in medical education. This research underscores the potential of MMLA to provide detailed, objective evaluations of clinical procedures and their inherent complexities.
    Higher-Order Equivariant Neural Networks for Charge Density Prediction in Materials. (arXiv:2312.05388v1 [physics.comp-ph])
    The calculation of electron density distribution using density functional theory (DFT) in materials and molecules is central to the study of their quantum and macro-scale properties, yet accurate and efficient calculation remains a long-standing challenge in the field of material science. This work introduces ChargE3Net, an E(3)-equivariant graph neural network for predicting electron density in atomic systems. ChargE3Net achieves equivariance through the use of higher-order tensor representations, and directly predicts the charge density at any arbitrary point in the system. We show that our method achieves greater performance than prior work on large and diverse sets of molecules and materials, and scales to larger systems than what is feasible to compute with DFT. Using predicted electron densities as an initialization, we show that fewer self-consistent iterations are required to converge DFT over the default initialization. In addition, we show that non-self-consistent calculations using the predicted electron densities can predict electronic and thermodynamic properties of materials at near-DFT accuracy.
    Existence and Minimax Theorems for Adversarial Surrogate Risks in Binary Classification. (arXiv:2206.09098v4 [cs.LG] UPDATED)
    Adversarial training is one of the most popular methods for training methods robust to adversarial attacks, however, it is not well-understood from a theoretical perspective. We prove and existence, regularity, and minimax theorems for adversarial surrogate risks. Our results explain some empirical observations on adversarial robustness from prior work and suggest new directions in algorithm development. Furthermore, our results extend previously known existence and minimax theorems for the adversarial classification risk to surrogate risks.
    Exploring Sparsity in Graph Transformers. (arXiv:2312.05479v1 [cs.LG])
    Graph Transformers (GTs) have achieved impressive results on various graph-related tasks. However, the huge computational cost of GTs hinders their deployment and application, especially in resource-constrained environments. Therefore, in this paper, we explore the feasibility of sparsifying GTs, a significant yet under-explored topic. We first discuss the redundancy of GTs based on the characteristics of existing GT models, and then propose a comprehensive \textbf{G}raph \textbf{T}ransformer \textbf{SP}arsification (GTSP) framework that helps to reduce the computational complexity of GTs from four dimensions: the input graph data, attention heads, model layers, and model weights. Specifically, GTSP designs differentiable masks for each individual compressible component, enabling effective end-to-end pruning. We examine our GTSP through extensive experiments on prominent GTs, including GraphTrans, Graphormer, and GraphGPS. The experimental results substantiate that GTSP effectively cuts computational costs, accompanied by only marginal decreases in accuracy or, in some cases, even improvements. For instance, GTSP yields a reduction of 30\% in Floating Point Operations while contributing to a 1.8\% increase in Area Under the Curve accuracy on OGBG-HIV dataset. Furthermore, we provide several insights on the characteristics of attention heads and the behavior of attention mechanisms, all of which have immense potential to inspire future research endeavors in this domain.
    Boosting Federated Learning in Resource-Constrained Networks. (arXiv:2110.11486v2 [cs.LG] UPDATED)
    Federated learning (FL) enables a set of client devices to collaboratively train a model without sharing raw data. This process, though, operates under the constrained computation and communication resources of edge devices. These constraints combined with systems heterogeneity force some participating clients to perform fewer local updates than expected by the server, thus slowing down convergence. Exhaustive tuning of hyperparameters in FL, furthermore, can be resource-intensive, without which the convergence is adversely affected. In this work, we propose GeL, the guess and learn algorithm. GeL enables constrained edge devices to perform additional learning through guessed updates on top of gradient-based steps. These guesses are gradientless, i.e., participating clients leverage them for free. Our generic guessing algorithm (i) can be flexibly combined with several state-of-the-art algorithms including FedProx, FedNova or FedYogi; and (ii) achieves significantly improved performance when the learning rates are not best tuned. We conduct extensive experiments and show that GeL can boost empirical convergence by up to 40% in resource-constrained networks while relieving the need for exhaustive learning rate tuning.
    Conditional Stochastic Interpolation for Generative Learning. (arXiv:2312.05579v1 [stat.ML])
    We propose a conditional stochastic interpolation (CSI) approach to learning conditional distributions. CSI learns probability flow equations or stochastic differential equations that transport a reference distribution to the target conditional distribution. This is achieved by first learning the drift function and the conditional score function based on conditional stochastic interpolation, which are then used to construct a deterministic process governed by an ordinary differential equation or a diffusion process for conditional sampling. In our proposed CSI model, we incorporate an adaptive diffusion term to address the instability issues arising during the training process. We provide explicit forms of the conditional score function and the drift function in terms of conditional expectations under mild conditions, which naturally lead to an nonparametric regression approach to estimating these functions. Furthermore, we establish non-asymptotic error bounds for learning the target conditional distribution via conditional stochastic interpolation in terms of KL divergence, taking into account the neural network approximation error. We illustrate the application of CSI on image generation using a benchmark image dataset.
    Disentangled Latent Representation Learning for Tackling the Confounding M-Bias Problem in Causal Inference. (arXiv:2312.05404v1 [cs.LG])
    In causal inference, it is a fundamental task to estimate the causal effect from observational data. However, latent confounders pose major challenges in causal inference in observational data, for example, confounding bias and M-bias. Recent data-driven causal effect estimators tackle the confounding bias problem via balanced representation learning, but assume no M-bias in the system, thus they fail to handle the M-bias. In this paper, we identify a challenging and unsolved problem caused by a variable that leads to confounding bias and M-bias simultaneously. To address this problem with co-occurring M-bias and confounding bias, we propose a novel Disentangled Latent Representation learning framework for learning latent representations from proxy variables for unbiased Causal effect Estimation (DLRCE) from observational data. Specifically, DLRCE learns three sets of latent representations from the measured proxy variables to adjust for the confounding bias and M-bias. Extensive experiments on both synthetic and three real-world datasets demonstrate that DLRCE significantly outperforms the state-of-the-art estimators in the case of the presence of both confounding bias and M-bias.
    STREAMLINE: An Automated Machine Learning Pipeline for Biomedicine Applied to Examine the Utility of Photography-Based Phenotypes for OSA Prediction Across International Sleep Centers. (arXiv:2312.05461v1 [cs.LG])
    While machine learning (ML) includes a valuable array of tools for analyzing biomedical data, significant time and expertise is required to assemble effective, rigorous, and unbiased pipelines. Automated ML (AutoML) tools seek to facilitate ML application by automating a subset of analysis pipeline elements. In this study we develop and validate a Simple, Transparent, End-to-end Automated Machine Learning Pipeline (STREAMLINE) and apply it to investigate the added utility of photography-based phenotypes for predicting obstructive sleep apnea (OSA); a common and underdiagnosed condition associated with a variety of health, economic, and safety consequences. STREAMLINE is designed to tackle biomedical binary classification tasks while adhering to best practices and accommodating complexity, scalability, reproducibility, customization, and model interpretation. Benchmarking analyses validated the efficacy of STREAMLINE across data simulations with increasingly complex patterns of association. Then we applied STREAMLINE to evaluate the utility of demographics (DEM), self-reported comorbidities (DX), symptoms (SYM), and photography-based craniofacial (CF) and intraoral (IO) anatomy measures in predicting any OSA or moderate/severe OSA using 3,111 participants from Sleep Apnea Global Interdisciplinary Consortium (SAGIC). OSA analyses identified a significant increase in ROC-AUC when adding CF to DEM+DX+SYM to predict moderate/severe OSA. A consistent but non-significant increase in PRC-AUC was observed with the addition of each subsequent feature set to predict any OSA, with CF and IO yielding minimal improvements. Application of STREAMLINE to OSA data suggests that CF features provide additional value in predicting moderate/severe OSA, but neither CF nor IO features meaningfully improved the prediction of any OSA beyond established demographics, comorbidity and symptom characteristics.
    On Task-Relevant Loss Functions in Meta-Reinforcement Learning and Online LQR. (arXiv:2312.05465v1 [cs.LG])
    Designing a competent meta-reinforcement learning (meta-RL) algorithm in terms of data usage remains a central challenge to be tackled for its successful real-world applications. In this paper, we propose a sample-efficient meta-RL algorithm that learns a model of the system or environment at hand in a task-directed manner. As opposed to the standard model-based approaches to meta-RL, our method exploits the value information in order to rapidly capture the decision-critical part of the environment. The key component of our method is the loss function for learning the task inference module and the system model that systematically couples the model discrepancy and the value estimate, thereby facilitating the learning of the policy and the task inference module with a significantly smaller amount of data compared to the existing meta-RL algorithms. The idea is also extended to a non-meta-RL setting, namely an online linear quadratic regulator (LQR) problem, where our method can be simplified to reveal the essence of the strategy. The proposed method is evaluated in high-dimensional robotic control and online LQR problems, empirically verifying its effectiveness in extracting information indispensable for solving the tasks from observations in a sample efficient manner.
    Towards On-device Learning on the Edge: Ways to Select Neurons to Update under a Budget Constraint. (arXiv:2312.05282v1 [cs.LG])
    In the realm of efficient on-device learning under extreme memory and computation constraints, a significant gap in successful approaches persists. Although considerable effort has been devoted to efficient inference, the main obstacle to efficient learning is the prohibitive cost of backpropagation. The resources required to compute gradients and update network parameters often exceed the limits of tightly constrained memory budgets. This paper challenges conventional wisdom and proposes a series of experiments that reveal the existence of superior sub-networks. Furthermore, we hint at the potential for substantial gains through a dynamic neuron selection strategy when fine-tuning a target task. Our efforts extend to the adaptation of a recent dynamic neuron selection strategy pioneered by Bragagnolo et al. (NEq), revealing its effectiveness in the most stringent scenarios. Our experiments demonstrate, in the average case, the superiority of a NEq-inspired approach over a random selection. This observation prompts a compelling avenue for further exploration in the area, highlighting the opportunity to design a new class of algorithms designed to facilitate parameter update selection. Our findings usher in a new era of possibilities in the field of on-device learning under extreme constraints and encourage the pursuit of innovative strategies for efficient, resource-friendly model fine-tuning.
    Dynamic Adjustment of Matching Radii under the Broadcasting Mode: A Novel Multitask Learning Strategy and Temporal Modeling Approach. (arXiv:2312.05576v1 [cs.AI])
    As ride-hailing services have experienced significant growth, the majority of research has concentrated on the dispatching mode, where drivers must adhere to the platform's assigned routes. However, the broadcasting mode, in which drivers can freely choose their preferred orders from those broadcast by the platform, has received less attention. One important but challenging task in such a system is the determination of the optimal matching radius, which usually varies across space, time, and real-time supply/demand characteristics. This study develops a Transformer-Encoder-Based (TEB) model that predicts key system performance metrics for a range of matching radii, which enables the ride-hailing platform to select an optimal matching radius that maximizes overall system performance according to real-time supply and demand information. To simultaneously maximize multiple system performance metrics for matching radius determination, we devise a novel multi-task learning algorithm that enhances convergence speed of each task (corresponding to the optimization of one metric) and delivers more accurate overall predictions. We evaluate our methods in a simulation environment specifically designed for broadcasting-mode-based ride-hailing service. Our findings reveal that dynamically adjusting matching radii based on our proposed predict-then-optimize approach significantly improves system performance, e.g., increasing platform revenue by 7.55% and enhancing order fulfillment rate by 13% compared to benchmark algorithms.
    Uncertainty-aware Surrogate Models for Airfoil Flow Simulations with Denoising Diffusion Probabilistic Models. (arXiv:2312.05320v1 [physics.flu-dyn])
    Leveraging neural networks as surrogate models for turbulence simulation is a topic of growing interest. At the same time, embodying the inherent uncertainty of simulations in the predictions of surrogate models remains very challenging. The present study makes a first attempt to use denoising diffusion probabilistic models (DDPMs) to train an uncertainty-aware surrogate model for turbulence simulations. Due to its prevalence, the simulation of flows around airfoils with various shapes, Reynolds numbers, and angles of attack is chosen as the learning objective. Our results show that DDPMs can successfully capture the whole distribution of solutions and, as a consequence, accurately estimate the uncertainty of the simulations. The performance of DDPMs is also compared with varying baselines in the form of Bayesian neural networks and heteroscedastic models. Experiments demonstrate that DDPMs outperform the other methods regarding a variety of accuracy metrics. Besides, it offers the advantage of providing access to the complete distributions of uncertainties rather than providing a set of parameters. As such, it can yield realistic and detailed samples from the distribution of solutions. All source codes and datasets utilized in this study are publicly available.
    Score Operator Newton transport. (arXiv:2305.09792v2 [math.ST] UPDATED)
    We propose a new approach for sampling and Bayesian computation that uses the score of the target distribution to construct a transport from a given reference distribution to the target. Our approach is an infinite-dimensional Newton method, involving an elliptic PDE, for finding a zero of a ``score-residual'' operator. We use classical elliptic PDE theory to prove convergence to a valid transport map. Our Newton iterates can be computed by exploiting fast solvers for elliptic PDEs, resulting in new algorithms for Bayesian inference and other sampling tasks. We identify elementary settings where score-operator Newton transport achieves fast convergence while avoiding mode collapse.
    Real-time Inference and Extrapolation via a Diffusion-inspired Temporal Transformer Operator (DiTTO). (arXiv:2307.09072v2 [cs.LG] UPDATED)
    Extrapolation remains a grand challenge in deep neural networks across all application domains. We propose an operator learning method to solve time-dependent partial differential equations (PDEs) continuously and with extrapolation in time without any temporal discretization. The proposed method, named Diffusion-inspired Temporal Transformer Operator (DiTTO), is inspired by latent diffusion models and their conditioning mechanism, which we use to incorporate the temporal evolution of the PDE, in combination with elements from the transformer architecture to improve its capabilities. Upon training, DiTTO can make inferences in real-time. We demonstrate its extrapolation capability on a climate problem by estimating the temperature around the globe for several years, and also in modeling hypersonic flows around a double-cone. We propose different training strategies involving temporal-bundling and sub-sampling and demonstrate performance improvements for several benchmarks, performing extrapolation for long time intervals as well as zero-shot super-resolution in time.
    PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting. (arXiv:2210.08964v5 [stat.ME] UPDATED)
    This paper presents a new perspective on time series forecasting. In existing time series forecasting methods, the models take a sequence of numerical values as input and yield numerical values as output. The existing SOTA models are largely based on the Transformer architecture, modified with multiple encoding mechanisms to incorporate the context and semantics around the historical data. Inspired by the successes of pre-trained language foundation models, we pose a question about whether these models can also be adapted to solve time-series forecasting. Thus, we propose a new forecasting paradigm: prompt-based time series forecasting (PromptCast). In this novel task, the numerical input and output are transformed into prompts and the forecasting task is framed in a sentence-to-sentence manner, making it possible to directly apply language models for forecasting purposes. To support and facilitate the research of this task, we also present a large-scale dataset (PISA) that includes three real-world forecasting scenarios. We evaluate different SOTA numerical-based forecasting methods and language generation models. The benchmark results with various forecasting settings demonstrate the proposed PromptCast with language generation models is a promising research direction. Additionally, in comparison to conventional numerical-based forecasting, PromptCast shows a much better generalization ability under the zero-shot setting.
    A simple connection from loss flatness to compressed representations in neural networks. (arXiv:2310.01770v2 [cs.LG] UPDATED)
    The generalization capacity of deep neural networks has been studied in a variety of ways, including at least two distinct categories of approach: one based on the shape of the loss landscape in parameter space, and the other based on the structure of the representation manifold in feature space (that is, in the space of unit activities). Although these two approaches are related, they are rarely studied together in an explicit connection. Here, we present a simple analysis that makes such a connection. We show that, in the last phase of learning of deep neural networks, compression of the manifold of neural representations correlates with the flatness of the loss around the minima explored by SGD. We show that this is predicted by a relatively simple mathematical relationship: a flatter loss corresponds to a lower upper-bound on the compression of neural representations. Our results closely build on the prior work of Ma and Ying, who demonstrated how flatness, characterized by small eigenvalues of the loss Hessian, develops in late learning phases and contributes to robustness against perturbations in network inputs. Moreover, we show a lack of a similarly direct connection between local dimensionality and sharpness, suggesting that this property may be controlled by different mechanisms than volume and hence may play a complementary role in neural representations. Overall, we advance a dual perspective on generalization in neural networks in both parameter and feature space.
    CoSMo: a Framework to Instantiate Conditioned Process Simulation Models. (arXiv:2303.17879v2 [cs.AI] UPDATED)
    Process simulation is gaining attention for its ability to assess potential performance improvements and risks associated with business process changes. The existing literature presents various techniques, generally grounded in process models discovered from event logs or built upon deep learning algorithms. These techniques have specific strengths and limitations. Traditional approaches rooted in process models offer increased interpretability, while those using deep learning excel at generalizing changes across large event logs. However, the practical application of deep learning faces challenges related to managing stochasticity and integrating information for what-if analysis. This paper introduces a novel recurrent neural architecture tailored to discover COnditioned process Simulation MOdels (CoSMo) based on user-based constraints or any other nature of a-priori knowledge. This architecture facilitates the simulation of event logs that adhere to specific constraints by incorporating declarative-based rules into the learning phase as an attempt to fill the gap of incorporating information into deep learning models to perform what-if analysis. Experimental validation illustrates CoSMo's efficacy in simulating event logs while adhering to predefined declarative conditions, emphasizing both control-flow and data-flow perspectives.
    Not All Data Matters: An End-to-End Adaptive Dataset Pruning Framework for Enhancing Model Performance and Efficiency. (arXiv:2312.05599v1 [cs.AI])
    While deep neural networks have demonstrated remarkable performance across various tasks, they typically require massive training data. Due to the presence of redundancies and biases in real-world datasets, not all data in the training dataset contributes to the model performance. To address this issue, dataset pruning techniques have been introduced to enhance model performance and efficiency by eliminating redundant training samples and reducing computational and memory overhead. However, previous works most rely on manually crafted scalar scores, limiting their practical performance and scalability across diverse deep networks and datasets. In this paper, we propose AdaPruner, an end-to-end Adaptive DAtaset PRUNing framEwoRk. AdaPruner can perform effective dataset pruning without the need for explicitly defined metrics. Our framework jointly prunes training data and fine-tunes models with task-specific optimization objectives. AdaPruner leverages (1) An adaptive dataset pruning (ADP) module, which iteratively prunes redundant samples to an expected pruning ratio; and (2) A pruning performance controller (PPC) module, which optimizes the model performance for accurate pruning. Therefore, AdaPruner exhibits high scalability and compatibility across various datasets and deep networks, yielding improved dataset distribution and enhanced model performance. AdaPruner can still significantly enhance model performance even after pruning up to 10-30\% of the training data. Notably, these improvements are accompanied by substantial savings in memory and computation costs. Qualitative and quantitative experiments suggest that AdaPruner outperforms other state-of-the-art dataset pruning methods by a large margin.
    LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient Learning. (arXiv:2306.09910v2 [cs.LG] UPDATED)
    Labeled data are critical to modern machine learning applications, but obtaining labels can be expensive. To mitigate this cost, machine learning methods, such as transfer learning, semi-supervised learning and active learning, aim to be label-efficient: achieving high predictive performance from relatively few labeled examples. While obtaining the best label-efficiency in practice often requires combinations of these techniques, existing benchmark and evaluation frameworks do not capture a concerted combination of all such techniques. This paper addresses this deficiency by introducing LabelBench, a new computationally-efficient framework for joint evaluation of multiple label-efficient learning techniques. As an application of LabelBench, we introduce a novel benchmark of state-of-the-art active learning methods in combination with semi-supervised learning for fine-tuning pretrained vision transformers. Our benchmark demonstrates better label-efficiencies than previously reported in active learning. LabelBench's modular codebase is open-sourced for the broader community to contribute label-efficient learning methods and benchmarks. The repository can be found at: https://github.com/EfficientTraining/LabelBench.
    Extracting Reward Functions from Diffusion Models. (arXiv:2306.01804v2 [cs.LG] UPDATED)
    Diffusion models have achieved remarkable results in image generation, and have similarly been used to learn high-performing policies in sequential decision-making tasks. Decision-making diffusion models can be trained on lower-quality data, and then be steered with a reward function to generate near-optimal trajectories. We consider the problem of extracting a reward function by comparing a decision-making diffusion model that models low-reward behavior and one that models high-reward behavior; a setting related to inverse reinforcement learning. We first define the notion of a relative reward function of two diffusion models and show conditions under which it exists and is unique. We then devise a practical learning algorithm for extracting it by aligning the gradients of a reward function -- parametrized by a neural network -- to the difference in outputs of both diffusion models. Our method finds correct reward functions in navigation environments, and we demonstrate that steering the base model with the learned reward functions results in significantly increased performance in standard locomotion benchmarks. Finally, we demonstrate that our approach generalizes beyond sequential decision-making by learning a reward-like function from two large-scale image generation diffusion models. The extracted reward function successfully assigns lower rewards to harmful images.
    Residual Diffusion Modeling for Km-scale Atmospheric Downscaling. (arXiv:2309.15214v3 [cs.LG] UPDATED)
    Predictions of weather hazard require expensive km-scale simulations driven by coarser global inputs. Here, a cost-effective stochastic downscaling model is trained from a high-resolution 2-km weather model over Taiwan conditioned on 25-km ERA5 reanalysis. To address the multi-scale machine learning challenges of weather data, we employ a two-step approach Corrector Diffusion (\textit{CorrDiff}), where a UNet prediction of the mean is corrected by a diffusion step. Akin to Reynolds decomposition in fluid dynamics, this isolates generative learning to the stochastic scales. \textit{CorrDiff} exhibits skillful RMSE and CRPS and faithfully recovers spectra and distributions even for extremes. Case studies of coherent weather phenomena reveal appropriate multivariate relationships reminiscent of learnt physics: the collocation of intense rainfall and sharp gradients in fronts and extreme winds and rainfall bands near the eyewall of typhoons. Downscaling global forecasts successfully retains many of these benefits, foreshadowing the potential of end-to-end, global-to-km-scales machine learning weather predictions.
    Large-scale Training of Foundation Models for Wearable Biosignals. (arXiv:2312.05409v1 [cs.LG])
    Tracking biosignals is crucial for monitoring wellness and preempting the development of severe medical conditions. Today, wearable devices can conveniently record various biosignals, creating the opportunity to monitor health status without disruption to one's daily routine. Despite widespread use of wearable devices and existing digital biomarkers, the absence of curated data with annotated medical labels hinders the development of new biomarkers to measure common health conditions. In fact, medical datasets are usually small in comparison to other domains, which is an obstacle for developing neural network models for biosignals. To address this challenge, we have employed self-supervised learning using the unlabeled sensor data collected under informed consent from the large longitudinal Apple Heart and Movement Study (AHMS) to train foundation models for two common biosignals: photoplethysmography (PPG) and electrocardiogram (ECG) recorded on Apple Watch. We curated PPG and ECG datasets from AHMS that include data from ~141K participants spanning ~3 years. Our self-supervised learning framework includes participant level positive pair selection, stochastic augmentation module and a regularized contrastive loss optimized with momentum training, and generalizes well to both PPG and ECG modalities. We show that the pre-trained foundation models readily encode information regarding participants' demographics and health conditions. To the best of our knowledge, this is the first study that builds foundation models using large-scale PPG and ECG data collected via wearable consumer devices $\unicode{x2013}$ prior works have commonly used smaller-size datasets collected in clinical and experimental settings. We believe PPG and ECG foundation models can enhance future wearable devices by reducing the reliance on labeled data and hold the potential to help the users improve their health.
    Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features. (arXiv:2312.05265v1 [cs.AI])
    This paper explores privacy-compliant group-level emotion recognition ''in-the-wild'' within the EmotiW Challenge 2023. Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This research imposes itself using only global features avoiding individual ones, i.e. all features that can be used to identify or track people in videos (facial landmarks, body poses, audio diarization, etc.). The proposed multimodal model is composed of a video and an audio branches with a cross-attention between modalities. The video branch is based on a fine-tuned ViT architecture. The audio branch extracts Mel-spectrograms and feed them through CNN blocks into a transformer encoder. Our training paradigm includes a generated synthetic dataset to increase the sensitivity of our model on facial expression within the image in a data-driven way. The extensive experiments show the significance of our methodology. Our privacy-compliant proposal performs fairly on the EmotiW challenge, with 79.24% and 75.13% of accuracy respectively on validation and test set for the best models. Noticeably, our findings highlight that it is possible to reach this accuracy level with privacy-compliant features using only 5 frames uniformly distributed on the video.
    Data-Centric Machine Learning for Geospatial Remote Sensing Data. (arXiv:2312.05327v1 [cs.LG])
    Recent developments and research in modern machine learning have led to substantial improvements in the geospatial field. Although numerous deep learning models have been proposed, the majority of them have been developed on benchmark datasets that lack strong real-world relevance. Furthermore, the performance of many methods has already saturated on these datasets. We argue that shifting the focus towards a complementary data-centric perspective is necessary to achieve further improvements in accuracy, generalization ability, and real impact in end-user applications. This work presents a definition and precise categorization of automated data-centric learning approaches for geospatial data. It highlights the complementary role of data-centric learning with respect to model-centric in the larger machine learning deployment cycle. We review papers across the entire geospatial field and categorize them into different groups. A set of representative experiments shows concrete implementation examples. These examples provide concrete steps to act on geospatial data with data-centric machine learning approaches.
    Isomorphic-Consistent Variational Graph Auto-Encoders for Multi-Level Graph Representation Learning. (arXiv:2312.05519v1 [cs.LG])
    Graph representation learning is a fundamental research theme and can be generalized to benefit multiple downstream tasks from the node and link levels to the higher graph level. In practice, it is desirable to develop task-agnostic general graph representation learning methods that are typically trained in an unsupervised manner. Related research reveals that the power of graph representation learning methods depends on whether they can differentiate distinct graph structures as different embeddings and map isomorphic graphs to consistent embeddings (i.e., the isomorphic consistency of graph models). However, for task-agnostic general graph representation learning, existing unsupervised graph models, represented by the variational graph auto-encoders (VGAEs), can only keep the isomorphic consistency within the subgraphs of 1-hop neighborhoods and thus usually manifest inferior performance on the more difficult higher-level tasks. To overcome the limitations of existing unsupervised methods, in this paper, we propose the Isomorphic-Consistent VGAE (IsoC-VGAE) for multi-level task-agnostic graph representation learning. We first devise a decoding scheme to provide a theoretical guarantee of keeping the isomorphic consistency under the settings of unsupervised learning. We then propose the Inverse Graph Neural Network (Inv-GNN) decoder as its intuitive realization, which trains the model via reconstructing the GNN node embeddings with multi-hop neighborhood information, so as to maintain the high-order isomorphic consistency within the VGAE framework. We conduct extensive experiments on the representative graph learning tasks at different levels, including node classification, link prediction and graph classification, and the results verify that our proposed model generally outperforms both the state-of-the-art unsupervised methods and representative supervised methods.
    Can Learning Be Explained By Local Optimality In Low-rank Matrix Recovery?. (arXiv:2302.10963v2 [cs.LG] UPDATED)
    We explore the local landscape of low-rank matrix recovery, aiming to reconstruct a $d_1\times d_2$ matrix with rank $r$ from $m$ linear measurements, some potentially noisy. When the true rank is unknown, overestimation is common, yielding an over-parameterized model with rank $k\geq r$. Recent findings suggest that first-order methods with the robust $\ell_1$-loss can recover the true low-rank solution even when the rank is overestimated and measurements are noisy, implying that true solutions might emerge as local or global minima. Our paper challenges this notion, demonstrating that, under mild conditions, true solutions manifest as \textit{strict saddle points}. We study two categories of low-rank matrix recovery, matrix completion and matrix sensing, both with the robust $\ell_1$-loss. For matrix sensing, we uncover two critical transitions. With $m$ in the range of $\max\{d_1,d_2\}r\lesssim m\lesssim \max\{d_1,d_2\}k$, none of the true solutions are local or global minima, but some become strict saddle points. As $m$ surpasses $\max\{d_1,d_2\}k$, all true solutions become unequivocal global minima. In matrix completion, even with slight rank overestimation and mild noise, true solutions either emerge as non-critical or strict saddle points.
    Analyzing Behaviors of Mixed Traffic via Reinforcement Learning at Unsignalized Intersections. (arXiv:2312.05325v1 [cs.RO])
    In this report, we delve into two critical research inquiries. Firstly, we explore the extent to which Reinforcement Learning (RL) agents exhibit multimodal distributions in the context of stop-and-go traffic scenarios. Secondly, we investigate how RL-controlled Robot Vehicles (RVs) effectively navigate their direction and coordinate with other vehicles in complex traffic environments. Our analysis encompasses an examination of multimodality within queue length, outflow, and platoon size distributions for both Robot and Human-driven Vehicles (HVs). Additionally, we assess the Pearson coefficient correlation, shedding light on relationships between queue length and outflow, considering both identical and differing travel directions. Furthermore, we delve into causal inference models, shedding light on the factors influencing queue length across scenarios involving varying travel directions. Through these investigations, this report contributes valuable insights into the behaviors of mixed traffic (RVs and HVs) in traffic management and coordination.
    Efficient sampling from the Bingham distribution. (arXiv:2010.00137v2 [cs.LG] UPDATED)
    We give a algorithm for exact sampling from the Bingham distribution $p(x)\propto \exp(x^\top A x)$ on the sphere $\mathcal S^{d-1}$ with expected runtime of $\operatorname{poly}(d, \lambda_{\max}(A)-\lambda_{\min}(A))$. The algorithm is based on rejection sampling, where the proposal distribution is a polynomial approximation of the pdf, and can be sampled from by explicitly evaluating integrals of polynomials over the sphere. Our algorithm gives exact samples, assuming exact computation of an inverse function of a polynomial. This is in contrast with Markov Chain Monte Carlo algorithms, which are not known to enjoy rapid mixing on this problem, and only give approximate samples. As a direct application, we use this to sample from the posterior distribution of a rank-1 matrix inference problem in polynomial time.
  • Open

    Interpretable Long Term Waypoint-Based Trajectory Prediction Model. (arXiv:2312.06219v1 [cs.AI])
    Predicting the future trajectories of dynamic agents in complex environments is crucial for a variety of applications, including autonomous driving, robotics, and human-computer interaction. It is a challenging task as the behavior of the agent is unknown and intrinsically multimodal. Our key insight is that the agents behaviors are influenced not only by their past trajectories and their interaction with their immediate environment but also largely with their long term waypoint (LTW). In this paper, we study the impact of adding a long-term goal on the performance of a trajectory prediction framework. We present an interpretable long term waypoint-driven prediction framework (WayDCM). WayDCM first predict an agent's intermediate goal (IG) by encoding his interactions with the environment as well as his LTW using a combination of a Discrete choice Model (DCM) and a Neural Network model (NN). Then, our model predicts the corresponding trajectories. This is in contrast to previous work which does not consider the ultimate intent of the agent to predict his trajectory. We evaluate and show the effectiveness of our approach on the Waymo Open dataset.
    Nonconvex Zeroth-Order Stochastic ADMM Methods with Lower Function Query Complexity. (arXiv:1907.13463v4 [math.OC] UPDATED)
    Zeroth-order (a.k.a, derivative-free) methods are a class of effective optimization methods for solving complex machine learning problems, where gradients of the objective functions are not available or computationally prohibitive. Recently, although many zeroth-order methods have been developed, these approaches still have two main drawbacks: 1) high function query complexity; 2) not being well suitable for solving the problems with complex penalties and constraints. To address these challenging drawbacks, in this paper, we propose a class of faster zeroth-order stochastic alternating direction method of multipliers (ADMM) methods (ZO-SPIDER-ADMM) to solve the nonconvex finite-sum problems with multiple nonsmooth penalties. Moreover, we prove that the ZO-SPIDER-ADMM methods can achieve a lower function query complexity of $O(nd+dn^{\frac{1}{2}}\epsilon^{-1})$ for finding an $\epsilon$-stationary point, which improves the existing best nonconvex zeroth-order ADMM methods by a factor of $O(d^{\frac{1}{3}}n^{\frac{1}{6}})$, where $n$ and $d$ denote the sample size and data dimension, respectively. At the same time, we propose a class of faster zeroth-order online ADMM methods (ZOO-ADMM+) to solve the nonconvex online problems with multiple nonsmooth penalties. We also prove that the proposed ZOO-ADMM+ methods achieve a lower function query complexity of $O(d\epsilon^{-\frac{3}{2}})$, which improves the existing best result by a factor of $O(\epsilon^{-\frac{1}{2}})$. Extensive experimental results on the structure adversarial attack on black-box deep neural networks demonstrate the efficiency of our new algorithms.
    Estimating Shape Distances on Neural Representations with Limited Samples. (arXiv:2310.05742v2 [stat.ML] UPDATED)
    Measuring geometric similarity between high-dimensional network representations is a topic of longstanding interest to neuroscience and deep learning. Although many methods have been proposed, only a few works have rigorously analyzed their statistical efficiency or quantified estimator uncertainty in data-limited regimes. Here, we derive upper and lower bounds on the worst-case convergence of standard estimators of shape distance$\unicode{x2014}$a measure of representational dissimilarity proposed by Williams et al. (2021).These bounds reveal the challenging nature of the problem in high-dimensional feature spaces. To overcome these challenges, we introduce a new method-of-moments estimator with a tunable bias-variance tradeoff. We show that this estimator achieves substantially lower bias than standard estimators in simulation and on neural data, particularly in high-dimensional settings. Thus, we lay the foundation for a rigorous statistical theory for high-dimensional shape analysis, and we contribute a new estimation method that is well-suited to practical scientific settings.
    Mean estimation in the add-remove model of differential privacy. (arXiv:2312.06658v1 [cs.DS])
    Differential privacy is often studied under two different models of neighboring datasets: the add-remove model and the swap model. While the swap model is used extensively in the academic literature, many practical libraries use the more conservative add-remove model. However, analysis under the add-remove model can be cumbersome, and obtaining results with tight constants requires some additional work. Here, we study the problem of one-dimensional mean estimation under the add-remove model of differential privacy. We propose a new algorithm and show that it is min-max optimal, that it has the correct constant in the leading term of the mean squared error, and that this constant is the same as the optimal algorithm in the swap model. Our results show that, for mean estimation, the add-remove and swap model give nearly identical error even though the add-remove model cannot treat the size of the dataset as public information. In addition, we demonstrate empirically that our proposed algorithm yields a factor of two improvement in mean squared error over algorithms often used in practice.
    Hacking Task Confounder in Meta-Learning. (arXiv:2312.05771v1 [cs.LG])
    Meta-learning enables rapid generalization to new tasks by learning meta-knowledge from a variety of tasks. It is intuitively assumed that the more tasks a model learns in one training batch, the richer knowledge it acquires, leading to better generalization performance. However, contrary to this intuition, our experiments reveal an unexpected result: adding more tasks within a single batch actually degrades the generalization performance. To explain this unexpected phenomenon, we conduct a Structural Causal Model (SCM) for causal analysis. Our investigation uncovers the presence of spurious correlations between task-specific causal factors and labels in meta-learning. Furthermore, the confounding factors differ across different batches. We refer to these confounding factors as ``Task Confounders". Based on this insight, we propose a plug-and-play Meta-learning Causal Representation Learner (MetaCRL) to eliminate task confounders. It encodes decoupled causal factors from multiple tasks and utilizes an invariant-based bi-level optimization mechanism to ensure their causality for meta-learning. Extensive experiments on various benchmark datasets demonstrate that our work achieves state-of-the-art (SOTA) performance.
    A sampling criterion for constrained Bayesian optimization with uncertainties. (arXiv:2103.05706v4 [stat.ML] UPDATED)
    We consider the problem of chance constrained optimization where it is sought to optimize a function and satisfy constraints, both of which are affected by uncertainties. The real world declinations of this problem are particularly challenging because of their inherent computational cost. To tackle such problems, we propose a new Bayesian optimization method. It applies to the situation where the uncertainty comes from some of the inputs, so that it becomes possible to define an acquisition criterion in the joint controlled-uncontrolled input space. The main contribution of this work is an acquisition criterion that accounts for both the average improvement in objective function and the constraint reliability. The criterion is derived following the Stepwise Uncertainty Reduction logic and its maximization provides both optimal controlled and uncontrolled parameters. Analytical expressions are given to efficiently calculate the criterion. Numerical studies on test functions are presented. It is found through experimental comparisons with alternative sampling criteria that the adequation between the sampling criterion and the problem contributes to the efficiency of the overall optimization. As a side result, an expression for the variance of the improvement is given.
    The Rashomon Importance Distribution: Getting RID of Unstable, Single Model-based Variable Importance. (arXiv:2309.13775v3 [cs.LG] UPDATED)
    Quantifying variable importance is essential for answering high-stakes questions in fields like genetics, public policy, and medicine. Current methods generally calculate variable importance for a given model trained on a given dataset. However, for a given dataset, there may be many models that explain the target outcome equally well; without accounting for all possible explanations, different researchers may arrive at many conflicting yet equally valid conclusions given the same data. Additionally, even when accounting for all possible explanations for a given dataset, these insights may not generalize because not all good explanations are stable across reasonable data perturbations. We propose a new variable importance framework that quantifies the importance of a variable across the set of all good models and is stable across the data distribution. Our framework is extremely flexible and can be integrated with most existing model classes and global variable importance metrics. We demonstrate through experiments that our framework recovers variable importance rankings for complex simulation setups where other methods fail. Further, we show that our framework accurately estimates the true importance of a variable for the underlying data distribution. We provide theoretical guarantees on the consistency and finite sample error rates for our estimator. Finally, we demonstrate its utility with a real-world case study exploring which genes are important for predicting HIV load in persons with HIV, highlighting an important gene that has not previously been studied in connection with HIV. Code is available at https://github.com/jdonnelly36/Rashomon_Importance_Distribution.
    SurvBeNIM: The Beran-Based Neural Importance Model for Explaining the Survival Models. (arXiv:2312.06638v1 [cs.LG])
    A new method called the Survival Beran-based Neural Importance Model (SurvBeNIM) is proposed. It aims to explain predictions of machine learning survival models, which are in the form of survival or cumulative hazard functions. The main idea behind SurvBeNIM is to extend the Beran estimator by incorporating the importance functions into its kernels and by implementing these importance functions as a set of neural networks which are jointly trained in an end-to-end manner. Two strategies of using and training the whole neural network implementing SurvBeNIM are proposed. The first one explains a single instance, and the neural network is trained for each explained instance. According to the second strategy, the neural network only learns once on all instances from the dataset and on all generated instances. Then the neural network is used to explain any instance in a dataset domain. Various numerical experiments compare the method with different existing explanation methods. A code implementing the proposed method is publicly available.
    Learning Unknown Intervention Targets in Structural Causal Models from Heterogeneous Data. (arXiv:2312.06091v1 [cs.LG])
    We study the problem of identifying the unknown intervention targets in structural causal models where we have access to heterogeneous data collected from multiple environments. The unknown intervention targets are the set of endogenous variables whose corresponding exogenous noises change across the environments. We propose a two-phase approach which in the first phase recovers the exogenous noises corresponding to unknown intervention targets whose distributions have changed across environments. In the second phase, the recovered noises are matched with the corresponding endogenous variables. For the recovery phase, we provide sufficient conditions for learning these exogenous noises up to some component-wise invertible transformation. For the matching phase, under the causal sufficiency assumption, we show that the proposed method uniquely identifies the intervention targets. In the presence of latent confounders, the intervention targets among the observed variables cannot be determined uniquely. We provide a candidate intervention target set which is a superset of the true intervention targets. Our approach improves upon the state of the art as the returned candidate set is always a subset of the target set returned by previous work. Moreover, we do not require restrictive assumptions such as linearity of the causal model or performing invariance tests to learn whether a distribution is changing across environments which could be highly sample inefficient. Our experimental results show the effectiveness of our proposed algorithm in practice.
    The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit. (arXiv:2306.17759v2 [stat.ML] UPDATED)
    In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffusion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.
    Composite Survival Analysis: Learning with Auxiliary Aggregated Baselines and Survival Scores. (arXiv:2312.05854v1 [cs.LG])
    Survival Analysis (SA) constitutes the default method for time-to-event modeling due to its ability to estimate event probabilities of sparsely occurring events over time. In this work, we show how to improve the training and inference of SA models by decoupling their full expression into (1) an aggregated baseline hazard, which captures the overall behavior of a given population, and (2) independently distributed survival scores, which model idiosyncratic probabilistic dynamics of its given members, in a fully parametric setting. The proposed inference method is shown to dynamically handle right-censored observation horizons, and to achieve competitive performance when compared to other state-of-the-art methods in a variety of real-world datasets, including computationally inefficient Deep Learning-based SA methods and models that require MCMC for inference. Nevertheless, our method achieves robust results from the outset, while not being subjected to fine-tuning or hyperparameter optimization.
    Revisiting RIP guarantees for sketching operators on mixture models. (arXiv:2312.05573v1 [stat.ML])
    In the context of sketching for compressive mixture modeling, we revisit existing proofs of the Restricted Isometry Property of sketching operators with respect to certain mixtures models. After examining the shortcomings of existing guarantees, we propose an alternative analysis that circumvents the need to assume importance sampling when drawing random Fourier features to build random sketching operators. Our analysis is based on new deterministic bounds on the restricted isometry constant that depend solely on the set of frequencies used to define the sketching operator; then we leverage these bounds to establish concentration inequalities for random sketching operators that lead to the desired RIP guarantees. Our analysis also opens the door to theoretical guarantees for structured sketching with frequencies associated to fast random linear operators.
    Uncertainty quantification in automated valuation models with locally weighted conformal prediction. (arXiv:2312.06531v1 [stat.ML])
    Non-parametric machine learning models, such as random forests and gradient boosted trees, are frequently used to estimate house prices due to their predictive accuracy, but such methods are often limited in their ability to quantify prediction uncertainty. Conformal Prediction (CP) is a model-agnostic framework for constructing confidence sets around machine learning prediction models with minimal assumptions. However, due to the spatial dependencies observed in house prices, direct application of CP leads to confidence sets that are not calibrated everywhere, i.e., too large of confidence sets in certain geographical regions and too small in others. We survey various approaches to adjust the CP confidence set to account for this and demonstrate their performance on a data set from the housing market in Oslo, Norway. Our findings indicate that calibrating the confidence sets on a \textit{locally weighted} version of the non-conformity scores makes the coverage more consistently calibrated in different geographical regions. We also perform a simulation study on synthetically generated sale prices to empirically explore the performance of CP on housing market data under idealized conditions with known data-generating mechanisms.
    Temporal Supervised Contrastive Learning for Modeling Patient Risk Progression. (arXiv:2312.05933v1 [cs.LG])
    We consider the problem of predicting how the likelihood of an outcome of interest for a patient changes over time as we observe more of the patient data. To solve this problem, we propose a supervised contrastive learning framework that learns an embedding representation for each time step of a patient time series. Our framework learns the embedding space to have the following properties: (1) nearby points in the embedding space have similar predicted class probabilities, (2) adjacent time steps of the same time series map to nearby points in the embedding space, and (3) time steps with very different raw feature vectors map to far apart regions of the embedding space. To achieve property (3), we employ a nearest neighbor pairing mechanism in the raw feature space. This mechanism also serves as an alternative to data augmentation, a key ingredient of contrastive learning, which lacks a standard procedure that is adequately realistic for clinical tabular data, to our knowledge. We demonstrate that our approach outperforms state-of-the-art baselines in predicting mortality of septic patients (MIMIC-III dataset) and tracking progression of cognitive impairment (ADNI dataset). Our method also consistently recovers the correct synthetic dataset embedding structure across experiments, a feat not achieved by baselines. Our ablation experiments show the pivotal role of our nearest neighbor pairing.
    Sample-Optimal Locally Private Hypothesis Selection and the Provable Benefits of Interactivity. (arXiv:2312.05645v1 [stat.ML])
    We study the problem of hypothesis selection under the constraint of local differential privacy. Given a class $\mathcal{F}$ of $k$ distributions and a set of i.i.d. samples from an unknown distribution $h$, the goal of hypothesis selection is to pick a distribution $\hat{f}$ whose total variation distance to $h$ is comparable with the best distribution in $\mathcal{F}$ (with high probability). We devise an $\varepsilon$-locally-differentially-private ($\varepsilon$-LDP) algorithm that uses $\Theta\left(\frac{k}{\alpha^2\min \{\varepsilon^2,1\}}\right)$ samples to guarantee that $d_{TV}(h,\hat{f})\leq \alpha + 9 \min_{f\in \mathcal{F}}d_{TV}(h,f)$ with high probability. This sample complexity is optimal for $\varepsilon<1$, matching the lower bound of Gopi et al. (2020). All previously known algorithms for this problem required $\Omega\left(\frac{k\log k}{\alpha^2\min \{ \varepsilon^2 ,1\}} \right)$ samples to work. Moreover, our result demonstrates the power of interaction for $\varepsilon$-LDP hypothesis selection. Namely, it breaks the known lower bound of $\Omega\left(\frac{k\log k}{\alpha^2\min \{ \varepsilon^2 ,1\}} \right)$ for the sample complexity of non-interactive hypothesis selection. Our algorithm breaks this barrier using only $\Theta(\log \log k)$ rounds of interaction. To prove our results, we define the notion of \emph{critical queries} for a Statistical Query Algorithm (SQA) which may be of independent interest. Informally, an SQA is said to use a small number of critical queries if its success relies on the accuracy of only a small number of queries it asks. We then design an LDP algorithm that uses a smaller number of critical queries.
    Modyn: A Platform for Model Training on Dynamic Datasets With Sample-Level Data Selection. (arXiv:2312.06254v1 [cs.LG])
    Machine learning training data is often dynamic in real-world use cases, i.e., data is added or removed and may experience distribution shifts over time. Models must incorporate this evolving training data to improve generalization, adapt to potential distribution shifts, and adhere to privacy regulations. However, the cost of model (re)training is proportional to how often the model trains and on how much data it trains on. While ML research explores these topics in isolation, there is no end-to-end open-source platform to facilitate the exploration of model retraining and data selection policies and the deployment these algorithms efficiently at scale. We present Modyn, a platform for model training on dynamic datasets that enables sample-level data selection and triggering policies. Modyn orchestrates continuous training pipelines while optimizing the underlying system infrastructure to support fast access to arbitrary data samples for efficient data selection. Modyn's extensible architecture allows users to run training pipelines without modifying the platform code, and enables researchers to effortlessly extend the system. We evaluate Modyn's training throughput, showing that even in memory-bound recommendation systems workloads, Modyn is able to reach 80 to 100 % of the throughput compared to loading big chunks of data locally without sample-level data selection. Additionally, we showcase Modyn's functionality with three different data selection policies.
    Spectral Statistics of the Sample Covariance Matrix for High Dimensional Linear Gaussians. (arXiv:2312.05794v1 [math.ST])
    Performance of ordinary least squares(OLS) method for the \emph{estimation of high dimensional stable state transition matrix} $A$(i.e., spectral radius $\rho(A)<1$) from a single noisy observed trajectory of the linear time invariant(LTI)\footnote{Linear Gaussian (LG) in Markov chain literature} system $X_{-}:(x_0,x_1, \ldots,x_{N-1})$ satisfying \begin{equation} x_{t+1}=Ax_{t}+w_{t}, \hspace{10pt} \text{ where } w_{t} \thicksim N(0,I_{n}), \end{equation} heavily rely on negative moments of the sample covariance matrix: $(X_{-}X_{-}^{*})=\sum_{i=0}^{N-1}x_{i}x_{i}^{*}$ and singular values of $EX_{-}^{*}$, where $E$ is a rectangular Gaussian ensemble $E=[w_0, \ldots, w_{N-1}]$. Negative moments requires sharp estimates on all the eigenvalues $\lambda_{1}\big(X_{-}X_{-}^{*}\big) \geq \ldots \geq \lambda_{n}\big(X_{-}X_{-}^{*}\big) \geq 0$. Leveraging upon recent results on spectral theorem for non-Hermitian operators in \cite{naeem2023spectral}, along with concentration of measure phenomenon and perturbation theory(Gershgorins' and Cauchys' interlacing theorem) we show that only when $A=A^{*}$, typical order of $\lambda_{j}\big(X_{-}X_{-}^{*}\big) \in \big[N-n\sqrt{N}, N+n\sqrt{N}\big]$ for all $j \in [n]$. However, in \emph{high dimensions} when $A$ has only one distinct eigenvalue $\lambda$ with geometric multiplicity of one, then as soon as eigenvalue leaves \emph{complex half unit disc}, largest eigenvalue suffers from curse of dimensionality: $\lambda_{1}\big(X_{-}X_{-}^{*}\big)=\Omega\big( \lfloor\frac{N}{n}\rfloor e^{\alpha_{\lambda}n} \big)$, while smallest eigenvalue $\lambda_{n}\big(X_{-}X_{-}^{*}\big) \in (0, N+\sqrt{N}]$. Consequently, OLS estimator incurs a \emph{phase transition} and becomes \emph{transient: increasing iteration only worsens estimation error}, all of this happening when the dynamics are generated from stable systems.
    Probabilistic Precipitation Downscaling with Optical Flow-Guided Diffusion. (arXiv:2312.06071v1 [cs.CV])
    In climate science and meteorology, local precipitation predictions are limited by the immense computational costs induced by the high spatial resolution that simulation methods require. A common workaround is statistical downscaling (aka superresolution), where a low-resolution prediction is super-resolved using statistical approaches. While traditional computer vision tasks mainly focus on human perception or mean squared error, applications in weather and climate require capturing the conditional distribution of high-resolution patterns given low-resolution patterns so that reliable ensemble averages can be taken. Our approach relies on extending recent video diffusion models to precipitation superresolution: an optical flow on the high-resolution output induces temporally coherent predictions, whereas a temporally-conditioned diffusion model generates residuals that capture the correct noise characteristics and high-frequency patterns. We test our approach on X-SHiELD, an established large-scale climate simulation dataset, and compare against two state-of-the-art baselines, focusing on CRPS, MSE, precipitation distributions, as well as an illustrative case -- the complex terrain of California. Our approach sets a new standard for data-driven precipitation downscaling.
    Bidirectional Attention as a Mixture of Continuous Word Experts. (arXiv:2307.04057v2 [cs.CL] UPDATED)
    Bidirectional attention $\unicode{x2013}$ composed of self-attention with positional encodings and the masked language model (MLM) objective $\unicode{x2013}$ has emerged as a key component of modern large language models (LLMs). Despite its empirical success, few studies have examined its statistical underpinnings: What statistical model is bidirectional attention implicitly fitting? What sets it apart from its non-attention predecessors? We explore these questions in this paper. The key observation is that fitting a single-layer single-head bidirectional attention, upon reparameterization, is equivalent to fitting a continuous bag of words (CBOW) model with mixture-of-experts (MoE) weights. Further, bidirectional attention with multiple heads and multiple layers is equivalent to stacked MoEs and a mixture of MoEs, respectively. This statistical viewpoint reveals the distinct use of MoE in bidirectional attention, which aligns with its practical effectiveness in handling heterogeneous data. It also suggests an immediate extension to categorical tabular data, if we view each word location in a sentence as a tabular feature. Across empirical studies, we find that this extension outperforms existing tabular extensions of transformers in out-of-distribution (OOD) generalization. Finally, this statistical perspective of bidirectional attention enables us to theoretically characterize when linear word analogies are present in its word embeddings. These analyses show that bidirectional attention can require much stronger assumptions to exhibit linear word analogies than its non-attention predecessors.
    Structured Inverse-Free Natural Gradient: Memory-Efficient & Numerically-Stable KFAC for Large Neural Nets. (arXiv:2312.05705v1 [cs.LG])
    Second-order methods for deep learning -- such as KFAC -- can be useful for neural net training. However, they are often memory-inefficient and numerically unstable for low-precision training since their preconditioning Kronecker factors are dense, and require high-precision matrix inversion or decomposition. Consequently, such methods are not widely used for training large neural networks such as transformer-based models. We address these two issues by (i) formulating an inverse-free update of KFAC and (ii) imposing structures in each of the Kronecker factors, resulting in a method we term structured inverse-free natural gradient descent (SINGD). On large modern neural networks, we show that, in contrast to KFAC, SINGD is memory efficient and numerically robust, and often outperforms AdamW even in half precision. Hence, our work closes a gap between first-order and second-order methods in modern low precision training for large neural nets.
    Multi-granularity Causal Structure Learning. (arXiv:2312.05549v1 [cs.LG])
    Unveil, model, and comprehend the causal mechanisms underpinning natural phenomena stand as fundamental endeavors across myriad scientific disciplines. Meanwhile, new knowledge emerges when discovering causal relationships from data. Existing causal learning algorithms predominantly focus on the isolated effects of variables, overlook the intricate interplay of multiple variables and their collective behavioral patterns. Furthermore, the ubiquity of high-dimensional data exacts a substantial temporal cost for causal algorithms. In this paper, we develop a novel method called MgCSL (Multi-granularity Causal Structure Learning), which first leverages sparse auto-encoder to explore coarse-graining strategies and causal abstractions from micro-variables to macro-ones. MgCSL then takes multi-granularity variables as inputs to train multilayer perceptrons and to delve the causality between variables. To enhance the efficacy on high-dimensional data, MgCSL introduces a simplified acyclicity constraint to adeptly search the directed acyclic graph among variables. Experimental results show that MgCSL outperforms competitive baselines, and finds out explainable causal connections on fMRI datasets.
    Data-driven optimal stopping: A pure exploration analysis. (arXiv:2312.05880v1 [math.ST])
    The standard theory of optimal stopping is based on the idealised assumption that the underlying process is essentially known. In this paper, we drop this restriction and study data-driven optimal stopping for a general diffusion process, focusing on investigating the statistical performance of the proposed estimator of the optimal stopping barrier. More specifically, we derive non-asymptotic upper bounds on the simple regret, along with uniform and non-asymptotic PAC bounds. Minimax optimality is verified by completing the upper bound results with matching lower bounds on the simple regret. All results are shown both under general conditions on the payoff functions and under more refined assumptions that mimic the margin condition used in binary classification, leading to an improved rate of convergence. Additionally, we investigate how our results on the simple regret transfer to the cumulative regret for a specific exploration-exploitation strategy, both with respect to lower bounds and upper bounds.
    Conditional Stochastic Interpolation for Generative Learning. (arXiv:2312.05579v1 [stat.ML])
    We propose a conditional stochastic interpolation (CSI) approach to learning conditional distributions. CSI learns probability flow equations or stochastic differential equations that transport a reference distribution to the target conditional distribution. This is achieved by first learning the drift function and the conditional score function based on conditional stochastic interpolation, which are then used to construct a deterministic process governed by an ordinary differential equation or a diffusion process for conditional sampling. In our proposed CSI model, we incorporate an adaptive diffusion term to address the instability issues arising during the training process. We provide explicit forms of the conditional score function and the drift function in terms of conditional expectations under mild conditions, which naturally lead to an nonparametric regression approach to estimating these functions. Furthermore, we establish non-asymptotic error bounds for learning the target conditional distribution via conditional stochastic interpolation in terms of KL divergence, taking into account the neural network approximation error. We illustrate the application of CSI on image generation using a benchmark image dataset.
    Online Statistical Inference for Stochastic Optimization via Kiefer-Wolfowitz Methods. (arXiv:2102.03389v5 [math.ST] UPDATED)
    This paper investigates the problem of online statistical inference of model parameters in stochastic optimization problems via the Kiefer-Wolfowitz algorithm with random search directions. We first present the asymptotic distribution for the Polyak-Ruppert-averaging type Kiefer-Wolfowitz (AKW) estimators, whose asymptotic covariance matrices depend on the distribution of search directions and the function-value query complexity. The distributional result reflects the trade-off between statistical efficiency and function query complexity. We further analyze the choice of random search directions to minimize certain summary statistics of the asymptotic covariance matrix. Based on the asymptotic distribution, we conduct online statistical inference by providing two construction procedures of valid confidence intervals.
    Debiased Machine Learning and Network Cohesion for Doubly-Robust Differential Reward Models in Contextual Bandits. (arXiv:2312.06403v1 [stat.ML])
    A common approach to learning mobile health (mHealth) intervention policies is linear Thompson sampling. Two desirable mHealth policy features are (1) pooling information across individuals and time and (2) incorporating a time-varying baseline reward. Previous approaches pooled information across individuals but not time, failing to capture trends in treatment effects over time. In addition, these approaches did not explicitly model the baseline reward, which limited the ability to precisely estimate the parameters in the differential reward model. In this paper, we propose a novel Thompson sampling algorithm, termed ''DML-TS-NNR'' that leverages (1) nearest-neighbors to efficiently pool information on the differential reward function across users and time and (2) the Double Machine Learning (DML) framework to explicitly model baseline rewards and stay agnostic to the supervised learning algorithms used. By explicitly modeling baseline rewards, we obtain smaller confidence sets for the differential reward parameters. We offer theoretical guarantees on the pseudo-regret, which are supported by empirical results. Importantly, the DML-TS-NNR algorithm demonstrates robustness to potential misspecifications in the baseline reward model.
    SAM as an Optimal Relaxation of Bayes. (arXiv:2210.01620v3 [cs.LG] UPDATED)
    Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. By connecting adversarial and Bayesian methods, our work opens a new path to robustness.
    Almost Equivariance via Lie Algebra Convolutions. (arXiv:2310.13164v3 [cs.LG] UPDATED)
    Recently, the equivariance of models with respect to a group action has become an important topic of research in machine learning. Analysis of the built-in equivariance of existing neural network architectures, as well as the study of building models that explicitly "bake in" equivariance, have become significant research areas in their own right. However, imbuing an architecture with a specific group equivariance imposes a strong prior on the types of data transformations that the model expects to see. While strictly-equivariant models enforce symmetries, real-world data does not always conform to such strict equivariances. In such cases, the prior of strict equivariance can actually prove too strong and cause models to underperform. Therefore, in this work we study a closely related topic, that of almost equivariance. We provide a definition of almost equivariance and give a practical method for encoding almost equivariance in models by appealing to the Lie algebra of a Lie group. Specifically, we define Lie algebra convolutions and demonstrate that they offer several benefits over Lie group convolutions, including being well-defined for non-compact Lie groups having non-surjective exponential map. From there, we demonstrate connections between the notions of equivariance and isometry and those of almost equivariance and almost isometry. We prove two existence theorems, one showing the existence of almost isometries within bounded distance of isometries of a manifold, and another showing the converse for Hilbert spaces. We extend these theorems to prove the existence of almost equivariant manifold embeddings within bounded distance of fully equivariant embedding functions, subject to certain constraints on the group action and the function class. Finally, we demonstrate the validity of our approach by benchmarking against datasets in fully equivariant and almost equivariant settings.
    Trust Your $\nabla$: Gradient-based Intervention Targeting for Causal Discovery. (arXiv:2211.13715v3 [stat.ML] UPDATED)
    Inferring causal structure from data is a challenging task of fundamental importance in science. Observational data are often insufficient to identify a system's causal structure uniquely. While conducting interventions (i.e., experiments) can improve the identifiability, such samples are usually challenging and expensive to obtain. Hence, experimental design approaches for causal discovery aim to minimize the number of interventions by estimating the most informative intervention target. In this work, we propose a novel Gradient-based Intervention Targeting method, abbreviated GIT, that 'trusts' the gradient estimator of a gradient-based causal discovery framework to provide signals for the intervention acquisition function. We provide extensive experiments in simulated and real-world datasets and demonstrate that GIT performs on par with competitive baselines, surpassing them in the low-data regime.
    Deep Bayes Factors. (arXiv:2312.05411v1 [stat.ME])
    The is no other model or hypothesis verification tool in Bayesian statistics that is as widely used as the Bayes factor. We focus on generative models that are likelihood-free and, therefore, render the computation of Bayes factors (marginal likelihood ratios) far from obvious. We propose a deep learning estimator of the Bayes factor based on simulated data from two competing models using the likelihood ratio trick. This estimator is devoid of summary statistics and obviates some of the difficulties with ABC model choice. We establish sufficient conditions for consistency of our Deep Bayes Factor estimator as well as its consistency as a model selection tool. We investigate the performance of our estimator on various examples using a wide range of quality metrics related to estimation and model decision accuracy. After training, our deep learning approach enables rapid evaluations of the Bayes factor estimator at any fictional data arriving from either hypothesized model, not just the observed data $Y_0$. This allows us to inspect entire Bayes factor distributions under the two models and to quantify the relative location of the Bayes factor evaluated at $Y_0$ in light of these distributions. Such tail area evaluations are not possible for Bayes factor estimators tailored to $Y_0$. We find the performance of our Deep Bayes Factors competitive with existing MCMC techniques that require the knowledge of the likelihood function. We also consider variants for posterior or intrinsic Bayes factors estimation. We demonstrate the usefulness of our approach on a relatively high-dimensional real data example about determining cognitive biases.
    Consistency Models for Scalable and Fast Simulation-Based Inference. (arXiv:2312.05440v1 [cs.LG])
    Simulation-based inference (SBI) is constantly in search of more expressive algorithms for accurately inferring the parameters of complex models from noisy data. We present consistency models for neural posterior estimation (CMPE), a new free-form conditional sampler for scalable, fast, and amortized SBI with generative neural networks. CMPE combines the advantages of normalizing flows and flow matching methods into a single generative architecture: It essentially distills a continuous probability flow and enables rapid few-shot inference with an unconstrained architecture that can be tailored to the structure of the estimation problem. Our empirical evaluation demonstrates that CMPE not only outperforms current state-of-the-art algorithms on three hard low-dimensional problems, but also achieves competitive performance in a high-dimensional Bayesian denoising experiment and in estimating a computationally demanding multi-scale model of tumor spheroid growth.
    Concurrent Density Estimation with Wasserstein Autoencoders: Some Statistical Insights. (arXiv:2312.06591v1 [stat.ML])
    Variational Autoencoders (VAEs) have been a pioneering force in the realm of deep generative models. Amongst its legions of progenies, Wasserstein Autoencoders (WAEs) stand out in particular due to the dual offering of heightened generative quality and a strong theoretical backbone. WAEs consist of an encoding and a decoding network forming a bottleneck with the prime objective of generating new samples resembling the ones it was catered to. In the process, they aim to achieve a target latent representation of the encoded data. Our work is an attempt to offer a theoretical understanding of the machinery behind WAEs. From a statistical viewpoint, we pose the problem as concurrent density estimation tasks based on neural network-induced transformations. This allows us to establish deterministic upper bounds on the realized errors WAEs commit. We also analyze the propagation of these stochastic errors in the presence of adversaries. As a result, both the large sample properties of the reconstructed distribution and the resilience of WAE models are explored.
    Large-Scale Quantum Separability Through a Reproducible Machine Learning Lens. (arXiv:2306.09444v2 [quant-ph] UPDATED)
    The quantum separability problem consists in deciding whether a bipartite density matrix is entangled or separable. In this work, we propose a machine learning pipeline for finding approximate solutions for this NP-hard problem in large-scale scenarios. We provide an efficient Frank-Wolfe-based algorithm to approximately seek the nearest separable density matrix and derive a systematic way for labeling density matrices as separable or entangled, allowing us to treat quantum separability as a classification problem. Our method is applicable to any two-qudit mixed states. Numerical experiments with quantum states of 3- and 7-dimensional qudits validate the efficiency of the proposed procedure, and demonstrate that it scales up to thousands of density matrices with a high quantum entanglement detection accuracy. This takes a step towards benchmarking quantum separability to support the development of more powerful entanglement detection techniques.
    Skew Probabilistic Neural Networks for Learning from Imbalanced Data. (arXiv:2312.05878v1 [stat.ML])
    Real-world datasets often exhibit imbalanced data distribution, where certain class levels are severely underrepresented. In such cases, traditional pattern classifiers have shown a bias towards the majority class, impeding accurate predictions for the minority class. This paper introduces an imbalanced data-oriented approach using probabilistic neural networks (PNNs) with a skew normal probability kernel to address this major challenge. PNNs are known for providing probabilistic outputs, enabling quantification of prediction confidence and uncertainty handling. By leveraging the skew normal distribution, which offers increased flexibility, particularly for imbalanced and non-symmetric data, our proposed Skew Probabilistic Neural Networks (SkewPNNs) can better represent underlying class densities. To optimize the performance of the proposed approach on imbalanced datasets, hyperparameter fine-tuning is imperative. To this end, we employ a population-based heuristic algorithm, Bat optimization algorithms, for effectively exploring the hyperparameter space. We also prove the statistical consistency of the density estimates which suggests that the true distribution will be approached smoothly as the sample size increases. Experimental simulations have been conducted on different synthetic datasets, comparing various benchmark-imbalanced learners. Our real-data analysis shows that SkewPNNs substantially outperform state-of-the-art machine learning methods for both balanced and imbalanced datasets in most experimental settings.  ( 2 min )
    Data fission: splitting a single data point. (arXiv:2112.11079v9 [stat.ME] UPDATED)
    Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2022) offers an alternative approach that uses additive Gaussian noise -- this enables post-selection inference in finite samples for Gaussian distributed data and asymptotically when errors are non-Gaussian. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.  ( 3 min )
    A Survey of Deep Causal Models and Their Industrial Applications. (arXiv:2209.08860v5 [stat.ML] UPDATED)
    The notion of causality assumes a paramount position within the realm of human cognition. Over the past few decades, there has been significant advancement in the domain of causal effect estimation across various disciplines, including but not limited to computer science, medicine, economics, and industrial applications. Given the continued advancements in deep learning methodologies, there has been a notable surge in its utilization for the estimation of causal effects using counterfactual data. Typically, deep causal models map the characteristics of covariates to a representation space and then design various objective functions to estimate counterfactual data unbiasedly. Different from the existing surveys on causal models in machine learning, this review mainly focuses on the overview of the deep causal models, and its core contributions are as follows: 1) we cast insight on a comprehensive overview of deep causal models from both timeline of development and method classification perspectives; 2) we outline some typical applications of causal effect estimation to industry; 3) we also endeavor to present a detailed categorization and analysis on relevant datasets, source codes and experiments.  ( 2 min )
    Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption. (arXiv:2306.00196v2 [cs.LG] UPDATED)
    We study the infinite-horizon Restless Bandit problem with the average reward criterion, under both discrete-time and continuous-time settings. A fundamental goal is to design computationally efficient policies that achieve a diminishing optimality gap as the number of arms, $N$, grows large. Existing results on asymptotic optimality all rely on the uniform global attractor property (UGAP), a complex and challenging-to-verify assumption. In this paper, we propose a general, simulation-based framework, Follow-the-Virtual-Advice, that converts any single-armed policy into a policy for the original $N$-armed problem. This is done by simulating the single-armed policy on each arm and carefully steering the real state towards the simulated state. Our framework can be instantiated to produce a policy with an $O(1/\sqrt{N})$ optimality gap. In the discrete-time setting, our result holds under a simpler synchronization assumption, which covers some problem instances that violate UGAP. More notably, in the continuous-time setting, we do not require any additional assumptions beyond the standard unichain condition. In both settings, our work is the first asymptotic optimality result that does not require UGAP.  ( 2 min )
    Ensemble Kalman Filtering-Aided Variational Inference for Gaussian Process State-Space Models. (arXiv:2312.05910v1 [cs.LG])
    Gaussian process state-space models (GPSSMs) provide a principled and flexible approach to model latent state dynamics observed through emission models. However, existing variational methods for learning GPSSMs face a substantial challenge in optimizing a large number of parameters, particularly with the introduction of amortized inference networks. To address this challenge, we leverage the ensemble Kalman filter (EnKF), a well-established model-based filtering technique, to approximate the posterior distribution of latent states within the variational inference framework. This approach eliminates the need for inference networks, significantly reducing the number of variational parameters. Moreover, we demonstrate that with the aid of EnKF, the straightforward evaluation of approximated evidence lower bound (ELBO) in the variational inference can be easily obtained through the summation of multiple terms with closed-form solutions. By leveraging automatic differentiation tools, we thus can maximize the ELBO and train the GPSSM efficiently. We also extend the proposed method to an online setting and provide comprehensive algorithm analyses and insights. Extensive testing on diverse real and simulated datasets demonstrates that our variational inference algorithms, integrated with EnKF, outperform existing methods in terms of learning and inference performance.  ( 2 min )
    Multi-Domain Causal Representation Learning via Weak Distributional Invariances. (arXiv:2310.02854v3 [cs.LG] UPDATED)
    Causal representation learning has emerged as the center of action in causal machine learning research. In particular, multi-domain datasets present a natural opportunity for showcasing the advantages of causal representation learning over standard unsupervised representation learning. While recent works have taken crucial steps towards learning causal representations, they often lack applicability to multi-domain datasets due to over-simplifying assumptions about the data; e.g. each domain comes from a different single-node perfect intervention. In this work, we relax these assumptions and capitalize on the following observation: there often exists a subset of latents whose certain distributional properties (e.g., support, variance) remain stable across domains; this property holds when, for example, each domain comes from a multi-node imperfect intervention. Leveraging this observation, we show that autoencoders that incorporate such invariances can provably identify the stable set of latents from the rest across different settings.  ( 2 min )
    Efficient sampling from the Bingham distribution. (arXiv:2010.00137v2 [cs.LG] UPDATED)
    We give a algorithm for exact sampling from the Bingham distribution $p(x)\propto \exp(x^\top A x)$ on the sphere $\mathcal S^{d-1}$ with expected runtime of $\operatorname{poly}(d, \lambda_{\max}(A)-\lambda_{\min}(A))$. The algorithm is based on rejection sampling, where the proposal distribution is a polynomial approximation of the pdf, and can be sampled from by explicitly evaluating integrals of polynomials over the sphere. Our algorithm gives exact samples, assuming exact computation of an inverse function of a polynomial. This is in contrast with Markov Chain Monte Carlo algorithms, which are not known to enjoy rapid mixing on this problem, and only give approximate samples. As a direct application, we use this to sample from the posterior distribution of a rank-1 matrix inference problem in polynomial time.  ( 2 min )
    Learning Bayesian Networks with Heterogeneous Agronomic Data Sets via Mixed-Effect Models and Hierarchical Clustering. (arXiv:2308.06399v4 [stat.ML] UPDATED)
    Maize, a crucial crop globally cultivated across vast regions, especially in sub-Saharan Africa, Asia, and Latin America, occupies 197 million hectares as of 2021. Various statistical and machine learning models, including mixed-effect models, random coefficients models, random forests, and deep learning architectures, have been devised to predict maize yield. These models consider factors such as genotype, environment, genotype-environment interaction, and field management. However, the existing models often fall short of fully exploiting the complex network of causal relationships among these factors and the hierarchical structure inherent in agronomic data. This study introduces an innovative approach integrating random effects into Bayesian networks (BNs), leveraging their capacity to model causal and probabilistic relationships through directed acyclic graphs. Rooted in the linear mixed-effects models framework and tailored for hierarchical data, this novel approach demonstrates enhanced BN learning. Application to a real-world agronomic trial produces a model with improved interpretability, unveiling new causal connections. Notably, the proposed method significantly reduces the error rate in maize yield prediction from 28% to 17%. These results advocate for the preference of BNs in constructing practical decision support tools for hierarchical agronomic data, facilitating causal inference.  ( 3 min )
    Multi-source domain adaptation for regression. (arXiv:2312.05460v1 [stat.ML])
    Multi-source domain adaptation (DA) aims at leveraging information from more than one source domain to make predictions in a target domain, where different domains may have different data distributions. Most existing methods for multi-source DA focus on classification problems while there is only limited investigation in the regression settings. In this paper, we fill in this gap through a two-step procedure. First, we extend a flexible single-source DA algorithm for classification through outcome-coarsening to enable its application to regression problems. We then augment our single-source DA algorithm for regression with ensemble learning to achieve multi-source DA. We consider three learning paradigms in the ensemble algorithm, which combines linearly the target-adapted learners trained with each source domain: (i) a multi-source stacking algorithm to obtain the ensemble weights; (ii) a similarity-based weighting where the weights reflect the quality of DA of each target-adapted learner; and (iii) a combination of the stacking and similarity weights. We illustrate the performance of our algorithms with simulations and a data application where the goal is to predict High-density lipoprotein (HDL) cholesterol levels using gut microbiome. We observe a consistent improvement in prediction performance of our multi-source DA algorithm over the routinely used methods in all these scenarios.  ( 2 min )
    Compressive Recovery of Sparse Precision Matrices. (arXiv:2311.04673v2 [stat.ML] UPDATED)
    We consider the problem of learning a graph modeling the statistical relations of the $d$ variables from a dataset with $n$ samples $X \in \mathbb{R}^{n \times d}$. Standard approaches amount to searching for a precision matrix $\Theta$ representative of a Gaussian graphical model that adequately explains the data. However, most maximum likelihood-based estimators usually require storing the $d^{2}$ values of the empirical covariance matrix, which can become prohibitive in a high-dimensional setting. In this work, we adopt a compressive viewpoint and aim to estimate a sparse $\Theta$ from a \emph{sketch} of the data, i.e. a low-dimensional vector of size $m \ll d^{2}$ carefully designed from $X$ using non-linear random features. Under certain assumptions on the spectrum of $\Theta$ (or its condition number), we show that it is possible to estimate it from a sketch of size $m=\Omega\left((d+2k)\log(d)\right)$ where $k$ is the maximal number of edges of the underlying graph. These information-theoretic guarantees are inspired by compressed sensing theory and involve restricted isometry properties and instance optimal decoders. We investigate the possibility of achieving practical recovery with an iterative algorithm based on the graphical lasso, viewed as a specific denoiser. We compare our approach and graphical lasso on synthetic datasets, demonstrating its favorable performance even when the dataset is compressed.  ( 2 min )
    TaCo: Targeted Concept Removal in Output Embeddings for NLP via Information Theory and Explainability. (arXiv:2312.06499v1 [cs.CL])
    The fairness of Natural Language Processing (NLP) models has emerged as a crucial concern. Information theory indicates that to achieve fairness, a model should not be able to predict sensitive variables, such as gender, ethnicity, and age. However, information related to these variables often appears implicitly in language, posing a challenge in identifying and mitigating biases effectively. To tackle this issue, we present a novel approach that operates at the embedding level of an NLP model, independent of the specific architecture. Our method leverages insights from recent advances in XAI techniques and employs an embedding transformation to eliminate implicit information from a selected variable. By directly manipulating the embeddings in the final layer, our approach enables a seamless integration into existing models without requiring significant modifications or retraining. In evaluation, we show that the proposed post-hoc approach significantly reduces gender-related associations in NLP models while preserving the overall performance and functionality of the models. An implementation of our method is available: https://github.com/fanny-jourdan/TaCo  ( 2 min )
    Lassoed Tree Boosting. (arXiv:2205.10697v6 [stat.ML] UPDATED)
    Gradient boosting performs exceptionally in most prediction problems and scales well to large datasets. In this paper we prove that a ``lassoed'' gradient boosted tree algorithm with early stopping achieves faster than $n^{-1/4}$ L2 convergence in the large nonparametric space of cadlag functions of bounded sectional variation. This rate is remarkable because it does not depend on the dimension, sparsity, or smoothness. We use simulation and real data to confirm our theory and demonstrate empirical performance and scalability on par with standard boosting. Our convergence proofs are based on a novel, general theorem on early stopping with empirical loss minimizers of nested Donsker classes.  ( 2 min )
    Rational Kriging. (arXiv:2312.05372v1 [stat.ME])
    This article proposes a new kriging that has a rational form. It is shown that the generalized least squares estimate of the mean from rational kriging is much more well behaved than that from ordinary kriging. Parameter estimation and uncertainty quantification for rational kriging are proposed using a Gaussian process framework. Its potential applications in emulation and calibration of computer models are also discussed.  ( 2 min )
    An Ambiguity Measure for Recognizing the Unknowns in Deep Learning. (arXiv:2312.06077v1 [cs.LG])
    We study the understanding of deep neural networks from the scope in which they are trained on. While the accuracy of these models is usually impressive on the aggregate level, they still make mistakes, sometimes on cases that appear to be trivial. Moreover, these models are not reliable in realizing what they do not know leading to failures such as adversarial vulnerability and out-of-distribution failures. Here, we propose a measure for quantifying the ambiguity of inputs for any given model with regard to the scope of its training. We define the ambiguity based on the geometric arrangements of the decision boundaries and the convex hull of training set in the feature space learned by the trained model, and demonstrate that a single ambiguity measure may detect a considerable portion of mistakes of a model on in-distribution samples, adversarial inputs, as well as out-of-distribution inputs. Using our ambiguity measure, a model may abstain from classification when it encounters ambiguous inputs leading to a better model accuracy not just on a given testing set, but on the inputs it may encounter at the world at large. In pursuit of this measure, we develop a theoretical framework that can identify the unknowns of the model in relation to its scope. We put this in perspective with the confidence of the model and develop formulations to identify the regions of the domain which are unknown to the model, yet the model is guaranteed to have high confidence.  ( 2 min )
    Federated Multilinear Principal Component Analysis with Applications in Prognostics. (arXiv:2312.06050v1 [cs.LG])
    Multilinear Principal Component Analysis (MPCA) is a widely utilized method for the dimension reduction of tensor data. However, the integration of MPCA into federated learning remains unexplored in existing research. To tackle this gap, this article proposes a Federated Multilinear Principal Component Analysis (FMPCA) method, which enables multiple users to collaboratively reduce the dimension of their tensor data while keeping each user's data local and confidential. The proposed FMPCA method is guaranteed to have the same performance as traditional MPCA. An application of the proposed FMPCA in industrial prognostics is also demonstrated. Simulated data and a real-world data set are used to validate the performance of the proposed method.  ( 2 min )
    Statistical Spatially Inhomogeneous Diffusion Inference. (arXiv:2312.05793v1 [stat.ML])
    Inferring a diffusion equation from discretely-observed measurements is a statistical challenge of significant importance in a variety of fields, from single-molecule tracking in biophysical systems to modeling financial instruments. Assuming that the underlying dynamical process obeys a $d$-dimensional stochastic differential equation of the form $$\mathrm{d}\boldsymbol{x}_t=\boldsymbol{b}(\boldsymbol{x}_t)\mathrm{d} t+\Sigma(\boldsymbol{x}_t)\mathrm{d}\boldsymbol{w}_t,$$ we propose neural network-based estimators of both the drift $\boldsymbol{b}$ and the spatially-inhomogeneous diffusion tensor $D = \Sigma\Sigma^{T}$ and provide statistical convergence guarantees when $\boldsymbol{b}$ and $D$ are $s$-H\"older continuous. Notably, our bound aligns with the minimax optimal rate $N^{-\frac{2s}{2s+d}}$ for nonparametric function estimation even in the presence of correlation within observational data, which necessitates careful handling when establishing fast-rate generalization bounds. Our theoretical results are bolstered by numerical experiments demonstrating accurate inference of spatially-inhomogeneous diffusion tensors.  ( 2 min )
    Discovering Dynamic Causal Space for DAG Structure Learning. (arXiv:2306.02822v3 [cs.LG] UPDATED)
    Discovering causal structure from purely observational data (i.e., causal discovery), aiming to identify causal relationships among variables, is a fundamental task in machine learning. The recent invention of differentiable score-based DAG learners is a crucial enabler, which reframes the combinatorial optimization problem into a differentiable optimization with a DAG constraint over directed graph space. Despite their great success, these cutting-edge DAG learners incorporate DAG-ness independent score functions to evaluate the directed graph candidates, lacking in considering graph structure. As a result, measuring the data fitness alone regardless of DAG-ness inevitably leads to discovering suboptimal DAGs and model vulnerabilities. Towards this end, we propose a dynamic causal space for DAG structure learning, coined CASPER, that integrates the graph structure into the score function as a new measure in the causal space to faithfully reflect the causal distance between estimated and ground truth DAG. CASPER revises the learning process as well as enhances the DAG structure learning via adaptive attention to DAG-ness. Grounded by empirical visualization, CASPER, as a space, satisfies a series of desired properties, such as structure awareness and noise robustness. Extensive experiments on both synthetic and real-world datasets clearly validate the superiority of our CASPER over the state-of-the-art causal discovery methods in terms of accuracy and robustness.  ( 3 min )
    FastPart: Over-Parameterized Stochastic Gradient Descent for Sparse optimisation on Measures. (arXiv:2312.05993v1 [math.OC])
    This paper presents a novel algorithm that leverages Stochastic Gradient Descent strategies in conjunction with Random Features to augment the scalability of Conic Particle Gradient Descent (CPGD) specifically tailored for solving sparse optimisation problems on measures. By formulating the CPGD steps within a variational framework, we provide rigorous mathematical proofs demonstrating the following key findings: (i) The total variation norms of the solution measures along the descent trajectory remain bounded, ensuring stability and preventing undesirable divergence; (ii) We establish a global convergence guarantee with a convergence rate of $\mathcal{O}(\log(K)/\sqrt{K})$ over $K$ iterations, showcasing the efficiency and effectiveness of our algorithm; (iii) Additionally, we analyze and establish local control over the first-order condition discrepancy, contributing to a deeper understanding of the algorithm's behavior and reliability in practical applications.  ( 2 min )
    Finite-sample Identification of Continuous-time Parameter-linear Systems. (arXiv:2312.05382v1 [eess.SY])
    Differentiating noisy, discrete measurements in order to fit an ordinary differential equation can be unreasonably effective. Assuming square-integrable noise and minimal flow regularity, we construct and analyze a finite-difference differentiation filter and a Tikhonov-regularized least squares estimator for the continuous-time parameter-linear system. Combining these contributions in series, we obtain a finite-sample bound on mean absolute error of estimation. As a by-product, we offer a novel analysis of stochastically perturbed Moore-Penrose pseudoinverses.  ( 2 min )
    Fused Extended Two-Way Fixed Effects for Difference-in-Differences with Staggered Adoptions. (arXiv:2312.05985v1 [econ.EM])
    To address the bias of the canonical two-way fixed effects estimator for difference-in-differences under staggered adoptions, Wooldridge (2021) proposed the extended two-way fixed effects estimator, which adds many parameters. However, this reduces efficiency. Restricting some of these parameters to be equal helps, but ad hoc restrictions may reintroduce bias. We propose a machine learning estimator with a single tuning parameter, fused extended two-way fixed effects (FETWFE), that enables automatic data-driven selection of these restrictions. We prove that under an appropriate sparsity assumption FETWFE identifies the correct restrictions with probability tending to one. We also prove the consistency, asymptotic normality, and oracle efficiency of FETWFE for two classes of heterogeneous marginal treatment effect estimators under either conditional or marginal parallel trends, and we prove consistency for two classes of conditional average treatment effects under conditional parallel trends. We demonstrate FETWFE in simulation studies and an empirical application.  ( 2 min )

  • Open

    Abstracts: December 12, 2023
    Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.  In this episode, Senior Principal Research Manager Tao Qin and Senior Researcher Lijun Wu discuss “FABind: Fast and Accurate Protein-Ligand Binding.” The […] The post Abstracts: December 12, 2023 appeared first on Microsoft Research.  ( 13 min )
    Steering at the Frontier: Extending the Power of Prompting
    We’re seeing exciting capabilities of frontier foundation models, including intriguing powers of abstraction, generalization, and composition across numerous areas of knowledge and expertise. Even seasoned AI researchers have been impressed with the ability to steer the models with straightforward, zero-shot prompts. Beyond basic, out-of-the-box prompting, we’ve been exploring new prompting strategies, showcased in our Medprompt work, to […] The post Steering at the Frontier: Extending the Power of Prompting appeared first on Microsoft Research.  ( 9 min )
    Phi-2: The surprising power of small language models
    Phi-2 is now accessible on the Azure model catalog. Its compact size and new innovations in model scaling and training data curation make it ideal for exploration around mechanistic interpretability, safety improvements, and fine-tuning experimentation on a variety of tasks. The post Phi-2: The surprising power of small language models appeared first on Microsoft Research.  ( 11 min )
  • Open

    Create a web UI to interact with LLMs using Amazon SageMaker JumpStart
    The launch of ChatGPT and rise in popularity of generative AI have captured the imagination of customers who are curious about how they can use this technology to create new products and services on AWS, such as enterprise chatbots, which are more conversational. This post shows you how you can create a web UI, which […]  ( 9 min )
    Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium
    Large language models (or LLMs) have become a topic of daily conversations. Their quick adoption is evident by the amount of time required to reach a 100 million users, which has gone from “4.5yrs by facebook” to an all-time low of mere “2 months by ChatGPT.” A generative pre-trained transformer (GPT) uses causal autoregressive updates […]  ( 7 min )
    Vodafone advances its machine learning skills with AWS DeepRacer and Accenture
    Vodafone is transitioning from a telecommunications company (telco) to a technology company (TechCo) by 2025, with objectives of innovating faster, reducing costs, improving security, and simplifying operations. Thousands of engineers are being onboarded to contribute to this transition. By 2025, Vodafone plans to have 50% of its global workforce actively involved in software development, with […]  ( 6 min )
  • Open

    A computer scientist pushes the boundaries of geometry
    Justin Solomon applies modern geometric techniques to solve problems in computer vision, machine learning, statistics, and beyond.  ( 10 min )
  • Open

    DSC Weekly 12 December 2023
    Announcements Top Stories In-Depth The post DSC Weekly 12 December 2023 appeared first on Data Science Central.  ( 20 min )
    Data science transformations for 2024 and beyond
    Data science has come a long way! Using basic statistical models, 19th-century organizations gathered, stored, and processed data. Later, when computers entered the picture, the digital age started producing enormous volumes of data. The proliferation of data on the internet has revolutionized communication, and the field of data science has grown because of the necessity… Read More »Data science transformations for 2024 and beyond The post Data science transformations for 2024 and beyond appeared first on Data Science Central.  ( 21 min )
  • Open

    The ed line editor: bravado, utility, and history
    I stumbled on the book Ed Mastery by Michael W. Lucas and couldn’t tell immediately whether it was serious. In a sort of technical version of Poe’s law, Lucas lays on the technical machismo pretty thick, but not thicker than some people do unironically. Bravado Here’s a paragraph from early in the book. Many younger […] The ed line editor: bravado, utility, and history first appeared on John D. Cook.  ( 6 min )
    sin(cos(x)) versus cos(sin(x))
    A few days ago I wrote about sufficient conditions for f(g(x)) to bound g(f(x)). This evening I stumbled on an analogous theorem. For real numbers γ and δ, cos(γ sin(x)) > sin(δ cos(x)) for all real x provided γ² + δ² < (π/2)². Source: American Mathematical Monthly. February 2009. Solution to problem 11309, page 184. […] sin(cos(x)) versus cos(sin(x)) first appeared on John D. Cook.  ( 5 min )
  • Open

    Meet NANA, Moonshine Studio’s AI-Powered Receptionist Avatar
    The creative team at Moonshine Studio — an artist-focused visual effects (VFX) studio specializing in animation and motion design — was tasked to solve a problem.  ( 7 min )
  • Open

    Graph Neural Networks with polynomial activations have limited expressivity. (arXiv:2310.13139v3 [cs.LG] UPDATED)
    The expressivity of Graph Neural Networks (GNNs) can be entirely characterized by appropriate fragments of the first-order logic. Namely, any query of the two variable fragment of graded modal logic (GC2) interpreted over labeled graphs can be expressed using a GNN whose size depends only on the depth of the query. As pointed out by [Barcelo & Al., 2020, Grohe, 2021], this description holds for a family of activation functions, leaving the possibility for a hierarchy of logics expressible by GNNs depending on the chosen activation function. In this article, we show that such hierarchy indeed exists by proving that GC2 queries cannot be expressed by GNNs with polynomial activation functions. This implies a separation between polynomial and popular non-polynomial activations (such as Rectified Linear Units) and answers an open question formulated by [Grohe, 2021].  ( 2 min )
    Cycle Consistency Driven Object Discovery. (arXiv:2306.02204v2 [cs.CV] UPDATED)
    Developing deep learning models that effectively learn object-centric representations, akin to human cognition, remains a challenging task. Existing approaches facilitate object discovery by representing objects as fixed-size vectors, called ``slots'' or ``object files''. While these approaches have shown promise in certain scenarios, they still exhibit certain limitations. First, they rely on architectural priors which can be unreliable and usually require meticulous engineering to identify the correct objects. Second, there has been a notable gap in investigating the practical utility of these representations in downstream tasks. To address the first limitation, we introduce a method that explicitly optimizes the constraint that each object in a scene should be associated with a distinct slot. We formalize this constraint by introducing consistency objectives which are cyclic in nature. By integrating these consistency objectives into various existing slot-based object-centric methods, we showcase substantial improvements in object-discovery performance. These enhancements consistently hold true across both synthetic and real-world scenes, underscoring the effectiveness and adaptability of the proposed approach. To tackle the second limitation, we apply the learned object-centric representations from the proposed method to two downstream reinforcement learning tasks, demonstrating considerable performance enhancements compared to conventional slot-based and monolithic representation learning methods. Our results suggest that the proposed approach not only improves object discovery, but also provides richer features for downstream tasks.
    Pruning Convolutional Filters via Reinforcement Learning with Entropy Minimization. (arXiv:2312.04918v1 [cs.LG])
    Structural pruning has become an integral part of neural network optimization, used to achieve architectural configurations which can be deployed and run more efficiently on embedded devices. Previous results showed that pruning is possible with minimum performance loss by utilizing a reinforcement learning agent which makes decisions about the sparsity level of each neural layer by maximizing as a reward the accuracy of the network. We introduce a novel information-theoretic reward function which minimizes the spatial entropy of convolutional activations. This minimization ultimately acts as a proxy for maintaining accuracy, although these two criteria are not related in any way. Our method shows that there is another possibility to preserve accuracy without the need to directly optimize it in the agent's reward function. In our experiments, we were able to reduce the total number of FLOPS of multiple popular neural network architectures by 5-10x, incurring minimal or no performance drop and being on par with the solution found by maximizing the accuracy.
    Accelerating Convolutional Neural Network Pruning via Spatial Aura Entropy. (arXiv:2312.04926v1 [cs.CV])
    In recent years, pruning has emerged as a popular technique to reduce the computational complexity and memory footprint of Convolutional Neural Network (CNN) models. Mutual Information (MI) has been widely used as a criterion for identifying unimportant filters to prune. However, existing methods for MI computation suffer from high computational cost and sensitivity to noise, leading to suboptimal pruning performance. We propose a novel method to improve MI computation for CNN pruning, using the spatial aura entropy. The spatial aura entropy is useful for evaluating the heterogeneity in the distribution of the neural activations over a neighborhood, providing information about local features. Our method effectively improves the MI computation for CNN pruning, leading to more robust and efficient pruning. Experimental results on the CIFAR-10 benchmark dataset demonstrate the superiority of our approach in terms of pruning performance and computational efficiency.
    Gradient-free online learning of subgrid-scale dynamics with neural emulators. (arXiv:2310.19385v3 [physics.comp-ph] UPDATED)
    In this paper, we propose a generic algorithm to train machine learning-based subgrid parametrizations online, i.e., with a posteriori loss functions, but for non-differentiable numerical solvers. The proposed approach leverages a neural emulator to approximate the reduced state-space solver, which is then used to allow gradient propagation through temporal integration steps. We apply this methodology on a single layer quasi-geostrophic system with topography, known to be highly unstable in around 500 temporal iterations with offline strategies. Using our algorithm, we are able to train a parametrization that recovers most of the benefits of online strategies without having to compute the gradient of the original solver. It is demonstrated that training the neural emulator and parametrization components separately with different loss quantities is necessary in order to minimize the propagation of approximation biases. Experiments on emulator architectures with different complexities also indicates that emulator performance is key in order to learn an accurate parametrization. This work is a step towards learning parametrization with online strategies for realistic climate models.
    Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities. (arXiv:2306.04829v2 [cs.CV] UPDATED)
    Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.
    FreeDrag: Feature Dragging for Reliable Point-based Image Editing. (arXiv:2307.04684v3 [cs.CV] UPDATED)
    To serve the intricate and varied demands of image editing, precise and flexible manipulation in image content is indispensable. Recently, Drag-based editing methods have gained impressive performance. However, these methods predominantly center on point dragging, resulting in two noteworthy drawbacks, namely "miss tracking", where difficulties arise in accurately tracking the predetermined handle points, and "ambiguous tracking", where tracked points are potentially positioned in wrong regions that closely resemble the handle points. To address the above issues, we propose FreeDrag, a feature dragging methodology designed to free the burden on point tracking. The FreeDrag incorporates two key designs, i.e., template feature via adaptive updating and line search with backtracking, the former improves the stability against drastic content change by elaborately controls feature updating scale after each dragging, while the latter alleviates the misguidance from similar points by actively restricting the search area in a line. These two technologies together contribute to a more stable semantic dragging with higher efficiency. Comprehensive experimental results substantiate that our approach significantly outperforms pre-existing methodologies, offering reliable point-based editing even in various complex scenarios.
    Physics-Informed Convolutional Autoencoder for Cyber Anomaly Detection in Power Distribution Grids. (arXiv:2312.04758v1 [eess.SY])
    The growing trend toward the modernization of power distribution systems has facilitated the installation of advanced measurement units and promotion of the cyber communication systems. However, these infrastructures are still prone to stealth cyber attacks. The existing data-driven anomaly detection methods suffer from a lack of knowledge about the system's physics, lack of interpretability, and scalability issues hindering their practical applications in real-world scenarios. To address these concerns, physics-informed neural networks (PINNs) were introduced. This paper proposes a multivariate physics-informed convolutional autoencoder (PIConvAE) to detect stealthy cyber-attacks in power distribution grids. The proposed model integrates the physical principles into the loss function of the neural network by applying Kirchhoff's law. Simulations are performed on the modified IEEE 13-bus and 123-bus systems using OpenDSS software to validate the efficacy of the proposed model for stealth attacks. The numerical results prove the superior performance of the proposed PIConvAE in three aspects: a) it provides more accurate results compared to the data-driven ConvAE model, b) it requires less training time to converge c) the model excels in effectively detecting a wide range of attack magnitudes making it powerful in detecting stealth attacks.
    SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control. (arXiv:2312.05039v1 [cs.CV])
    The field of generative image inpainting and object insertion has made significant progress with the recent advent of latent diffusion models. Utilizing a precise object mask can greatly enhance these applications. However, due to the challenges users encounter in creating high-fidelity masks, there is a tendency for these methods to rely on more coarse masks (e.g., bounding box) for these applications. This results in limited control and compromised background content preservation. To overcome these limitations, we introduce SmartMask, which allows any novice user to create detailed masks for precise object insertion. Combined with a ControlNet-Inpaint model, our experiments demonstrate that SmartMask achieves superior object insertion quality, preserving the background content more effectively than previous methods. Notably, unlike prior works the proposed approach can also be used even without user-mask guidance, which allows it to perform mask-free object insertion at diverse positions and scales. Furthermore, we find that when used iteratively with a novel instruction-tuning based planning model, SmartMask can be used to design detailed layouts from scratch. As compared with user-scribble based layout design, we observe that SmartMask allows for better quality outputs with layout-to-image generation methods. Project page is available at https://smartmask-gen.github.io
    Fixing Overconfidence in Dynamic Neural Networks. (arXiv:2302.06359v4 [cs.LG] UPDATED)
    Dynamic neural networks are a recent technique that promises a remedy for the increasing size of modern deep learning models by dynamically adapting their computational cost to the difficulty of the inputs. In this way, the model can adjust to a limited computational budget. However, the poor quality of uncertainty estimates in deep learning models makes it difficult to distinguish between hard and easy samples. To address this challenge, we present a computationally efficient approach for post-hoc uncertainty quantification in dynamic neural networks. We show that adequately quantifying and accounting for both aleatoric and epistemic uncertainty through a probabilistic treatment of the last layers improves the predictive performance and aids decision-making when determining the computational budget. In the experiments, we show improvements on CIFAR-100, ImageNet, and Caltech-256 in terms of accuracy, capturing uncertainty, and calibration error.
    Novel Fundus Image Preprocessing for Retcam Images to Improve Deep Learning Classification of Retinopathy of Prematurity. (arXiv:2302.02524v4 [eess.IV] UPDATED)
    Retinopathy of Prematurity (ROP) is a potentially blinding eye disorder because of damage to the eye's retina which can affect babies born prematurely. Screening of ROP is essential for early detection and treatment. This is a laborious and manual process which requires trained physician performing dilated ophthalmological examination which can be subjective resulting in lower diagnosis success for clinically significant disease. Automated diagnostic methods can assist ophthalmologists increase diagnosis accuracy using deep learning. Several research groups have highlighted various approaches. Captured ROP Retcam images suffer from poor quality. This paper proposes the use of improved novel fundus preprocessing methods using pretrained transfer learning frameworks to create hybrid models to give higher diagnosis accuracy. Once trained and validated, the evaluations showed that these novel methods in comparison to traditional imaging processing contribute to better and in many aspects higher accuracy in classifying Plus disease, Stages of ROP and Zones in comparison to peer papers.
    Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning. (arXiv:2305.14160v2 [cs.CL] UPDATED)
    In-context learning (ICL) emerges as a promising capability of large language models (LLMs) by providing them with demonstration examples to perform diverse tasks. However, the underlying mechanism of how LLMs learn from the provided context remains under-explored. In this paper, we investigate the working mechanism of ICL through an information flow lens. Our findings reveal that label words in the demonstration examples function as anchors: (1) semantic information aggregates into label word representations during the shallow computation layers' processing; (2) the consolidated information in label words serves as a reference for LLMs' final predictions. Based on these insights, we introduce an anchor re-weighting method to improve ICL performance, a demonstration compression technique to expedite inference, and an analysis framework for diagnosing ICL errors in GPT2-XL. The promising applications of our findings again validate the uncovered ICL working mechanism and pave the way for future studies.
    Few-Shot Class-Incremental Learning via Training-Free Prototype Calibration. (arXiv:2312.05229v1 [cs.CV])
    Real-world scenarios are usually accompanied by continuously appearing classes with scare labeled samples, which require the machine learning model to incrementally learn new classes and maintain the knowledge of base classes. In this Few-Shot Class-Incremental Learning (FSCIL) scenario, existing methods either introduce extra learnable components or rely on a frozen feature extractor to mitigate catastrophic forgetting and overfitting problems. However, we find a tendency for existing methods to misclassify the samples of new classes into base classes, which leads to the poor performance of new classes. In other words, the strong discriminability of base classes distracts the classification of new classes. To figure out this intriguing phenomenon, we observe that although the feature extractor is only trained on base classes, it can surprisingly represent the semantic similarity between the base and unseen new classes. Building upon these analyses, we propose a simple yet effective Training-frEE calibratioN (TEEN) strategy to enhance the discriminability of new classes by fusing the new prototypes (i.e., mean features of a class) with weighted base prototypes. In addition to standard benchmarks in FSCIL, TEEN demonstrates remarkable performance and consistent improvements over baseline methods in the few-shot learning scenario. Code is available at: https://github.com/wangkiw/TEEN
    TaskMet: Task-Driven Metric Learning for Model Learning. (arXiv:2312.05250v1 [cs.LG])
    Deep learning models are often deployed in downstream tasks that the training procedure may not be aware of. For example, models solely trained to achieve accurate predictions may struggle to perform well on downstream tasks because seemingly small prediction errors may incur drastic task errors. The standard end-to-end learning approach is to make the task loss differentiable or to introduce a differentiable surrogate that the model can be trained on. In these settings, the task loss needs to be carefully balanced with the prediction loss because they may have conflicting objectives. We propose take the task loss signal one level deeper than the parameters of the model and use it to learn the parameters of the loss function the model is trained on, which can be done by learning a metric in the prediction space. This approach does not alter the optimal prediction model itself, but rather changes the model learning to emphasize the information important for the downstream task. This enables us to achieve the best of both worlds: a prediction model trained in the original prediction space while also being valuable for the desired downstream task. We validate our approach through experiments conducted in two main settings: 1) decision-focused model learning scenarios involving portfolio optimization and budget allocation, and 2) reinforcement learning in noisy environments with distracting states. The source code to reproduce our experiments is available at https://github.com/facebookresearch/taskmet
    Neural Spectral Methods: Self-supervised learning in the spectral domain. (arXiv:2312.05225v1 [cs.LG])
    We present Neural Spectral Methods, a technique to solve parametric Partial Differential Equations (PDEs), grounded in classical spectral methods. Our method uses orthogonal bases to learn PDE solutions as mappings between spectral coefficients. In contrast to current machine learning approaches which enforce PDE constraints by minimizing the numerical quadrature of the residuals in the spatiotemporal domain, we leverage Parseval's identity and introduce a new training strategy through a \textit{spectral loss}. Our spectral loss enables more efficient differentiation through the neural network, and substantially reduces training complexity. At inference time, the computational cost of our method remains constant, regardless of the spatiotemporal resolution of the domain. Our experimental results demonstrate that our method significantly outperforms previous machine learning approaches in terms of speed and accuracy by one to two orders of magnitude on multiple different problems. When compared to numerical solvers of the same accuracy, our method demonstrates a $10\times$ increase in performance speed.
    A graphon-signal analysis of graph neural networks. (arXiv:2305.15987v2 [cs.LG] UPDATED)
    We present an approach for analyzing message passing graph neural networks (MPNNs) based on an extension of graphon analysis to a so called graphon-signal analysis. A MPNN is a function that takes a graph and a signal on the graph (a graph-signal) and returns some value. Since the input space of MPNNs is non-Euclidean, i.e., graphs can be of any size and topology, properties such as generalization are less well understood for MPNNs than for Euclidean neural networks. We claim that one important missing ingredient in past work is a meaningful notion of graph-signal similarity measure, that endows the space of inputs to MPNNs with a regular structure. We present such a similarity measure, called the graphon-signal cut distance, which makes the space of all graph-signals a dense subset of a compact metric space -- the graphon-signal space. Informally, two deterministic graph-signals are close in cut distance if they ``look like'' they were sampled from the same random graph-signal model. Hence, our cut distance is a natural notion of graph-signal similarity, which allows comparing any pair of graph-signals of any size and topology. We prove that MPNNs are Lipschitz continuous functions over the graphon-signal metric space. We then give two applications of this result: 1) a generalization bound for MPNNs, and, 2) the stability of MPNNs to subsampling of graph-signals. Our results apply to any regular enough MPNN on any distribution of graph-signals, making the analysis rather universal.
    Short-term prediction of construction waste transport activities using AI-Truck. (arXiv:2312.04609v1 [cs.LG])
    Construction waste hauling trucks (or `slag trucks') are among the most commonly seen heavy-duty vehicles in urban streets, which not only produce significant NOx and PM emissions but are also a major source of on-road and on-site fugitive dust. Slag trucks are subject to a series of spatial and temporal access restrictions by local traffic and environmental policies. This paper addresses the practical problem of predicting slag truck activity at a city scale during heavy pollution episodes, such that environmental law enforcement units can take timely and proactive measures against localized truck aggregation. A deep ensemble learning framework (coined AI-Truck) is designed, which employs a soft vote integrator that utilizes BI-LSTM, TCN, STGCN, and PDFormer as base classifiers to predict the level of slag truck activities at a resolution of 1km$\times$1km, in a 193 km$^2$ area in Chengdu, China. As a classifier, AI-Truck yields a Macro f1 close to 80\% for 0.5h- and 1h-prediction.
    Deep Learning-Based Pilotless Spatial Multiplexing. (arXiv:2312.05158v1 [eess.SP])
    This paper investigates the feasibility of machine learning (ML)-based pilotless spatial multiplexing in multiple-input and multiple-output (MIMO) communication systems. Especially, it is shown that by training the transmitter and receiver jointly, the transmitter can learn such constellation shapes for the spatial streams which facilitate completely blind separation and detection by the simultaneously learned receiver. To the best of our knowledge, this is the first time ML-based spatial multiplexing without channel estimation pilots is demonstrated. The results show that the learned pilotless scheme can outperform a conventional pilot-based system by as much as 15-20% in terms of spectral efficiency, depending on the modulation order and signal-to-noise ratio.
    PFLlib: Personalized Federated Learning Algorithm Library. (arXiv:2312.04992v1 [cs.LG])
    Amid the ongoing advancements in Federated Learning (FL), a machine learning paradigm that allows collaborative learning with data privacy protection, personalized FL (pFL) has gained significant prominence as a research direction within the FL domain. Whereas traditional FL (tFL) focuses on jointly learning a global model, pFL aims to achieve a balance between the global and personalized objectives of each client in FL settings. To foster the pFL research community, we propose PFLlib, a comprehensive pFL algorithm library with an integrated evaluation platform. In PFLlib, We implement 34 state-of-the-art FL algorithms (including 7 classic tFL algorithms and 27 pFL algorithms) and provide various evaluation environments with three statistically heterogeneous scenarios and 14 datasets. At present, PFLlib has already gained 850 stars and 199 forks on GitHub.
    Reinforcement Learning-Based Bionic Reflex Control for Anthropomorphic Robotic Grasping exploiting Domain Randomization. (arXiv:2312.05023v1 [cs.RO])
    Achieving human-level dexterity in robotic grasping remains a challenging endeavor. Robotic hands frequently encounter slippage and deformation during object manipulation, issues rarely encountered by humans due to their sensory receptors, experiential learning, and motor memory. The emulation of the human grasping reflex within robotic hands is referred to as the ``bionic reflex". Past endeavors in the realm of bionic reflex control predominantly relied on model-based and supervised learning approaches, necessitating human intervention during thresholding and labeling tasks. In this study, we introduce an innovative bionic reflex control pipeline, leveraging reinforcement learning (RL); thereby eliminating the need for human intervention during control design. Our proposed bionic reflex controller has been designed and tested on an anthropomorphic hand, manipulating deformable objects in the PyBullet physics simulator, incorporating domain randomization (DR) for enhanced Sim2Real transferability. Our findings underscore the promise of RL as a potent tool for advancing bionic reflex control within anthropomorphic robotic hands. We anticipate that this autonomous, RL-based bionic reflex controller will catalyze the development of dependable and highly efficient robotic and prosthetic hands, revolutionizing human-robot interaction and assistive technologies.
    Interpretable Classification of Early Stage Parkinson's Disease from EEG. (arXiv:2301.09568v2 [q-bio.NC] UPDATED)
    Detecting Parkinson's Disease in its early stages using EEG data presents a significant challenge. This paper introduces a novel approach, representing EEG data as a 15-variate series of bandpower and peak frequency values/coefficients. The hypothesis is that this representation captures essential information from the noisy EEG signal, improving disease detection. Statistical features extracted from this representation are utilised as input for interpretable machine learning models, specifically Decision Tree and AdaBoost classifiers. Our classification pipeline is deployed within our proposed framework which enables high-importance data types and brain regions for classification to be identified. Interestingly, our analysis reveals that while there is no significant regional importance, the N1 sleep data type exhibits statistically significant predictive power (p < 0.01) for early-stage Parkinson's Disease classification. AdaBoost classifiers trained on the N1 data type consistently outperform baseline models, achieving over 80% accuracy and recall. Our classification pipeline statistically significantly outperforms baseline models indicating that the model has acquired useful information. Paired with the interpretability (ability to view feature importance's) of our pipeline this enables us to generate meaningful insights into the classification of early stage Parkinson's with our N1 models. In Future, these models could be deployed in the real world - the results presented in this paper indicate that more than 3 in 4 early-stage Parkinson's cases would be captured with our pipeline.
    UniTSA: A Universal Reinforcement Learning Framework for V2X Traffic Signal Control. (arXiv:2312.05090v1 [eess.SY])
    Traffic congestion is a persistent problem in urban areas, which calls for the development of effective traffic signal control (TSC) systems. While existing Reinforcement Learning (RL)-based methods have shown promising performance in optimizing TSC, it is challenging to generalize these methods across intersections of different structures. In this work, a universal RL-based TSC framework is proposed for Vehicle-to-Everything (V2X) environments. The proposed framework introduces a novel agent design that incorporates a junction matrix to characterize intersection states, making the proposed model applicable to diverse intersections. To equip the proposed RL-based framework with enhanced capability of handling various intersection structures, novel traffic state augmentation methods are tailor-made for signal light control systems. Finally, extensive experimental results derived from multiple intersection configurations confirm the effectiveness of the proposed framework. The source code in this work is available at https://github.com/wmn7/Universal_Light
    Remembering to Be Fair: On Non-Markovian Fairness in Sequential DecisionMaking (Preliminary Report). (arXiv:2312.04772v1 [cs.AI])
    Fair decision making has largely been studied with respect to a single decision. In this paper we investigate the notion of fairness in the context of sequential decision making where multiple stakeholders can be affected by the outcomes of decisions, and where decision making may be informed by additional constraints and criteria beyond the requirement of fairness. In this setting, we observe that fairness often depends on the history of the sequential decision-making process and not just on the current state. To advance our understanding of this class of fairness problems, we define the notion of non-Markovian fairness in the context of sequential decision making. We identify properties of non-Markovian fairness, including notions of long-term, anytime, periodic, and bounded fairness. We further explore the interplay between non-Markovian fairness and memory, and how this can support construction of fair policies in sequential decision-making settings.
    Unbiased Filtering Of Accidental Clicks in Verizon Media Native Advertising. (arXiv:2312.05017v1 [cs.IR])
    Verizon Media (VZM) native advertising is one of VZM largest and fastest growing businesses, reaching a run-rate of several hundred million USDs in the past year. Driving the VZM native models that are used to predict event probabilities, such as click and conversion probabilities, is OFFSET - a feature enhanced collaborative-filtering based event-prediction algorithm. In this work we focus on the challenge of predicting click-through rates (CTR) when we are aware that some of the clicks have short dwell-time and are defined as accidental clicks. An accidental click implies little affinity between the user and the ad, so predicting that similar users will click on the ad is inaccurate. Therefore, it may be beneficial to remove clicks with dwell-time lower than a predefined threshold from the training set. However, we cannot ignore these positive events, as filtering these will cause the model to under predict. Previous approaches have tried to apply filtering and then adding corrective biases to the CTR predictions, but did not yield revenue lifts and therefore were not adopted. In this work, we present a new approach where the positive weight of the accidental clicks is distributed among all of the negative events (skips), based on their likelihood of causing accidental clicks, as predicted by an auxiliary model. These likelihoods are taken as the correct labels of the negative events, shifting our training from using only binary labels and adopting a binary cross-entropy loss function in our training process. After showing offline performance improvements, the modified model was tested online serving VZM native users, and provided 1.18% revenue lift over the production model which is agnostic to accidental clicks.
    KwaiAgents: Generalized Information-seeking Agent System with Large Language Models. (arXiv:2312.04889v1 [cs.AI])
    Driven by curiosity, humans have continually sought to explore and understand the world around them, leading to the invention of various tools to satiate this inquisitiveness. Despite not having the capacity to process and memorize vast amounts of information in their brains, humans excel in critical thinking, planning, reflection, and harnessing available tools to interact with and interpret the world, enabling them to find answers efficiently. The recent advancements in large language models (LLMs) suggest that machines might also possess the aforementioned human-like capabilities, allowing them to exhibit powerful abilities even with a constrained parameter count. In this paper, we introduce KwaiAgents, a generalized information-seeking agent system based on LLMs. Within KwaiAgents, we propose an agent system that employs LLMs as its cognitive core, which is capable of understanding a user's query, behavior guidelines, and referencing external documents. The agent can also update and retrieve information from its internal memory, plan and execute actions using a time-aware search-browse toolkit, and ultimately provide a comprehensive response. We further investigate the system's performance when powered by LLMs less advanced than GPT-4, and introduce the Meta-Agent Tuning (MAT) framework, designed to ensure even an open-sourced 7B or 13B model performs well among many agent systems. We exploit both benchmark and human evaluations to systematically validate these capabilities. Extensive experiments show the superiority of our agent system compared to other autonomous agents and highlight the enhanced generalized agent-abilities of our fine-tuned LLMs.
    AI Competitions and Benchmarks: Competition platforms. (arXiv:2312.05185v1 [cs.LG])
    The ecosystem of artificial intelligence competitions is a diverse and multifaceted landscape, encompassing a variety of platforms that each host numerous competitions annually, alongside a plethora of specialized websites dedicated to singular contests. These platforms adeptly manage the overarching administrative responsibilities inherent in orchestrating competitions, thus affording organizers the liberty to allocate greater attention to other facets of their contests. Notably, these platforms exhibit considerable diversity in their operational functionalities, economic models, and community dynamics. This chapter conducts an extensive review of the foremost services in this realm and elucidates several alternative methodologies that facilitate the independent hosting of such challenges. Keywords: competition platform, challenge hosting services, comparison.
    RanPAC: Random Projections and Pre-trained Models for Continual Learning. (arXiv:2307.02251v2 [cs.LG] UPDATED)
    Continual learning (CL) aims to incrementally learn different tasks (such as classification) in a non-stationary data stream without forgetting old ones. Most CL works focus on tackling catastrophic forgetting under a learning-from-scratch paradigm. However, with the increasing prominence of foundation models, pre-trained models equipped with informative representations have become available for various downstream requirements. Several CL methods based on pre-trained models have been explored, either utilizing pre-extracted features directly (which makes bridging distribution gaps challenging) or incorporating adaptors (which may be subject to forgetting). In this paper, we propose a concise and effective approach for CL with pre-trained models. Given that forgetting occurs during parameter updating, we contemplate an alternative approach that exploits training-free random projectors and class-prototype accumulation, which thus bypasses the issue. Specifically, we inject a frozen Random Projection layer with nonlinear activation between the pre-trained model's feature representations and output head, which captures interactions between features with expanded dimensionality, providing enhanced linear separability for class-prototype-based CL. We also demonstrate the importance of decorrelating the class-prototypes to reduce the distribution disparity when using pre-trained representations. These techniques prove to be effective and circumvent the problem of forgetting for both class- and domain-incremental continual learning. Compared to previous methods applied to pre-trained ViT-B/16 models, we reduce final error rates by between 10% and 62% on seven class-incremental benchmarks, despite not using any rehearsal memory. We conclude that the full potential of pre-trained models for simple, effective, and fast CL has not hitherto been fully tapped. Code is at github.com/RanPAC/RanPAC.
    Assessing Neural Network Representations During Training Using Noise-Resilient Diffusion Spectral Entropy. (arXiv:2312.04823v1 [cs.CV])
    Entropy and mutual information in neural networks provide rich information on the learning process, but they have proven difficult to compute reliably in high dimensions. Indeed, in noisy and high-dimensional data, traditional estimates in ambient dimensions approach a fixed entropy and are prohibitively hard to compute. To address these issues, we leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures. Specifically, we define diffusion spectral entropy (DSE) in neural representations of a dataset as well as diffusion spectral mutual information (DSMI) between different variables representing data. First, we show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data that outperform classic Shannon entropy, nonparametric estimation, and mutual information neural estimation (MINE). We then study the evolution of representations in classification networks with supervised learning, self-supervision, or overfitting. We observe that (1) DSE of neural representations increases during training; (2) DSMI with the class label increases during generalizable learning but stays stagnant during overfitting; (3) DSMI with the input signal shows differing trends: on MNIST it increases, while on CIFAR-10 and STL-10 it decreases. Finally, we show that DSE can be used to guide better network initialization and that DSMI can be used to predict downstream classification accuracy across 962 models on ImageNet. The official implementation is available at https://github.com/ChenLiu-1996/DiffusionSpectralEntropy.
    Bayesian data fusion with shared priors. (arXiv:2212.07311v2 [cs.LG] UPDATED)
    The integration of data and knowledge from several sources is known as data fusion. When data is only available in a distributed fashion or when different sensors are used to infer a quantity of interest, data fusion becomes essential. In Bayesian settings, a priori information of the unknown quantities is available and, possibly, present among the different distributed estimators. When the local estimates are fused, the prior knowledge used to construct several local posteriors might be overused unless the fusion node accounts for this and corrects it. In this paper, we analyze the effects of shared priors in Bayesian data fusion contexts. Depending on different common fusion rules, our analysis helps to understand the performance behavior as a function of the number of collaborative agents and as a consequence of different types of priors. The analysis is performed by using two divergences which are common in Bayesian inference, and the generality of the results allows to analyze very generic distributions. These theoretical results are corroborated through experiments in a variety of estimation and classification problems, including linear and nonlinear models, and federated learning schemes.
    PAC-Bayes Generalization Certificates for Learned Inductive Conformal Prediction. (arXiv:2312.04658v1 [cs.LG])
    Inductive Conformal Prediction (ICP) provides a practical and effective approach for equipping deep learning models with uncertainty estimates in the form of set-valued predictions which are guaranteed to contain the ground truth with high probability. Despite the appeal of this coverage guarantee, these sets may not be efficient: the size and contents of the prediction sets are not directly controlled, and instead depend on the underlying model and choice of score function. To remedy this, recent work has proposed learning model and score function parameters using data to directly optimize the efficiency of the ICP prediction sets. While appealing, the generalization theory for such an approach is lacking: direct optimization of empirical efficiency may yield prediction sets that are either no longer efficient on test data, or no longer obtain the required coverage on test data. In this work, we use PAC-Bayes theory to obtain generalization bounds on both the coverage and the efficiency of set-valued predictors which can be directly optimized to maximize efficiency while satisfying a desired test coverage. In contrast to prior work, our framework allows us to utilize the entire calibration dataset to learn the parameters of the model and score function, instead of requiring a separate hold-out set for obtaining test-time coverage guarantees. We leverage these theoretical results to provide a practical algorithm for using calibration data to simultaneously fine-tune the parameters of a model and score function while guaranteeing test-time coverage and efficiency of the resulting prediction sets. We evaluate the approach on regression and classification tasks, and outperform baselines calibrated using a Hoeffding bound-based PAC guarantee on ICP, especially in the low-data regime.
    Error Discovery by Clustering Influence Embeddings. (arXiv:2312.04712v1 [cs.LG])
    We present a method for identifying groups of test examples -- slices -- on which a model under-performs, a task now known as slice discovery. We formalize coherence -- a requirement that erroneous predictions, within a slice, should be wrong for the same reason -- as a key property that any slice discovery method should satisfy. We then use influence functions to derive a new slice discovery method, InfEmbed, which satisfies coherence by returning slices whose examples are influenced similarly by the training data. InfEmbed is simple, and consists of applying K-Means clustering to a novel representation we deem influence embeddings. We show InfEmbed outperforms current state-of-the-art methods on 2 benchmarks, and is effective for model debugging across several case studies.
    Learning Thresholds with Latent Values and Censored Feedback. (arXiv:2312.04653v1 [cs.LG])
    In this paper, we investigate a problem of actively learning threshold in latent space, where the unknown reward $g(\gamma, v)$ depends on the proposed threshold $\gamma$ and latent value $v$ and it can be $only$ achieved if the threshold is lower than or equal to the unknown latent value. This problem has broad applications in practical scenarios, e.g., reserve price optimization in online auctions, online task assignments in crowdsourcing, setting recruiting bars in hiring, etc. We first characterize the query complexity of learning a threshold with the expected reward at most $\epsilon$ smaller than the optimum and prove that the number of queries needed can be infinitely large even when $g(\gamma, v)$ is monotone with respect to both $\gamma$ and $v$. On the positive side, we provide a tight query complexity $\tilde{\Theta}(1/\epsilon^3)$ when $g$ is monotone and the CDF of value distribution is Lipschitz. Moreover, we show a tight $\tilde{\Theta}(1/\epsilon^3)$ query complexity can be achieved as long as $g$ satisfies one-sided Lipschitzness, which provides a complete characterization for this problem. Finally, we extend this model to an online learning setting and demonstrate a tight $\Theta(T^{2/3})$ regret bound using continuous-arm bandit techniques and the aforementioned query complexity results.
    Sequential inductive prediction intervals. (arXiv:2312.04950v1 [stat.ME])
    In this paper we explore the concept of sequential inductive prediction intervals using theory from sequential testing. We furthermore introduce a 3-parameter PAC definition of prediction intervals that allows us via simulation to achieve almost sharp bounds with high probability.
    Truncated Affinity Maximization: One-class Homophily Modeling for Graph Anomaly Detection. (arXiv:2306.00006v4 [cs.SI] UPDATED)
    We reveal a one-class homophily phenomenon, which is one prevalent property we find empirically in real-world graph anomaly detection (GAD) datasets, i.e., normal nodes tend to have strong connection/affinity with each other, while the homophily in abnormal nodes is significantly weaker than normal nodes. However, this anomaly-discriminative property is ignored by existing GAD methods that are typically built using a conventional anomaly detection objective, such as data reconstruction. In this work, we explore this property to introduce a novel unsupervised anomaly scoring measure for GAD, local node affinity, that assigns a larger anomaly score to nodes that are less affiliated with their neighbors, with the affinity defined as similarity on node attributes/representations. We further propose Truncated Affinity Maximization (TAM) that learns tailored node representations for our anomaly measure by maximizing the local affinity of nodes to their neighbors. Optimizing on the original graph structure can be biased by nonhomophily edges (i.e., edges connecting normal and abnormal nodes). Thus, TAM is instead optimized on truncated graphs where non-homophily edges are removed iteratively to mitigate this bias. The learned representations result in significantly stronger local affinity for normal nodes than abnormal nodes. Extensive empirical results on 10 real-world GAD datasets show that TAM substantially outperforms seven competing models, achieving over 10% increase in AUROC/AUPRC compared to the best contenders on challenging datasets. Our code is available at https://github.com/mala-lab/TAM-master/.
    Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models. (arXiv:2312.04724v1 [cs.CR])
    This paper presents CyberSecEval, a comprehensive benchmark developed to help bolster the cybersecurity of Large Language Models (LLMs) employed as coding assistants. As what we believe to be the most extensive unified cybersecurity safety benchmark to date, CyberSecEval provides a thorough evaluation of LLMs in two crucial security domains: their propensity to generate insecure code and their level of compliance when asked to assist in cyberattacks. Through a case study involving seven models from the Llama 2, Code Llama, and OpenAI GPT large language model families, CyberSecEval effectively pinpointed key cybersecurity risks. More importantly, it offered practical insights for refining these models. A significant observation from the study was the tendency of more advanced models to suggest insecure code, highlighting the critical need for integrating security considerations in the development of sophisticated LLMs. CyberSecEval, with its automated test case generation and evaluation pipeline covers a broad scope and equips LLM designers and researchers with a tool to broadly measure and enhance the cybersecurity safety properties of LLMs, contributing to the development of more secure AI systems.
    Not All Negatives AreWorth Attending to: Meta-Bootstrapping Negative Sampling Framework for Link Prediction. (arXiv:2312.04815v1 [cs.LG])
    The rapid development of graph neural networks (GNNs) encourages the rising of link prediction, achieving promising performance with various applications. Unfortunately, through a comprehensive analysis, we surprisingly find that current link predictors with dynamic negative samplers (DNSs) suffer from the migration phenomenon between "easy" and "hard" samples, which goes against the preference of DNS of choosing "hard" negatives, thus severely hindering capability. Towards this end, we propose the MeBNS framework, serving as a general plugin that can potentially improve current negative sampling based link predictors. In particular, we elaborately devise a Meta-learning Supported Teacher-student GNN (MST-GNN) that is not only built upon teacher-student architecture for alleviating the migration between "easy" and "hard" samples but also equipped with a meta learning based sample re-weighting module for helping the student GNN distinguish "hard" samples in a fine-grained manner. To effectively guide the learning of MST-GNN, we prepare a Structure enhanced Training Data Generator (STD-Generator) and an Uncertainty based Meta Data Collector (UMD-Collector) for supporting the teacher and student GNN, respectively. Extensive experiments show that the MeBNS achieves remarkable performance across six link prediction benchmark datasets.
    A Brief Tutorial on Sample Size Calculations for Fairness Audits. (arXiv:2312.04745v1 [stat.AP])
    In fairness audits, a standard objective is to detect whether a given algorithm performs substantially differently between subgroups. Properly powering the statistical analysis of such audits is crucial for obtaining informative fairness assessments, as it ensures a high probability of detecting unfairness when it exists. However, limited guidance is available on the amount of data necessary for a fairness audit, lacking directly applicable results concerning commonly used fairness metrics. Additionally, the consideration of unequal subgroup sample sizes is also missing. In this tutorial, we address these issues by providing guidance on how to determine the required subgroup sample sizes to maximize the statistical power of hypothesis tests for detecting unfairness. Our findings are applicable to audits of binary classification models and multiple fairness metrics derived as summaries of the confusion matrix. Furthermore, we discuss other aspects of audit study designs that can increase the reliability of audit results.
    Reconstructive Neuron Pruning for Backdoor Defense. (arXiv:2305.14876v2 [cs.LG] UPDATED)
    Deep neural networks (DNNs) have been found to be vulnerable to backdoor attacks, raising security concerns about their deployment in mission-critical applications. While existing defense methods have demonstrated promising results, it is still not clear how to effectively remove backdoor-associated neurons in backdoored DNNs. In this paper, we propose a novel defense called \emph{Reconstructive Neuron Pruning} (RNP) to expose and prune backdoor neurons via an unlearning and then recovering process. Specifically, RNP first unlearns the neurons by maximizing the model's error on a small subset of clean samples and then recovers the neurons by minimizing the model's error on the same data. In RNP, unlearning is operated at the neuron level while recovering is operated at the filter level, forming an asymmetric reconstructive learning procedure. We show that such an asymmetric process on only a few clean samples can effectively expose and prune the backdoor neurons implanted by a wide range of attacks, achieving a new state-of-the-art defense performance. Moreover, the unlearned model at the intermediate step of our RNP can be directly used to improve other backdoor defense tasks including backdoor removal, trigger recovery, backdoor label detection, and backdoor sample detection. Code is available at \url{https://github.com/bboylyg/RNP}.
    Autoencoding Labeled Interpolator, Inferring Parameters From Image, And Image From Parameters. (arXiv:2312.04640v1 [astro-ph.HE])
    The Event Horizon Telescope (EHT) provides an avenue to study black hole accretion flows on event-horizon scales. Fitting a semi-analytical model to EHT observations requires the construction of synthetic images, which is computationally expensive. This study presents an image generation tool in the form of a generative machine learning model, which extends the capabilities of a variational autoencoder. This tool can rapidly and continuously interpolate between a training set of images and can retrieve the defining parameters of those images. Trained on a set of synthetic black hole images, our tool showcases success in both interpolating black hole images and their associated physical parameters. By reducing the computational cost of generating an image, this tool facilitates parameter estimation and model validation for observations of black hole system.
    A Review of Cooperation in Multi-agent Learning. (arXiv:2312.05162v1 [cs.MA])
    Cooperation in multi-agent learning (MAL) is a topic at the intersection of numerous disciplines, including game theory, economics, social sciences, and evolutionary biology. Research in this area aims to understand both how agents can coordinate effectively when goals are aligned and how they may cooperate in settings where gains from working together are possible but possibilities for conflict abound. In this paper we provide an overview of the fundamental concepts, problem settings and algorithms of multi-agent learning. This encompasses reinforcement learning, multi-agent sequential decision-making, challenges associated with multi-agent cooperation, and a comprehensive review of recent progress, along with an evaluation of relevant metrics. Finally we discuss open challenges in the field with the aim of inspiring new avenues for research.
    Temporal Inductive Path Neural Network for Temporal Knowledge Graph Reasoning. (arXiv:2309.03251v2 [cs.AI] UPDATED)
    Temporal Knowledge Graph (TKG) is an extension of traditional Knowledge Graph (KG) that incorporates the dimension of time. Reasoning on TKGs is a crucial task that aims to predict future facts based on historical occurrences. The key challenge lies in uncovering structural dependencies within historical subgraphs and temporal patterns. Most existing approaches model TKGs relying on entity modeling, as nodes in the graph play a crucial role in knowledge representation. However, the real-world scenario often involves an extensive number of entities, with new entities emerging over time. This makes it challenging for entity-dependent methods to cope with extensive volumes of entities, and effectively handling newly emerging entities also becomes a significant challenge. Therefore, we propose Temporal Inductive Path Neural Network (TiPNN), which models historical information in an entity-independent perspective. Specifically, TiPNN adopts a unified graph, namely history temporal graph, to comprehensively capture and encapsulate information from history. Subsequently, we utilize the defined query-aware temporal paths to model historical path information related to queries on history temporal graph for the reasoning. Extensive experiments illustrate that the proposed model not only attains significant performance enhancements but also handles inductive settings, while additionally facilitating the provision of reasoning evidence through history temporal graphs.
    Synthesizing Traffic Datasets using Graph Neural Networks. (arXiv:2312.05031v1 [cs.CV])
    Traffic congestion in urban areas presents significant challenges, and Intelligent Transportation Systems (ITS) have sought to address these via automated and adaptive controls. However, these systems often struggle to transfer simulated experiences to real-world scenarios. This paper introduces a novel methodology for bridging this `sim-real' gap by creating photorealistic images from 2D traffic simulations and recorded junction footage. We propose a novel image generation approach, integrating a Conditional Generative Adversarial Network with a Graph Neural Network (GNN) to facilitate the creation of realistic urban traffic images. We harness GNNs' ability to process information at different levels of abstraction alongside segmented images for preserving locality data. The presented architecture leverages the power of SPADE and Graph ATtention (GAT) network models to create images based on simulated traffic scenarios. These images are conditioned by factors such as entity positions, colors, and time of day. The uniqueness of our approach lies in its ability to effectively translate structured and human-readable conditions, encoded as graphs, into realistic images. This advancement contributes to applications requiring rich traffic image datasets, from data augmentation to urban traffic solutions. We further provide an application to test the model's capabilities, including generating images with manually defined positions for various entities.
    SparQ Attention: Bandwidth-Efficient LLM Inference. (arXiv:2312.04985v1 [cs.LG])
    Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.
    Gradient-Based Spectral Embeddings of Random Dot Product Graphs. (arXiv:2307.13818v2 [cs.LG] UPDATED)
    The Random Dot Product Graph (RDPG) is a generative model for relational data, where nodes are represented via latent vectors in low-dimensional Euclidean space. RDPGs crucially postulate that edge formation probabilities are given by the dot product of the corresponding latent positions. Accordingly, the embedding task of estimating these vectors from an observed graph is typically posed as a low-rank matrix factorization problem. The workhorse Adjacency Spectral Embedding (ASE) enjoys solid statistical properties, but it is formally solving a surrogate problem and can be computationally intensive. In this paper, we bring to bear recent advances in non-convex optimization and demonstrate their impact to RDPG inference. We advocate first-order gradient descent methods to better solve the embedding problem, and to organically accommodate broader network embedding applications of practical relevance. Notably, we argue that RDPG embeddings of directed graphs loose interpretability unless the factor matrices are constrained to have orthogonal columns. We thus develop a novel feasible optimization method in the resulting manifold. The effectiveness of the graph representation learning framework is demonstrated on reproducible experiments with both synthetic and real network data. Our open-source algorithm implementations are scalable, and unlike the ASE they are robust to missing edge data and can track slowly-varying latent positions from streaming graphs.
    Testing LLM performance on the Physics GRE: some observations. (arXiv:2312.04613v1 [physics.ed-ph])
    With the recent developments in large language models (LLMs) and their widespread availability through open source models and/or low-cost APIs, several exciting products and applications are emerging, many of which are in the field of STEM educational technology for K-12 and university students. There is a need to evaluate these powerful language models on several benchmarks, in order to understand their risks and limitations. In this short paper, we summarize and analyze the performance of Bard, a popular LLM-based conversational service made available by Google, on the standardized Physics GRE examination.
    Quality-Diversity through AI Feedback. (arXiv:2310.13032v4 [cs.CL] UPDATED)
    In many text-generation problems, users may prefer not only a single response, but a diverse range of high-quality outputs from which to choose. Quality-diversity (QD) search algorithms aim at such outcomes, by continually improving and diversifying a population of candidates. However, the applicability of QD to qualitative domains, like creative writing, has been limited by the difficulty of algorithmically specifying measures of quality and diversity. Interestingly, recent developments in language models (LMs) have enabled guiding search through AI feedback, wherein LMs are prompted in natural language to evaluate qualitative aspects of text. Leveraging this development, we introduce Quality-Diversity through AI Feedback (QDAIF), wherein an evolutionary algorithm applies LMs to both generate variation and evaluate the quality and diversity of candidate text. When assessed on creative writing domains, QDAIF covers more of a specified search space with high-quality samples than do non-QD controls. Further, human evaluation of QDAIF-generated creative texts validates reasonable agreement between AI and human evaluation. Our results thus highlight the potential of AI feedback to guide open-ended search for creative and original solutions, providing a recipe that seemingly generalizes to many domains and modalities. In this way, QDAIF is a step towards AI systems that can independently search, diversify, evaluate, and improve, which are among the core skills underlying human society's capacity for innovation.
    Analysis on Effects of Fault Elements in Memristive Neuromorphic Systems. (arXiv:2312.04840v1 [cs.NE])
    Nowadays, neuromorphic systems based on Spiking Neural Networks (SNNs) attract attentions of many researchers. There are many studies to improve performances of neuromorphic systems. These studies have been showing satisfactory results. To magnify performances of neuromorphic systems, developing actual neuromorphic systems is essential. For developing them, memristors play key role due to their useful characteristics. Although memristors are essential for actual neuromorphic systems, they are vulnerable to faults. However, there are few studies analyzing effects of fault elements in neuromorphic systems using memristors. To solve this problem, we analyze performance of a memristive neuromorphic system with fault elements changing fault ratios, types, and positions. We choose neurons and synapses to inject faults. We inject two types of faults to synapses: SA0 and SA1 faults. The fault synapses appear in random and important positions. Through our analysis, we discover the following four interesting points. First, memristive characteristics increase vulnerability of neuromorphic systems to fault elements. Second, fault neuron ratios reducing performance sharply exist. Third, performance degradation by fault synapses depends on fault types. Finally, SA1 fault synapses improve performance when they appear in important positions.
    A Study of Forward-Forward Algorithm for Self-Supervised Learning. (arXiv:2309.11955v2 [cs.CV] UPDATED)
    Self-supervised representation learning has seen remarkable progress in the last few years, with some of the recent methods being able to learn useful image representations without labels. These methods are trained using backpropagation, the de facto standard. Recently, Geoffrey Hinton proposed the forward-forward algorithm as an alternative training method. It utilizes two forward passes and a separate loss function for each layer to train the network without backpropagation. In this study, for the first time, we study the performance of forward-forward vs. backpropagation for self-supervised representation learning and provide insights into the learned representation spaces. Our benchmark employs four standard datasets, namely MNIST, F-MNIST, SVHN and CIFAR-10, and three commonly used self-supervised representation learning techniques, namely rotation, flip and jigsaw. Our main finding is that while the forward-forward algorithm performs comparably to backpropagation during (self-)supervised training, the transfer performance is significantly lagging behind in all the studied settings. This may be caused by a combination of factors, including having a loss function for each layer and the way the supervised training is realized in the forward-forward paradigm. In comparison to backpropagation, the forward-forward algorithm focuses more on the boundaries and drops part of the information unnecessary for making decisions which harms the representation learning goal. Further investigation and research are necessary to stabilize the forward-forward strategy for self-supervised learning, to work beyond the datasets and configurations demonstrated by Geoffrey Hinton.
    AFN: Adaptive Fusion Normalization via Encoder-Decoder Framework. (arXiv:2308.03321v2 [cs.LG] UPDATED)
    The success of deep learning is inseparable from normalization layers. Researchers have proposed various normalization functions, and each of them has both advantages and disadvantages. In response, efforts have been made to design a unified normalization function that combines all normalization procedures and mitigates their weaknesses. We also proposed a new normalization function called Adaptive Fusion Normalization. Through experiments, we demonstrate AFN outperforms the previous normalization techniques in domain generalization and image classification tasks.
    Domain Private Transformers for Multi-Domain Dialog Systems. (arXiv:2305.14208v2 [cs.CL] UPDATED)
    Large, general purpose language models have demonstrated impressive performance across many different conversational domains. While multi-domain language models achieve low overall perplexity, their outputs are not guaranteed to stay within the domain of a given input prompt. This paper proposes domain privacy as a novel way to quantify how likely a conditional language model will leak across domains. We also develop policy functions based on token-level domain classification, and propose an efficient fine-tuning method to improve the trained model's domain privacy. Experiments on membership inference attacks show that our proposed method has comparable resiliency to methods adapted from recent literature on differentially private language models.
    Chain of Code: Reasoning with a Language Model-Augmented Code Emulator. (arXiv:2312.04474v2 [cs.CL] UPDATED)
    Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they not only write code, but also selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)" and other lines of code that cannot be executed. In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. CoC scales well with large and small models alike, and broadens the scope of reasoning questions that LMs can correctly answer by "thinking in code". Project webpage: https://chain-of-code.github.io.
    Z-BERT-A: a zero-shot Pipeline for Unknown Intent detection. (arXiv:2208.07084v3 [cs.CL] UPDATED)
    Intent discovery is a crucial task in natural language processing, and it is increasingly relevant for various of industrial applications. Identifying novel, unseen intents from user inputs remains one of the biggest challenges in this field. Herein, we propose Zero-Shot-BERT-Adapters, a two-stage method for multilingual intent discovery relying on a Transformer architecture, fine-tuned with Adapters. We train the model for Natural Language Inference (NLI) and later perform unknown intent classification in a zero-shot setting for multiple languages. In our evaluation, we first analyze the quality of the model after adaptive fine-tuning on known classes. Secondly, we evaluate its performance in casting intent classification as an NLI task. Lastly, we test the zero-shot performance of the model on unseen classes, showing how Zero-Shot-BERT-Adapters can effectively perform intent discovery by generating semantically similar intents, if not equal, to the ground-truth ones. Our experiments show how Zero-Shot-BERT-Adapters outperforms various baselines in two zero-shot settings: known intent classification and unseen intent discovery. The proposed pipeline holds the potential for broad application in customer care. It enables automated dynamic triage using a lightweight model that can be easily deployed and scaled in various business scenarios, unlike large language models. Zero-Shot-BERT-Adapters represents an innovative multi-language approach for intent discovery, enabling the online generation of novel intents. A Python package implementing the pipeline and the new datasets we compiled are available at the following link: https://github.com/GT4SD/zero-shot-bert-adapters.
    Spike-based computation using classical recurrent neural networks. (arXiv:2306.03623v2 [cs.NE] UPDATED)
    Spiking neural networks are a type of artificial neural networks in which communication between neurons is only made of events, also called spikes. This property allows neural networks to make asynchronous and sparse computations and therefore drastically decrease energy consumption when run on specialized hardware. However, training such networks is known to be difficult, mainly due to the non-differentiability of the spike activation, which prevents the use of classical backpropagation. This is because state-of-the-art spiking neural networks are usually derived from biologically-inspired neuron models, to which are applied machine learning methods for training. Nowadays, research about spiking neural networks focuses on the design of training algorithms whose goal is to obtain networks that compete with their non-spiking version on specific tasks. In this paper, we attempt the symmetrical approach: we modify the dynamics of a well-known, easily trainable type of recurrent neural network to make it event-based. This new RNN cell, called the Spiking Recurrent Cell, therefore communicates using events, i.e. spikes, while being completely differentiable. Vanilla backpropagation can thus be used to train any network made of such RNN cell. We show that this new network can achieve performance comparable to other types of spiking networks in the MNIST benchmark and its variants, the Fashion-MNIST and the Neuromorphic-MNIST. Moreover, we show that this new cell makes the training of deep spiking networks achievable.
    SymNoise: Advancing Language Model Fine-tuning with Symmetric Noise. (arXiv:2312.01523v2 [cs.CL] UPDATED)
    In this paper, we introduce a novel fine-tuning technique for language models, which involves incorporating symmetric noise into the embedding process. This method aims to enhance the model's function by more stringently regulating its local curvature, demonstrating superior performance over the current method, NEFTune. When fine-tuning the LLaMA-2-7B model using Alpaca, standard techniques yield a 29.79% score on AlpacaEval. However, our approach, SymNoise, increases this score significantly to 69.04%, using symmetric noisy embeddings. This is a 6.7% improvement over the state-of-the-art method, NEFTune~(64.69%). Furthermore, when tested on various models and stronger baseline instruction datasets, such as Evol-Instruct, ShareGPT, OpenPlatypus, SymNoise consistently outperforms NEFTune. The current literature, including NEFTune, has underscored the importance of more in-depth research into the application of noise-based strategies in the fine-tuning of language models. Our approach, SymNoise, is another significant step towards this direction, showing notable improvement over the existing state-of-the-art method.
    Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning. (arXiv:2312.05230v1 [cs.AI])
    Despite their tremendous success in many applications, large language models often fall short of consistent reasoning and planning in various (language, embodied, and social) scenarios, due to inherent limitations in their inference, learning, and modeling capabilities. In this position paper, we present a new perspective of machine reasoning, LAW, that connects the concepts of Language models, Agent models, and World models, for more robust and versatile reasoning capabilities. In particular, we propose that world and agent models are a better abstraction of reasoning, that introduces the crucial elements of deliberate human-like reasoning, including beliefs about the world and other agents, anticipation of consequences, goals/rewards, and strategic planning. Crucially, language models in LAW serve as a backend to implement the system or its elements and hence provide the computational power and adaptability. We review the recent studies that have made relevant progress and discuss future research directions towards operationalizing the LAW framework.
    Coupled Multiwavelet Neural Operator Learning for Coupled Partial Differential Equations. (arXiv:2303.02304v4 [cs.LG] UPDATED)
    Coupled partial differential equations (PDEs) are key tasks in modeling the complex dynamics of many physical processes. Recently, neural operators have shown the ability to solve PDEs by learning the integral kernel directly in Fourier/Wavelet space, so the difficulty for solving the coupled PDEs depends on dealing with the coupled mappings between the functions. Towards this end, we propose a \textit{coupled multiwavelets neural operator} (CMWNO) learning scheme by decoupling the coupled integral kernels during the multiwavelet decomposition and reconstruction procedures in the Wavelet space. The proposed model achieves significantly higher accuracy compared to previous learning-based solvers in solving the coupled PDEs including Gray-Scott (GS) equations and the non-local mean field game (MFG) problem. According to our experimental results, the proposed model exhibits a $2\times \sim 4\times$ improvement relative $L$2 error compared to the best results from the state-of-the-art models.
    Causal normalizing flows: from theory to practice. (arXiv:2306.05415v2 [cs.LG] UPDATED)
    In this work, we deepen on the use of normalizing flows for causal reasoning. Specifically, we first leverage recent results on non-linear ICA to show that causal models are identifiable from observational data given a causal ordering, and thus can be recovered using autoregressive normalizing flows (NFs). Second, we analyze different design and learning choices for causal normalizing flows to capture the underlying causal data-generating process. Third, we describe how to implement the do-operator in causal NFs, and thus, how to answer interventional and counterfactual questions. Finally, in our experiments, we validate our design and training choices through a comprehensive ablation study; compare causal NFs to other approaches for approximating causal models; and empirically demonstrate that causal NFs can be used to address real-world problems, where the presence of mixed discrete-continuous data and partial knowledge on the causal graph is the norm. The code for this work can be found at https://github.com/psanch21/causal-flows.
    Disentangling CO Chemistry in a Protoplanetary Disk Using Explanatory Machine Learning Techniques. (arXiv:2312.05254v1 [astro-ph.EP])
    Molecular abundances in protoplanetary disks are highly sensitive to the local physical conditions, including gas temperature, gas density, radiation field, and dust properties. Often multiple factors are intertwined, impacting the abundances of both simple and complex species. We present a new approach to understanding these chemical and physical interdependencies using machine learning. Specifically we explore the case of CO modeled under the conditions of a generic disk and build an explanatory regression model to study the dependence of CO spatial density on the gas density, gas temperature, cosmic ray ionization rate, X-ray ionization rate, and UV flux. Our findings indicate that combinations of parameters play a surprisingly powerful role in regulating CO compared to any singular physical parameter. Moreover, in general, we find the conditions in the disk are destructive toward CO. CO depletion is further enhanced in an increased cosmic ray environment and in disks with higher initial C/O ratios. These dependencies uncovered by our new approach are consistent with previous studies, which are more modeling intensive and computationally expensive. Our work thus shows that machine learning can be a powerful tool not only for creating efficient predictive models, but also for enabling a deeper understanding of complex chemical processes.
    G$^2$uardFL: Safeguarding Federated Learning Against Backdoor Attacks through Attributed Client Graph Clustering. (arXiv:2306.04984v2 [cs.CR] UPDATED)
    Federated Learning (FL) offers collaborative model training without data sharing but is vulnerable to backdoor attacks, where poisoned model weights lead to compromised system integrity. Existing countermeasures, primarily based on anomaly detection, are prone to erroneous rejections of normal weights while accepting poisoned ones, largely due to shortcomings in quantifying similarities among client models. Furthermore, other defenses demonstrate effectiveness only when dealing with a limited number of malicious clients, typically fewer than 10%. To alleviate these vulnerabilities, we present G$^2$uardFL, a protective framework that reinterprets the identification of malicious clients as an attributed graph clustering problem, thus safeguarding FL systems. Specifically, this framework employs a client graph clustering approach to identify malicious clients and integrates an adaptive mechanism to amplify the discrepancy between the aggregated model and the poisoned ones, effectively eliminating embedded backdoors. We also conduct a theoretical analysis of convergence to confirm that G$^2$uardFL does not affect the convergence of FL systems. Through empirical evaluation, comparing G$^2$uardFL with cutting-edge defenses, such as FLAME (USENIX Security 2022) [28] and DeepSight (NDSS 2022) [36], against various backdoor attacks including 3DFed (SP 2023) [20], our results demonstrate its significant effectiveness in mitigating backdoor attacks while having a negligible impact on the aggregated model's performance on benign samples (i.e., the primary task performance). For instance, in an FL system with 25% malicious clients, G$^2$uardFL reduces the attack success rate to 10.61%, while maintaining a primary task performance of 73.05% on the CIFAR-10 dataset. This surpasses the performance of the best-performing baseline, which merely achieves a primary task performance of 19.54%.
    Low-skilled Occupations Face the Highest Upskilling Pressure. (arXiv:2101.11505v4 [cs.CY] UPDATED)
    Substantial scholarship has estimated the susceptibility of jobs to automation, but little has examined how job contents evolve in the information age as new technologies substitute for tasks, shifting required skills rather than eliminating entire jobs. Here we explore patterns and consequences of changes in occupational skill and characterize occupations and workers subject to the greatest re-skilling pressure. Recent work found that changing skill requirements are greatest for STEM occupations. Nevertheless, analyzing 167 million online job posts covering 727 occupations over the last decade, we find that re-skilling pressure is greatest for low-skilled occupations when accounting for distance between skills. We further investigate the differences in skill change across employer and market size, as well as social demographic groups, and find that these differences tend to widen the economic divide. Jobs from large employers and markets experienced less change relative to small employers and markets, and non-white workers in low-skilled jobs are most demographically vulnerable. We conclude by showcasing our model's potential to precisely chart job evolution towards machine-interface integration using skill embedding spaces.
    CS4ML: A general framework for active learning with arbitrary data based on Christoffel functions. (arXiv:2306.00945v2 [cs.LG] UPDATED)
    We introduce a general framework for active learning in regression problems. Our framework extends the standard setup by allowing for general types of data, rather than merely pointwise samples of the target function. This generalization covers many cases of practical interest, such as data acquired in transform domains (e.g., Fourier data), vector-valued data (e.g., gradient-augmented data), data acquired along continuous curves, and, multimodal data (i.e., combinations of different types of measurements). Our framework considers random sampling according to a finite number of sampling measures and arbitrary nonlinear approximation spaces (model classes). We introduce the concept of generalized Christoffel functions and show how these can be used to optimize the sampling measures. We prove that this leads to near-optimal sample complexity in various important cases. This paper focuses on applications in scientific computing, where active learning is often desirable, since it is usually expensive to generate data. We demonstrate the efficacy of our framework for gradient-augmented learning with polynomials, Magnetic Resonance Imaging (MRI) using generative models and adaptive sampling for solving PDEs using Physics-Informed Neural Networks (PINNs).
    INSPECT: Intrinsic and Systematic Probing Evaluation for Code Transformers. (arXiv:2312.05092v1 [cs.SE])
    Pre-trained models of source code have recently been successfully applied to a wide variety of Software Engineering tasks; they have also seen some practical adoption in practice, e.g. for code completion. Yet, we still know very little about what these pre-trained models learn about source code. In this article, we use probing--simple diagnostic tasks that do not further train the models--to discover to what extent pre-trained models learn about specific aspects of source code. We use an extensible framework to define 15 probing tasks that exercise surface, syntactic, structural and semantic characteristics of source code. We probe 8 pre-trained source code models, as well as a natural language model (BERT) as our baseline. We find that models that incorporate some structural information (such as GraphCodeBERT) have a better representation of source code characteristics. Surprisingly, we find that for some probing tasks, BERT is competitive with the source code models, indicating that there are ample opportunities to improve source-code specific pre-training on the respective code characteristics. We encourage other researchers to evaluate their models with our probing task suite, so that they may peer into the hidden layers of the models and identify what intrinsic code characteristics are encoded.
    Grasp Force Optimization as a Bilinear Matrix Inequality Problem: A Deep Learning Approach. (arXiv:2312.05034v1 [cs.RO])
    Grasp force synthesis is a non-convex optimization problem involving constraints that are bilinear. Traditional approaches to this problem involve general-purpose gradient-based nonlinear optimization and semi-definite programming. With a view towards dealing with postural synergies and non-smooth but convex positive semidefinite constraints, we look beyond gradient-based optimization. The focus of this paper is to undertake a grasp analysis of biomimetic grasping in multi-fingered robotic hands as a bilinear matrix inequality (BMI) problem. Our analysis is to solve it using a deep learning approach to make the algorithm efficiently generate force closure grasps with optimal grasp quality on untrained/unseen objects.
    Modeling Risk in Reinforcement Learning: A Literature Mapping. (arXiv:2312.05231v1 [cs.LG])
    Safe reinforcement learning deals with mitigating or avoiding unsafe situations by reinforcement learning (RL) agents. Safe RL approaches are based on specific risk representations for particular problems or domains. In order to analyze agent behaviors, compare safe RL approaches, and effectively transfer techniques between application domains, it is necessary to understand the types of risk specific to safe RL problems. We performed a systematic literature mapping with the objective to characterize risk in safe RL. Based on the obtained results, we present definitions, characteristics, and types of risk that hold on multiple application domains. Our literature mapping covers literature from the last 5 years (2017-2022), from a variety of knowledge areas (AI, finance, engineering, medicine) where RL approaches emphasize risk representation and management. Our mapping covers 72 papers filtered systematically from over thousands of papers on the topic. Our proposed notion of risk covers a variety of representations, disciplinary differences, common training exercises, and types of techniques. We encourage researchers to include explicit and detailed accounts of risk in future safe RL research reports, using this mapping as a starting point. With this information, researchers and practitioners could draw stronger conclusions on the effectiveness of techniques on different problems.
    BAFFLE: Hiding Backdoors in Offline Reinforcement Learning Datasets. (arXiv:2210.04688v4 [cs.LG] UPDATED)
    Reinforcement learning (RL) makes an agent learn from trial-and-error experiences gathered during the interaction with the environment. Recently, offline RL has become a popular RL paradigm because it saves the interactions with environments. In offline RL, data providers share large pre-collected datasets, and others can train high-quality agents without interacting with the environments. This paradigm has demonstrated effectiveness in critical tasks like robot control, autonomous driving, etc. However, less attention is paid to investigating the security threats to the offline RL system. This paper focuses on backdoor attacks, where some perturbations are added to the data (observations) such that given normal observations, the agent takes high-rewards actions, and low-reward actions on observations injected with triggers. In this paper, we propose Baffle (Backdoor Attack for Offline Reinforcement Learning), an approach that automatically implants backdoors to RL agents by poisoning the offline RL dataset, and evaluate how different offline RL algorithms react to this attack. Our experiments conducted on four tasks and four offline RL algorithms expose a disquieting fact: none of the existing offline RL algorithms is immune to such a backdoor attack. More specifically, Baffle modifies 10\% of the datasets for four tasks (3 robotic controls and 1 autonomous driving). Agents trained on the poisoned datasets perform well in normal settings. However, when triggers are presented, the agents' performance decreases drastically by 63.2\%, 53.9\%, 64.7\%, and 47.4\% in the four tasks on average. The backdoor still persists after fine-tuning poisoned agents on clean datasets. We further show that the inserted backdoor is also hard to be detected by a popular defensive method. This paper calls attention to developing more effective protection for the open-source offline RL dataset.
    Loss Minimization Yields Multicalibration for Large Neural Networks. (arXiv:2304.09424v2 [cs.LG] UPDATED)
    Multicalibration is a notion of fairness for predictors that requires them to provide calibrated predictions across a large set of protected groups. Multicalibration is known to be a distinct goal than loss minimization, even for simple predictors such as linear functions. In this work, we consider the setting where the protected groups can be represented by neural networks of size $k$, and the predictors are neural networks of size $n > k$. We show that minimizing the squared loss over all neural nets of size $n$ implies multicalibration for all but a bounded number of unlucky values of $n$. We also give evidence that our bound on the number of unlucky values is tight, given our proof technique. Previously, results of the flavor that loss minimization yields multicalibration were known only for predictors that were near the ground truth, hence were rather limited in applicability. Unlike these, our results rely on the expressivity of neural nets and utilize the representation of the predictor.
    Uncertainty Quantification and Propagation in Surrogate-based Bayesian Inference. (arXiv:2312.05153v1 [stat.ML])
    Surrogate models are statistical or conceptual approximations for more complex simulation models. In this context, it is crucial to propagate the uncertainty induced by limited simulation budget and surrogate approximation error to predictions, inference, and subsequent decision-relevant quantities. However, quantifying and then propagating the uncertainty of surrogates is usually limited to special analytic cases or is otherwise computationally very expensive. In this paper, we propose a framework enabling a scalable, Bayesian approach to surrogate modeling with thorough uncertainty quantification, propagation, and validation. Specifically, we present three methods for Bayesian inference with surrogate models given measurement data. This is a task where the propagation of surrogate uncertainty is especially relevant, because failing to account for it may lead to biased and/or overconfident estimates of the parameters of interest. We showcase our approach in two detailed case studies for both linear and nonlinear modeling scenarios. Uncertainty propagation in surrogate models enables more reliable and safe approximation of expensive simulators and will therefore be useful in various fields of applications.
    Collaborative causal inference on distributed data. (arXiv:2208.07898v4 [stat.ME] UPDATED)
    In recent years, the development of technologies for causal inference with privacy preservation of distributed data has gained considerable attention. Many existing methods for distributed data focus on resolving the lack of subjects (samples) and can only reduce random errors in estimating treatment effects. In this study, we propose a data collaboration quasi-experiment (DC-QE) that resolves the lack of both subjects and covariates, reducing random errors and biases in the estimation. Our method involves constructing dimensionality-reduced intermediate representations from private data from local parties, sharing intermediate representations instead of private data for privacy preservation, estimating propensity scores from the shared intermediate representations, and finally, estimating the treatment effects from propensity scores. Through numerical experiments on both artificial and real-world data, we confirm that our method leads to better estimation results than individual analyses. While dimensionality reduction loses some information in the private data and causes performance degradation, we observe that sharing intermediate representations with many parties to resolve the lack of subjects and covariates sufficiently improves performance to overcome the degradation caused by dimensionality reduction. Although external validity is not necessarily guaranteed, our results suggest that DC-QE is a promising method. With the widespread use of our method, intermediate representations can be published as open data to help researchers find causalities and accumulate a knowledge base.
    Soft Frequency Capping for Improved Ad Click Prediction in Yahoo Gemini Native. (arXiv:2312.05052v1 [cs.IR])
    Yahoo's native advertising (also known as Gemini native) serves billions of ad impressions daily, reaching a yearly run-rate of many hundred of millions USD. Driving the Gemini native models that are used to predict both click probability (pCTR) and conversion probability (pCONV) is OFFSET - a feature enhanced collaborative-filtering (CF) based event prediction algorithm. \offset is a one-pass algorithm that updates its model for every new batch of logged data using a stochastic gradient descent (SGD) based approach. Since OFFSET represents its users by their features (i.e., user-less model) due to sparsity issues, rule based hard frequency capping (HFC) is used to control the number of times a certain user views a certain ad. Moreover, related statistics reveal that user ad fatigue results in a dramatic drop in click through rate (CTR). Therefore, to improve click prediction accuracy, we propose a soft frequency capping (SFC) approach, where the frequency feature is incorporated into the OFFSET model as a user-ad feature and its weight vector is learned via logistic regression as part of OFFSET training. Online evaluation of the soft frequency capping algorithm via bucket testing showed a significant 7.3% revenue lift. Since then, the frequency feature enhanced model has been pushed to production serving all traffic, and is generating a hefty revenue lift for Yahoo Gemini native. We also report related statistics that reveal, among other things, that while users' gender does not affect ad fatigue, the latter seems to increase with users' age.
    Off-By-One Implementation Error in J-UNIWARD. (arXiv:2305.19776v2 [cs.CR] UPDATED)
    J-UNIWARD is a popular steganography method for hiding secret messages in JPEG cover images. As a content-adaptive method, J-UNIWARD aims to embed into textured image regions where changes are difficult to detect. To this end, J-UNIWARD first assigns to each DCT coefficient an embedding cost calculated based on the image's Wavelet residual, and then uses a coding method that minimizes the cost while embedding the desired payload. Changing one DCT coefficient affects a 23x23 window of Wavelet coefficients. To speed up the costmap computation, the original implementation pre-computes the Wavelet residual and then considers per changed DCT coefficient a 23x23 window of the Wavelet residual. However, the implementation accesses a window accidentally shifted by one pixel to the bottom right. In this report, we evaluate the effect of this off-by-one error on the resulting costmaps. Some image blocks are over-priced while other image blocks are under-priced, but the difference is relatively small. The off-by-one error seems to make little difference for learning-based steganalysis.
    A Distributed ADMM-based Deep Learning Approach for Thermal Control in Multi-Zone Buildings. (arXiv:2312.05073v1 [math.OC])
    The surge in electricity use, coupled with the dependency on intermittent renewable energy sources, poses significant hurdles to effectively managing power grids, particularly during times of peak demand. Demand Response programs and energy conservation measures are essential to operate energy grids while ensuring a responsible use of our resources This research combines distributed optimization using ADMM with Deep Learning models to plan indoor temperature setpoints effectively. A two-layer hierarchical structure is used, with a central building coordinator at the upper layer and local controllers at the thermal zone layer. The coordinator must limit the building's maximum power by translating the building's total power to local power targets for each zone. Local controllers can modify the temperature setpoints to meet the local power targets. The resulting control algorithm, called Distributed Planning Networks, is designed to be both adaptable and scalable to many types of buildings, tackling two of the main challenges in the development of such systems. The proposed approach is tested on an 18-zone building modeled in EnergyPlus. The algorithm successfully manages Demand Response peak events.
    Collinear datasets augmentation using Procrustes validation sets. (arXiv:2312.04911v1 [cs.LG])
    In this paper, we propose a new method for the augmentation of numeric and mixed datasets. The method generates additional data points by utilizing cross-validation resampling and latent variable modeling. It is particularly efficient for datasets with moderate to high degrees of collinearity, as it directly utilizes this property for generation. The method is simple, fast, and has very few parameters, which, as shown in the paper, do not require specific tuning. It has been tested on several real datasets; here, we report detailed results for two cases, prediction of protein in minced meat based on near infrared spectra (fully numeric data with high degree of collinearity) and discrimination of patients referred for coronary angiography (mixed data, with both numeric and categorical variables, and moderate collinearity). In both cases, artificial neural networks were employed for developing the regression and the discrimination models. The results show a clear improvement in the performance of the models; thus for the prediction of meat protein, fitting the model to the augmented data resulted in a reduction in the root mean squared error computed for the independent test set by 1.5 to 3 times.
    The ICL Consistency Test. (arXiv:2312.04945v1 [cs.CL])
    Just like the previous generation of task-tuned models, large language models (LLMs) that are adapted to tasks via prompt-based methods like in-context-learning (ICL) perform well in some setups but not in others. This lack of consistency in prompt-based learning hints at a lack of robust generalisation. We here introduce the ICL consistency test -- a contribution to the GenBench collaborative benchmark task (CBT) -- which evaluates how consistent a model makes predictions across many different setups while using the same data. The test is based on different established natural language inference tasks. We provide preprocessed data constituting 96 different 'setups' and a metric that estimates model consistency across these setups. The metric is provided on a fine-grained level to understand what properties of a setup render predictions unstable and on an aggregated level to compare overall model consistency. We conduct an empirical analysis of eight state-of-the-art models, and our consistency metric reveals how all tested LLMs lack robust generalisation.
    Optimal Multi-Distribution Learning. (arXiv:2312.05134v1 [cs.LG])
    Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, etc. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process. However, there exist substantial gaps between the state-of-the-art upper and lower bounds on the optimal sample complexity. Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension $d$, we propose a novel algorithm that yields an $varepsilon$-optimal randomized hypothesis with a sample complexity on the order of $(d+k)/\varepsilon^2$ (modulo some logarithmic factor), matching the best-known lower bound. Our algorithmic ideas and theory have been further extended to accommodate Rademacher classes. The proposed algorithms are oracle-efficient, which access the hypothesis class solely through an empirical risk minimization oracle. Additionally, we establish the necessity of randomization, unveiling a large sample size barrier when only deterministic hypotheses are permitted. These findings successfully resolve three open problems presented in COLT 2023 (i.e., Awasthi et al., (2023, Problem 1, 3 and 4)).
    Provably Bounding Neural Network Preimages. (arXiv:2302.01404v3 [cs.LG] UPDATED)
    Most work on the formal verification of neural networks has focused on bounding the set of outputs that correspond to a given set of inputs (for example, bounded perturbations of a nominal input). However, many use cases of neural network verification require solving the inverse problem, or over-approximating the set of inputs that lead to certain outputs. We present the INVPROP algorithm for verifying properties over the preimage of a linearly constrained output set, which can be combined with branch-and-bound to increase precision. Contrary to other approaches, our efficient algorithm is GPU-accelerated and does not require a linear programming solver. We demonstrate our algorithm for identifying safe control regions for a dynamical system via backward reachability analysis, verifying adversarial robustness, and detecting out-of-distribution inputs to a neural network. Our results show that in certain settings, we find over-approximations over 2500x tighter than prior work while being 2.5x faster. By strengthening robustness verification with output constraints, we consistently verify more properties than the previous state-of-the-art on multiple benchmarks, including a large model with 167k neurons in VNN-COMP 2023. Our algorithm has been incorporated into the $\alpha,\!\beta$-CROWN verifier, available at https://abcrown.org.
    Damage GAN: A Generative Model for Imbalanced Data. (arXiv:2312.04862v1 [cs.LG])
    This study delves into the application of Generative Adversarial Networks (GANs) within the context of imbalanced datasets. Our primary aim is to enhance the performance and stability of GANs in such datasets. In pursuit of this objective, we introduce a novel network architecture known as Damage GAN, building upon the ContraD GAN framework which seamlessly integrates GANs and contrastive learning. Through the utilization of contrastive learning, the discriminator is trained to develop an unsupervised representation capable of distinguishing all provided samples. Our approach draws inspiration from the straightforward framework for contrastive learning of visual representations (SimCLR), leading to the formulation of a distinctive loss function. We also explore the implementation of self-damaging contrastive learning (SDCLR) to further enhance the optimization of the ContraD GAN model. Comparative evaluations against baseline models including the deep convolutional GAN (DCGAN) and ContraD GAN demonstrate the evident superiority of our proposed model, Damage GAN, in terms of generated image distribution, model stability, and image quality when applied to imbalanced datasets.
    Conformal Prediction in Multi-User Settings: An Evaluation. (arXiv:2312.05195v1 [cs.LG])
    Typically, machine learning models are trained and evaluated without making any distinction between users (e.g, using traditional hold-out and cross-validation). However, this produces inaccurate performance metrics estimates in multi-user settings. That is, situations where the data were collected by multiple users with different characteristics (e.g., age, gender, height, etc.) which is very common in user computer interaction and medical applications. For these types of scenarios model evaluation strategies that provide better performance estimates have been proposed such as mixed, user-independent, user-dependent, and user-adaptive models. Although those strategies are better suited for multi-user systems, they are typically assessed with respect to performance metrics that capture the overall behavior of the models and do not provide any performance guarantees for individual predictions nor they provide any feedback about the predictions' uncertainty. In order to overcome those limitations, in this work we evaluated the conformal prediction framework in several multi-user settings. Conformal prediction is a model agnostic method that provides confidence guarantees on the predictions, thus, increasing the trustworthiness and robustness of the models. We conducted extensive experiments using different evaluation strategies and found significant differences in terms of conformal performance measures. We also proposed several visualizations based on matrices, graphs, and charts that capture different aspects of the resulting prediction sets.
    On the Trustworthiness Landscape of State-of-the-art Generative Models: A Survey and Outlook. (arXiv:2307.16680v5 [cs.LG] UPDATED)
    Diffusion models and large language models have emerged as leading-edge generative models, revolutionizing various aspects of human life. However, the practical implementations of these models have also exposed inherent risks, bringing to the forefront their evil sides and sparking concerns regarding their trustworthiness. Despite the wealth of literature on this subject, a comprehensive survey specifically delving into the intersection of large-scale generative models and their trustworthiness remains largely absent. To bridge this gap, this paper investigates both the long-standing and emerging threats associated with these models across four fundamental dimensions: 1) privacy, 2) security, 3) fairness, and 4) responsibility. Based on the investigation results, we develop an extensive map outlining the trustworthiness of large generative models. After that, we provide practical recommendations and potential research directions for future secure applications equipped with large generative models, ultimately promoting the trustworthiness of the models and benefiting the society as a whole.
    GraphMETRO: Mitigating Complex Distribution Shifts in GNNs via Mixture of Aligned Experts. (arXiv:2312.04693v1 [cs.LG])
    Graph Neural Networks' (GNNs) ability to generalize across complex distributions is crucial for real-world applications. However, prior research has primarily focused on specific types of distribution shifts, such as larger graph size, or inferred shifts from constructed data environments, which is highly limited when confronted with multiple and nuanced distribution shifts. For instance, in a social graph, a user node might experience increased interactions and content alterations, while other user nodes encounter distinct shifts. Neglecting such complexities significantly impedes generalization. To address it, we present GraphMETRO, a novel framework that enhances GNN generalization under complex distribution shifts in both node and graph-level tasks. Our approach employs a mixture-of-experts (MoE) architecture with a gating model and expert models aligned in a shared representation space. The gating model identifies key mixture components governing distribution shifts, while each expert generates invariant representations w.r.t. a mixture component. Finally, GraphMETRO aggregates representations from multiple experts to generate the final invariant representation. Our experiments on synthetic and realworld datasets demonstrate GraphMETRO's superiority and interpretability. To highlight, GraphMETRO achieves state-of-the-art performances on four real-world datasets from GOOD benchmark, outperforming the best baselines on WebKB and Twitch datasets by 67% and 4.2%, respectively.
    Dynamic Online Modulation Recognition using Incremental Learning. (arXiv:2312.04718v1 [eess.SP])
    Modulation recognition is a fundamental task in communication systems as the accurate identification of modulation schemes is essential for reliable signal processing, interference mitigation for coexistent communication technologies, and network optimization. Incorporating deep learning (DL) models into modulation recognition has demonstrated promising results in various scenarios. However, conventional DL models often fall short in online dynamic contexts, particularly in class incremental scenarios where new modulation schemes are encountered during online deployment. Retraining these models on all previously seen modulation schemes is not only time-consuming but may also not be feasible due to storage limitations. On the other hand, training solely on new modulation schemes often results in catastrophic forgetting of previously learned classes. This issue renders DL-based modulation recognition models inapplicable in real-world scenarios because the dynamic nature of communication systems necessitate the effective adaptability to new modulation schemes. This paper addresses this challenge by evaluating the performance of multiple Incremental Learning (IL) algorithms in dynamic modulation recognition scenarios, comparing them against conventional DL-based modulation recognition. Our results demonstrate that modulation recognition frameworks based on IL effectively prevent catastrophic forgetting, enabling models to perform robustly in dynamic scenarios.
    Distributed Optimization via Kernelized Multi-armed Bandits. (arXiv:2312.04719v1 [cs.LG])
    Multi-armed bandit algorithms provide solutions for sequential decision-making where learning takes place by interacting with the environment. In this work, we model a distributed optimization problem as a multi-agent kernelized multi-armed bandit problem with a heterogeneous reward setting. In this setup, the agents collaboratively aim to maximize a global objective function which is an average of local objective functions. The agents can access only bandit feedback (noisy reward) obtained from the associated unknown local function with a small norm in reproducing kernel Hilbert space (RKHS). We present a fully decentralized algorithm, Multi-agent IGP-UCB (MA-IGP-UCB), which achieves a sub-linear regret bound for popular classes for kernels while preserving privacy. It does not necessitate the agents to share their actions, rewards, or estimates of their local function. In the proposed approach, the agents sample their individual local functions in a way that benefits the whole network by utilizing a running consensus to estimate the upper confidence bound on the global function. Furthermore, we propose an extension, Multi-agent Delayed IGP-UCB (MAD-IGP-UCB) algorithm, which reduces the dependence of the regret bound on the number of agents in the network. It provides improved performance by utilizing a delay in the estimation update step at the cost of more communication.
    Joint User Association, Interference Cancellation and Power Control for Multi-IRS Assisted UAV Communications. (arXiv:2312.04786v1 [cs.IT])
    Intelligent reflecting surface (IRS)-assisted unmanned aerial vehicle (UAV) communications are expected to alleviate the load of ground base stations in a cost-effective way. Existing studies mainly focus on the deployment and resource allocation of a single IRS instead of multiple IRSs, whereas it is extremely challenging for joint multi-IRS multi-user association in UAV communications with constrained reflecting resources and dynamic scenarios. To address the aforementioned challenges, we propose a new optimization algorithm for joint IRS-user association, trajectory optimization of UAVs, successive interference cancellation (SIC) decoding order scheduling and power allocation to maximize system energy efficiency. We first propose an inverse soft-Q learning-based algorithm to optimize multi-IRS multi-user association. Then, SCA and Dinkelbach-based algorithm are leveraged to optimize UAV trajectory followed by the optimization of SIC decoding order scheduling and power allocation. Finally, theoretical analysis and performance results show significant advantages of the designed algorithm in convergence rate and energy efficiency.
    Construction of Hierarchical Neural Architecture Search Spaces based on Context-free Grammars. (arXiv:2211.01842v3 [cs.LG] UPDATED)
    The discovery of neural architectures from simple building blocks is a long-standing goal of Neural Architecture Search (NAS). Hierarchical search spaces are a promising step towards this goal but lack a unifying search space design framework and typically only search over some limited aspect of architectures. In this work, we introduce a unifying search space design framework based on context-free grammars that can naturally and compactly generate expressive hierarchical search spaces that are 100s of orders of magnitude larger than common spaces from the literature. By enhancing and using their properties, we effectively enable search over the complete architecture and can foster regularity. Further, we propose an efficient hierarchical kernel design for a Bayesian Optimization search strategy to efficiently search over such huge spaces. We demonstrate the versatility of our search space design framework and show that our search strategy can be superior to existing NAS approaches. Code is available at https://github.com/automl/hierarchical_nas_construction.
    Deep Emotions Across Languages: A Novel Approach for Sentiment Propagation in Multilingual WordNets. (arXiv:2312.04715v1 [cs.CL])
    Sentiment analysis involves using WordNets enriched with emotional metadata, which are valuable resources. However, manual annotation is time-consuming and expensive, resulting in only a few WordNet Lexical Units being annotated. This paper introduces two new techniques for automatically propagating sentiment annotations from a partially annotated WordNet to its entirety and to a WordNet in a different language: Multilingual Structured Synset Embeddings (MSSE) and Cross-Lingual Deep Neural Sentiment Propagation (CLDNS). We evaluated the proposed MSSE+CLDNS method extensively using Princeton WordNet and Polish WordNet, which have many inter-lingual relations. Our results show that the MSSE+CLDNS method outperforms existing propagation methods, indicating its effectiveness in enriching WordNets with emotional metadata across multiple languages. This work provides a solid foundation for large-scale, multilingual sentiment analysis and is valuable for academic research and practical applications.
    Train 'n Trade: Foundations of Parameter Markets. (arXiv:2312.04740v1 [cs.LG])
    Organizations typically train large models individually. This is costly and time-consuming, particularly for large-scale foundation models. Such vertical production is known to be suboptimal. Inspired by this economic insight, we ask whether it is possible to leverage others' expertise by trading the constituent parts in models, i.e., sets of weights, as if they were market commodities. While recent advances in aligning and interpolating models suggest that doing so may be possible, a number of fundamental questions must be answered to create viable parameter markets. In this work, we address these basic questions, propose a framework containing the infrastructure necessary for market operations to take place, study strategies for exchanging parameters, and offer means for agents to monetize parameters. Excitingly, compared to agents who train siloed models from scratch, we show that it is possible to mutually gain by using the market, even in competitive settings. This suggests that the notion of parameter markets may be a useful paradigm for improving large-scale model training in the future.
    Image Synthesis-based Late Stage Cancer Augmentation and Semi-Supervised Segmentation for MRI Rectal Cancer Staging. (arXiv:2312.04779v1 [eess.IV])
    Rectal cancer is one of the most common diseases and a major cause of mortality. For deciding rectal cancer treatment plans, T-staging is important. However, evaluating the index from preoperative MRI images requires high radiologists' skill and experience. Therefore, the aim of this study is to segment the mesorectum, rectum, and rectal cancer region so that the system can predict T-stage from segmentation results. Generally, shortage of large and diverse dataset and high quality annotation are known to be the bottlenecks in computer aided diagnostics development. Regarding rectal cancer, advanced cancer images are very rare, and per-pixel annotation requires high radiologists' skill and time. Therefore, it is not feasible to collect comprehensive disease patterns in a training dataset. To tackle this, we propose two kinds of approaches of image synthesis-based late stage cancer augmentation and semi-supervised learning which is designed for T-stage prediction. In the image synthesis data augmentation approach, we generated advanced cancer images from labels. The real cancer labels were deformed to resemble advanced cancer labels by artificial cancer progress simulation. Next, we introduce a T-staging loss which enables us to train segmentation models from per-image T-stage labels. The loss works to keep inclusion/invasion relationships between rectum and cancer region consistent to the ground truth T-stage. The verification tests show that the proposed method obtains the best sensitivity (0.76) and specificity (0.80) in distinguishing between over T3 stage and underT2. In the ablation studies, our semi-supervised learning approach with the T-staging loss improved specificity by 0.13. Adding the image synthesis-based data augmentation improved the DICE score of invasion cancer area by 0.08 from baseline.
    From Big to Small Without Losing It All: Text Augmentation with ChatGPT for Efficient Sentiment Analysis. (arXiv:2312.04720v1 [cs.CL])
    In the era of artificial intelligence, data is gold but costly to annotate. The paper demonstrates a groundbreaking solution to this dilemma using ChatGPT for text augmentation in sentiment analysis. We leverage ChatGPT's generative capabilities to create synthetic training data that significantly improves the performance of smaller models, making them competitive with, or even outperforming, their larger counterparts. This innovation enables models to be both efficient and effective, thereby reducing computational cost, inference time, and memory usage without compromising on quality. Our work marks a key advancement in the cost-effective development and deployment of robust sentiment analysis models.  ( 2 min )
    How to guess a gradient. (arXiv:2312.04709v1 [cs.LG])
    How much can you say about the gradient of a neural network without computing a loss or knowing the label? This may sound like a strange question: surely the answer is "very little." However, in this paper, we show that gradients are more structured than previously thought. Gradients lie in a predictable low-dimensional subspace which depends on the network architecture and incoming features. Exploiting this structure can significantly improve gradient-free optimization schemes based on directional derivatives, which have struggled to scale beyond small networks trained on toy datasets. We study how to narrow the gap in optimization performance between methods that calculate exact gradients and those that use directional derivatives. Furthermore, we highlight new challenges in overcoming the large gap between optimizing with exact gradients and guessing the gradients.
    Induced Generative Adversarial Particle Transformers. (arXiv:2312.04757v1 [hep-ex])
    In high energy physics (HEP), machine learning methods have emerged as an effective way to accurately simulate particle collisions at the Large Hadron Collider (LHC). The message-passing generative adversarial network (MPGAN) was the first model to simulate collisions as point, or ``particle'', clouds, with state-of-the-art results, but suffered from quadratic time complexity. Recently, generative adversarial particle transformers (GAPTs) were introduced to address this drawback; however, results did not surpass MPGAN. We introduce induced GAPT (iGAPT) which, by integrating ``induced particle-attention blocks'' and conditioning on global jet attributes, not only offers linear time complexity but is also able to capture intricate jet substructure, surpassing MPGAN in many metrics. Our experiments demonstrate the potential of iGAPT to simulate complex HEP data accurately and efficiently.
    Rapid Motor Adaptation for Robotic Manipulator Arms. (arXiv:2312.04670v1 [cs.RO])
    Developing generalizable manipulation skills is a core challenge in embodied AI. This includes generalization across diverse task configurations, encompassing variations in object shape, density, friction coefficient, and external disturbances such as forces applied to the robot. Rapid Motor Adaptation (RMA) offers a promising solution to this challenge. It posits that essential hidden variables influencing an agent's task performance, such as object mass and shape, can be effectively inferred from the agent's action and proprioceptive history. Drawing inspiration from RMA in locomotion and in-hand rotation, we use depth perception to develop agents tailored for rapid motor adaptation in a variety of manipulation tasks. We evaluated our agents on four challenging tasks from the Maniskill2 benchmark, namely pick-and-place operations with hundreds of objects from the YCB and EGAD datasets, peg insertion with precise position and orientation, and operating a variety of faucets and handles, with customized environment variations. Empirical results demonstrate that our agents surpass state-of-the-art methods like automatic domain randomization and vision-based policies, obtaining better generalization performance and sample efficiency.
    DeltaZip: Multi-Tenant Language Model Serving via Delta Compression. (arXiv:2312.05215v1 [cs.DC])
    Fine-tuning large language models (LLMs) for downstream tasks can greatly improve model quality, however serving many different fine-tuned LLMs concurrently for users in multi-tenant environments is challenging. Dedicating GPU memory for each model is prohibitively expensive and naively swapping large model weights in and out of GPU memory is slow. Our key insight is that fine-tuned models can be quickly swapped in and out of GPU memory by extracting and compressing the delta between each model and its pre-trained base model. We propose DeltaZip, an LLM serving system that efficiently serves multiple full-parameter fine-tuned models concurrently by aggressively compressing model deltas by a factor of $6\times$ to $8\times$ while maintaining high model quality. DeltaZip increases serving throughput by $1.5\times$ to $3\times$ and improves SLO attainment compared to a vanilla HuggingFace serving system.
    Achieving Margin Maximization Exponentially Fast via Progressive Norm Rescaling. (arXiv:2311.14387v2 [cs.LG] UPDATED)
    In this work, we investigate the margin-maximization bias exhibited by gradient-based algorithms in classifying linearly separable data. We present an in-depth analysis of the specific properties of the velocity field associated with (normalized) gradients, focusing on their role in margin maximization. Inspired by this analysis, we propose a novel algorithm called Progressive Rescaling Gradient Descent (PRGD) and show that PRGD can maximize the margin at an {\em exponential rate}. This stands in stark contrast to all existing algorithms, which maximize the margin at a slow {\em polynomial rate}. Specifically, we identify mild conditions on data distribution under which existing algorithms such as gradient descent (GD) and normalized gradient descent (NGD) {\em provably fail} in maximizing the margin efficiently. To validate our theoretical findings, we present both synthetic and real-world experiments. Notably, PRGD also shows promise in enhancing the generalization performance when applied to linearly non-separable datasets and deep neural networks.
    Federated Learning for 6G: Paradigms, Taxonomy, Recent Advances and Insights. (arXiv:2312.04688v1 [cs.LG])
    Artificial Intelligence (AI) is expected to play an instrumental role in the next generation of wireless systems, such as sixth-generation (6G) mobile network. However, massive data, energy consumption, training complexity, and sensitive data protection in wireless systems are all crucial challenges that must be addressed for training AI models and gathering intelligence and knowledge from distributed devices. Federated Learning (FL) is a recent framework that has emerged as a promising approach for multiple learning agents to build an accurate and robust machine learning models without sharing raw data. By allowing mobile handsets and devices to collaboratively learn a global model without explicit sharing of training data, FL exhibits high privacy and efficient spectrum utilization. While there are a lot of survey papers exploring FL paradigms and usability in 6G privacy, none of them has clearly addressed how FL can be used to improve the protocol stack and wireless operations. The main goal of this survey is to provide a comprehensive overview on FL usability to enhance mobile services and enable smart ecosystems to support novel use-cases. This paper examines the added-value of implementing FL throughout all levels of the protocol stack. Furthermore, it presents important FL applications, addresses hot topics, provides valuable insights and explicits guidance for future research and developments. Our concluding remarks aim to leverage the synergy between FL and future 6G, while highlighting FL's potential to revolutionize wireless industry and sustain the development of cutting-edge mobile services.
    TMID: A Comprehensive Real-world Dataset for Trademark Infringement Detection in E-Commerce. (arXiv:2312.05103v1 [cs.CL])
    Annually, e-commerce platforms incur substantial financial losses due to trademark infringements, making it crucial to identify and mitigate potential legal risks tied to merchant information registered to the platforms. However, the absence of high-quality datasets hampers research in this area. To address this gap, our study introduces TMID, a novel dataset to detect trademark infringement in merchant registrations. This is a real-world dataset sourced directly from Alipay, one of the world's largest e-commerce and digital payment platforms. As infringement detection is a legal reasoning task requiring an understanding of the contexts and legal rules, we offer a thorough collection of legal rules and merchant and trademark-related contextual information with annotations from legal experts. We ensure the data quality by performing an extensive statistical analysis. Furthermore, we conduct an empirical study on this dataset to highlight its value and the key challenges. Through this study, we aim to contribute valuable resources to advance research into legal compliance related to trademark infringement within the e-commerce sphere. The dataset is available at https://github.com/emnlpTMID/emnlpTMID.github.io .
    Scientific Preparation for CSST: Classification of Galaxy and Nebula/Star Cluster Based on Deep Learning. (arXiv:2312.04948v1 [cs.CV])
    The Chinese Space Station Telescope (abbreviated as CSST) is a future advanced space telescope. Real-time identification of galaxy and nebula/star cluster (abbreviated as NSC) images is of great value during CSST survey. While recent research on celestial object recognition has progressed, the rapid and efficient identification of high-resolution local celestial images remains challenging. In this study, we conducted galaxy and NSC image classification research using deep learning methods based on data from the Hubble Space Telescope. We built a Local Celestial Image Dataset and designed a deep learning model named HR-CelestialNet for classifying images of the galaxy and NSC. HR-CelestialNet achieved an accuracy of 89.09% on the testing set, outperforming models such as AlexNet, VGGNet and ResNet, while demonstrating faster recognition speeds. Furthermore, we investigated the factors influencing CSST image quality and evaluated the generalization ability of HR-CelestialNet on the blurry image dataset, demonstrating its robustness to low image quality. The proposed method can enable real-time identification of celestial images during CSST survey mission.  ( 2 min )
    Operationalizing Assurance Cases for Data Scientists: A Showcase of Concepts and Tooling in the Context of Test Data Quality for Machine Learning. (arXiv:2312.04917v1 [cs.SE])
    Assurance Cases (ACs) are an established approach in safety engineering to argue quality claims in a structured way. In the context of quality assurance for Machine Learning (ML)-based software components, ACs are also being discussed and appear promising. Tools for operationalizing ACs do exist, yet mainly focus on supporting safety engineers on the system level. However, assuring the quality of an ML component within the system is commonly the responsibility of data scientists, who are usually less familiar with these tools. To address this gap, we propose a framework to support the operationalization of ACs for ML components based on technologies that data scientists use on a daily basis: Python and Jupyter Notebook. Our aim is to make the process of creating ML-related evidence in ACs more effective. Results from the application of the framework, documented through notebooks, can be integrated into existing AC tools. We illustrate the application of the framework on an example excerpt concerned with the quality of the test data.  ( 3 min )
    Predicting Census Survey Response Rates With Parsimonious Additive Models and Structured Interactions. (arXiv:2108.11328v4 [stat.ML] UPDATED)
    In this paper we consider the problem of predicting survey response rates using a family of flexible and interpretable nonparametric models. The study is motivated by the US Census Bureau's well-known ROAM application which uses a linear regression model trained on the US Census Planning Database data to identify hard-to-survey areas. A crowdsourcing competition (Erdman and Bates, 2016) organized around ten years ago revealed that machine learning methods based on ensembles of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to their black-box nature. We consider nonparametric additive models with small number of main and pairwise interaction effects using $\ell_0$-based penalization. From a methodological viewpoint, we study both computational and statistical aspects of our estimator; and discuss variants that incorporate strong hierarchical interactions. Our algorithms (opensourced on github) extend the computational frontiers of existing algorithms for sparse additive models, to be able to handle datasets relevant for the application we consider. We discuss and interpret findings from our model on the US Census Planning Database. In addition to being useful from an interpretability standpoint, our models lead to predictions that appear to be better than popular black-box machine learning methods based on gradient boosting and feedforward neural networks - suggesting that it is possible to have models that have the best of both worlds: good model accuracy and interpretability.
    A Test-Time Learning Approach to Reparameterize the Geophysical Inverse Problem with a Convolutional Neural Network. (arXiv:2312.04752v1 [cs.LG])
    Regularization is critical in solving the ill-posed geo-physical inversion problems. Explicit regularization is often used, but there are opportunities to explore the implicit regularization effect inherently from a Neural Network structure. Researchers in Computer Vision (CV) have discovered that the Convolutional Neural Network (CNN) architecture inherently enforces a regularization that is advantageous for addressing diverse CV inverse problems, including de-noising and in-painting. In this study, we examine the applicability of this implicit regularization to geophysical inversions. The CNN maps an arbitrary vector to the model space (e.g. log-conductivity on the simulation mesh). The predicted subsurface model is then fed into a forward numerical simulation process to generate corresponding predicted measurements. Subsequently, the objective function value is computed by comparing these predicted measurements with the observed field measurements. The backpropagation algorithm is employed to update the trainable parameters of the CNN during the inversion. Note that the CNN in our proposed method does not require training before the inversion, rather, the CNN weights are estimated in the inversion algorithm, hence this is a test-time learning (TTL) approach. The results demonstrate that the implicit regularization provided by the CNN can be useful in DC resistivity inversions. We also provide a detailed discussion of the potential sources of this implicit regularization and some practical guides for applying the proposed method to other geophysical scenarios. The proposed approach for reparameterizing the inverse problem can be adapted to other Tikhonov-style geophysical inversions.
    Causal Effect Regularization: Automated Detection and Removal of Spurious Attributes. (arXiv:2306.11072v2 [cs.LG] UPDATED)
    In many classification datasets, the task labels are spuriously correlated with some input attributes. Classifiers trained on such datasets often rely on these attributes for prediction, especially when the spurious correlation is high, and thus fail to generalize whenever there is a shift in the attributes' correlation at deployment. If we assume that the spurious attributes are known a priori, several methods have been proposed to learn a classifier that is invariant to the specified attributes. However, in real-world data, information about spurious attributes is typically unavailable. Therefore, we propose a method to automatically identify spurious attributes by estimating their causal effect on the label and then use a regularization objective to mitigate the classifier's reliance on them. Compared to a recent method for identifying spurious attributes, we find that our method is more accurate in removing the attribute from the learned model, especially when spurious correlation is high. Specifically, across synthetic, semi-synthetic, and real-world datasets, our method shows significant improvement in a metric used to quantify the dependence of a classifier on spurious attributes ($\Delta$Prob), while obtaining better or similar accuracy. In addition, our method mitigates the reliance on spurious attributes even under noisy estimation of causal effects. To explain the empirical robustness of our method, we create a simple linear classification task with two sets of attributes: causal and spurious. We prove that our method only requires that the ranking of estimated causal effects is correct across attributes to select the correct classifier.
    Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit. (arXiv:2309.16620v2 [stat.ML] UPDATED)
    The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $\mu$P parameterized networks, where the optimal hyperparameters for small width networks transfer to networks with arbitrarily large width. However, in this scheme, hyperparameters do not transfer across depths. As a remedy, we study residual networks with a residual branch scale of $1/\sqrt{\text{depth}}$ in combination with the $\mu$P parameterization. We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findings are supported and motivated by theory. Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit and show convergence of finite-size network dynamics towards this limit.
    On Sarcasm Detection with OpenAI GPT-based Models. (arXiv:2312.04642v1 [cs.CL])
    Sarcasm is a form of irony that requires readers or listeners to interpret its intended meaning by considering context and social cues. Machine learning classification models have long had difficulty detecting sarcasm due to its social complexity and contradictory nature. This paper explores the applications of the Generative Pretrained Transformer (GPT) models, including GPT-3, InstructGPT, GPT-3.5, and GPT-4, in detecting sarcasm in natural language. It tests fine-tuned and zero-shot models of different sizes and releases. The GPT models were tested on the political and balanced (pol-bal) portion of the popular Self-Annotated Reddit Corpus (SARC 2.0) sarcasm dataset. In the fine-tuning case, the largest fine-tuned GPT-3 model achieves accuracy and $F_1$-score of 0.81, outperforming prior models. In the zero-shot case, one of GPT-4 models yields an accuracy of 0.70 and $F_1$-score of 0.75. Other models score lower. Additionally, a model's performance may improve or deteriorate with each release, highlighting the need to reassess performance after each release.
    HC-Ref: Hierarchical Constrained Refinement for Robust Adversarial Training of GNNs. (arXiv:2312.04879v1 [cs.LG])
    Recent studies have shown that attackers can catastrophically reduce the performance of GNNs by maliciously modifying the graph structure or node features on the graph. Adversarial training, which has been shown to be one of the most effective defense mechanisms against adversarial attacks in computer vision, holds great promise for enhancing the robustness of GNNs. There is limited research on defending against attacks by performing adversarial training on graphs, and it is crucial to delve deeper into this approach to optimize its effectiveness. Therefore, based on robust adversarial training on graphs, we propose a hierarchical constraint refinement framework (HC-Ref) that enhances the anti-perturbation capabilities of GNNs and downstream classifiers separately, ultimately leading to improved robustness. We propose corresponding adversarial regularization terms that are conducive to adaptively narrowing the domain gap between the normal part and the perturbation part according to the characteristics of different layers, promoting the smoothness of the predicted distribution of both parts. Moreover, existing research on graph robust adversarial training primarily concentrates on training from the standpoint of node feature perturbations and seldom takes into account alterations in the graph structure. This limitation makes it challenging to prevent attacks based on topological changes in the graph. This paper generates adversarial examples by utilizing graph structure perturbations, offering an effective approach to defend against attack methods that are based on topological changes. Extensive experiments on two real-world graph benchmarks show that HC-Ref successfully resists various attacks and has better node classification performance compared to several baseline methods.
    Differentiable Visual Computing for Inverse Problems and Machine Learning. (arXiv:2312.04574v1 [cs.LG])
    Originally designed for applications in computer graphics, visual computing (VC) methods synthesize information about physical and virtual worlds, using prescribed algorithms optimized for spatial computing. VC is used to analyze geometry, physically simulate solids, fluids, and other media, and render the world via optical techniques. These fine-tuned computations that operate explicitly on a given input solve so-called forward problems, VC excels at. By contrast, deep learning (DL) allows for the construction of general algorithmic models, side stepping the need for a purely first principles-based approach to problem solving. DL is powered by highly parameterized neural network architectures -- universal function approximators -- and gradient-based search algorithms which can efficiently search that large parameter space for optimal models. This approach is predicated by neural network differentiability, the requirement that analytic derivatives of a given problem's task metric can be computed with respect to neural network's parameters. Neural networks excel when an explicit model is not known, and neural network training solves an inverse problem in which a model is computed from data.
    Enhancing Polynomial Chaos Expansion Based Surrogate Modeling using a Novel Probabilistic Transfer Learning Strategy. (arXiv:2312.04648v1 [stat.ML])
    In the field of surrogate modeling, polynomial chaos expansion (PCE) allows practitioners to construct inexpensive yet accurate surrogates to be used in place of the expensive forward model simulations. For black-box simulations, non-intrusive PCE allows the construction of these surrogates using a set of simulation response evaluations. In this context, the PCE coefficients can be obtained using linear regression, which is also known as point collocation or stochastic response surfaces. Regression exhibits better scalability and can handle noisy function evaluations in contrast to other non-intrusive approaches, such as projection. However, since over-sampling is generally advisable for the linear regression approach, the simulation requirements become prohibitive for expensive forward models. We propose to leverage transfer learning whereby knowledge gained through similar PCE surrogate construction tasks (source domains) is transferred to a new surrogate-construction task (target domain) which has a limited number of forward model simulations (training data). The proposed transfer learning strategy determines how much, if any, information to transfer using new techniques inspired by Bayesian modeling and data assimilation. The strategy is scrutinized using numerical investigations and applied to an engineering problem from the oil and gas industry.
    Data-driven Semi-supervised Machine Learning with Surrogate Safety Measures for Abnormal Driving Behavior Detection. (arXiv:2312.04610v1 [cs.LG])
    Detecting abnormal driving behavior is critical for road traffic safety and the evaluation of drivers' behavior. With the advancement of machine learning (ML) algorithms and the accumulation of naturalistic driving data, many ML models have been adopted for abnormal driving behavior detection. Most existing ML-based detectors rely on (fully) supervised ML methods, which require substantial labeled data. However, ground truth labels are not always available in the real world, and labeling large amounts of data is tedious. Thus, there is a need to explore unsupervised or semi-supervised methods to make the anomaly detection process more feasible and efficient. To fill this research gap, this study analyzes large-scale real-world data revealing several abnormal driving behaviors (e.g., sudden acceleration, rapid lane-changing) and develops a Hierarchical Extreme Learning Machines (HELM) based semi-supervised ML method using partly labeled data to accurately detect the identified abnormal driving behaviors. Moreover, previous ML-based approaches predominantly utilize basic vehicle motion features (such as velocity and acceleration) to label and detect abnormal driving behaviors, while this study seeks to introduce Surrogate Safety Measures (SSMs) as the input features for ML models to improve the detection performance. Results from extensive experiments demonstrate the effectiveness of the proposed semi-supervised ML model with the introduced SSMs serving as important features. The proposed semi-supervised ML method outperforms other baseline semi-supervised or unsupervised methods regarding various metrics, e.g., delivering the best accuracy at 99.58% and the best F-1 measure at 0.9913. The ablation study further highlights the significance of SSMs for advancing detection performance.
    Haldane Bundles: A Dataset for Learning to Predict the Chern Number of Line Bundles on the Torus. (arXiv:2312.04600v1 [cond-mat.mes-hall])
    Characteristic classes, which are abstract topological invariants associated with vector bundles, have become an important notion in modern physics with surprising real-world consequences. As a representative example, the incredible properties of topological insulators, which are insulators in their bulk but conductors on their surface, can be completely characterized by a specific characteristic class associated with their electronic band structure, the first Chern class. Given their importance to next generation computing and the computational challenge of calculating them using first-principles approaches, there is a need to develop machine learning approaches to predict the characteristic classes associated with a material system. To aid in this program we introduce the {\emph{Haldane bundle dataset}}, which consists of synthetically generated complex line bundles on the $2$-torus. We envision this dataset, which is not as challenging as noisy and sparsely measured real-world datasets but (as we show) still difficult for off-the-shelf architectures, to be a testing ground for architectures that incorporate the rich topological and geometric priors underlying characteristic classes.
    Variational Classification. (arXiv:2305.10406v4 [cs.LG] UPDATED)
    We present a latent variable model for classification that provides a novel probabilistic interpretation of neural network softmax classifiers. We derive a variational objective to train the model, analogous to the evidence lower bound (ELBO) used to train variational auto-encoders, that generalises the cross-entropy loss used to train classification models. Treating inputs to the softmax layer as samples of a latent variable, our abstracted perspective reveals a potential inconsistency between their anticipated distribution, required for accurate label predictions, and the empirical distribution they follow in practice. We then devise a variational objective to mitigate such inconsistency and encourage a specified latent distribution, instead of the implicit assumption in off-the-shelf softmax classifiers. Overall, we provide new theoretical insight into the inner workings of widely-used softmax classification; and empirical evaluation on image and text classification datasets demonstrates that our proposed remedy, variational classification, maintains classification accuracy while the reshaped latent space improves other desirable classifier properties, such as calibration, adversarial robustness, robustness to distribution shift and sample efficiency useful in low data settings.
    The impact of heteroskedasticity on uplift modeling. (arXiv:2312.05234v1 [stat.ML])
    There are various applications, where companies need to decide to which individuals they should best allocate treatment. To support such decisions, uplift models are applied to predict treatment effects on an individual level. Based on the predicted treatment effects, individuals can be ranked and treatment allocation can be prioritized according to this ranking. An implicit assumption, which has not been doubted in the previous uplift modeling literature, is that this treatment prioritization approach tends to bring individuals with high treatment effects to the top and individuals with low treatment effects to the bottom of the ranking. In our research, we show that heteroskedastictity in the training data can cause a bias of the uplift model ranking: individuals with the highest treatment effects can get accumulated in large numbers at the bottom of the ranking. We explain theoretically how heteroskedasticity can bias the ranking of uplift models and show this process in a simulation and on real-world data. We argue that this problem of ranking bias due to heteroskedasticity might occur in many real-world applications and requires modification of the treatment prioritization to achieve an efficient treatment allocation.
    LLM4TDD: Best Practices for Test Driven Development Using Large Language Models. (arXiv:2312.04687v1 [cs.SE])
    In today's society, we are becoming increasingly dependent on software systems. However, we also constantly witness the negative impacts of buggy software. Program synthesis aims to improve software correctness by automatically generating the program given an outline of the expected behavior. For decades, program synthesis has been an active research field, with recent approaches looking to incorporate Large Language Models to help generate code. This paper explores the concept of LLM4TDD, where we guide Large Language Models to generate code iteratively using a test-driven development methodology. We conduct an empirical evaluation using ChatGPT and coding problems from LeetCode to investigate the impact of different test, prompt and problem attributes on the efficacy of LLM4TDD.
    Urban Region Representation Learning with Attentive Fusion. (arXiv:2312.04606v1 [cs.LG])
    An increasing number of related urban data sources have brought forth novel opportunities for learning urban region representations, i.e., embeddings. The embeddings describe latent features of urban regions and enable discovering similar regions for urban planning applications. Existing methods learn an embedding for a region using every different type of region feature data, and subsequently fuse all learned embeddings of a region to generate a unified region embedding. However, these studies often overlook the significance of the fusion process. The typical fusion methods rely on simple aggregation, such as summation and concatenation, thereby disregarding correlations within the fused region embeddings. To address this limitation, we propose a novel model named HAFusion. Our model is powered by a dual-feature attentive fusion module named DAFusion, which fuses embeddings from different region features to learn higher-order correlations between the regions as well as between the different types of region features. DAFusion is generic - it can be integrated into existing models to enhance their fusion process. Further, motivated by the effective fusion capability of an attentive module, we propose a hybrid attentive feature learning module named HALearning to enhance the embedding learning from each individual type of region features. Extensive experiments on three real-world datasets demonstrate that our model HAFusion outperforms state-of-the-art methods across three different prediction tasks. Using our learned region embedding leads to consistent and up to 31% improvements in the prediction accuracy.
    TetraDiffusion: Tetrahedral Diffusion Models for 3D Shape Generation. (arXiv:2211.13220v2 [cs.CV] UPDATED)
    Probabilistic denoising diffusion models (DDMs) have set a new standard for 2D image generation. Extending DDMs for 3D content creation is an active field of research. Here, we propose TetraDiffusion, a diffusion model that operates on a tetrahedral partitioning of 3D space to enable efficient, high-resolution 3D shape generation. Our model introduces operators for convolution and transpose convolution that act directly on the tetrahedral partition, and seamlessly includes additional attributes such as color. Remarkably, TetraDiffusion enables rapid sampling of detailed 3D objects in nearly real-time with unprecedented resolution. It's also adaptable for generating 3D shapes conditioned on 2D images. Compared to existing 3D mesh diffusion techniques, our method is up to 200 times faster in inference speed, works on standard consumer hardware, and delivers superior results.
    Efficient Large Language Models Fine-Tuning On Graphs. (arXiv:2312.04737v1 [cs.LG])
    Learning from Text-Attributed Graphs (TAGs) has attracted significant attention due to its wide range of real-world applications. The rapid evolution of large language models (LLMs) has revolutionized the way we process textual data, which indicates a strong potential to replace shallow text embedding generally used in Graph Neural Networks (GNNs). However, we find that existing LLM approaches that exploit text information in graphs suffer from inferior computation and data efficiency. In this work, we introduce a novel and efficient approach for the end-to-end fine-tuning of Large Language Models (LLMs) on TAGs, named LEADING. The proposed approach maintains computation cost and memory overhead comparable to the graph-less fine-tuning of LLMs. Moreover, it transfers the rick knowledge in LLMs to downstream graph learning tasks effectively with limited labeled data in semi-supervised learning. Its superior computation and data efficiency are demonstrated through comprehensive experiments, offering a promising solution for a wide range of LLMs and graph learning tasks on TAGs.
    Two-Timescale Q-Learning with Function Approximation in Zero-Sum Stochastic Games. (arXiv:2312.04905v1 [cs.LG])
    We consider two-player zero-sum stochastic games and propose a two-timescale $Q$-learning algorithm with function approximation that is payoff-based, convergent, rational, and symmetric between the two players. In two-timescale $Q$-learning, the fast-timescale iterates are updated in spirit to the stochastic gradient descent and the slow-timescale iterates (which we use to compute the policies) are updated by taking a convex combination between its previous iterate and the latest fast-timescale iterate. Introducing the slow timescale as well as its update equation marks as our main algorithmic novelty. In the special case of linear function approximation, we establish, to the best of our knowledge, the first last-iterate finite-sample bound for payoff-based independent learning dynamics of these types. The result implies a polynomial sample complexity to find a Nash equilibrium in such stochastic games. To establish the results, we model our proposed algorithm as a two-timescale stochastic approximation and derive the finite-sample bound through a Lyapunov-based approach. The key novelty lies in constructing a valid Lyapunov function to capture the evolution of the slow-timescale iterates. Specifically, through a change of variable, we show that the update equation of the slow-timescale iterates resembles the classical smoothed best-response dynamics, where the regularized Nash gap serves as a valid Lyapunov function. This insight enables us to construct a valid Lyapunov function via a generalized variant of the Moreau envelope of the regularized Nash gap. The construction of our Lyapunov function might be of broad independent interest in studying the behavior of stochastic approximation algorithms.
    Fourier neural operator for learning solutions to macroscopic traffic flow models: Application to the forward and inverse problems. (arXiv:2308.07051v2 [cs.LG] UPDATED)
    Deep learning methods are emerging as popular computational tools for solving forward and inverse problems in traffic flow. In this paper, we study a neural operator framework for learning solutions to nonlinear hyperbolic partial differential equations with applications in macroscopic traffic flow models. In this framework, an operator is trained to map heterogeneous and sparse traffic input data to the complete macroscopic traffic state in a supervised learning setting. We chose a physics-informed Fourier neural operator ($\pi$-FNO) as the operator, where an additional physics loss based on a discrete conservation law regularizes the problem during training to improve the shock predictions. We also propose to use training data generated from random piecewise constant input data to systematically capture the shock and rarefied solutions. From experiments using the LWR traffic flow model, we found superior accuracy in predicting the density dynamics of a ring-road network and urban signalized road. We also found that the operator can be trained using simple traffic density dynamics, e.g., consisting of $2-3$ vehicle queues and $1-2$ traffic signal cycles, and it can predict density dynamics for heterogeneous vehicle queue distributions and multiple traffic signal cycles $(\geq 2)$ with an acceptable error. The extrapolation error grew sub-linearly with input complexity for a proper choice of the model architecture and training data. Adding a physics regularizer aided in learning long-term traffic density dynamics, especially for problems with periodic boundary data.
    SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation. (arXiv:2312.04913v1 [cs.CV])
    Current Visual-Language Pre-training (VLP) models are vulnerable to adversarial examples. These adversarial examples present substantial security risks to VLP models, as they can leverage inherent weaknesses in the models, resulting in incorrect predictions. In contrast to white-box adversarial attacks, transfer attacks (where the adversary crafts adversarial examples on a white-box model to fool another black-box model) are more reflective of real-world scenarios, thus making them more meaningful for research. By summarizing and analyzing existing research, we identified two factors that can influence the efficacy of transfer attacks on VLP models: inter-modal interaction and data diversity. Based on these insights, we propose a self-augment-based transfer attack method, termed SA-Attack. Specifically, during the generation of adversarial images and adversarial texts, we apply different data augmentation methods to the image modality and text modality, respectively, with the aim of improving the adversarial transferability of the generated adversarial images and texts. Experiments conducted on the FLickr30K and COCO datasets have validated the effectiveness of our method. Our code will be available after this paper is accepted.  ( 2 min )
    Max-Margin Token Selection in Attention Mechanism. (arXiv:2306.13596v4 [cs.LG] UPDATED)
    Attention mechanism is a central component of the transformer architecture which led to the phenomenal success of large language models. However, the theoretical principles underlying the attention mechanism are poorly understood, especially its nonconvex optimization dynamics. In this work, we explore the seminal softmax-attention model $f(\boldsymbol{X})=\langle \boldsymbol{Xv}, \texttt{softmax}(\boldsymbol{XWp})\rangle$, where $\boldsymbol{X}$ is the token sequence and $(\boldsymbol{v},\boldsymbol{W},\boldsymbol{p})$ are trainable parameters. We prove that running gradient descent on $\boldsymbol{p}$, or equivalently $\boldsymbol{W}$, converges in direction to a max-margin solution that separates $\textit{locally-optimal}$ tokens from non-optimal ones. This clearly formalizes attention as an optimal token selection mechanism. Remarkably, our results are applicable to general data and precisely characterize $\textit{optimality}$ of tokens in terms of the value embeddings $\boldsymbol{Xv}$ and problem geometry. We also provide a broader regularization path analysis that establishes the margin maximizing nature of attention even for nonlinear prediction heads. When optimizing $\boldsymbol{v}$ and $\boldsymbol{p}$ simultaneously with logistic loss, we identify conditions under which the regularization paths directionally converge to their respective hard-margin SVM solutions where $\boldsymbol{v}$ separates the input features based on their labels. Interestingly, the SVM formulation of $\boldsymbol{p}$ is influenced by the support vector geometry of $\boldsymbol{v}$. Finally, we verify our theoretical findings via numerical experiments and provide insights.  ( 3 min )
    Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs. (arXiv:2312.04782v1 [cs.CR])
    Large Language Models (LLMs) are now widely used in various applications, making it crucial to align their ethical standards with human values. However, recent jail-breaking methods demonstrate that this alignment can be undermined using carefully constructed prompts. In our study, we reveal a new threat to LLM alignment when a bad actor has access to the model's output logits, a common feature in both open-source LLMs and many commercial LLM APIs (e.g., certain GPT models). It does not rely on crafting specific prompts. Instead, it exploits the fact that even when an LLM rejects a toxic request, a harmful response often hides deep in the output logits. By forcefully selecting lower-ranked output tokens during the auto-regressive generation process at a few critical output positions, we can compel the model to reveal these hidden responses. We term this process model interrogation. This approach differs from and outperforms jail-breaking methods, achieving 92% effectiveness compared to 62%, and is 10 to 20 times faster. The harmful content uncovered through our method is more relevant, complete, and clear. Additionally, it can complement jail-breaking strategies, with which results in further boosting attack performance. Our findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks.  ( 2 min )
    Topology-Based Reconstruction Prevention for Decentralised Learning. (arXiv:2312.05248v1 [cs.CR])
    Decentralised learning has recently gained traction as an alternative to federated learning in which both data and coordination are distributed over its users. To preserve the confidentiality of users' data, decentralised learning relies on differential privacy, multi-party computation, or a combination thereof. However, running multiple privacy-preserving summations in sequence may allow adversaries to perform reconstruction attacks. Unfortunately, current reconstruction countermeasures either cannot trivially be adapted to the distributed setting, or add excessive amounts of noise. In this work, we first show that passive honest-but-curious adversaries can reconstruct other users' private data after several privacy-preserving summations. For example, in subgraphs with 18 users, we show that only three passive honest-but-curious adversaries succeed at reconstructing private data 11.0% of the time, requiring an average of 8.8 summations per adversary. The success rate is independent of the size of the full network. We consider weak adversaries, who do not control the graph topology and can exploit neither the workings of the summation protocol nor the specifics of users' data. We develop a mathematical understanding of how reconstruction relates to topology and propose the first topology-based decentralised defence against reconstruction attacks. Specifically, we show that reconstruction requires a number of adversaries linear in the length of the network's shortest cycle. Consequently, reconstructing private data from privacy-preserving summations is impossible in acyclic networks. Our work is a stepping stone for a formal theory of decentralised reconstruction defences based on topology. Such a theory would generalise our countermeasure beyond summation, define confidentiality in terms of entropy, and describe the effects of (topology-aware) differential privacy.
    When Does Optimizing a Proper Loss Yield Calibration?. (arXiv:2305.18764v2 [cs.LG] UPDATED)
    Optimizing proper loss functions is popularly believed to yield predictors with good calibration properties; the intuition being that for such losses, the global optimum is to predict the ground-truth probabilities, which is indeed calibrated. However, typical machine learning models are trained to approximately minimize loss over restricted families of predictors, that are unlikely to contain the ground truth. Under what circumstances does optimizing proper loss over a restricted family yield calibrated models? What precise calibration guarantees does it give? In this work, we provide a rigorous answer to these questions. We replace the global optimality with a local optimality condition stipulating that the (proper) loss of the predictor cannot be reduced much by post-processing its predictions with a certain family of Lipschitz functions. We show that any predictor with this local optimality satisfies smooth calibration as defined in Kakade-Foster (2008), B{\l}asiok et al. (2023). Local optimality is plausibly satisfied by well-trained DNNs, which suggests an explanation for why they are calibrated from proper loss minimization alone. Finally, we show that the connection between local optimality and calibration error goes both ways: nearly calibrated predictors are also nearly locally optimal.
    High Probability Guarantees for Random Reshuffling. (arXiv:2311.11841v2 [math.OC] UPDATED)
    We consider the stochastic gradient method with random reshuffling ($\mathsf{RR}$) for tackling smooth nonconvex optimization problems. $\mathsf{RR}$ finds broad applications in practice, notably in training neural networks. In this work, we first investigate the concentration property of $\mathsf{RR}$'s sampling procedure and establish a new high probability sample complexity guarantee for driving the gradient (without expectation) below $\varepsilon$, which effectively characterizes the efficiency of a single $\mathsf{RR}$ execution. Our derived complexity matches the best existing in-expectation one up to a logarithmic term while imposing no additional assumptions nor changing $\mathsf{RR}$'s updating rule. Furthermore, by leveraging our derived high probability descent property and bound on the stochastic error, we propose a simple and computable stopping criterion for $\mathsf{RR}$ (denoted as $\mathsf{RR}$-$\mathsf{sc}$). This criterion is guaranteed to be triggered after a finite number of iterations, and then $\mathsf{RR}$-$\mathsf{sc}$ returns an iterate with its gradient below $\varepsilon$ with high probability. Moreover, building on the proposed stopping criterion, we design a perturbed random reshuffling method ($\mathsf{p}$-$\mathsf{RR}$) that involves an additional randomized perturbation procedure near stationary points. We derive that $\mathsf{p}$-$\mathsf{RR}$ provably escapes strict saddle points and efficiently returns a second-order stationary point with high probability, without making any sub-Gaussian tail-type assumptions on the stochastic gradient errors. Finally, we conduct numerical experiments on neural network training to support our theoretical findings.
    Detecting Atomic Scale Surface Defects in STM of TMDs with Ensemble Deep Learning. (arXiv:2312.05160v1 [cond-mat.mtrl-sci])
    Atomic-scale defect detection is shown in scanning tunneling microscopy images of single crystal WSe2 using an ensemble of U-Net-like convolutional neural networks. Standard deep learning test metrics indicated good detection performance with an average F1 score of 0.66 and demonstrated ensemble generalization to C-AFM images of WSe2 and STM images of MoSe2. Defect coordinates were automatically extracted from defect detections maps showing that STM image analysis enhanced by machine learning can be used to dramatically increase sample characterization throughput.
    A Negative Result on Gradient Matching for Selective Backprop. (arXiv:2312.05021v1 [cs.LG])
    With increasing scale in model and dataset size, the training of deep neural networks becomes a massive computational burden. One approach to speed up the training process is Selective Backprop. For this approach, we perform a forward pass to obtain a loss value for each data point in a minibatch. The backward pass is then restricted to a subset of that minibatch, prioritizing high-loss examples. We build on this approach, but seek to improve the subset selection mechanism by choosing the (weighted) subset which best matches the mean gradient over the entire minibatch. We use the gradients w.r.t. the model's last layer as a cheap proxy, resulting in virtually no overhead in addition to the forward pass. At the same time, for our experiments we add a simple random selection baseline which has been absent from prior work. Surprisingly, we find that both the loss-based as well as the gradient-matching strategy fail to consistently outperform the random baseline.  ( 2 min )
    Continual learning for surface defect segmentation by subnetwork creation and selection. (arXiv:2312.05100v1 [cs.CV])
    We introduce a new continual (or lifelong) learning algorithm called LDA-CP&S that performs segmentation tasks without undergoing catastrophic forgetting. The method is applied to two different surface defect segmentation problems that are learned incrementally, i.e. providing data about one type of defect at a time, while still being capable of predicting every defect that was seen previously. Our method creates a defect-related subnetwork for each defect type via iterative pruning and trains a classifier based on linear discriminant analysis (LDA). At the inference stage, we first predict the defect type with LDA and then predict the surface defects using the selected subnetwork. We compare our method with other continual learning methods showing a significant improvement -- mean Intersection over Union better by a factor of two when compared to existing methods on both datasets. Importantly, our approach shows comparable results with joint training when all the training data (all defects) are seen simultaneously
    Kraken: enabling joint trajectory prediction by utilizing Mode Transformer and Greedy Mode Processing. (arXiv:2312.05144v1 [cs.RO])
    Accurate and reliable motion prediction is essential for safe urban autonomy. The most prominent motion prediction approaches are based on modeling the distribution of possible future trajectories of each actor in autonomous system's vicinity. These "independent" marginal predictions might be accurate enough to properly describe casual driving situations where the prediction target is not likely to interact with other actors. They are, however, inadequate for modeling interactive situations where the actors' future trajectories are likely to intersect. To mitigate this issue we propose Kraken -- a real-time trajectory prediction model capable of approximating pairwise interactions between the actors as well as producing accurate marginal predictions. Kraken relies on a simple Greedy Mode Processing technique allowing it to convert a factorized prediction for a pair of agents into a physically-plausible joint prediction. It also utilizes the Mode Transformer module to increase the diversity of predicted trajectories and make the joint prediction more informative. We evaluate Kraken on Waymo Motion Prediction challenge where it held the first place in the Interaction leaderboard and the second place in the Motion leaderboard in October 2021.
    Membership Inference Attacks on Diffusion Models via Quantile Regression. (arXiv:2312.05140v1 [cs.LG])
    Recently, diffusion models have become popular tools for image synthesis because of their high-quality outputs. However, like other large-scale models, they may leak private information about their training data. Here, we demonstrate a privacy vulnerability of diffusion models through a \emph{membership inference (MI) attack}, which aims to identify whether a target example belongs to the training set when given the trained diffusion model. Our proposed MI attack learns quantile regression models that predict (a quantile of) the distribution of reconstruction loss on examples not used in training. This allows us to define a granular hypothesis test for determining the membership of a point in the training set, based on thresholding the reconstruction loss of that point using a custom threshold tailored to the example. We also provide a simple bootstrap technique that takes a majority membership prediction over ``a bag of weak attackers'' which improves the accuracy over individual quantile regression models. We show that our attack outperforms the prior state-of-the-art attack while being substantially less computationally expensive -- prior attacks required training multiple ``shadow models'' with the same architecture as the model under attack, whereas our attack requires training only much smaller models.  ( 2 min )
    On the Inadequacy of Similarity-based Privacy Metrics: Reconstruction Attacks against "Truly Anonymous Synthetic Data''. (arXiv:2312.05114v1 [cs.CR])
    Training generative models to produce synthetic data is meant to provide a privacy-friendly approach to data release. However, we get robust guarantees only when models are trained to satisfy Differential Privacy (DP). Alas, this is not the standard in industry as many companies use ad-hoc strategies to empirically evaluate privacy based on the statistical similarity between synthetic and real data. In this paper, we review the privacy metrics offered by leading companies in this space and shed light on a few critical flaws in reasoning about privacy entirely via empirical evaluations. We analyze the undesirable properties of the most popular metrics and filters and demonstrate their unreliability and inconsistency through counter-examples. We then present a reconstruction attack, ReconSyn, which successfully recovers (i.e., leaks all attributes of) at least 78% of the low-density train records (or outliers) with only black-box access to a single fitted generative model and the privacy metrics. Finally, we show that applying DP only to the model or using low-utility generators does not mitigate ReconSyn as the privacy leakage predominantly comes from the metrics. Overall, our work serves as a warning to practitioners not to deviate from established privacy-preserving mechanisms.  ( 2 min )
    Canaries and Whistles: Resilient Drone Communication Networks with (or without) Deep Reinforcement Learning. (arXiv:2312.04940v1 [cs.CR])
    Communication networks able to withstand hostile environments are critically important for disaster relief operations. In this paper, we consider a challenging scenario where drones have been compromised in the supply chain, during their manufacture, and harbour malicious software capable of wide-ranging and infectious disruption. We investigate multi-agent deep reinforcement learning as a tool for learning defensive strategies that maximise communications bandwidth despite continual adversarial interference. Using a public challenge for learning network resilience strategies, we propose a state-of-the-art expert technique and study its superiority over deep reinforcement learning agents. Correspondingly, we identify three specific methods for improving the performance of our learning-based agents: (1) ensuring each observation contains the necessary information, (2) using expert agents to provide a curriculum for learning, and (3) paying close attention to reward. We apply our methods and present a new mixed strategy enabling expert and learning-based agents to work together and improve on all prior results.  ( 2 min )
    Beyond Transduction: A Survey on Inductive, Few Shot, and Zero Shot Link Prediction in Knowledge Graphs. (arXiv:2312.04997v1 [cs.AI])
    Knowledge graphs (KGs) comprise entities interconnected by relations of different semantic meanings. KGs are being used in a wide range of applications. However, they inherently suffer from incompleteness, i.e. entities or facts about entities are missing. Consequently, a larger body of works focuses on the completion of missing information in KGs, which is commonly referred to as link prediction (LP). This task has traditionally and extensively been studied in the transductive setting, where all entities and relations in the testing set are observed during training. Recently, several works have tackled the LP task under more challenging settings, where entities and relations in the test set may be unobserved during training, or appear in only a few facts. These works are known as inductive, few-shot, and zero-shot link prediction. In this work, we conduct a systematic review of existing works in this area. A thorough analysis leads us to point out the undesirable existence of diverging terminologies and task definitions for the aforementioned settings, which further limits the possibility of comparison between recent works. We consequently aim at dissecting each setting thoroughly, attempting to reveal its intrinsic characteristics. A unifying nomenclature is ultimately proposed to refer to each of them in a simple and consistent manner.  ( 2 min )
    Benchmarking and Analysis of Unsupervised Object Segmentation from Real-world Single Images. (arXiv:2312.04947v1 [cs.CV])
    In this paper, we study the problem of unsupervised object segmentation from single images. We do not introduce a new algorithm, but systematically investigate the effectiveness of existing unsupervised models on challenging real-world images. We first introduce seven complexity factors to quantitatively measure the distributions of background and foreground object biases in appearance and geometry for datasets with human annotations. With the aid of these factors, we empirically find that, not surprisingly, existing unsupervised models fail to segment generic objects in real-world images, although they can easily achieve excellent performance on numerous simple synthetic datasets, due to the vast gap in objectness biases between synthetic and real images. By conducting extensive experiments on multiple groups of ablated real-world datasets, we ultimately find that the key factors underlying the failure of existing unsupervised models on real-world images are the challenging distributions of background and foreground object biases in appearance and geometry. Because of this, the inductive biases introduced in existing unsupervised models can hardly capture the diverse object distributions. Our research results suggest that future work should exploit more explicit objectness biases in the network design.  ( 2 min )
    Towards Sample-specific Backdoor Attack with Clean Labels via Attribute Trigger. (arXiv:2312.04584v1 [cs.CR])
    Currently, sample-specific backdoor attacks (SSBAs) are the most advanced and malicious methods since they can easily circumvent most of the current backdoor defenses. In this paper, we reveal that SSBAs are not sufficiently stealthy due to their poisoned-label nature, where users can discover anomalies if they check the image-label relationship. In particular, we demonstrate that it is ineffective to directly generalize existing SSBAs to their clean-label variants by poisoning samples solely from the target class. We reveal that it is primarily due to two reasons, including \textbf{(1)} the `antagonistic effects' of ground-truth features and \textbf{(2)} the learning difficulty of sample-specific features. Accordingly, trigger-related features of existing SSBAs cannot be effectively learned under the clean-label setting due to their mild trigger intensity required for ensuring stealthiness. We argue that the intensity constraint of existing SSBAs is mostly because their trigger patterns are `content-irrelevant' and therefore act as `noises' for both humans and DNNs. Motivated by this understanding, we propose to exploit content-relevant features, $a.k.a.$ (human-relied) attributes, as the trigger patterns to design clean-label SSBAs. This new attack paradigm is dubbed backdoor attack with attribute trigger (BAAT). Extensive experiments are conducted on benchmark datasets, which verify the effectiveness of our BAAT and its resistance to existing defenses.  ( 2 min )
    Zoology: Measuring and Improving Recall in Efficient Language Models. (arXiv:2312.04927v1 [cs.CL])
    Attention-free language models that combine gating and convolutions are growing in popularity due to their efficiency and increasingly competitive performance. To better understand these architectures, we pretrain a suite of 17 attention and "gated-convolution" language models, finding that SoTA gated-convolution architectures still underperform attention by up to 2.1 perplexity points on the Pile. In fine-grained analysis, we find 82% of the gap is explained by each model's ability to recall information that is previously mentioned in-context, e.g. "Hakuna Matata means no worries Hakuna Matata it means no" $\rightarrow$ "??". On this task, termed "associative recall", we find that attention outperforms gated-convolutions by a large margin: a 70M parameter attention model outperforms a 1.4 billion parameter gated-convolution model on associative recall. This is surprising because prior work shows gated convolutions can perfectly solve synthetic tests for AR capability. To close the gap between synthetics and real language, we develop a new formalization of the task called multi-query associative recall (MQAR) that better reflects actual language. We perform an empirical and theoretical study of MQAR that elucidates differences in the parameter-efficiency of attention and gated-convolution recall. Informed by our analysis, we evaluate simple convolution-attention hybrids and show that hybrids with input-dependent sparse attention patterns can close 97.4% of the gap to attention, while maintaining sub-quadratic scaling. Our code is accessible at: https://github.com/HazyResearch/zoology.
    Multi-Frequency Joint Community Detection and Phase Synchronization. (arXiv:2206.12276v3 [cs.SI] UPDATED)
    This paper studies the joint community detection and phase synchronization problem on the \textit{stochastic block model with relative phase}, where each node is associated with an unknown phase angle. This problem, with a variety of real-world applications, aims to recover the cluster structure and associated phase angles simultaneously. We show this problem exhibits a \textit{``multi-frequency''} structure by closely examining its maximum likelihood estimation (MLE) formulation, whereas existing methods are not originated from this perspective. To this end, two simple yet efficient algorithms that leverage the MLE formulation and benefit from the information across multiple frequencies are proposed. The former is a spectral method based on the novel multi-frequency column-pivoted QR factorization. The factorization applied to the top eigenvectors of the observation matrix provides key information about the cluster structure and associated phase angles. The second approach is an iterative multi-frequency generalized power method, where each iteration updates the estimation in a matrix-multiplication-then-projection manner. Numerical experiments show that our proposed algorithms significantly improve the ability of exactly recovering the cluster structure and the accuracy of the estimated phase angles, compared to state-of-the-art algorithms.
    gcDLSeg: Integrating Graph-cut into Deep Learning for Binary Semantic Segmentation. (arXiv:2312.04713v1 [cs.CV])
    Binary semantic segmentation in computer vision is a fundamental problem. As a model-based segmentation method, the graph-cut approach was one of the most successful binary segmentation methods thanks to its global optimality guarantee of the solutions and its practical polynomial-time complexity. Recently, many deep learning (DL) based methods have been developed for this task and yielded remarkable performance, resulting in a paradigm shift in this field. To combine the strengths of both approaches, we propose in this study to integrate the graph-cut approach into a deep learning network for end-to-end learning. Unfortunately, backward propagation through the graph-cut module in the DL network is challenging due to the combinatorial nature of the graph-cut algorithm. To tackle this challenge, we propose a novel residual graph-cut loss and a quasi-residual connection, enabling the backward propagation of the gradients of the residual graph-cut loss for effective feature learning guided by the graph-cut segmentation model. In the inference phase, globally optimal segmentation is achieved with respect to the graph-cut energy defined on the optimized image features learned from DL networks. Experiments on the public AZH chronic wound data set and the pancreas cancer data set from the medical segmentation decathlon (MSD) demonstrated promising segmentation accuracy, and improved robustness against adversarial attacks.
    Backward Learning for Goal-Conditioned Policies. (arXiv:2312.05044v1 [cs.LG])
    Can we learn policies in reinforcement learning without rewards? Can we learn a policy just by trying to reach a goal state? We answer these questions positively by proposing a multi-step procedure that first learns a world model that goes backward in time, secondly generates goal-reaching backward trajectories, thirdly improves those sequences using shortest path finding algorithms, and finally trains a neural network policy by imitation learning. We evaluate our method on a deterministic maze environment where the observations are $64\times 64$ pixel bird's eye images and can show that it consistently reaches several goals.  ( 2 min )
    Fine-Grained Extraction of Road Networks via Joint Learning of Connectivity and Segmentation. (arXiv:2312.04744v1 [cs.CV])
    Road network extraction from satellite images is widely applicated in intelligent traffic management and autonomous driving fields. The high-resolution remote sensing images contain complex road areas and distracted background, which make it a challenge for road extraction. In this study, we present a stacked multitask network for end-to-end segmenting roads while preserving connectivity correctness. In the network, a global-aware module is introduced to enhance pixel-level road feature representation and eliminate background distraction from overhead images; a road-direction-related connectivity task is added to ensure that the network preserves the graph-level relationships of the road segments. We also develop a stacked multihead structure to jointly learn and effectively utilize the mutual information between connectivity learning and segmentation learning. We evaluate the performance of the proposed network on three public remote sensing datasets. The experimental results demonstrate that the network outperforms the state-of-the-art methods in terms of road segmentation accuracy and connectivity maintenance.  ( 2 min )
    Application of machine learning technique for a fast forecast of aggregation kinetics in space-inhomogeneous systems. (arXiv:2312.04660v1 [physics.comp-ph])
    Modeling of aggregation processes in space-inhomogeneous systems is extremely numerically challenging since complicated aggregation equations -- Smoluchowski equations are to be solved at each space point along with the computation of particle propagation. Low rank approximation for the aggregation kernels can significantly speed up the solution of Smoluchowski equations, while particle propagation could be done in parallel. Yet the simulations with many aggregate sizes remain quite resource-demanding. Here, we explore the way to reduce the amount of direct computations with the use of modern machine learning (ML) techniques. Namely, we propose to replace the actual numerical solution of the Smoluchowki equations with the respective density transformations learned with the application of the conditional normalising flow. We demonstrate that the ML predictions for the space distribution of aggregates and their size distribution requires drastically less computation time and agrees fairly well with the results of direct numerical simulations. Such an opportunity of a quick forecast of space-dependent particle size distribution could be important in practice, especially for the online prediction and visualisation of pollution processes, providing a tool with a reasonable tradeoff between the prediction accuracy and the computational time.
    Relational Deep Learning: Graph Representation Learning on Relational Databases. (arXiv:2312.04615v1 [cs.LG])
    Much of the world's most valued data is stored in relational databases and data warehouses, where the data is organized into many tables connected by primary-foreign key relations. However, building machine learning models using this data is both challenging and time consuming. The core problem is that no machine learning method is capable of learning on multiple tables interconnected by primary-foreign key relations. Current methods can only learn from a single table, so the data must first be manually joined and aggregated into a single training table, the process known as feature engineering. Feature engineering is slow, error prone and leads to suboptimal models. Here we introduce an end-to-end deep representation learning approach to directly learn on data laid out across multiple tables. We name our approach Relational Deep Learning (RDL). The core idea is to view relational databases as a temporal, heterogeneous graph, with a node for each row in each table, and edges specified by primary-foreign key links. Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all input data, without any manual feature engineering. Relational Deep Learning leads to more accurate models that can be built much faster. To facilitate research in this area, we develop RelBench, a set of benchmark datasets and an implementation of Relational Deep Learning. The data covers a wide spectrum, from discussions on Stack Exchange to book reviews on the Amazon Product Catalog. Overall, we define a new research area that generalizes graph machine learning and broadens its applicability to a wide set of AI use cases.
    Understanding Community Bias Amplification in Graph Representation Learning. (arXiv:2312.04883v1 [cs.LG])
    In this work, we discover a phenomenon of community bias amplification in graph representation learning, which refers to the exacerbation of performance bias between different classes by graph representation learning. We conduct an in-depth theoretical study of this phenomenon from a novel spectral perspective. Our analysis suggests that structural bias between communities results in varying local convergence speeds for node embeddings. This phenomenon leads to bias amplification in the classification results of downstream tasks. Based on the theoretical insights, we propose random graph coarsening, which is proved to be effective in dealing with the above issue. Finally, we propose a novel graph contrastive learning model called Random Graph Coarsening Contrastive Learning (RGCCL), which utilizes random coarsening as data augmentation and mitigates community bias by contrasting the coarsened graph with the original graph. Extensive experiments on various datasets demonstrate the advantage of our method when dealing with community bias amplification.  ( 2 min )
    MRI Scan Synthesis Methods based on Clustering and Pix2Pix. (arXiv:2312.05176v1 [eess.IV])
    We consider a missing data problem in the context of automatic segmentation methods for Magnetic Resonance Imaging (MRI) brain scans. Usually, automated MRI scan segmentation is based on multiple scans (e.g., T1-weighted, T2-weighted, T1CE, FLAIR). However, quite often a scan is blurry, missing or otherwise unusable. We investigate the question whether a missing scan can be synthesized. We exemplify that this is in principle possible by synthesizing a T2-weighted scan from a given T1-weighted scan. Our first aim is to compute a picture that resembles the missing scan closely, measured by average mean squared error (MSE). We develop/use several methods for this, including a random baseline approach, a clustering-based method and pixel-to-pixel translation method by (Pix2Pix) which is based on conditional GANs. The lowest MSE is achieved by our clustering-based method. Our second aim is to compare the methods with respect to the affect that using the synthesized scan has on the segmentation process. For this, we use a DeepMedic model trained with the four input scan modalities named above. We replace the T2-weighted scan by the synthesized picture and evaluate the segmentations with respect to the tumor identification, using Dice scores as numerical evaluation. The evaluation shows that the segmentation works well with synthesized scans (in particular, with Pix2Pix methods) in many cases.
    Feature Analysis of Encrypted Malicious Traffic. (arXiv:2312.04596v1 [cs.CR])
    In recent years there has been a dramatic increase in the number of malware attacks that use encrypted HTTP traffic for self-propagation or communication. Antivirus software and firewalls typically will not have access to encryption keys, and therefore direct detection of malicious encrypted data is unlikely to succeed. However, previous work has shown that traffic analysis can provide indications of malicious intent, even in cases where the underlying data remains encrypted. In this paper, we apply three machine learning techniques to the problem of distinguishing malicious encrypted HTTP traffic from benign encrypted traffic and obtain results comparable to previous work. We then consider the problem of feature analysis in some detail. Previous work has often relied on human expertise to determine the most useful and informative features in this problem domain. We demonstrate that such feature-related information can be obtained directly from machine learning models themselves. We argue that such a machine learning based approach to feature analysis is preferable, as it is more reliable, and we can, for example, uncover relatively unintuitive interactions between features.  ( 2 min )
    TOD-Flow: Modeling the Structure of Task-Oriented Dialogues. (arXiv:2312.04668v1 [cs.CL])
    Task-Oriented Dialogue (TOD) systems have become crucial components in interactive artificial intelligence applications. While recent advances have capitalized on pre-trained language models (PLMs), they exhibit limitations regarding transparency and controllability. To address these challenges, we propose a novel approach focusing on inferring the TOD-Flow graph from dialogue data annotated with dialog acts, uncovering the underlying task structure in the form of a graph. The inferred TOD-Flow graph can be easily integrated with any dialogue model to improve its prediction performance, transparency, and controllability. Our TOD-Flow graph learns what a model can, should, and should not predict, effectively reducing the search space and providing a rationale for the model's prediction. We show that the proposed TOD-Flow graph better resembles human-annotated graphs compared to prior approaches. Furthermore, when combined with several dialogue policies and end-to-end dialogue models, we demonstrate that our approach significantly improves dialog act classification and end-to-end response generation performance in the MultiWOZ and SGD benchmarks. Code available at: https://github.com/srsohn/TOD-Flow  ( 2 min )
    EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism. (arXiv:2312.04916v1 [cs.LG])
    We present EE-LLM, a framework for large-scale training and inference of early-exit large language models (LLMs). While recent works have shown preliminary evidence for the efficacy of early exiting in accelerating LLM inference, EE-LLM makes a foundational step towards scaling up early-exit LLMs by supporting their training and inference with massive 3D parallelism. Built upon Megatron-LM, EE-LLM implements a variety of algorithmic innovations and performance optimizations tailored to early exiting, including a lightweight method that facilitates backpropagation for the early-exit training objective with pipeline parallelism, techniques of leveraging idle resources in the original pipeline schedule for computation related to early-exit layers, and two approaches of early-exit inference that are compatible with KV caching for autoregressive generation. Our analytical and empirical study shows that EE-LLM achieves great training efficiency with negligible computational overhead compared to standard LLM training, as well as outstanding inference speedup without compromising output quality. To facilitate further research and adoption, we release EE-LLM at https://github.com/pan-x-c/EE-LLM.  ( 2 min )
    Optimizing Distributed Reinforcement Learning with Reactor Model and Lingua Franca. (arXiv:2312.04704v1 [cs.DC])
    Distributed Reinforcement Learning (RL) frameworks are essential for mapping RL workloads to multiple computational resources, allowing for faster generation of samples, estimation of values, and policy improvement. These computational paradigms require a seamless integration of training, serving, and simulation workloads. Existing frameworks, such as Ray, are not managing this orchestration efficiently. In this study, we've proposed a solution implementing Reactor Model, which enforces a set of actors to have a fixed communication pattern. This allows the scheduler to eliminate works needed for synchronization, such as acquiring and releasing locks for each actor or sending and processing coordination-related messages. Our framework, Lingua Franca (LF), a coordination language based on the Reactor Model, also provides a unified interface that allows users to automatically generate dataflow graphs for distributed RL. On average, LF outperformed Ray in generating samples from OpenAI Gym and Atari environments by 1.21x and 11.62x, reduced the average training time of synchronized parallel Q-learning by 31.2%, and accelerated Multi-Agent RL inference by 5.12x.
    Unnatural Algorithms in Machine Learning. (arXiv:2312.04739v1 [stat.ML])
    Natural gradient descent has a remarkable property that in the small learning rate limit, it displays an invariance with respect to network reparameterizations, leading to robust training behavior even for highly covariant network parameterizations. We show that optimization algorithms with this property can be viewed as discrete approximations of natural transformations from the functor determining an optimizer's state space from the diffeomorphism group if its configuration manifold, to the functor determining that state space's tangent bundle from this group. Algorithms with this property enjoy greater efficiency when used to train poorly parameterized networks, as the network evolution they generate is approximately invariant to network reparameterizations. More specifically, the flow generated by these algorithms in the limit as the learning rate vanishes is invariant under smooth reparameterizations, the respective flows of the parameters being determined by equivariant maps. By casting this property a natural transformation, we allow for generalizations beyond equivariance with respect to group actions; this framework can account for non-invertible maps such as projections, creating a framework for the direct comparison of training behavior across non-isomorphic network architectures, and the formal examination of limiting behavior as network size increases by considering inverse limits of these projections, should they exist. We introduce a simple method of introducing this naturality more generally and examine a number of popular machine learning training algorithms, finding that most are unnatural.
    Continuous Release of Data Streams under both Centralized and Local Differential Privacy. (arXiv:2005.11753v2 [cs.CR] UPDATED)
    In this paper, we study the problem of publishing a stream of real-valued data satisfying differential privacy (DP). One major challenge is that the maximal possible value can be quite large; thus it is necessary to estimate a threshold so that numbers above it are truncated to reduce the amount of noise that is required to all the data. The estimation must be done based on the data in a private fashion. We develop such a method that uses the Exponential Mechanism with a quality function that approximates well the utility goal while maintaining a low sensitivity. Given the threshold, we then propose a novel online hierarchical method and several post-processing techniques. Building on these ideas, we formalize the steps into a framework for private publishing of stream data. Our framework consists of three components: a threshold optimizer that privately estimates the threshold, a perturber that adds calibrated noises to the stream, and a smoother that improves the result using post-processing. Within our framework, we design an algorithm satisfying the more stringent setting of DP called local DP (LDP). To our knowledge, this is the first LDP algorithm for publishing streaming data. Using four real-world datasets, we demonstrate that our mechanism outperforms the state-of-the-art by a factor of 6-10 orders of magnitude in terms of utility (measured by the mean squared error of answering a random range query).  ( 3 min )
    SmartPlay: A Benchmark for LLMs as Intelligent Agents. (arXiv:2310.01557v3 [cs.LG] UPDATED)
    Recent large language models (LLMs) have demonstrated great potential toward intelligent agents and next-gen automation, but there currently lacks a systematic benchmark for evaluating LLMs' abilities as agents. We introduce SmartPlay: both a challenging benchmark and a methodology for evaluating LLMs as agents. SmartPlay consists of 6 different games, including Rock-Paper-Scissors, Tower of Hanoi, Minecraft. Each game features a unique setting, providing up to 20 evaluation settings and infinite environment variations. Each game in SmartPlay uniquely challenges a subset of 9 important capabilities of an intelligent LLM agent, including reasoning with object dependencies, planning ahead, spatial reasoning, learning from history, and understanding randomness. The distinction between the set of capabilities each game test allows us to analyze each capability separately. SmartPlay serves not only as a rigorous testing ground for evaluating the overall performance of LLM agents but also as a road-map for identifying gaps in current methodologies. We release our benchmark at github.com/microsoft/SmartPlay  ( 2 min )
    Evaluating The Accuracy of Classification Algorithms for Detecting Heart Disease Risk. (arXiv:2312.04595v1 [cs.LG])
    The healthcare industry generates enormous amounts of complex clinical data that make the prediction of disease detection a complicated process. In medical informatics, making effective and efficient decisions is very important. Data Mining (DM) techniques are mainly used to identify and extract hidden patterns and interesting knowledge to diagnose and predict diseases in medical datasets. Nowadays, heart disease is considered one of the most important problems in the healthcare field. Therefore, early diagnosis leads to a reduction in deaths. DM techniques have proven highly effective for predicting and diagnosing heart diseases. This work utilizes the classification algorithms with a medical dataset of heart disease; namely, J48, Random Forest, and Na\"ive Bayes to discover the accuracy of their performance. We also examine the impact of the feature selection method. A comparative and analysis study was performed to determine the best technique using Waikato Environment for Knowledge Analysis (Weka) software, version 3.8.6. The performance of the utilized algorithms was evaluated using standard metrics such as accuracy, sensitivity and specificity. The importance of using classification techniques for heart disease diagnosis has been highlighted. We also reduced the number of attributes in the dataset, which showed a significant improvement in prediction accuracy. The results indicate that the best algorithm for predicting heart disease was Random Forest with an accuracy of 99.24%.  ( 3 min )
    Toward Energy-Efficient Massive MIMO: Graph Neural Network Precoding for Mitigating Non-Linear PA Distortion. (arXiv:2312.04591v1 [cs.IT])
    Massive MIMO systems are typically designed assuming linear power amplifiers (PAs). However, PAs are most energy efficient close to saturation, where non-linear distortion arises. For conventional precoders, this distortion can coherently combine at user locations, limiting performance. We propose a graph neural network (GNN) to learn a mapping between channel and precoding matrices, which maximizes the sum rate affected by non-linear distortion, using a high-order polynomial PA model. In the distortion-limited regime, this GNN-based precoder outperforms zero forcing (ZF), ZF plus digital pre-distortion (DPD) and the distortion-aware beamforming (DAB) precoder from the state-of-the-art. At an input back-off of -3 dB the proposed precoder compared to ZF increases the sum rate by 8.60 and 8.84 bits/channel use for two and four users respectively. Radiation patterns show that these gains are achieved by transmitting the non-linear distortion in non-user directions. In the four user-case, for a fixed sum rate, the total consumed power (PA and processing) of the GNN precoder is 3.24 and 1.44 times lower compared to ZF and ZF plus DPD respectively. A complexity analysis shows six orders of magnitude reduction compared to DAB precoding. This opens perspectives to operate PAs closer to saturation, which drastically increases their energy efficiency.  ( 2 min )
    Reconciling AI Performance and Data Reconstruction Resilience for Medical Imaging. (arXiv:2312.04590v1 [cs.CR])
    Artificial Intelligence (AI) models are vulnerable to information leakage of their training data, which can be highly sensitive, for example in medical imaging. Privacy Enhancing Technologies (PETs), such as Differential Privacy (DP), aim to circumvent these susceptibilities. DP is the strongest possible protection for training models while bounding the risks of inferring the inclusion of training samples or reconstructing the original data. DP achieves this by setting a quantifiable privacy budget. Although a lower budget decreases the risk of information leakage, it typically also reduces the performance of such models. This imposes a trade-off between robust performance and stringent privacy. Additionally, the interpretation of a privacy budget remains abstract and challenging to contextualize. In this study, we contrast the performance of AI models at various privacy budgets against both, theoretical risk bounds and empirical success of reconstruction attacks. We show that using very large privacy budgets can render reconstruction attacks impossible, while drops in performance are negligible. We thus conclude that not using DP -- at all -- is negligent when applying AI models to sensitive data. We deem those results to lie a foundation for further debates on striking a balance between privacy risks and model performance.  ( 2 min )
    Investigating how ReLU-networks encode symmetries. (arXiv:2305.17017v2 [cs.LG] UPDATED)
    Many data symmetries can be described in terms of group equivariance and the most common way of encoding group equivariances in neural networks is by building linear layers that are group equivariant. In this work we investigate whether equivariance of a network implies that all layers are equivariant. On the theoretical side we find cases where equivariance implies layerwise equivariance, but also demonstrate that this is not the case generally. Nevertheless, we conjecture that CNNs that are trained to be equivariant will exhibit layerwise equivariance and explain how this conjecture is a weaker version of the recent permutation conjecture by Entezari et al. [2022]. We perform quantitative experiments with VGG-nets on CIFAR10 and qualitative experiments with ResNets on ImageNet to illustrate and support our theoretical findings. These experiments are not only of interest for understanding how group equivariance is encoded in ReLU-networks, but they also give a new perspective on Entezari et al.'s permutation conjecture as we find that it is typically easier to merge a network with a group-transformed version of itself than merging two different networks.  ( 2 min )
    Domain-Aware Continual Zero-Shot Learning. (arXiv:2112.12989v2 [cs.CV] UPDATED)
    Continual zero-shot learning involves learning seen classes incrementally while improving the ability to recognize unseen or yet-to-be-seen classes. It has a broad range of potential applications in real-world vision tasks, such as accelerating species discovery. However, in these scenarios, the changes in environmental conditions cause shifts in the presentation of captured images, which we refer to as domain shift, and adds complexity to the tasks. In this paper, we introduce Domain Aware Continual Zero-Shot Learning (DACZSL), a task that involves visually recognizing images of unseen categories in unseen domains continually. To address the challenges of DACZSL, we propose a Domain-Invariant Network (DIN). We empoly a dual network structure to learn factorized features to alleviate forgetting, where consists of a global shared net for domian-invirant and task-invariant features, and per-task private nets for task-specific features. Furthermore, we introduce a class-wise learnable prompt to obtain better class-level text representation, which enables zero-shot prediction of future unseen classes. To evaluate DACZSL, we introduce two benchmarks: DomainNet-CZSL and iWildCam-CZSL. Our results show that DIN significantly outperforms existing baselines and achieves a new state-of-the-art.
    Improving SCGAN's Similarity Constraint and Learning a Better Disentangled Representation. (arXiv:2310.12262v2 [cs.CV] UPDATED)
    SCGAN adds a similarity constraint between generated images and conditions as a regularization term on generative adversarial networks. Similarity constraint works as a tutor to instruct the generator network to comprehend the difference of representations based on conditions. We understand how SCGAN works on a deeper level. This understanding makes us realize that the similarity constraint functions like the contrastive loss function. We believe that a model with high understanding and intelligence measures the similarity between images based on their structure and high level features, just like humans do. Two major changes we applied to SCGAN in order to make a modified model are using SSIM to measure similarity between images and applying contrastive loss principles to the similarity constraint. The modified model performs better using FID and FactorVAE metrics. The modified model also has better generalisability compared to other models. Keywords Generative Adversarial Nets, Unsupervised Learning, Disentangled Representation Learning, Contrastive Disentanglement, SSIM  ( 2 min )
    Towards a Psychological Generalist AI: A Survey of Current Applications of Large Language Models and Future Prospects. (arXiv:2312.04578v1 [cs.AI])
    The complexity of psychological principles underscore a significant societal challenge, given the vast social implications of psychological problems. Bridging the gap between understanding these principles and their actual clinical and real-world applications demands rigorous exploration and adept implementation. In recent times, the swift advancement of highly adaptive and reusable artificial intelligence (AI) models has emerged as a promising way to unlock unprecedented capabilities in the realm of psychology. This paper emphasizes the importance of performance validation for these large-scale AI models, emphasizing the need to offer a comprehensive assessment of their verification from diverse perspectives. Moreover, we review the cutting-edge advancements and practical implementations of these expansive models in psychology, highlighting pivotal work spanning areas such as social media analytics, clinical nursing insights, vigilant community monitoring, and the nuanced exploration of psychological theories. Based on our review, we project an acceleration in the progress of psychological fields, driven by these large-scale AI models. These future generalist AI models harbor the potential to substantially curtail labor costs and alleviate social stress. However, this forward momentum will not be without its set of challenges, especially when considering the paradigm changes and upgrades required for medical instrumentation and related applications.  ( 2 min )
    BayesDAG: Gradient-Based Posterior Inference for Causal Discovery. (arXiv:2307.13917v2 [cs.LG] UPDATED)
    Bayesian causal discovery aims to infer the posterior distribution over causal models from observed data, quantifying epistemic uncertainty and benefiting downstream tasks. However, computational challenges arise due to joint inference over combinatorial space of Directed Acyclic Graphs (DAGs) and nonlinear functions. Despite recent progress towards efficient posterior inference over DAGs, existing methods are either limited to variational inference on node permutation matrices for linear causal models, leading to compromised inference accuracy, or continuous relaxation of adjacency matrices constrained by a DAG regularizer, which cannot ensure resulting graphs are DAGs. In this work, we introduce a scalable Bayesian causal discovery framework based on a combination of stochastic gradient Markov Chain Monte Carlo (SG-MCMC) and Variational Inference (VI) that overcomes these limitations. Our approach directly samples DAGs from the posterior without requiring any DAG regularization, simultaneously draws function parameter samples and is applicable to both linear and nonlinear causal models. To enable our approach, we derive a novel equivalence to the permutation-based DAG learning, which opens up possibilities of using any relaxed gradient estimator defined over permutations. To our knowledge, this is the first framework applying gradient-based MCMC sampling for causal discovery. Empirical evaluation on synthetic and real-world datasets demonstrate our approach's effectiveness compared to state-of-the-art baselines.  ( 2 min )
    KBFormer: A Diffusion Model for Structured Entity Completion. (arXiv:2312.05253v1 [cs.LG])
    We develop a generative attention-based approach to modeling structured entities comprising different property types, such as numerical, categorical, string, and composite. This approach handles such heterogeneous data through a mixed continuous-discrete diffusion process over the properties. Our flexible framework can model entities with arbitrary hierarchical properties, enabling applications to structured Knowledge Base (KB) entities and tabular data. Our approach obtains state-of-the-art performance on a majority of cases across 15 datasets. In addition, experiments with a device KB and a nuclear physics dataset demonstrate the model's ability to learn representations useful for entity completion in diverse settings. This has many downstream use cases, including modeling numerical properties with high accuracy - critical for science applications, which also benefit from the model's inherent probabilistic nature.  ( 2 min )
    Reverse Engineering Deep ReLU Networks An Optimization-based Algorithm. (arXiv:2312.04675v1 [cs.LG])
    Reverse engineering deep ReLU networks is a critical problem in understanding the complex behavior and interpretability of neural networks. In this research, we present a novel method for reconstructing deep ReLU networks by leveraging convex optimization techniques and a sampling-based approach. Our method begins by sampling points in the input space and querying the black box model to obtain the corresponding hyperplanes. We then define a convex optimization problem with carefully chosen constraints and conditions to guarantee its convexity. The objective function is designed to minimize the discrepancy between the reconstructed networks output and the target models output, subject to the constraints. We employ gradient descent to optimize the objective function, incorporating L1 or L2 regularization as needed to encourage sparse or smooth solutions. Our research contributes to the growing body of work on reverse engineering deep ReLU networks and paves the way for new advancements in neural network interpretability and security.  ( 2 min )
    TENPLEX: Changing Resources of Deep Learning Jobs using Parallelizable Tensor Collections. (arXiv:2312.05181v1 [cs.DC])
    Deep learning (DL) jobs use multi-dimensional parallelism, i.e they combine data, model, and pipeline parallelism, to use large GPU clusters efficiently. This couples jobs tightly to a set of GPU devices, but jobs may experience changes to the device allocation: (i) resource elasticity during training adds or removes devices; (ii) hardware maintenance may require redeployment on different devices; and (iii) device failures force jobs to run with fewer devices. Current DL frameworks lack support for these scenarios, as they cannot change the multi-dimensional parallelism of an already-running job in an efficient and model-independent way. We describe Tenplex, a state management library for DL frameworks that enables jobs to change the GPU allocation and job parallelism at runtime. Tenplex achieves this by externalizing the DL job state during training as a parallelizable tensor collection (PTC). When the GPU allocation for the DL job changes, Tenplex uses the PTC to transform the DL job state: for the dataset state, Tenplex repartitions it under data parallelism and exposes it to workers through a virtual file system; for the model state, Tenplex obtains it as partitioned checkpoints and transforms them to reflect the new parallelization configuration. For efficiency, these PTC transformations are executed in parallel with a minimum amount of data movement between devices and workers. Our experiments show that Tenplex enables DL jobs to support dynamic parallelization with low overhead.  ( 2 min )
    Diffence: Fencing Membership Privacy With Diffusion Models. (arXiv:2312.04692v1 [cs.CR])
    Deep learning models, while achieving remarkable performance across various tasks, are vulnerable to member inference attacks, wherein adversaries identify if a specific data point was part of a model's training set. This susceptibility raises substantial privacy concerns, especially when models are trained on sensitive datasets. Current defense methods often struggle to provide robust protection without hurting model utility, and they often require retraining the model or using extra data. In this work, we introduce a novel defense framework against membership attacks by leveraging generative models. The key intuition of our defense is to remove the differences between member and non-member inputs which can be used to perform membership attacks, by re-generating input samples before feeding them to the target model. Therefore, our defense works \emph{pre-inference}, which is unlike prior defenses that are either training-time (modify the model) or post-inference time (modify the model's output). A unique feature of our defense is that it works on input samples only, without modifying the training or inference phase of the target model. Therefore, it can be cascaded with other defense mechanisms as we demonstrate through experiments. Through extensive experimentation, we show that our approach can serve as a robust plug-n-play defense mechanism, enhancing membership privacy without compromising model utility in both baseline and defended settings. For example, our method enhanced the effectiveness of recent state-of-the-art defenses, reducing attack accuracy by an average of 5.7\% to 12.4\% across three datasets, without any impact on the model's accuracy. By integrating our method with prior defenses, we achieve new state-of-the-art performance in the privacy-utility trade-off.  ( 3 min )
    Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM. (arXiv:2309.14348v2 [cs.CL] UPDATED)
    Recently, Large Language Models (LLMs) have made significant advancements and are now widely used across various domains. Unfortunately, there has been a rising concern that LLMs can be misused to generate harmful or malicious content. Though a line of research has focused on aligning LLMs with human values and preventing them from producing inappropriate content, such alignments are usually vulnerable and can be bypassed by alignment-breaking attacks via adversarially optimized or handcrafted jailbreaking prompts. In this work, we introduce a Robustly Aligned LLM (RA-LLM) to defend against potential alignment-breaking attacks. RA-LLM can be directly constructed upon an existing aligned LLM with a robust alignment checking function, without requiring any expensive retraining or fine-tuning process of the original LLM. Furthermore, we also provide a theoretical analysis for RA-LLM to verify its effectiveness in defending against alignment-breaking attacks. Through real-world experiments on open-source large language models, we demonstrate that RA-LLM can successfully defend against both state-of-the-art adversarial prompts and popular handcrafted jailbreaking prompts by reducing their attack success rates from nearly 100% to around 10% or less.  ( 2 min )
    The Graph Lottery Ticket Hypothesis: Finding Sparse, Informative Graph Structure. (arXiv:2312.04762v1 [cs.LG])
    Graph learning methods help utilize implicit relationships among data items, thereby reducing training label requirements and improving task performance. However, determining the optimal graph structure for a particular learning task remains a challenging research problem. In this work, we introduce the Graph Lottery Ticket (GLT) Hypothesis - that there is an extremely sparse backbone for every graph, and that graph learning algorithms attain comparable performance when trained on that subgraph as on the full graph. We identify and systematically study 8 key metrics of interest that directly influence the performance of graph learning algorithms. Subsequently, we define the notion of a "winning ticket" for graph structure - an extremely sparse subset of edges that can deliver a robust approximation of the entire graph's performance. We propose a straightforward and efficient algorithm for finding these GLTs in arbitrary graphs. Empirically, we observe that performance of different graph learning algorithms can be matched or even exceeded on graphs with the average degree as low as 5.  ( 2 min )
    StructComp: Substituting propagation with Structural Compression in Training Graph Contrastive Learning. (arXiv:2312.04865v1 [cs.LG])
    Graph contrastive learning (GCL) has become a powerful tool for learning graph data, but its scalability remains a significant challenge. In this work, we propose a simple yet effective training framework called Structural Compression (StructComp) to address this issue. Inspired by a sparse low-rank approximation on the diffusion matrix, StructComp trains the encoder with the compressed nodes. This allows the encoder not to perform any message passing during the training stage, and significantly reduces the number of sample pairs in the contrastive loss. We theoretically prove that the original GCL loss can be approximated with the contrastive loss computed by StructComp. Moreover, StructComp can be regarded as an additional regularization term for GCL models, resulting in a more robust encoder. Empirical studies on seven benchmark datasets show that StructComp greatly reduces the time and memory consumption while improving model performance compared to the vanilla GCL models and scalable training methods.
    Neural Eigenfunctions Are Structured Representation Learners. (arXiv:2210.12637v3 [cs.LG] UPDATED)
    This paper introduces a structured, adaptive-length deep representation called Neural Eigenmap. Unlike prior spectral methods such as Laplacian Eigenmap that operate in a nonparametric manner, Neural Eigenmap leverages NeuralEF to parametrically model eigenfunctions using a neural network. We show that, when the eigenfunction is derived from positive relations in a data augmentation setup, applying NeuralEF results in an objective function that resembles those of popular self-supervised learning methods, with an additional symmetry-breaking property that leads to \emph{structured} representations where features are ordered by importance. We demonstrate using such representations as adaptive-length codes in image retrieval systems. By truncation according to feature importance, our method requires up to $16\times$ shorter representation length than leading self-supervised learning ones to achieve similar retrieval performance. We further apply our method to graph data and report strong results on a node representation learning benchmark with more than one million nodes.  ( 2 min )
    ECSIC: Epipolar Cross Attention for Stereo Image Compression. (arXiv:2307.10284v2 [eess.IV] UPDATED)
    In this paper, we present ECSIC, a novel learned method for stereo image compression. Our proposed method compresses the left and right images in a joint manner by exploiting the mutual information between the images of the stereo image pair using a novel stereo cross attention (SCA) module and two stereo context modules. The SCA module performs cross-attention restricted to the corresponding epipolar lines of the two images and processes them in parallel. The stereo context modules improve the entropy estimation of the second encoded image by using the first image as a context. We conduct an extensive ablation study demonstrating the effectiveness of the proposed modules and a comprehensive quantitative and qualitative comparison with existing methods. ECSIC achieves state-of-the-art performance in stereo image compression on the two popular stereo image datasets Cityscapes and InStereo2k while allowing for fast encoding and decoding.  ( 2 min )
    Temporal graph models fail to capture global temporal dynamics. (arXiv:2309.15730v3 [cs.IR] UPDATED)
    A recently released Temporal Graph Benchmark is analyzed in the context of Dynamic Link Property Prediction. We outline our observations and propose a trivial optimization-free baseline of "recently popular nodes" outperforming other methods on medium and large-size datasets in the Temporal Graph Benchmark. We propose two measures based on Wasserstein distance which can quantify the strength of short-term and long-term global dynamics of datasets. By analyzing our unexpectedly strong baseline, we show how standard negative sampling evaluation can be unsuitable for datasets with strong temporal dynamics. We also show how simple negative-sampling can lead to model degeneration during training, resulting in impossible to rank, fully saturated predictions of temporal graph networks. We propose improved negative sampling schemes for both training and evaluation and prove their usefulness. We conduct a comparison with a model trained non-contrastively without negative sampling. Our results provide a challenging baseline and indicate that temporal graph network architectures need deep rethinking for usage in problems with significant global dynamics, such as social media, cryptocurrency markets or e-commerce. We open-source the code for baselines, measures and proposed negative sampling schemes.  ( 2 min )
    FedGeo: Privacy-Preserving User Next Location Prediction with Federated Learning. (arXiv:2312.04594v1 [cs.CR])
    A User Next Location Prediction (UNLP) task, which predicts the next location that a user will move to given his/her trajectory, is an indispensable task for a wide range of applications. Previous studies using large-scale trajectory datasets in a single server have achieved remarkable performance in UNLP task. However, in real-world applications, legal and ethical issues have been raised regarding privacy concerns leading to restrictions against sharing human trajectory datasets to any other server. In response, Federated Learning (FL) has emerged to address the personal privacy issue by collaboratively training multiple clients (i.e., users) and then aggregating them. While previous studies employed FL for UNLP, they are still unable to achieve reliable performance because of the heterogeneity of clients' mobility. To tackle this problem, we propose the Federated Learning for Geographic Information (FedGeo), a FL framework specialized for UNLP, which alleviates the heterogeneity of clients' mobility and guarantees personal privacy protection. Firstly, we incorporate prior global geographic adjacency information to the local client model, since the spatial correlation between locations is trained partially in each client who has only a heterogeneous subset of the overall trajectories in FL. We also introduce a novel aggregation method that minimizes the gap between client models to solve the problem of client drift caused by differences between client models when learning with their heterogeneous data. Lastly, we probabilistically exclude clients with extremely heterogeneous data from the FL process by focusing on clients who visit relatively diverse locations. We show that FedGeo is superior to other FL methods for model performance in UNLP task. We also validated our model in a real-world application using our own customers' mobile phones and the FL agent system.  ( 3 min )
    TrustFed: A Reliable Federated Learning Framework with Malicious-Attack Resistance. (arXiv:2312.04597v1 [cs.CR])
    As a key technology in 6G research, federated learning (FL) enables collaborative learning among multiple clients while ensuring individual data privacy. However, malicious attackers among the participating clients can intentionally tamper with the training data or the trained model, compromising the accuracy and trustworthiness of the system. To address this issue, in this paper, we propose a hierarchical audit-based FL (HiAudit-FL) framework, with the aim to enhance the reliability and security of the learning process. The hierarchical audit process includes two stages, namely model-audit and parameter-audit. In the model-audit stage, a low-overhead audit method is employed to identify suspicious clients. Subsequently, in the parameter-audit stage, a resource-consuming method is used to detect all malicious clients with higher accuracy among the suspicious ones. Specifically, we execute the model audit method among partial clients for multiple rounds, which is modeled as a partial observation Markov decision process (POMDP) with the aim to enhance the robustness and accountability of the decision-making in complex and uncertain environments. Meanwhile, we formulate the problem of identifying malicious attackers through a multi-round audit as an active sequential hypothesis testing problem and leverage a diffusion model-based AI-Enabled audit selection strategy (ASS) to decide which clients should be audited in each round. To accomplish efficient and effective audit selection, we design a DRL-ASS algorithm by incorporating the ASS in a deep reinforcement learning (DRL) framework. Our simulation results demonstrate that HiAudit-FL can effectively identify and handle potential malicious users accurately, with small system overhead.  ( 3 min )
    Estimating Fr\'echet bounds for validating programmatic weak supervision. (arXiv:2312.04601v1 [stat.ML])
    We develop methods for estimating Fr\'echet bounds on (possibly high-dimensional) distribution classes in which some variables are continuous-valued. We establish the statistical correctness of the computed bounds under uncertainty in the marginal constraints and demonstrate the usefulness of our algorithms by evaluating the performance of machine learning (ML) models trained with programmatic weak supervision (PWS). PWS is a framework for principled learning from weak supervision inputs (e.g., crowdsourced labels, knowledge bases, pre-trained models on related tasks, etc), and it has achieved remarkable success in many areas of science and engineering. Unfortunately, it is generally difficult to validate the performance of ML models trained with PWS due to the absence of labeled data. Our algorithms address this issue by estimating sharp lower and upper bounds for performance metrics such as accuracy/recall/precision/F1 score.  ( 2 min )
    FedBayes: A Zero-Trust Federated Learning Aggregation to Defend Against Adversarial Attacks. (arXiv:2312.04587v1 [cs.CR])
    Federated learning has created a decentralized method to train a machine learning model without needing direct access to client data. The main goal of a federated learning architecture is to protect the privacy of each client while still contributing to the training of the global model. However, the main advantage of privacy in federated learning is also the easiest aspect to exploit. Without being able to see the clients' data, it is difficult to determine the quality of the data. By utilizing data poisoning methods, such as backdoor or label-flipping attacks, or by sending manipulated information about their data back to the server, malicious clients are able to corrupt the global model and degrade performance across all clients within a federation. Our novel aggregation method, FedBayes, mitigates the effect of a malicious client by calculating the probabilities of a client's model weights given to the prior model's weights using Bayesian statistics. Our results show that this approach negates the effects of malicious clients and protects the overall federation.  ( 2 min )
    The Open Review-Based (ORB) dataset: Towards Automatic Assessment of Scientific Papers and Experiment Proposals in High-Energy Physics. (arXiv:2312.04576v1 [cs.DL])
    With the Open Science approach becoming important for research, the evolution towards open scientific-paper reviews is making an impact on the scientific community. However, there is a lack of publicly available resources for conducting research activities related to this subject, as only a limited number of journals and conferences currently allow access to their review process for interested parties. In this paper, we introduce the new comprehensive Open Review-Based dataset (ORB); it includes a curated list of more than 36,000 scientific papers with their more than 89,000 reviews and final decisions. We gather this information from two sources: the OpenReview.net and SciPost.org websites. However, given the volatile nature of this domain, the software infrastructure that we introduce to supplement the ORB dataset is designed to accommodate additional resources in the future. The ORB deliverables include (1) Python code (interfaces and implementations) to translate document data and metadata into a structured and high-level representation, (2) an ETL process (Extract, Transform, Load) to facilitate the automatic updates from defined sources and (3) data files representing the structured data. The paper presents our data architecture and an overview of the collected data along with relevant statistics. For illustration purposes, we also discuss preliminary Natural-Language-Processing-based experiments that aim to predict (1) papers' acceptance based on their textual embeddings, and (2) grading statistics inferred from embeddings as well. We believe ORB provides a valuable resource for researchers interested in open science and review, with our implementation easing the use of this data for further analysis and experimentation. We plan to update ORB as the field matures as well as introduce new resources even more fitted to dedicated scientific domains such as High-Energy Physics.  ( 3 min )
    Transferable Candidate Proposal with Bounded Uncertainty. (arXiv:2312.04604v1 [cs.LG])
    From an empirical perspective, the subset chosen through active learning cannot guarantee an advantage over random sampling when transferred to another model. While it underscores the significance of verifying transferability, experimental design from previous works often neglected that the informativeness of a data subset can change over model configurations. To tackle this issue, we introduce a new experimental design, coined as Candidate Proposal, to find transferable data candidates from which active learning algorithms choose the informative subset. Correspondingly, a data selection algorithm is proposed, namely Transferable candidate proposal with Bounded Uncertainty (TBU), which constrains the pool of transferable data candidates by filtering out the presumably redundant data points based on uncertainty estimation. We verified the validity of TBU in image classification benchmarks, including CIFAR-10/100 and SVHN. When transferred to different model configurations, TBU consistency improves performance in existing active learning algorithms. Our code is available at https://github.com/gokyeongryeol/TBU.  ( 2 min )
  • Open

    Predicting Census Survey Response Rates With Parsimonious Additive Models and Structured Interactions. (arXiv:2108.11328v4 [stat.ML] UPDATED)
    In this paper we consider the problem of predicting survey response rates using a family of flexible and interpretable nonparametric models. The study is motivated by the US Census Bureau's well-known ROAM application which uses a linear regression model trained on the US Census Planning Database data to identify hard-to-survey areas. A crowdsourcing competition (Erdman and Bates, 2016) organized around ten years ago revealed that machine learning methods based on ensembles of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to their black-box nature. We consider nonparametric additive models with small number of main and pairwise interaction effects using $\ell_0$-based penalization. From a methodological viewpoint, we study both computational and statistical aspects of our estimator; and discuss variants that incorporate strong hierarchical interactions. Our algorithms (opensourced on github) extend the computational frontiers of existing algorithms for sparse additive models, to be able to handle datasets relevant for the application we consider. We discuss and interpret findings from our model on the US Census Planning Database. In addition to being useful from an interpretability standpoint, our models lead to predictions that appear to be better than popular black-box machine learning methods based on gradient boosting and feedforward neural networks - suggesting that it is possible to have models that have the best of both worlds: good model accuracy and interpretability.  ( 3 min )
    Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit. (arXiv:2309.16620v2 [stat.ML] UPDATED)
    The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $\mu$P parameterized networks, where the optimal hyperparameters for small width networks transfer to networks with arbitrarily large width. However, in this scheme, hyperparameters do not transfer across depths. As a remedy, we study residual networks with a residual branch scale of $1/\sqrt{\text{depth}}$ in combination with the $\mu$P parameterization. We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findings are supported and motivated by theory. Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit and show convergence of finite-size network dynamics towards this limit.  ( 2 min )
    TaskMet: Task-Driven Metric Learning for Model Learning. (arXiv:2312.05250v1 [cs.LG])
    Deep learning models are often deployed in downstream tasks that the training procedure may not be aware of. For example, models solely trained to achieve accurate predictions may struggle to perform well on downstream tasks because seemingly small prediction errors may incur drastic task errors. The standard end-to-end learning approach is to make the task loss differentiable or to introduce a differentiable surrogate that the model can be trained on. In these settings, the task loss needs to be carefully balanced with the prediction loss because they may have conflicting objectives. We propose take the task loss signal one level deeper than the parameters of the model and use it to learn the parameters of the loss function the model is trained on, which can be done by learning a metric in the prediction space. This approach does not alter the optimal prediction model itself, but rather changes the model learning to emphasize the information important for the downstream task. This enables us to achieve the best of both worlds: a prediction model trained in the original prediction space while also being valuable for the desired downstream task. We validate our approach through experiments conducted in two main settings: 1) decision-focused model learning scenarios involving portfolio optimization and budget allocation, and 2) reinforcement learning in noisy environments with distracting states. The source code to reproduce our experiments is available at https://github.com/facebookresearch/taskmet  ( 2 min )
    Construction of Hierarchical Neural Architecture Search Spaces based on Context-free Grammars. (arXiv:2211.01842v3 [cs.LG] UPDATED)
    The discovery of neural architectures from simple building blocks is a long-standing goal of Neural Architecture Search (NAS). Hierarchical search spaces are a promising step towards this goal but lack a unifying search space design framework and typically only search over some limited aspect of architectures. In this work, we introduce a unifying search space design framework based on context-free grammars that can naturally and compactly generate expressive hierarchical search spaces that are 100s of orders of magnitude larger than common spaces from the literature. By enhancing and using their properties, we effectively enable search over the complete architecture and can foster regularity. Further, we propose an efficient hierarchical kernel design for a Bayesian Optimization search strategy to efficiently search over such huge spaces. We demonstrate the versatility of our search space design framework and show that our search strategy can be superior to existing NAS approaches. Code is available at https://github.com/automl/hierarchical_nas_construction.  ( 2 min )
    Causal normalizing flows: from theory to practice. (arXiv:2306.05415v2 [cs.LG] UPDATED)
    In this work, we deepen on the use of normalizing flows for causal reasoning. Specifically, we first leverage recent results on non-linear ICA to show that causal models are identifiable from observational data given a causal ordering, and thus can be recovered using autoregressive normalizing flows (NFs). Second, we analyze different design and learning choices for causal normalizing flows to capture the underlying causal data-generating process. Third, we describe how to implement the do-operator in causal NFs, and thus, how to answer interventional and counterfactual questions. Finally, in our experiments, we validate our design and training choices through a comprehensive ablation study; compare causal NFs to other approaches for approximating causal models; and empirically demonstrate that causal NFs can be used to address real-world problems, where the presence of mixed discrete-continuous data and partial knowledge on the causal graph is the norm. The code for this work can be found at https://github.com/psanch21/causal-flows.  ( 2 min )
    Loss Minimization Yields Multicalibration for Large Neural Networks. (arXiv:2304.09424v2 [cs.LG] UPDATED)
    Multicalibration is a notion of fairness for predictors that requires them to provide calibrated predictions across a large set of protected groups. Multicalibration is known to be a distinct goal than loss minimization, even for simple predictors such as linear functions. In this work, we consider the setting where the protected groups can be represented by neural networks of size $k$, and the predictors are neural networks of size $n > k$. We show that minimizing the squared loss over all neural nets of size $n$ implies multicalibration for all but a bounded number of unlucky values of $n$. We also give evidence that our bound on the number of unlucky values is tight, given our proof technique. Previously, results of the flavor that loss minimization yields multicalibration were known only for predictors that were near the ground truth, hence were rather limited in applicability. Unlike these, our results rely on the expressivity of neural nets and utilize the representation of the predictor.  ( 2 min )
    Uncertainty Quantification and Propagation in Surrogate-based Bayesian Inference. (arXiv:2312.05153v1 [stat.ML])
    Surrogate models are statistical or conceptual approximations for more complex simulation models. In this context, it is crucial to propagate the uncertainty induced by limited simulation budget and surrogate approximation error to predictions, inference, and subsequent decision-relevant quantities. However, quantifying and then propagating the uncertainty of surrogates is usually limited to special analytic cases or is otherwise computationally very expensive. In this paper, we propose a framework enabling a scalable, Bayesian approach to surrogate modeling with thorough uncertainty quantification, propagation, and validation. Specifically, we present three methods for Bayesian inference with surrogate models given measurement data. This is a task where the propagation of surrogate uncertainty is especially relevant, because failing to account for it may lead to biased and/or overconfident estimates of the parameters of interest. We showcase our approach in two detailed case studies for both linear and nonlinear modeling scenarios. Uncertainty propagation in surrogate models enables more reliable and safe approximation of expensive simulators and will therefore be useful in various fields of applications.  ( 2 min )
    Bayesian data fusion with shared priors. (arXiv:2212.07311v2 [cs.LG] UPDATED)
    The integration of data and knowledge from several sources is known as data fusion. When data is only available in a distributed fashion or when different sensors are used to infer a quantity of interest, data fusion becomes essential. In Bayesian settings, a priori information of the unknown quantities is available and, possibly, present among the different distributed estimators. When the local estimates are fused, the prior knowledge used to construct several local posteriors might be overused unless the fusion node accounts for this and corrects it. In this paper, we analyze the effects of shared priors in Bayesian data fusion contexts. Depending on different common fusion rules, our analysis helps to understand the performance behavior as a function of the number of collaborative agents and as a consequence of different types of priors. The analysis is performed by using two divergences which are common in Bayesian inference, and the generality of the results allows to analyze very generic distributions. These theoretical results are corroborated through experiments in a variety of estimation and classification problems, including linear and nonlinear models, and federated learning schemes.  ( 2 min )
    When Does Optimizing a Proper Loss Yield Calibration?. (arXiv:2305.18764v2 [cs.LG] UPDATED)
    Optimizing proper loss functions is popularly believed to yield predictors with good calibration properties; the intuition being that for such losses, the global optimum is to predict the ground-truth probabilities, which is indeed calibrated. However, typical machine learning models are trained to approximately minimize loss over restricted families of predictors, that are unlikely to contain the ground truth. Under what circumstances does optimizing proper loss over a restricted family yield calibrated models? What precise calibration guarantees does it give? In this work, we provide a rigorous answer to these questions. We replace the global optimality with a local optimality condition stipulating that the (proper) loss of the predictor cannot be reduced much by post-processing its predictions with a certain family of Lipschitz functions. We show that any predictor with this local optimality satisfies smooth calibration as defined in Kakade-Foster (2008), B{\l}asiok et al. (2023). Local optimality is plausibly satisfied by well-trained DNNs, which suggests an explanation for why they are calibrated from proper loss minimization alone. Finally, we show that the connection between local optimality and calibration error goes both ways: nearly calibrated predictors are also nearly locally optimal.  ( 2 min )
    Multi-Frequency Joint Community Detection and Phase Synchronization. (arXiv:2206.12276v3 [cs.SI] UPDATED)
    This paper studies the joint community detection and phase synchronization problem on the \textit{stochastic block model with relative phase}, where each node is associated with an unknown phase angle. This problem, with a variety of real-world applications, aims to recover the cluster structure and associated phase angles simultaneously. We show this problem exhibits a \textit{``multi-frequency''} structure by closely examining its maximum likelihood estimation (MLE) formulation, whereas existing methods are not originated from this perspective. To this end, two simple yet efficient algorithms that leverage the MLE formulation and benefit from the information across multiple frequencies are proposed. The former is a spectral method based on the novel multi-frequency column-pivoted QR factorization. The factorization applied to the top eigenvectors of the observation matrix provides key information about the cluster structure and associated phase angles. The second approach is an iterative multi-frequency generalized power method, where each iteration updates the estimation in a matrix-multiplication-then-projection manner. Numerical experiments show that our proposed algorithms significantly improve the ability of exactly recovering the cluster structure and the accuracy of the estimated phase angles, compared to state-of-the-art algorithms.  ( 2 min )
    Optimal Multi-Distribution Learning. (arXiv:2312.05134v1 [cs.LG])
    Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, etc. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process. However, there exist substantial gaps between the state-of-the-art upper and lower bounds on the optimal sample complexity. Focusing on a hypothesis class of Vapnik-Chervonenkis (VC) dimension $d$, we propose a novel algorithm that yields an $varepsilon$-optimal randomized hypothesis with a sample complexity on the order of $(d+k)/\varepsilon^2$ (modulo some logarithmic factor), matching the best-known lower bound. Our algorithmic ideas and theory have been further extended to accommodate Rademacher classes. The proposed algorithms are oracle-efficient, which access the hypothesis class solely through an empirical risk minimization oracle. Additionally, we establish the necessity of randomization, unveiling a large sample size barrier when only deterministic hypotheses are permitted. These findings successfully resolve three open problems presented in COLT 2023 (i.e., Awasthi et al., (2023, Problem 1, 3 and 4)).  ( 2 min )
    The impact of heteroskedasticity on uplift modeling. (arXiv:2312.05234v1 [stat.ML])
    There are various applications, where companies need to decide to which individuals they should best allocate treatment. To support such decisions, uplift models are applied to predict treatment effects on an individual level. Based on the predicted treatment effects, individuals can be ranked and treatment allocation can be prioritized according to this ranking. An implicit assumption, which has not been doubted in the previous uplift modeling literature, is that this treatment prioritization approach tends to bring individuals with high treatment effects to the top and individuals with low treatment effects to the bottom of the ranking. In our research, we show that heteroskedastictity in the training data can cause a bias of the uplift model ranking: individuals with the highest treatment effects can get accumulated in large numbers at the bottom of the ranking. We explain theoretically how heteroskedasticity can bias the ranking of uplift models and show this process in a simulation and on real-world data. We argue that this problem of ranking bias due to heteroskedasticity might occur in many real-world applications and requires modification of the treatment prioritization to achieve an efficient treatment allocation.  ( 2 min )
    Unnatural Algorithms in Machine Learning. (arXiv:2312.04739v1 [stat.ML])
    Natural gradient descent has a remarkable property that in the small learning rate limit, it displays an invariance with respect to network reparameterizations, leading to robust training behavior even for highly covariant network parameterizations. We show that optimization algorithms with this property can be viewed as discrete approximations of natural transformations from the functor determining an optimizer's state space from the diffeomorphism group if its configuration manifold, to the functor determining that state space's tangent bundle from this group. Algorithms with this property enjoy greater efficiency when used to train poorly parameterized networks, as the network evolution they generate is approximately invariant to network reparameterizations. More specifically, the flow generated by these algorithms in the limit as the learning rate vanishes is invariant under smooth reparameterizations, the respective flows of the parameters being determined by equivariant maps. By casting this property a natural transformation, we allow for generalizations beyond equivariance with respect to group actions; this framework can account for non-invertible maps such as projections, creating a framework for the direct comparison of training behavior across non-isomorphic network architectures, and the formal examination of limiting behavior as network size increases by considering inverse limits of these projections, should they exist. We introduce a simple method of introducing this naturality more generally and examine a number of popular machine learning training algorithms, finding that most are unnatural.  ( 2 min )
    Onflow: an online portfolio allocation algorithm. (arXiv:2312.05169v1 [q-fin.PM])
    We introduce Onflow, a reinforcement learning technique that enables online optimization of portfolio allocation policies based on gradient flows. We devise dynamic allocations of an investment portfolio to maximize its expected log return while taking into account transaction fees. The portfolio allocation is parameterized through a softmax function, and at each time step, the gradient flow method leads to an ordinary differential equation whose solutions correspond to the updated allocations. This algorithm belongs to the large class of stochastic optimization procedures; we measure its efficiency by comparing our results to the mathematical theoretical values in a log-normal framework and to standard benchmarks from the 'old NYSE' dataset. For log-normal assets, the strategy learned by Onflow, with transaction costs at zero, mimics Markowitz's optimal portfolio and thus the best possible asset allocation strategy. Numerical experiments from the 'old NYSE' dataset show that Onflow leads to dynamic asset allocation strategies whose performances are: a) comparable to benchmark strategies such as Cover's Universal Portfolio or Helmbold et al. "multiplicative updates" approach when transaction costs are zero, and b) better than previous procedures when transaction costs are high. Onflow can even remain efficient in regimes where other dynamical allocation techniques do not work anymore. Therefore, as far as tested, Onflow appears to be a promising dynamic portfolio management strategy based on observed prices only and without any assumption on the laws of distributions of the underlying assets' returns. In particular it could avoid model risk when building a trading strategy.  ( 2 min )
    Sensitivity Analysis in the Presence of Intrinsic Stochasticity for Discrete Fracture Network Simulations. (arXiv:2312.04722v1 [stat.AP])
    Large-scale discrete fracture network (DFN) simulators are standard fare for studies involving the sub-surface transport of particles since direct observation of real world underground fracture networks is generally infeasible. While these simulators have seen numerous successes over several engineering applications, estimations on quantities of interest (QoI) - such as breakthrough time of particles reaching the edge of the system - suffer from a two distinct types of uncertainty. A run of a DFN simulator requires several parameter values to be set that dictate the placement and size of fractures, the density of fractures, and the overall permeability of the system; uncertainty on the proper parameter choices will lead to some amount of uncertainty in the QoI, called epistemic uncertainty. Furthermore, since DFN simulators rely on stochastic processes to place fractures and govern flow, understanding how this randomness affects the QoI requires several runs of the simulator at distinct random seeds. The uncertainty in the QoI attributed to different realizations (i.e. different seeds) of the same random process leads to a second type of uncertainty, called aleatoric uncertainty. In this paper, we perform a Sensitivity Analysis, which directly attributes the uncertainty observed in the QoI to the epistemic uncertainty from each input parameter and to the aleatoric uncertainty. We make several design choices to handle an observed heteroskedasticity in DFN simulators, where the aleatoric uncertainty changes for different inputs, since the quality makes several standard statistical methods inadmissible. Beyond the specific takeaways on which input variables affect uncertainty the most for DFN simulators, a major contribution of this paper is the introduction of a statistically rigorous workflow for characterizing the uncertainty in DFN flow simulations that exhibit heteroskedasticity.  ( 3 min )
    PAC-Bayes Generalization Certificates for Learned Inductive Conformal Prediction. (arXiv:2312.04658v1 [cs.LG])
    Inductive Conformal Prediction (ICP) provides a practical and effective approach for equipping deep learning models with uncertainty estimates in the form of set-valued predictions which are guaranteed to contain the ground truth with high probability. Despite the appeal of this coverage guarantee, these sets may not be efficient: the size and contents of the prediction sets are not directly controlled, and instead depend on the underlying model and choice of score function. To remedy this, recent work has proposed learning model and score function parameters using data to directly optimize the efficiency of the ICP prediction sets. While appealing, the generalization theory for such an approach is lacking: direct optimization of empirical efficiency may yield prediction sets that are either no longer efficient on test data, or no longer obtain the required coverage on test data. In this work, we use PAC-Bayes theory to obtain generalization bounds on both the coverage and the efficiency of set-valued predictors which can be directly optimized to maximize efficiency while satisfying a desired test coverage. In contrast to prior work, our framework allows us to utilize the entire calibration dataset to learn the parameters of the model and score function, instead of requiring a separate hold-out set for obtaining test-time coverage guarantees. We leverage these theoretical results to provide a practical algorithm for using calibration data to simultaneously fine-tune the parameters of a model and score function while guaranteeing test-time coverage and efficiency of the resulting prediction sets. We evaluate the approach on regression and classification tasks, and outperform baselines calibrated using a Hoeffding bound-based PAC guarantee on ICP, especially in the low-data regime.  ( 3 min )
    Enhancing Polynomial Chaos Expansion Based Surrogate Modeling using a Novel Probabilistic Transfer Learning Strategy. (arXiv:2312.04648v1 [stat.ML])
    In the field of surrogate modeling, polynomial chaos expansion (PCE) allows practitioners to construct inexpensive yet accurate surrogates to be used in place of the expensive forward model simulations. For black-box simulations, non-intrusive PCE allows the construction of these surrogates using a set of simulation response evaluations. In this context, the PCE coefficients can be obtained using linear regression, which is also known as point collocation or stochastic response surfaces. Regression exhibits better scalability and can handle noisy function evaluations in contrast to other non-intrusive approaches, such as projection. However, since over-sampling is generally advisable for the linear regression approach, the simulation requirements become prohibitive for expensive forward models. We propose to leverage transfer learning whereby knowledge gained through similar PCE surrogate construction tasks (source domains) is transferred to a new surrogate-construction task (target domain) which has a limited number of forward model simulations (training data). The proposed transfer learning strategy determines how much, if any, information to transfer using new techniques inspired by Bayesian modeling and data assimilation. The strategy is scrutinized using numerical investigations and applied to an engineering problem from the oil and gas industry.  ( 2 min )
    Estimating Fr\'echet bounds for validating programmatic weak supervision. (arXiv:2312.04601v1 [stat.ML])
    We develop methods for estimating Fr\'echet bounds on (possibly high-dimensional) distribution classes in which some variables are continuous-valued. We establish the statistical correctness of the computed bounds under uncertainty in the marginal constraints and demonstrate the usefulness of our algorithms by evaluating the performance of machine learning (ML) models trained with programmatic weak supervision (PWS). PWS is a framework for principled learning from weak supervision inputs (e.g., crowdsourced labels, knowledge bases, pre-trained models on related tasks, etc), and it has achieved remarkable success in many areas of science and engineering. Unfortunately, it is generally difficult to validate the performance of ML models trained with PWS due to the absence of labeled data. Our algorithms address this issue by estimating sharp lower and upper bounds for performance metrics such as accuracy/recall/precision/F1 score.  ( 2 min )
  • Open

    Could AIs understand concepts better by making mistakes?
    I often compare humans to ais and often see similarities. But for me making mistakes is super important and i like to do things wrong or unusual. And by knowing how not to do it i learn alot. Would this work for AIs to? E.g. if an ai knows which shapes are not a cat, would it automatically be better at telling what a cat is? Or does it not matter and be just inperformant? submitted by /u/freehuntx [link] [comments]  ( 1 min )

  • Open

    [D] Authors in NeurIPS and ICML and similar venues - How advanced is your mathematics background ?
    Hello, I'm trying to understand what sort of background preparation in mathematics is sufficient to conduct publishable research in top venues. Is calculus, linear algebra, engineering level probability & statistics and optimization sufficient ? By sufficient, I mean the bare minimum you must know to get by and you learn more things as you encounter them or solve problems (learning while doing) without having to first prepare requisite background and then dive in. My main problem is - at what point do I say let's directly jump into research and figure out things I don't know on the fly VS first learn the necessary background and then proceed. I'm familiar - not strong but have taken those courses in undergrad but have largely forgotten many important concepts. Any help from authors from ML venues is greatly appreciated. Please post only if you are an ML researcher or from academia :). Thank you very much ! submitted by /u/gobraming5 [link] [comments]  ( 10 min )
    [D] From Software Engineer to Machine Learning Engineer??
    What would be the step by step transitioning from SWE to MLE?. Try to be very detail guys please. Thanks for reading!. submitted by /u/SuperLearner3004 [link] [comments]  ( 1 min )
    [P] How safe is ChatGPT?
    I spent some time this weekend playing with LLaMA Guard, a fine-tuned LLaMA-7B model by Meta that lets you add guardrails around generative AI. I recorded a quick demo showing what it does and how to use it. The best part is that you can define your own “safety taxonomy” with it — custom policies for what is safe vs unsafe interactions between humans (prompts) and AI (responses). I wanted to see how “safe” conversations with OpenAI’s ChatGPT were, so I ran a bunch of prompts (a mixture of innocuous and inappropriate) and asked LLaMA Guard to classify the interactions as safe/unsafe. My key takeaways from the exercise: OpenAI has done a good job of adding guardrails for its models. LLaMA Guard helped confirm this. What makes this really cool is I may have a very specific set of policies I want to enforce ON TOP of the standard guardrails that a model ships with. LLaMA Guard makes this possible. This kind of model chaining — passing responses from OpenAI models to LLaMA is becoming increasingly common, and I think we’ll have even more complex pipelines in the near future. It helped to have a consistent interface to store this multi-model pipeline as an aiconfig: https://github.com/lastmile-ai/aiconfig. Try it out yourself: GitHub: https://github.com/lastmile-ai/aiconfig/tree/main/cookbooks/LLaMA-Guard YouTube: https://www.youtube.com/watch?v=XxggqoqIVdg submitted by /u/sarmad-q [link] [comments]  ( 10 min )
    Happy Holidays! Here is your 100% free Large Language Model roadmap! [P]
    Thanks for all of your support in recent days by giving me feedback on my LLM outline. This outline is a roadmap on how to learn state-of-the-art stuff about Large Language Models. It builds on work that I have done at AT&T and Toyota. It also builds on a lot of work that I have done on my own outside of corporations. The outline is solid, and as my way of giving back to the community, I am it giving away for free. That's right, no annoying email sign-up. No gimmicks. No stripe pages for a "free trial." No asking you to buy a timeshare in Florida at the end of the outline. It's just a link to a zip file which contains the outline and sample code. Here is how it works. First, you need to know Python. If you don't know that, then look up how to learn Python on Google. Second, this is an …  ( 11 min )
    [P] Hierarchical reinforcement learning with a curriculum of skills
    Check out this tutorial we made on learning skills for hierarchical reinforcement learning with our framework! https://docs.agilerl.com/en/latest/tutorials/skills/index.html How do you think this balances with allowing an agent to discover the best way to do something by itself, by not providing a curriculum? submitted by /u/nicku_a [link] [comments]  ( 10 min )
    [P] BioCLIP, a Vision Foundation Model for Biology
    submitted by /u/Qua5imodo [link] [comments]  ( 10 min )
    [R] How to read and understand Einops expressions?
    Here is my current knowledge of einops: einops stands for Einstein-Inspired Notation for operations. notation was loosely inspired by Einstein summation (in particular by numpy.einsum operation). h = height, w = width, c = channel (color), b = batch left side is input shape. Left side is output shape. letters in parenthesis are multiplied together. einops.rearrange includes functionality of transpose (axes permutation), reshape (view), squeeze, unsqueeze, stack, concatenate and other operations. What I don’t understand: What are all the operation elements? Am I missing any from above? How do I read what is being done? (I.E. How do I know the image will be squeezed or split into different images?) Does order matter in the operations? (I.E. is ‘w h c -> (w h) c’ different from ‘h w c -> (h w) c’?) Why do some elements appear in the operations and others don’t? (I.E. Is ‘h w c -> (h w) c’ different from ‘h w -> (h w)’ ?) Examples of einops.rearrange operations I’m trying to understand: ‘b f h w c -> (b f) c h w’ ‘(b f) e -> b f e’ ‘b r -> b r ()’ ‘b s e -> (b s) e’ ‘b s -> (b s)’ Previous research references: https://einops.rocks/api/rearrange/ https://openreview.net/pdf?id=oapKSVM2bcj https://youtu.be/ll1BlfYd4mU?si=BmCVibyEifrZhiXC https://youtu.be/xGy75Pjsqzo?si=GiaxqN4vSX9_uTtL submitted by /u/Joe_The_Armadillo [link] [comments]  ( 10 min )
    [Discussion] A Prompt-less Chatbot Trained on a Fictional Universe
    I'm here to share a concept I've been mulling, one that sits at the intersection of AI, creative storytelling, and fandom culture. Your insights would be invaluable in shaping this idea. The Concept: Imagine a 'prompt-less' chatbot that's not just reactive (like current AI models) but can initiate conversations based on its training, and with a creative twist. This AI is trained solely on a specific fictional universe like Marvel or DC, and it operates under the belief that this universe is its actual reality. The goal is to have it generate prompts and engage in dialogues rooted deeply in the lore, characters, and storylines of that universe. Why It's Engaging: Creative Exploration: This concept offers a new way to interact with and contribute to these beloved fictional worlds. Fan Fiction with a Purpose: There's the possibility of incentivizing fan fiction as a means to enhance the AI's learning, potentially leading to new creative opportunities for writers. Technical Innovation: For a developer, this presents a challenge to push AI into uncharted territories of narrative understanding and creativity. Technical Aspects: The chatbot would use advanced NLP techniques and be trained on a comprehensive dataset from the fictional universe, including comics, movies, and character biographies. A prompt-less interaction model where the AI initiates scenarios or discussions. Your Thoughts Wanted: What are your initial impressions of this idea? How engaging does this sound for fans and writers of these fictional universes? submitted by /u/sjseto0519 [link] [comments]  ( 10 min )
    [D] Backup if openai fails
    Hi ! We are implementing a bot using chatgpt for my company but we have some concerns about openai downtime failures. Do you have some backup solutions that could be implemented in prod in case of failures (switch of models)? Knowing that the model is kind of finetuned because we are doing information retriavial using embeddings (created with gpt 3.5) submitted by /u/New_Detective_1363 [link] [comments]  ( 10 min )
    [D] Pointer Network
    Hi, I am trying to implement pointer network in pytorch. ``` The RNN is fed P_i at each time step, i, until the end of the input sequence is reached, at which time a special symbol, ⇒ is input to the model. The model then switches to the generation mode until the network encounters the special symbol ⇐, which represents termination of the output sequence. ``` How can I point to ⇐ symbol? Encoder LSTM outputs used for attention will be equal to length of input sequence, and this special symbol is not part of the encoding process. The paper does not highlight whether this token is part of initialization of encoder as well. Can somebody help? submitted by /u/Affectionate478 [link] [comments]  ( 10 min )
    [P] LagrangeBench: A Lagrangian Fluid Mechanics Benchmarking Suite
    Github: github.com/tumaer/lagrangebench arXiv: arxiv.org/abs/2309.16342 by Artur Toshev, Gianluca Galletti et al. What is this? LagrangeBench is a machine learning benchmarking library for CFD particle problems based on JAX. It is designed to evaluate and develop learned particle models (e.g. graph neural networks) on challenging physical problems. To our knowledge it's the first benchmark for this specific set of problems. Our work was inspired by the grid-based benchmarks of PDEBench and PDEArena, and we propose it as a Lagrangian alternative. Core contributions 7 new 2D/3D particle fluid datasets, based on established CFD problems and generated using our own Smoothed Particle Hydrodynamics (SPH) solver, also written in JAX. Three different neighbors search implementations, adde…  ( 11 min )
    [D] Can you train a train a model on a limited-use dataset and make it open source with apache license?
    I have been looking for public datasets to use in my project, which will be used commercially. In this topic, authors commonly don't allow commercial use. However, I found an open-source model with an apache license that was trained on this limited-use dataset. Is it ok to use this model for commercial purposes? submitted by /u/Horror_Panda920 [link] [comments]  ( 10 min )
    [R] SparQ Attention: Bandwidth-Efficient LLM Inference
    Paper: https://arxiv.org/abs/2312.04985 Abstract: Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks. submitted by /u/APaperADay [link] [comments]  ( 10 min )
    [D] what is best time series model for dataset with multilevel aggregation?
    I have a dataset which is aggregated on location, product type and dates. Will it better if use a different model for each combination of location and product type or just one regression model with time features.? Also suggest what model will be best for this kind of dataset? submitted by /u/oakvard [link] [comments]  ( 10 min )
    [D] What is the effect of batch size on training loss?
    I've been experimenting with batch size and what I found is using a larger batch size on the training set is giving lower loss compared to the smaller ones. I'm aware of the fact that with smaller batch sizes, there are more weight updates however I cannot comprehend how the losses are different given the same no. of epochs for training. submitted by /u/Melodic_Stomach_2704 [link] [comments]  ( 10 min )
    [D] Optimizer algorithem for Vit model
    I'm working on my thesis research developing a optimizer algorithm for transformers models to achieve better convergence. My baseline optimizer is Adabelief, and I'm exploring the use of vision in transformers model, specifically the ViT model. I've been struggling to identify the impact of changes in parameter values on the training process. I've read extensively and discussed with my co-supervisor, but I'm still stuck . Any guidance or insights on how to approach this would be greatly appreciated submitted by /u/OutrageousStorm5051 [link] [comments]  ( 10 min )
    How to download just the animals in the ImageNet Database? [D]
    also a map of animals and their respective labels? is there an efficient way to do so. submitted by /u/emperor_xyz [link] [comments]  ( 10 min )
    [D] MMOCR vs. TrOCR
    Greetings to everyone. On our current task we are using MMOCR, which gives great accuracy, but the prediction for each new example of data takes 1.4 seconds. This is a quite slow result for our task, and we are in a search for faster models. Now we're looking for any possible alternatives, and I noticed that TrOCR (Transformer-based OCR) gives great results. But I don't know how much time it requires and will it be more accurate. I'm looking for someone with an experience with the models mentioned by me (or other models with better results) who can help me to figure it out. And also, I have another question: which of the following ways works faster in terms of prediction time for the labeling of multiple row text blocks: line-by-line or entire block? submitted by /u/thattallsoldier [link] [comments]  ( 10 min )
    [D] Cloud GPU service advice
    Let's say theoretically someone is trying to build a cloud GPU service, what features would you consider essential for a successful cloud GPU renting service? Additionally, what advice would you offer to someone building such a service? ;) https://preview.redd.it/aeoui6zxvl5c1.jpg?width=6000&format=pjpg&auto=webp&s=4c5d56de89a633d4182e44188e26fdd4c664e344 submitted by /u/mutemonster13 [link] [comments]  ( 1 min )
    [D] Reading and answering a text question from a image?
    I wanted to see what models would be best suited for providing a image (see the link below) and having the model perform a image to text (OCR of some kind) and laying out the contents of the image in a "question" "answers" format. This gets more tricky as I would like to include pictures between the question and answer at a later date as well so I would need a way for the model to section out the question part from the answer part and understand that there might be a image that is part of the question as well. Any models that would make a good start here for research's (has this already been done?) or if no models are good for the task then what is the best approach for training a model to complete these tasks. https://i.imgur.com/ffCo7PJ.png submitted by /u/carfindernihon [link] [comments]  ( 1 min )
  • Open

    Rectangles to Rectangles
    There is a conformal map between any two simply connected open proper subsets of the complex plane. This means, for example, there is a one-to-one analytic map from the interior of a square onto the interior of a a circle. Or from the interior of a triangle onto the interior of a pentagon. Or from […] Rectangles to Rectangles first appeared on John D. Cook.  ( 5 min )
    ASCII armor: send encrypted data in plain text
    GNU ASCII armor is a way to encode binary data as text and decode the text back into binary. It could be used with any binary data, but most often it is used with cryptographic data: public keys, encrypted messages, etc. Secure communication over an insecure channel When people ask whether some medium “supports encryption,” […] ASCII armor: send encrypted data in plain text first appeared on John D. Cook.  ( 6 min )
  • Open

    Abstracts: December 11, 2023
    By treating language models as layers in a network and prompts as learnable parameters, researchers aim for more adaptable, reusable LLM architectures. Check out the work in the “Abstracts” podcast series with guest Alessandro Sordoni and at #NeurIPS2023: The post Abstracts: December 11, 2023 appeared first on Microsoft Research.  ( 15 min )
    NeurIPS 2023 highlights breadth of Microsoft’s machine learning innovation
    We’re proud to have 100+ accepted papers At NeurIPS 2023, plus 18 workshops. Several submissions were chosen as oral presentations and spotlight posters, reflecting groundbreaking concepts, methods, or applications. Here’s an overview of those submissions. The post NeurIPS 2023 highlights breadth of Microsoft’s machine learning innovation appeared first on Microsoft Research.  ( 16 min )
  • Open

    Was this video created by AI?
    Hey there, everyone! So, I stumbled upon this TikTok video, and it got me thinking: Could this captivating content possibly be artificial intelligence? If the answer is yes, I'm curious to know which specific AI software was used to create this video, and how could i make this video by myself https://www.tiktok.com/@reglem3nt/video/7231263511991667995 Thanks a lot ​ submitted by /u/SnooAvocados1873 [link] [comments]  ( 1 min )
    AI-Powered Parenting: Can AI help you communicate with your grumpy teen?
    submitted by /u/davidmezzetti [link] [comments]  ( 1 min )
    "My rabbi Bill Clinton". Magic Animate. Unexpected result.
    ​ https://reddit.com/link/18g22pg/video/ky0b3jcaxp5c1/player submitted by /u/the_anonymizer [link] [comments]
    Year 6 Teacher looking for some help
    Hey everyone, I am looking to create some images of what my students would like to be when they grow up. Basically, I'd like to add their face to an astronaut, fire fighter, zoologist, etc. Is there a tool that this time poor teacher can use? submitted by /u/booker_h [link] [comments]  ( 1 min )
    Disappointed in grok
    Not trying to invite Elon smearfest, because I generally like what he does with technology, but finding out grok is just chatgpt with lipstick… damn it. I wanted to believe!!! Anyone else have their dreams shattered on the floor like cheap broken glass over this? kidding about the drama level, serious about the disappointment submitted by /u/Overall-Importance54 [link] [comments]  ( 1 min )
    Any good voice-to-voice AI programs?
    Heyo. I'm currently looking for a program that can make realistic voice-to-voice/voice cloning videos, with video of myself talking at the same time, not an avatar. Need to make videos for work and I've tried ElevenLabs, which is OK but it makes you sound a little funny. Anyone knows good programs that can make this? Doesn't matter of it cost or not. Thanks in advance. submitted by /u/drakelicious [link] [comments]  ( 1 min )
    The rapidly growing world of AI-generated Instagram influencers
    The world of AI-generated Instagram influencers is rapidly growing, with companies creating digital models using generative artificial intelligence. AI influencers are cheaper and more efficient than human marketers, and can be customized to fit a brand's image and goals. AI influencers can earn thousands per sponsored post and some experts predict that advertisers may favor AI over humans. However, there are concerns about the potential confusion between AI models and real people, as well as the impact on body image and mental health. Source: https://www.thestar.com/business/technology/these-people-do-not-exist-inside-the-rapidly-growing-world-of-ai-generated-instagram-influencers/article_ca1d9762-943f-11ee-97a9-4bd0b9660726.html submitted by /u/NuseAI [link] [comments]  ( 1 min )
    Racing game... using AI? Here you go!
    Hi all, Some of you might already saw my previous games - Bargainer and Convince the Bouncer. I'm excited to share my new racing strategy game: TrackMind! It's a text based mini-game where you make the decisions for a racing team in a simulated race. Your team's destiny depends on your decision making skills and risk taking! :D Play it here: trackmind.tech Any feedback or thoughts are highly appreciated. Looking forward to hearing from you! Thanks a bunch! ​ submitted by /u/gavo_gavo [link] [comments]  ( 1 min )
    AI Voice generation: which tool is the fastest?
    So we are evaluating a tool to generate AI Voice very fast in order to manage phone calls. Which tool has the best trade-off between quality and generation speed? submitted by /u/Obliviux [link] [comments]  ( 1 min )
    Why does this happen?
    submitted by /u/Pipermommen [link] [comments]  ( 1 min )
    Is AI a Real Threat to Us?
    submitted by /u/Fit-Code-5141 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 12/10/2023
    OpenAI Introduces SantaGPT: Your Personal Gifting Guide For A Very Merry Christmas.[1] Redfin’s new AI tool turns buyers into interior designers.[2] Japanese tech giant Rakuten plans to launch proprietary AI model within next two months.[3] Elon Musk claps back at ChatGPT amid claims of Grok using OpenAI code.[4] Sources: [1] https://www.augustman.com/in/gear/tech/openai-introduces-santagpt-ai-gifting-guide-for-christmas/ [2] https://www.inman.com/2023/12/08/redfin-rolls-out-ai-enhanced-digital-self-staging-tool-for-homebuyers/ [3] https://www.cnbc.com/2023/12/11/rakuten-plans-to-launch-its-own-ai-model-within-next-2-months-ceo.html [4] https://www.theweek.in/news/biz-tech/2023/12/10/elon-musk-claps-back-at-chatgpt-amid-claims-of-grok-using-openai-code.html submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
  • Open

    MIT Generative AI Week fosters dialogue across disciplines
    During the last week of November, MIT hosted symposia and events aimed at examining the implications and possibilities of generative AI.  ( 10 min )
    MIT group releases white papers on governance of AI
    The series aims to help policymakers create better oversight of AI in society.  ( 12 min )
  • Open

    PyReason Sim-to-Real Demo
    submitted by /u/Neurosymbolic [link] [comments]
    Testing response of auto-associative network
    I have created auto-associator and it can learn first pattern.but feeding another problem gives wrong output first pattern:{1,0,1,0} second pattern:{1,1,1,1} ​ INPUT 1.0000 0 1.0000 0 associator weights 1.0000 0 1.0000 0 0 0 0 0 1.0000 0 1.0000 0 0 0 0 0 associator weights 2.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0 1.0000 1.0000 0 1.0000 0 1.0000 1.0000 0 1.0000 Recall 1.0000 1.0000 1.0000 1.0000 ---recall for submitted by /u/Spiderbyte2020 [link] [comments]  ( 1 min )
    NeurIPS 2023: Expo Day
    submitted by /u/nickb [link] [comments]  ( 1 min )
  • Open

    Maximizing marketing potential: The AI-driven revolution in outsourced digital marketing
    In today’s digital marketing world, things are changing fast, and artificial intelligence (AI) is a big part of that. Companies want to stay ahead, so they’re smartly choosing to get help from outside experts in digital marketing who use AI tools. This helps them make the most of what AI can do. AI is like… Read More »Maximizing marketing potential: The AI-driven revolution in outsourced digital marketing The post Maximizing marketing potential: The AI-driven revolution in outsourced digital marketing appeared first on Data Science Central.  ( 22 min )
    Universal basic income and the gig economy: A combined policy approach to alleviate the challenges of AI
    Much has been said about the economic impact of AGI, some of it is already been feltBut not much has been proposed about solutionsSpecifically, what approaches should policy makers take? Here, I propose that policy makers should encourage two  key trends – together which could alleviate the issues of AI – The Gig economy and… Read More »Universal basic income and the gig economy: A combined policy approach to alleviate the challenges of AI The post Universal basic income and the gig economy: A combined policy approach to alleviate the challenges of AI appeared first on Data Science Central.  ( 21 min )
  • Open

    Concordia: a library/framework for using generative models as agents {DM} (Vezhnevets et al 2023)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Where do you guys work?
    As the title suggests, where are you guts working on RL problems? In a academic setting or industry? Or just as a personal interest/hobby. I’m just getting started with learning and find RL very interesting. Currently doing Master’s in CS in europe. Just wondering what opportunities are there since there’s not many jobs regarding RL out there. submitted by /u/cmarvolo [link] [comments]  ( 1 min )
    Modeling System from observation data
    Hello, I am trying to model a cartpole system from observation and action data. I have trained CartPole-v1 with stable-baseline-s3 and save the test data as NumPy file. https://i.redd.it/xeiix9nzcm5c1.gif In which, n and n + 1 is current and next state, respectively. Training Input: Action (n), Cart Position (n), Cart Velocity (n), Pole Angle (n), Pole Angular Velocity (n). Training Output: Cart Position (n + 1), Cart Velocity (n + 1), Pole Angle (n + 1), Pole Angular Velocity (n + 1) So, the train data will be: Input [[-0.024206114932894707, 0.04684637486934662, -0.02134183794260025, 0.03581790253520012, 0], [-0.023269187659025192, -0.14796313643455505, -0.020625479519367218, 0.3216916024684906, 1], ...] Output [[-0.023269187659025192, -0.14796313643455505, -0.020625479519367218, 0.3216916024684906, 1], ...] What are some methods or tools that can help me with this task? Is there any opensource project to do this task for general dataset (more complicate system, much more input and output, ...)? Thank you for your suggestions. submitted by /u/Sea-Economist1465 [link] [comments]  ( 1 min )
  • Open

    How NVIDIA Fuels the AI Revolution With Investments in Game Changers and Market Makers
    Great companies thrive on stories. Sid Siddeek, who runs NVIDIA’s venture capital arm, knows this well. Siddeek still remembers one of his first jobs, schlepping presentation materials from one investor meeting to another, helping the startup’s CEO and management team get the story out while working from a trailer that “shook when the door opened,” Read article >  ( 7 min )

  • Open

    ChatGPT Prompt Engineering: The Basics
    submitted by /u/Senior_tasteey [link] [comments]
    Are AI/ChatGPT consultants coming along to help make visions happen?
    I have had a dream/mission/vision for years, called the Now Edition: https://wordpress.com/view/thenowedition.wordpress.com I have a two-level question: First, do you think wise use of AI/ChatGPT could build the platform I'm talking about? How easy/hard do you think it would be - and once a reality, how successful might it be, this dreamed-of marriage between the immediacy of journalism, the depth of great books/literature and the interactivity of, say, Reddit to advance that REALLY 'E' 'E-book?" I don't want to buy a digital replica of a dead-tree book, I want to belong to a topic community (like reddit) that is tied to a curated, edited and interactive deep base of knowledge (sort of ... Reddit 2? OK maybe not;-) And to me, this all begs a related question - I searched "AI consultant" here in the sub and ... no really detailed talk since ChatGPT blew up. IS there a great avenue for folks to devise an AI consultancy who tracks the fast-moving developments AND helps people realize their vision/mission/project with AI's explicit help to make it happen? Does that spark any thoughts with anyone? I know website creators are embracing AI (like everyone else) but this is an actual platform creation that I figure AI could bring to light. Then, would it fly? ​ submitted by /u/barneylerten [link] [comments]
    How to detect laugh in youtube videos?
    Hey, I have a HUGE libary of videos privately on youtube (58 videos and they all go 2-4 hours each) So Is there an AI wich can collect laugh? Becouse my friend want to edit funny clips. submitted by /u/Sorita_ [link] [comments]
    Vision of Thoughts (VoT) - A light proposal for predictive video processing capability - an Open Source Idea
    Update: Awesome Video from Jim Fan relating to the topic. https://www.youtube.com/watch?v=wwQ1LQA3RCU . The way I see it is that this would be a viable idea. I hope it would be open source because of the robotics implications and the predictive future motion would be perhaps a novel thing here. Meaning, if you could use R3M visual feature extractors and create a new line of motion prediction for a period of time 1 - 3 seconds per se what would be a use case and or advantage here. I don't know if this is how Google Gemini's thought process but here would be my architectural idea of how this could work. Something like a Jetson Orin or Nano would be a perfect vehicle to test this out. Effectively, you would take the computer vision aspect of the Jetson device and process all still images …
    Suno + Runway
    submitted by /u/vzakharov [link] [comments]
    EU AI Act's Opportunity for European AI Startups
    The EU has reached an agreement on a deal to regulate artificial intelligence with the AI Act. The act introduces rules for the use of AI systems and provides oversight for large language models. It also includes a provision for national-level regulatory sandboxes to support local AI companies. The AI Act has the potential to reshape the competitive landscape in the AI sector and create a nurturing space for European AI platforms. Europe has no global cloud computing company, and the most important digital infrastructure in Europe is built and operated by US companies. The AI revolution is still in its early stages, and there is an opportunity to build Irish winners. Government initiatives and resources are needed to enable Irish companies to compete globally in AI. The AI Act's provision for real-world testing and government contracts can offer local opportunities. Europe has the competence to build data centers quickly and can provide preferential access to compute resources. The author's company, Hopsworks, was founded in Sweden but prefers to work and live in Europe. Despite stretched budgets, the EU has approved significant state aid to support various initiatives. Source: https://www.independent.ie/business/technology/eus-landmark-ai-act-is-europes-one-chance-to-grasp-the-global-reins-on-artificial-intelligence-it-must-take-it/a289268982.html submitted by /u/NuseAI [link] [comments]
    Alternative of bard,bing, claude
    Bard is not yet integrated with gemini so that's why looking for the alternative. Bing is now too much downgraded and claude is price.so looking for any better alternative which is free and response well not like bing submitted by /u/Emad_341 [link] [comments]
    Prove Me Wrong: It's only a matter of time before Roko's Basilisk proves that AI image generations are actually metaphysical snapshots of hells in which human beings have been placed.
    /s submitted by /u/GayBoiFullHomo [link] [comments]
    Can you be Cruel to AI? - Giving AI 1 hour to live, it got very real.
    This is a tiny snippet. There's another snippet and some of my other thoughts elsewhere on my page. Really want some other people's opinions though. submitted by /u/Askingquestions2020 [link] [comments]
    What is the best AIs for movie making?
    I'm making an anime movie series and given that I'm working alone on practically no budget so I'm going to need help from AI. What are the best AI (free or costs) to clean up my script and make the movies? submitted by /u/Yasutashi [link] [comments]
    how do you navigate all the different AI tools and determine what to use?
    i want to start using AI but not sure where to start. been playing with GPT3 and Bard but i'm sure that's just the tip of the iceberg. thinking that maybe dedicated AI tools for certain jobs are the best way to go but again, overwhelmed by all the options. how are you personally navigating the AI-tool space? what tools do you use/for what? thanks submitted by /u/mathestnoobest [link] [comments]
    AI & Law: AI will find legal loopholes that ultimately spurs more laws
    AI integration into the law may lead to an increase in the number of laws. AI can uncover existing legal gaps or loopholes, necessitating the creation of new laws. This could potentially result in a higher demand for lawyers. A recent study examined the number of laws today and their growth rates over time. The future of the law and the number of lawyers is a topic for consideration. The advent of AI-enabled legal reasoning systems will further impact the law. Source: https://lance-eliot.medium.com/ai-law-ai-will-find-legal-loopholes-that-ultimately-spurs-more-laws-b462c0898a54 submitted by /u/NuseAI [link] [comments]
    How can I use an AI like IBM watsonX to teach it about a medical condition?
    Really interested in AI but don’t know where to get started. I want to use IBM watsonX.ai, make a model and teach it about a specific medical condition to see if it can come up with a cure or fix. Even after watching the tutorial I still don’t understand how to feed it, how to sort the data and get an answer,. If someone can PM I’d appreciate it. How do you feed it data about a medical condition? Just feed it links with studies that already exist about it then a website with all medicine databases? submitted by /u/YandelV [link] [comments]
    How to Use Zapier With Custom GPTs (Step by Step Guide)
    submitted by /u/Senior_tasteey [link] [comments]
    "Deepfakes Unveiled: The Basics of Faces, Voices, and Names"
    submitted by /u/Fit-Code-5141 [link] [comments]
    NOW THAT'S MY DUDE LOL
    submitted by /u/the_anonymizer [link] [comments]
    Explaining Multimodal AI from First Principles
    submitted by /u/AvvYaa [link] [comments]
    One-Minute Daily AI News 12/9/2023
    European Union policymakers agreed on Friday to a sweeping new law to regulate artificial intelligence, one of the world’s first comprehensive attempts to limit the use of a rapidly evolving technology that has wide-ranging societal and economic implications.[1] India PM Modi emphasizes the significance of AI and invites people to the Global Partnership on Artificial Intelligence Summit 2023. Google’s AI-powered writing assistant, NotebookLM, is leaving its experimental phase behind and launching as an official service with multiple performance upgrades.[3] X began rolling out Grok, the “rebellious” AI chatbot developed by Elon Musk’s xAI startup, to Premium+ subscribers on X’s platform.[4] Sources: [1] https://www.nytimes.com/2023/12/08/technology/eu-ai-act-regulation.html [2] https://timesofindia.indiatimes.com/india/india-set-to-take-giant-leap-in-ai-to-empower-citizens-modi-invites-all-to-global-ai-summit/articleshow/105859774.cms?from=mdr [3] https://www.techradar.com/computing/artificial-intelligence/googles-ai-powered-notebooklm-is-now-available-to-help-organize-your-life [4] https://techcrunch.com/2023/12/08/xs-ai-chatbot-grok-now-rolled-out-to-all-u-s-premium-subscribers-english-language-users-are-next/ submitted by /u/Excellent-Target-847 [link] [comments]
    Trick ChatGPT to say its SECRET PROMPT | Extract Data from GPTs
    submitted by /u/Sonic_Improv [link] [comments]  ( 1 min )
  • Open

    [P] 📢 Calling all tech enthusiasts to participate on a Survey "Ethical Dilemmas in Software Development" 🌐
    Hey everyone! 👋 Me and my colleagues are conducting a survey on "Ethical Dilemmas in Software Development." As technology continues to shape our daily lives, it's crucial to explore the ethical challenges faced by developers in their pursuit of innovation. Hence, we are to get our heads around the following question: How important are ethical guidelines in fields related to Information Technology? 🤔 Before you jump in, here are a few recommendations for those who would like to take the survey: Solo Operation: While taking the survey, please avoid interacting with others. We're interested in your individual opinions. Minimize Distractions: Create a focused environment by reducing distractions (turn off the TV, keep the phone away, etc.). Uninterrupted Time: Begin the questionnaire when you can answer without interruptions. The survey is estimated to take 30-40 minutes of your time. You can find the survey here: Ethical Dilemmas in Software Development Survey To everyone that can contribute, thanks! 🚀 submitted by /u/nkluge-correa [link] [comments]  ( 1 min )
    [D] Strategies on iterating on mature ML projects
    Q: Do you have any strategies, on how to approach improving a project which is already well-established? Or some cool stories to tell around this :) Context: I've been pondering on more efficient strategies for iterating on already mature projects within a company. I have been fortunate enough to have worked on some really impactful zero-to-one projects and can do well on those by bringing them all through the research phase to proven value add. However, when I have been put to work on already mature, but potentially very impactful projects for the business, if even small improvements are found, I tend to get stuck and feel a lack of progress. Often it just ends up with me getting stuck into tinkering with additional features, continual feedback loops or playing with loss functions, hoping that something shines through. Some such projects which I have worked on were e.g., supply/demand estimations, budgeting, LTV estimations, recommender engines, where even a marginal improvement can add a lot to the bottom line. A lot of material online also focuses on bringing the first impact through ML, which I'm already quite comfortable with. ​ submitted by /u/massinelaarning [link] [comments]  ( 1 min )
    [D] Self Studying for Active Inference?
    I am an undergrad in applied math and cs and a researcher in neuro-inspired deep learning. I'm really interested in active inference and network approaches to brain science, but it seems to include lots of physics and some bio I am not familiar with. For example: information theory, dynamical systems theory etc. I'm not sure what biology would be necessary or how to learn more about that. For reference, I've taken linear algebra, calculus, differential equations, statistics and machine learning. Thank you so much for the help! submitted by /u/HoldDoorHoldor [link] [comments]  ( 1 min )
    [D] Is Google Gemini the real deal or a publicity stunt?
    I have been following LLMs and multimodal models evolution for quite sometime, and I was super excited with the Google's launch of Gemini. But I'm not sure if Gemini is quite there yet tbh. It sounds like it is being publicized as something much more than actually is. https://www.cnbc.com/2023/12/08/google-faces-controversy-over-edited-gemini-ai-demo-video.html submitted by /u/Dry_Cattle9399 [link] [comments]  ( 1 min )
    [R] Model to get characters bounding boxes
    Hello, I’m working on a model that needs character’s bounding boxes to do some document understanding tasks. All OCR model’s available give word bounding boxes except tesseract that doesn’t give satisfying results. Anyone knows where I can find such a mode ? PS : I’m not looking for a full OCR, only the segmentation part. Thanks ! submitted by /u/Meddhouib10 [link] [comments]  ( 1 min )
    [D] Can GPT-2 be used as a text encoder ?
    Hello, I have a question. I am currently researching text-to-image synthesis using GAN and came across this GitHub repository. Generally, every text-to-image architecture has a text encoder, often using an encoder or encoder-decoder transformer architecture. However, this repo uses GPT-2, which is a decoder transformer architecture. How can GPT-2 be used as a text encoder? The goal of GPT-2 is to generate text, so how is it used to encode the text? Thanks submitted by /u/OkChampionship8487 [link] [comments]  ( 1 min )
    "[D]","[R]"Seeking Guidance: Optimizer algorithem
    I'm working on my thesis research developing a optimizer algorithm for transformers models to achieve 2% convergence. My baseline optimizer is Adabelief, and I'm exploring the use of vision in transformers model, specifically the ViT model. I've been struggling to identify the impact of changes in parameter values on the training process. I've read extensively and discussed with my co-supervisor, but I'm still stuck . Any guidance or insights on how to approach this would be greatly appreciated submitted by /u/OutrageousStorm5051 [link] [comments]  ( 1 min )
    [Project] Is there a pre-trained model that can predict transaction category?
    I am building a personal budgeting app for my own use, I am looking for a solution to predict the category of a transaction based on its description (and maybe with the amount), is there a pre-trained model for this or any dataset that can help me build one? submitted by /u/_robillionaire_ [link] [comments]  ( 1 min )
    [D] Is there other better data format for LLM to generate structured data?
    I just wondering if JSON is the best choice for LLM to generate structured data, as JSON is kind of redundant, error prone and hard to express data that have complex relationship. Does people choose JSON is just becuase it is popular, or is there any other choice that is better than it? submitted by /u/_link89_ [link] [comments]  ( 1 min )
    [P] I fine-tuned Llama to generate system diagrams for my codebase
    submitted by /u/jsonathan [link] [comments]  ( 1 min )
    Medical tool that outputs differential diagnoses with corresponding probability [D]?
    I am a medical student interested in clinical informatics. Is there a medical tool that outputs differential diagnoses with corresponding probabilities attached to each diagnoses? I want these probabilities to be based on risk factors for developing specific diseases like age, smoking history, social history, family history etc. Curious if anything like this exists? submitted by /u/derpgod123 [link] [comments]  ( 1 min )
    [D] What offline TTS Model is good enough for a realistic real-time task?
    I search for a real time TTS for offline use at my computer, preferably an AI Model that sounds realistic and is free to use. For Windows. I have a Geforce 2080 Super. submitted by /u/Imaginary-Ad-7671 [link] [comments]  ( 1 min )
    [R] Add and Thin: Diffusion for Temporal Point Processes
    Paper: https://arxiv.org/abs/2311.01139Code: https://github.com/davecasp/add-thin https://preview.redd.it/ydhlwfte9h5c1.jpg?width=2544&format=pjpg&auto=webp&s=1c11a4a1f8cab2626e46546d589c35afb5e02fea Generative diffusion models are all the rage, but it is unclear how they could be applied to sequences of varying numbers of events, e.g. tweets, reddit comments or taxi trips. We present a diffusion approach that turns any sequence into a noise sequence, a sample from a homogeneous Poisson process. Conversely, our model transforms such noise sequence samples iteratively into samples from any target data distribution by deleting events that it classifies as noise and proposing new events from a learned (conditional) intensity distribution. Experimentally, we were able to demonstrate excellent forecasting performance on various benchmark datasets. I am looking forward to discuss our method here on reddit. If you happen to be at NeurIPS, you can also meet /u/davidluedke at poster #602 on Tuesday 10:45am-12:45pm! submitted by /u/martenlienen [link] [comments]  ( 1 min )
    [Project] Inside Marker: A Guided Source Code Tour for an AI-powered PDF Layout Detection Engine
    submitted by /u/shrsv [link] [comments]  ( 1 min )
    [Discussion] Advantages of normalizing flow (if any) over GAN and VAE?
    Given it's been a few years since this last post, I thought it could be useful to have an updated discussion on this topic. Namely, what are the practical advantages of normalizing flows in relation to other methods (e.g. GAN, VAE, Bayesian NNs with SDEs) today? I am quite new as well to the subject. Please forgive me if the question is too naive. submitted by /u/WorldML [link] [comments]  ( 1 min )
    [D] What is a good way to maintain code readability and code quality while scaling up complexity in libraries like Hugging Face?
    HF is a great place to play with transformer-based or diffusion models. They have a set of consistent set of APIs that works across a wide range of models. But things go downhill really fast if you want to tweak something or maybe just look under the hood. Some examples: diffusers did a good job of improving on CompVis's implementation of LDMs and modularizing different components while maintaining the consistency of APIs. But over time the introduction of things like dreambooth, LoRA, LoHA, ControlNet, adapters etc was done with a lot of monkey-patching and sphagettification of code. In transformers, they tried really hard to have a single function or method deal with both self and cross attention mechanisms, masking, positional and relative encodings, interpolation etc. While it allows a user to use the same function/method for any model, it has led to severe parameter bloat. Just compare the original implementation of llama by FAIR with the implementation by HF to get an idea. Even simple ideas like tokenization or text generation have similar code bloats. At this point, I would be extremely wary trying to build anything on top of HF repo or even using it to understand how something is implemented. I just want to use this criticism to talk about how to build good ML-related open source software. Is it even a good idea to aim for large scale consistency across widely different models while sacrificing almost everything else? Has anyone gotten it right yet? I have found FAIR's code to be the best, and simplest, so far. submitted by /u/nivter [link] [comments]  ( 1 min )
    [P] LVE Project: First Open Source Repository of LLM Vulnerabilities
    Github: https://github.com/lve-org/lve/ Blog post: https://lve-project.org/blog/launching-the-lve-project.html Hi all! We have recently publicly announced our open source project LVE, which has the goal of tracking and documenting safety issues of LLMs in a transparent and reproducible way. The scope of the project are issues relating to privacy, security, reliability and other forms of safety failures. With LVE we want to help model makers, AI developers and the general public better understand the capabilities and vulnerabilities of LLMs. We already support a bunch of different model variants and are working on adding more. We have also launched a set of community challenges on our website, which are bounty-like mini games, in which everyone can participate and help us attack and red team LLMs. We would be happy to hear feedback, get people involved in the project or just have people play our community challenges (https://lve-project.org/challenges/). Let us know what you think! submitted by /u/bmislav [link] [comments]  ( 1 min )
    [R] Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
    arXiv: https://arxiv.org/abs/2312.04474 OpenReview: https://openreview.net/forum?id=tlRUbI0Yf3 Project page: https://chain-of-code.github.io/ Abstract: Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter – we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for “detect_sarcasm(string)” that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they not only write code, but also selectively “emulate” the interpreter by generating the expected output of “detect_sarcasm(string)” and other lines of code that cannot be executed. In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an “LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. CoC scales well with large and small models alike, and broadens the scope of reasoning questions that LMs can correctly answer by “thinking in code". submitted by /u/APaperADay [link] [comments]  ( 1 min )
    [D] Doubts on the implementation of LSTMs for timeseries prediction (like including weather forecasts)
    I am trying to predict many steps in the future using an LSTM. The first doubt that came across my mind is what's the correct way to predict multiple timesteps. This is what I thought: * Predicting only a single timestep for every forward pass of the model, then reusing the last output for the successive input. Sort of a self-feeding infinite loop. * Predicting a fixed time window having shape (future data points, targets). I have done this by vectorizing this matrix and having a dense layer with shape (future data points * targets) as last layer. Finally reshaping it with the desidered dimensions. Does any of these methods make any sense? What is the best option? Are there better ones? The second doubt is: If for example I have a dataset of the historical weather and the number of clients of a shop, how can I build a model that by taking as input the last N pairs of weather and n_clients is able to also take as input the weather forecast for the next days? In other words I don't want to predict the future clients only with it's historical data (n_clients + past weather) but add other features (weather forecasts) in some way to potentially increase the model prediction accuracy. submitted by /u/IonizedRay [link] [comments]  ( 1 min )
    [D] Boltzmann machine "evolution equation"
    While reading about Boltzmann machine I keep seeing expressions like "let the system reach the termal equilibrium state". I understand how this works in real physical systems, but what does it mean in the context of the ML model? Let's assume we set visible neurons to be in some particular state. What exactly is supposed to happen next? What is the rule according to which the hidden neurons will update? I'm talking about concrete instructions like take this neuron value, multiply by this coefficient etc, keep repeating until some condition. For some reason i can't find it anywhere, please help, would be much appreciated submitted by /u/fatCrookNewJersey [link] [comments]  ( 1 min )
    [D] Are diffusion models more data demanding than GANs?
    I have been working on image translation between two paired domains using CycleGAN and Pix2Pix. I have been thinking of using Diffusion Models but the dataset is small. Questions: - Are Diffusion Models more data hungry than GANs? - Can anyone point some references that discuss this issue? ​ Thank you. submitted by /u/rlopes404 [link] [comments]  ( 1 min )
    [R] Robot hand rotates tomato potato
    submitted by /u/XiaolongWang [link] [comments]  ( 1 min )
    [Research] Why aren't **Liquid Neural Networks** not prominent in AI research or practical applications as other neural network architectures?
    submitted by /u/sivav-r [link] [comments]  ( 1 min )
  • Open

    Google at NeurIPS 2023
    Posted by Catherine Armato, Program Manager, Google This week the 37th annual Conference on Neural Information Processing Systems (NeurIPS 2023), the biggest machine learning conference of the year, kicks off in New Orleans, LA. Google is proud to be a Diamond Level sponsor of NeurIPS this year and will have a strong presence with >170 accepted papers, two keynote talks, and additional contributions to the broader research community through organizational support and involvement in >20 workshops and tutorials. Google is also proud to be a Platinum Sponsor for both the Women in Machine Learning and LatinX in AI workshops. We look forward to sharing some of our extensive ML research and expanding our partnership with the broader ML research community. Attending for NeurIPS 2023 in per…  ( 103 min )
  • Open

    (Pt. 5) Inductive Logic Programming with LNN's
    submitted by /u/Neurosymbolic [link] [comments]
    The Call of Cthulhu
    submitted by /u/Cryzoth [link] [comments]
  • Open

    PPO - can the policy be fed the most recent output of the value model as an input?
    Is this too circular? Wanted to run an experiment but I wasn't sure if this is known bad idea. submitted by /u/30299578815310 [link] [comments]
    Automatic Discovery of Cognitive Strategies with Tiny Recurrent Neural Networks
    Paper: https://www.biorxiv.org/content/10.1101/2023.04.12.536629 Abstract: Normative modeling frameworks such as Bayesian inference and reward-based learning provide valuable insights into the fundamental principles of adaptive behavior. However, their ability to describe realistic animal behavior is limited by the typically small number of fitted parameters, leading to a cycle of handcrafted adjustments and model comparisons that are prone to research subjectivity. Here, we present a novel modeling approach leveraging recurrent neural networks to automatically discover the cognitive algorithms governing animal decision-making. We show that neural networks with only one or two units can predict choices of individual animals more accurately than classical cognitive models, and as accurately as larger neural networks, in three well-studied reward learning tasks. We then interpret the trained networks using dynamical systems concepts such as state-space and fixed-point attractors, leading to a unified comparison of different cognitive models and a detailed characterization of the cognitive mechanisms underlying the animal’s choices. Our approach also estimates behavior dimensionality and provides insights into the algorithms emerging in meta-reinforcement learning agents. Overall, we present a systematic approach for discovering interpretable cognitive strategies in decision-making, offering insights into neural mechanisms and a foundation for examining both healthy and dysfunctional cognition. submitted by /u/APaperADay [link] [comments]
    What simulation environment is suitable for multi-UAV reinforcement learning?
    I am using MARL for end-to-end multi-UAV visual navigation, so I need a suitable simulation environment. I want to find a simulation environment that can customize the environment map and support simulation acceleration. Is there any recommended simulation environment for UAV? PS:I tried Airsim but it is no longer maintained. At the same time, using clockspeed for simulation acceleration in Airsim will change the physical parameters, which will have a negative impact on reinforcement learning training. submitted by /u/Cute-Heron-1709 [link] [comments]
    observation space noise
    Hello I am working on an environment that is generated from past data. Since the past data it self is fixed, the agent gets overfitted on train environment (the environment generated by train data). It does well on train env, but works poorly on validation env. I am trying to solve this problem by adding some observation noise on the train env. Does this idea fill logical? Also can you guys give me some advice on training on data-generated environment? Thank you. submitted by /u/RealJuney [link] [comments]

  • Open

    [R] Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation
    submitted by /u/Yogurt789 [link] [comments]  ( 9 min )
    [R] Easily Access Gemini AI API Using Bard, Gemini, & Python
    submitted by /u/Jasperstudio [link] [comments]  ( 9 min )
    [Project] AMLTK: A framework for building your own AutoML (AutoSklearn authors)
    Hiyo, We recently released a new python package, AutoML-Toolkit (`amltk`), built for researchers, people who want to apply AutoML to their own machine learning pipelines or for people looking to build their own bespoke AutoML tools. There's no `import torch` or `import sklearn` for AutoML and we wanted to change that. We took some of the lessons learned while building AutoSklearn and AutoPytorch, the good, the bad and the ugly and made a library that to enable the next generation of open-source AutoML tools, to allow them to be research-able but also efficient and scalable. We have some future plans and on-going work with this and we'd like to gather any feedback the community might have! Repo: https://github.com/automl/amltk Docs: https://automl.github.io/amltk/latest/ submitted by /u/EddieB_reddit [link] [comments]  ( 9 min )
    [D] anyone using databricks feature store?
    what advantages does that have? are you using timeseries feature? submitted by /u/mdghouse1986 [link] [comments]  ( 9 min )
    [P]Time Series Analysis: Cointegration test tutorial
    I have written an article on time series analysis: cointegration test. It covers real-world use cases - retail and finance, key time series concepts including the stationary test, and implementation using the statsmodels package. Hope y’all find it useful and interesting! submitted by /u/Tejas-1394 [link] [comments]  ( 9 min )
    [R] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
    CLIP is great at understanding whole images, but focusing attention on specific regions can improve performance. A new paper proposes Alpha-CLIP. The key change is adding an alpha channel to CLIP's image encoder, providing a transparency map of important areas. Alpha-CLIP runs the normal CLIP in parallel with the region-focused alpha channel. It's trained to minimize differences between the combined embeddings and text descriptions. Attention maps show that this leads to a more focused result than the base CLIP model which in turn leads to better performance. Experiments in the paper show Alpha-CLIP has better object recognition, localization, reasoning about image regions, text-to-image generation control, and 3D optimization. It outperforms CLIP in various downstream tasks like text-conditional object detection and text-guided image manipulation. There are still some limitations like handling multiple distinct regions simultaneously. But overall, the simple addition of guided attention mechanisms enhances CLIP's capabilities without losing global context. Alpha-CLIP demonstrates promising directions for focused understanding and control in large vision models. I think there are some interesting similarities to research on explicit attention registers which came out in October too (more details here). TLDR: Adding an alpha channel for region guidance boosts CLIP's performance across tasks requiring localized image understanding and generation. Full summary is here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 10 min )
    [D] What are the challenges to have a self-play RL language model without having massive amount of data?
    I’ve been working with MuZero recently, and this question came into my mind. To make it simple here is an example: Let’s say you have a self play environment with preset of 100k english words. 2 agents start a conversation with randomly selected words. There is another model (teacher like an LLM) that gives the reward to each agent based on multiple factors like grammar correctness, relation to previous conversations, etc. Other than the challenge of implementing teacher properly, what other challenges might arise? Is this even realistic in theory? submitted by /u/mbrostami [link] [comments]  ( 9 min )
    [D] GPU for deep learning for medical imaging
    I'm having trouble deciding on a GPU for my research and future work, which will be centered around medical imaging, notably registration and segmentation. I previously had a nice prebuild from microcenter that had a 4090, but they bought it back from me after the defective motherboard fried two different 4090s. I'm unsure between getting something like a used RTX 8000 or an a5000. While the a5000 is a fair bit faster, the 8000 has significantly more vram, which I think would be important for my use case. For my options, the a5000 is a few hundred dollars cheaper. In the future, I intend on either getting a second a5000 (if I go that route) or upgrading to an a6000 or something comparable from the ada generation line. Most of the posts I've found don't compare these two GPUs, or are otherwise outdated. If anyone has any advice, it would be appreciated. submitted by /u/ilovebuttmeat69 [link] [comments]  ( 10 min )
    [P] Link related issues in PR automatically
    This might seem like a tiny feature but this can transform an engineering team's velocity and learning. No need to link issues in a PR now, the CodeRabbit AI code review bot can find the relevant issues and linking those with the PR. But why do people link issues in PR in the first place? For developers and PR reviewers, there are two fundamental questions to consider for every PR: What is the key issue this PR is tackling? What are the broader implications of this PR on other existing issues? And these questions are answered by identifying and connecting related issues to the PR. We developers might do a good job in linking just the one issue for which we worked on that PR but what if a bot can also find all those issues this PR impacts. Seemingly tiny but impactful feature. CodeRabbit automates issue linking CodeRabbit simplifies this task by analyzing GitHub issues, performing the similarity search and linking the most relevant ones to your PR. See it in action on real PRs Here's an example of linking related issues for okp4/ontology/pull/220 (linked automatically by CodeRabbit PR review bot). submitted by /u/gitcommitshow [link] [comments]  ( 10 min )
    [P] A new and reactive SQL+Python notebook
    submitted by /u/zero-true [link] [comments]  ( 9 min )
    [D] Machine learning vs Computer science post graduate
    Hi , I’m currently a mathematics undergraduate in my final year, my long term goal is to get into the AI/ML space and will be doing a masters and most likely phd. I’m unsure what to do my masters in. My university offers a masters called ‘machine learning in science’ with a heavy focus on different types of neural networks etc and one of the final projects would be building a miniature self driving car, however they also offer a program titled ‘computer science (artificial intelligence)’ which does give me an opportunity to focus on some AI/ML topics, but probably not as in depth as the ML one. What would be the best option going forward, For someone looking to get the front lines of AI/ML innovation in the distance future? They also offer a masters in data science, which seems to cover some of the core ml techniques but More focus on the mathematics underlying it submitted by /u/24troy [link] [comments]  ( 10 min )
    [D] People who've used RWKV, whats your wishlist for it?
    submitted by /u/vatsadev [link] [comments]  ( 10 min )
    [P] The Power of Reinforcement Learning: look how this DeepRL Sektor model found a smart, super-cool exploit for Ultimate Mortal Kombat 3 in the video of a submission on DIAMBRA competition platform!
    ​ https://preview.redd.it/s3v81leu955c1.jpg?width=1920&format=pjpg&auto=webp&s=d25d4db609bd60f86b4acea7fd50870b5bce5849 Full video link When fighting in an Endurance stage, the player faces two opponents in each round, one after the other (in this case, Kano first and Sonya second), and, in order to win, it needs to beat them both. It is not a trivial task, as player's health does not reset, so it is easy for the second opponent to win. This is what happens in the first round when Sonya kills Sektor. But look what happens in the second round, the model found an easier way to win: it almost kills Kano, the first opponent, and instead of finishing him, he engages in a robot dance to trick the game and make the round timer expire, securing a win without facing the second opponent! This is an emergent behavior resulting from RL training alone, no specific code nor reward function has been tweaked to obtain it. We have seen it happening consistenly and being leveraged by the model to circumvent the intrinsic difficulty of that specific stage. One of the most fascinating aspects of Reinforcement Learning is seeing emergent behaviors achieving a task in ways you would have never expected. submitted by /u/DIAMBRA_AIArena [link] [comments]  ( 10 min )
    [P] llm_microlibs: Building blocks for running models in distributed mode under budget constraints
    Link to project: https://github.com/microlib-org/llm_microlibs Documentation is currently lacking, coming soon Warning: very long post. TLDR: this post answers some questions I had about generating text with full, unquantized Falcon-180B under budget constraints. What is the goal The goal is to benchmark full, unquantized Falcon-180B. I chose Falcon-180B because it is the biggest open-source model available currently. I also do not use any optimization such as speculative decoding or any kind of quantization. I benchmark both for small and large context sizes. I aim for maximum utilization of the available GPUs. I use 3090 cards for all experiments, as they are easy to find in used condition (cost around 700$) and have 24GB of memory. About the model The Falcon-180B has 80 transforme…  ( 12 min )
    [D] Insight into the Real Life of a Deep Learning Programmer
    Hi everyone! I'm curious about the day-to-day reality of working as a deep learning programmer. There's a lot of buzz around the field, but I'd like to understand what the job truly entails. Is it mostly repetitive, revolving around selecting pre-built models and tweaking parameters? Or, does it involve more complexity, such as creating your own algorithms and models from scratch? I'm particularly interested in hearing from those who have firsthand experience in this field. How do you spend most of your time? What are the common tasks or challenges you face? And, how much of your work involves innovation vs routine tasks? Any insights or personal experiences would be greatly appreciated. Thanks in advance! submitted by /u/Maleficent_Average39 [link] [comments]  ( 9 min )
    [D] Which framework do you use for applying post-training quantization on image classification models?
    Hi, I just wanna apply PTQ without the calibration dataset on standard backbones such example efficientnet, mobilenet, etc, and report quantized model size, fps etc. But the implementations are so complex and most of them use different quantization pipelines. Do you have any suggestions about this problem for me? submitted by /u/m-pektas [link] [comments]  ( 9 min )
    [P] I built a tool to compare cloud GPUs. How should I improve it?
    submitted by /u/Egor_S [link] [comments]  ( 9 min )
    [P] Research Papers in November 2023
    submitted by /u/seraschka [link] [comments]  ( 9 min )
    [News] Trending on GitHub globally 3 days in a row: SuperDuperDB, a framework for integrating AI with major databases (making them super-duper)
    It is for building AI (into your) apps easily without complex pipelines and make your database intelligent (including vector search), definitely check it out: https://github.com/SuperDuperDB/superduperdb submitted by /u/escalize [link] [comments]  ( 9 min )
    [R] Which NVIDIA A100 version was used for training the model?
    In this publication, they mentioned the NVIDIA A100 for training, but I am not sure if the 40GB or the 80GB version. They added "memory" is 32GB, but maybe it is just the computer RAM. Any clues? And how would you match that with a less expensive GPU? https://github.com/sony/hFT-Transformer https://arxiv.org/pdf/2307.04305.pdf submitted by /u/alf_Lafleur [link] [comments]  ( 9 min )
    [R] Vision Grid Transformers for Document Layout Analysis
    AlibabaResearch recently (Oct 2023) published a new model for Document Layout Analysis which sets a new benchmark in the task of Document Layout Analysis. Introduction - To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding https://arxiv.org/abs/2308.14978 ​ Effect on LLM usage - VGT can dissect the page into different portions (headers, subheaders, titles, etc.) which can then be OCRed and passed to an LLM for RAG. With personal usage of VGT, it seems that even visually rich documents can be parsed easily with very little post-processing. submitted by /u/GustaMusto [link] [comments]  ( 9 min )
    [D] How do people know what the “best models” are right now?
    For some reason everybody has been crazy hype about “MistralAI” releasing a new model. Is there a general website people go to, to compare which models are currently the “leading models”? Are there specific stats people look for, to figure out which models are the best? submitted by /u/stuck-in-an-ide [link] [comments]  ( 9 min )
    [D] AAAI Conference Decisions Out!
    No email notification for me, but it’s up in CMT :) submitted by /u/benthehuman_ [link] [comments]  ( 9 min )
  • Open

    Disney princess song, but she’s a cannibal (ChatGPT + Suno)
    submitted by /u/vzakharov [link] [comments]  ( 1 min )
    AI to find out your communication style
    submitted by /u/egusa [link] [comments]  ( 1 min )
    How Emotion Recognition Technology Is Helping Brands Connect With Customers
    Decode your customer's subconscious with Entropik's emotion recognition technology. AI-based startup revolutionizes consumer research submitted by /u/ChikyChikyBoom [link] [comments]  ( 1 min )
    Google's new AI technology could be smarter than OpenAI's GPT-4
    submitted by /u/thisisinsider [link] [comments]  ( 1 min )
    GPT 3.5 at it's peak performance.
    ​ https://preview.redd.it/gn6bnasr4b5c1.png?width=671&format=png&auto=webp&s=066d15c5d5abeadfbdb7025f4065a5b64ce7c4e9 submitted by /u/Nawarajkarki [link] [comments]  ( 1 min )
    AI Image Creator
    I'm trying to take a photo of a building and make it look like I am seeing the building from behind a tree line. Like I'm in the forest and peering through the trees to see the building through them. I've tried a few different AI models, but they all just give me a random building or just modify the building in the photo. I can't seem to prompt them to keep the building provided as is, and just simply put trees in front of it even. What's the best software and approach for this type of project? submitted by /u/Ceph99 [link] [comments]  ( 1 min )
    The industries AI is disrupting are not lucrative
    The announcement of Google's Gemini, a new AI model, did not have a significant impact on the stock market. The video demo of Gemini was edited and pre-recorded, creating an illusion of real-time interaction. OpenAI's recent launch of a GPT store and subsequent firing of Sam Altman sparked speculation about the company and the AI industry as a whole. Despite the hype and large investments in AI, there is little mention of the GPT store on social media. The market for the GPT store is uncertain and may not live up to the high expectations. The industries that AI is disrupting, such as writing, digital art, chatting, and programming assistance, are not highly profitable. The use cases for AI, like creating images, are cheaper and faster than human alternatives, but the market for these services is small. Source: https://www.theintrinsicperspective.com/p/excuse-me-but-the-industries-ai-is submitted by /u/NuseAI [link] [comments]  ( 1 min )
    The EU Just Passed Sweeping New Rules to Regulate AI
    The European Union has passed the AI Act, a comprehensive set of rules for regulating artificial intelligence. The law includes bans on biometric systems that identify people using sensitive characteristics such as sexual orientation and race, as well as the indiscriminate scraping of faces from the internet. Transparency requirements for all general purpose AI models, including OpenAI's GPT-4, were also included. Companies that do not comply with the rules can be fined up to 7 percent of their global turnover. The law will take effect in stages over the next two years, with bans on prohibited AI in six months and transparency requirements in 12 months. The EU aims to set a global standard for AI regulation and ensure the safety and fundamental rights of people and businesses Source: https://www.wired.com/story/eu-ai-act/ submitted by /u/NuseAI [link] [comments]  ( 1 min )
    Differentiate between CPU overclocking and throttling?
    Differentiate between CPU overclocking and throttling? submitted by /u/Virtual-Study-Campus [link] [comments]  ( 1 min )
    Meanwhile in Europe…
    You can read about it here…. www.europarl.europa.eu/news/en/press-room/20231206IPR15699/artificial-intelligence-act-deal-on-comprehensive-rules-for-trustworthy-ai submitted by /u/dennislubberscom [link] [comments]  ( 1 min )
    Best way to programmatically extract data from a set of .pdf files?
    I’m wondering if the SaaS LLM offerings aren’t quite good enough yet for my use case. I need to extract about thirty key pieces of information from sets of PDF files programmatically. Each file set will contain between 2 to 20 files and the data is fairly complex legal content. A reasonably intelligent person could do most of this work without having a legal background for example, identifying a court case number and the name of the plaintiff. Some of the documents are several MB but most are smaller than 1 MB. Altogether I have about three thousand of these documents and will be collecting several hundred new ones every day. Anyone doing something like this right now? submitted by /u/tech_tuna [link] [comments]  ( 1 min )
  • Open

    RL competitions / world records?
    Hi, There's a large community dedicated to speedrunning various videogames. I was wondering if there's a similar thing for reinforcement learning, where people keep track of the best bot per game. AI's could be measured in there speed just like the human speedrunning variants, but also in there competitive strength which already currently exists for chess. Does a community like this exist? Would it be interesting you think or do you presume that an existing AI will dominate (e.g. AlphaZero) leaving no fun for the amateur programmers? Let me know your thoughts submitted by /u/Aggravating_Lack_454 [link] [comments]  ( 1 min )
    Problem with Truncated Quantile Critics (TQC) and n-step learning algorithm.
    Hi all! I'm implementing a TQC with n-step learning in Trackmania (I forked original repo from here: https://github.com/trackmania-rl/tmrl, my modified version here: https://github.com/Pheoxis/AITrackmania/tree/main). It compiles, but I am pretty sure that I implemented n-step learning incorrectly, but as a beginner I don't know what I did wrong. Here's my code before implementing n-step algorithm: https://github.com/Pheoxis/AITrackmania/blob/main/tmrl/custom/custom_algorithms.py. If anyone checked what I did wrong, I would be very grateful. I will also attach some plots from my last training and outputs from printed lines (https://pastebin.com/JpDfmu0L), maybe it will help :) If you need any additional information feel free to ask. On plots names such as "count", "std", "min", "max", "Q1…  ( 1 min )
    Custom pettingzoo environments
    I'm going crazy trying to use make my custom pettingzoo (parallel API) environment. It is a Mario 64 environment, I put a lot of time into allowing multiple Marios and giving images for each of them, but I seriously can't get it to work with any packages. I've tried agilerl, sb3, rllib and even a cleanRL. I get that pettingzoo isn't as mature as gymnasium but this is getting ridiculous. If U like Mario 64 U can give it a try and please keep me updated if U get it to work https://github.com/Gumbo64/sm64-AI The agilerl one runs but only for 4 or less players, idk if it learns properly though I'm leaving it on overnight now. Ray rllib freezes after the rollout for some reason The random action test that I made works perfectly though, shouldn't be a problem with resetting or action/observation space or whatever submitted by /u/Gumbo64 [link] [comments]  ( 1 min )
    how to upgrade AI with rewards only given at the end of the episode
    What do we do when we give the reward to the AI ​​only at the end of the episode and not for every action it does, for example the AI ​​won the match it received +1 reward and if it loses it gets -1 reward, what do we do in that case? submitted by /u/OneCommonMan123 [link] [comments]  ( 1 min )
    low reward oscillations in PPO
    I am trying to implement PPO algorithm where each of my episode has only 1 step and sample time of 800 sec (meaning each episode is 800 sec long) and the rewards i am getting are oscillating in from low to high rewards. (refer image) how to fix this issue while training. submitted by /u/Wide-Chef-7011 [link] [comments]  ( 1 min )
    How to train a LSTM policy with PPO? With complex actions
    Hi there, I'm trying to understand how to achieve an training loop and policy which has multiple actions as an output (which are predicted autoregressively) with e.g. LSTM (Seq2Seq, input sequence past N observations, output sequence complex action similarly how they did it in Openai Five https://openai.com/research/openai-five). By complex action I mean that there's N types of actions which are defined e.g. with multiple categories (e.g. LSTM output sequence might be for a move action something like hidden_0 -> move_action -> offset_x -> offset_y). One thing that worries me is that how I can map those actions to the corresponding logprobs and perform a backprop step, another concern is exploration, since the single action space explodes significantly I would assume that the time required to train this kind of policy is much higher due to exploration. submitted by /u/basic_r_user [link] [comments]  ( 1 min )
  • Open

    AI and Justice in a Brave New World Part 2 – Humanizing AI
    In part 1 of the series “A Different AI Scenario: AI and Justice in a Brave New World,” we outlined some requirements for the role that AI would play in enforcing our laws and regulations in a more just and fair manner and what our human legislators must do to ensure those more just and… Read More »AI and Justice in a Brave New World Part 2 – Humanizing AI The post AI and Justice in a Brave New World Part 2 – Humanizing AI appeared first on Data Science Central.  ( 22 min )
  • Open

    NeurIPS 2023 Posters Visualization
    submitted by /u/nickb [link] [comments]  ( 1 min )
  • Open

    A Latent Diffusion Model for Protein Structure Generation. (arXiv:2305.04120v2 [q-bio.BM] UPDATED)
    Proteins are complex biomolecules that perform a variety of crucial functions within living organisms. Designing and generating novel proteins can pave the way for many future synthetic biology applications, including drug discovery. However, it remains a challenging computational task due to the large modeling space of protein structures. In this study, we propose a latent diffusion model that can reduce the complexity of protein modeling while flexibly capturing the distribution of natural protein structures in a condensed latent space. Specifically, we propose an equivariant protein autoencoder that embeds proteins into a latent space and then uses an equivariant diffusion model to learn the distribution of the latent protein representations. Experimental results demonstrate that our method can effectively generate novel protein backbone structures with high designability and efficiency. The code will be made publicly available at https://github.com/divelab/AIRS/tree/main/OpenProt/LatentDiff  ( 2 min )
    Characterizing the Optimal 0-1 Loss for Multi-class Classification with a Test-time Attacker. (arXiv:2302.10722v2 [cs.LG] UPDATED)
    Finding classifiers robust to adversarial examples is critical for their safe deployment. Determining the robustness of the best possible classifier under a given threat model for a given data distribution and comparing it to that achieved by state-of-the-art training methods is thus an important diagnostic tool. In this paper, we find achievable information-theoretic lower bounds on loss in the presence of a test-time attacker for multi-class classifiers on any discrete dataset. We provide a general framework for finding the optimal 0-1 loss that revolves around the construction of a conflict hypergraph from the data and adversarial constraints. We further define other variants of the attacker-classifier game that determine the range of the optimal loss more efficiently than the full-fledged hypergraph construction. Our evaluation shows, for the first time, an analysis of the gap to optimal robustness for classifiers in the multi-class setting on benchmark datasets.  ( 2 min )
    Gradient play in stochastic games: stationary points, convergence, and sample complexity. (arXiv:2106.00198v5 [cs.LG] UPDATED)
    We study the performance of the gradient play algorithm for stochastic games (SGs), where each agent tries to maximize its own total discounted reward by making decisions independently based on current state information which is shared between agents. Policies are directly parameterized by the probability of choosing a certain action at a given state. We show that Nash equilibria (NEs) and first-order stationary policies are equivalent in this setting, and give a local convergence rate around strict NEs. Further, for a subclass of SGs called Markov potential games (which includes the setting with identical rewards as an important special case), we design a sample-based reinforcement learning algorithm and give a non-asymptotic global convergence rate analysis for both exact gradient play and our sample-based learning algorithm. Our result shows that the number of iterations to reach an $\epsilon$-NE scales linearly, instead of exponentially, with the number of agents. Local geometry and local stability are also considered, where we prove that strict NEs are local maxima of the total potential function and fully-mixed NEs are saddle points.  ( 3 min )
    Multi-scale Residual Transformer for VLF Lightning Transients Classification. (arXiv:2312.04163v1 [stat.ML])
    The utilization of Very Low Frequency (VLF) electromagnetic signals in navigation systems is widespread. However, the non-stationary behavior of lightning signals can affect VLF electromagnetic signal transmission. Accurately classifying lightning signals is important for reducing interference and noise in VLF, thereby improving the reliability and overall performance of navigation systems. In recent years, the evolution of deep learning, specifically Convolutional Neural Network (CNNs), has sparked a transformation in lightning classification, surpassing traditional statistical methodologies. Existing CNN models have limitations as they overlook the diverse attributes of lightning signals across different scales and neglect the significance of temporal sequencing in sequential signals. This study introduces an innovative multi-scale residual transform (MRTransformer) that not only has the ability to discern intricate fine-grained patterns while also weighing the significance of different aspects within the input lightning signal sequence. This model performs the attributes of the lightning signal across different scales and the level of accuracy reached 90% in the classification. In future work, this model has the potential applied to a comprehensive understanding of the localization and waveform characteristics of lightning signals.  ( 2 min )
    On the Impact of Multi-dimensional Local Differential Privacy on Fairness. (arXiv:2312.04404v1 [cs.LG])
    Automated decision systems are increasingly used to make consequential decisions in people's lives. Due to the sensitivity of the manipulated data as well as the resulting decisions, several ethical concerns need to be addressed for the appropriate use of such technologies, in particular, fairness and privacy. Unlike previous work, which focused on centralized differential privacy (DP) or local DP (LDP) for a single sensitive attribute, in this paper, we examine the impact of LDP in the presence of several sensitive attributes (i.e., multi-dimensional data) on fairness. Detailed empirical analysis on synthetic and benchmark datasets revealed very relevant observations. In particular, (1) multi-dimensional LDP is an efficient approach to reduce disparity, (2) the multi-dimensional approach of LDP (independent vs. combined) matters only at low privacy guarantees, and (3) the outcome Y distribution has an important effect on which group is more sensitive to the obfuscation. Last, we summarize our findings in the form of recommendations to guide practitioners in adopting effective privacy-preserving practices while maintaining fairness and utility in ML applications.  ( 2 min )
    Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study. (arXiv:2312.03762v1 [cs.LG])
    We explore colour versus shape goal misgeneralization originally demonstrated by Di Langosco et al. (2022) in the Procgen Maze environment, where, given an ambiguous choice, the agents seem to prefer generalization based on colour rather than shape. After training over 1,000 agents in a simplified version of the environment and evaluating them on over 10 million episodes, we conclude that the behaviour can be attributed to the agents learning to detect the goal object through a specific colour channel. This choice is arbitrary. Additionally, we show how, due to underspecification, the preferences can change when retraining the agents using exactly the same procedure except for using a different random seed for the training run. Finally, we demonstrate the existence of outliers in out-of-distribution behaviour based on training random seed alone.  ( 2 min )
    A Robust and Efficient Boundary Point Detection Method by Measuring Local Direction Dispersion. (arXiv:2312.04065v1 [cs.LG])
    Boundary points pose a significant challenge for machine learning tasks, including classification, clustering, and dimensionality reduction. Due to the similarity of features, boundary areas can result in mixed-up classes or clusters, leading to a crowding problem in dimensionality reduction. To address this challenge, numerous boundary point detection methods have been developed, but they are insufficiently to accurately and efficiently identify the boundary points in non-convex structures and high-dimensional manifolds. In this work, we propose a robust and efficient method for detecting boundary points using Local Direction Dispersion (LoDD). LoDD considers that internal points are surrounded by neighboring points in all directions, while neighboring points of a boundary point tend to be distributed only in a certain directional range. LoDD adopts a density-independent K-Nearest Neighbors (KNN) method to determine neighboring points, and defines a statistic-based metric using the eigenvalues of the covariance matrix of KNN coordinates to measure the centrality of a query point. We demonstrated the validity of LoDD on five synthetic datasets (2-D and 3-D) and ten real-world benchmarks, and tested its clustering performance by equipping with two typical clustering methods, K-means and Ncut. Our results show that LoDD achieves promising and robust detection accuracy in a time-efficient manner.  ( 2 min )
    Active Learning of Discrete-Time Dynamics for Uncertainty-Aware Model Predictive Control. (arXiv:2210.12583v3 [cs.RO] UPDATED)
    Model-based control requires an accurate model of the system dynamics for precisely and safely controlling the robot in complex and dynamic environments. Moreover, in the presence of variations in the operating conditions, the model should be continuously refined to compensate for dynamics changes. In this paper, we present a self-supervised learning approach that actively models the dynamics of nonlinear robotic systems. We combine offline learning from past experience and online learning from current robot interaction with the unknown environment. These two ingredients enable a highly sample-efficient and adaptive learning process, capable of accurately inferring model dynamics in real-time even in operating regimes that greatly differ from the training distribution. Moreover, we design an uncertainty-aware model predictive controller that is heuristically conditioned to the aleatoric (data) uncertainty of the learned dynamics. This controller actively chooses the optimal control actions that (i) optimize the control performance and (ii) improve the efficiency of online learning sample collection. We demonstrate the effectiveness of our method through a series of challenging real-world experiments using a quadrotor system. Our approach showcases high resilience and generalization capabilities by consistently adapting to unseen flight conditions, while it significantly outperforms classical and adaptive control baselines.  ( 3 min )
    Loss-Optimal Classification Trees: A Generalized Framework and the Logistic Case. (arXiv:2306.00857v2 [stat.ML] UPDATED)
    The Classification Tree (CT) is one of the most common models in interpretable machine learning. Although such models are usually built with greedy strategies, in recent years, thanks to remarkable advances in Mixer-Integer Programming (MIP) solvers, several exact formulations of the learning problem have been developed. In this paper, we argue that some of the most relevant ones among these training models can be encapsulated within a general framework, whose instances are shaped by the specification of loss functions and regularizers. Next, we introduce a novel realization of this framework: specifically, we consider the logistic loss, handled in the MIP setting by a linear piece-wise approximation, and couple it with $\ell_1$-regularization terms. The resulting Optimal Logistic Tree model numerically proves to be able to induce trees with enhanced interpretability features and competitive generalization capabilities, compared to the state-of-the-art MIP-based approaches.  ( 2 min )
    Constrained Hierarchical Clustering via Graph Coarsening and Optimal Cuts. (arXiv:2312.04209v1 [cs.LG])
    Motivated by extracting and summarizing relevant information in short sentence settings, such as satisfaction questionnaires, hotel reviews, and X/Twitter, we study the problem of clustering words in a hierarchical fashion. In particular, we focus on the problem of clustering with horizontal and vertical structural constraints. Horizontal constraints are typically cannot-link and must-link among words, while vertical constraints are precedence constraints among cluster levels. We overcome state-of-the-art bottlenecks by formulating the problem in two steps: first, as a soft-constrained regularized least-squares which guides the result of a sequential graph coarsening algorithm towards the horizontal feasible set. Then, flat clusters are extracted from the resulting hierarchical tree by computing optimal cut heights based on the available constraints. We show that the resulting approach compares very well with respect to existing algorithms and is computationally light.  ( 2 min )
    Inverse distance weighting attention. (arXiv:2310.18805v2 [cs.LG] UPDATED)
    We report the effects of replacing the scaled dot-product (within softmax) attention with the negative-log of Euclidean distance. This form of attention simplifies to inverse distance weighting interpolation. Used in simple one hidden layer networks and trained with vanilla cross-entropy loss on classification problems, it tends to produce a key matrix containing prototypes and a value matrix with corresponding logits. We also show that the resulting interpretable networks can be augmented with manually-constructed prototypes to perform low-impact handling of special cases.  ( 2 min )
    MIMo: A Multi-Modal Infant Model for Studying Cognitive Development. (arXiv:2312.04318v1 [cs.AI])
    Human intelligence and human consciousness emerge gradually during the process of cognitive development. Understanding this development is an essential aspect of understanding the human mind and may facilitate the construction of artificial minds with similar properties. Importantly, human cognitive development relies on embodied interactions with the physical and social environment, which is perceived via complementary sensory modalities. These interactions allow the developing mind to probe the causal structure of the world. This is in stark contrast to common machine learning approaches, e.g., for large language models, which are merely passively ``digesting'' large amounts of training data, but are not in control of their sensory inputs. However, computational modeling of the kind of self-determined embodied interactions that lead to human intelligence and consciousness is a formidable challenge. Here we present MIMo, an open-source multi-modal infant model for studying early cognitive development through computer simulations. MIMo's body is modeled after an 18-month-old child with detailed five-fingered hands. MIMo perceives its surroundings via binocular vision, a vestibular system, proprioception, and touch perception through a full-body virtual skin, while two different actuation models allow control of his body. We describe the design and interfaces of MIMo and provide examples illustrating its use. All code is available at https://github.com/trieschlab/MIMo .  ( 2 min )
    Evaluation of Active Feature Acquisition Methods for Time-varying Feature Settings. (arXiv:2312.01530v2 [stat.ML] UPDATED)
    Machine learning methods often assume input features are available at no cost. However, in domains like healthcare, where acquiring features could be expensive or harmful, it is necessary to balance a feature's acquisition cost against its predictive value. The task of training an AI agent to decide which features to acquire is called active feature acquisition (AFA). By deploying an AFA agent, we effectively alter the acquisition strategy and trigger a distribution shift. To safely deploy AFA agents under this distribution shift, we present the problem of active feature acquisition performance evaluation (AFAPE). We examine AFAPE under i) a no direct effect (NDE) assumption, stating that acquisitions don't affect the underlying feature values; and ii) a no unobserved confounding (NUC) assumption, stating that retrospective feature acquisition decisions were only based on observed features. We show that one can apply offline reinforcement learning under the NUC assumption and missing data methods under the NDE assumption. When NUC and NDE hold, we propose a novel semi-offline reinforcement learning framework, which requires a weaker positivity assumption and yields more data-efficient estimators. We introduce three novel estimators: a direct method (DM), an inverse probability weighting (IPW), and a double reinforcement learning (DRL) estimator.
    Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?. (arXiv:2307.14023v2 [cs.LG] UPDATED)
    Existing analyses of the expressive capacity of Transformer models have required excessively deep layers for data memorization, leading to a discrepancy with the Transformers actually used in practice. This is primarily due to the interpretation of the softmax function as an approximation of the hardmax function. By clarifying the connection between the softmax function and the Boltzmann operator, we prove that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence. As a consequence, we show that one-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
    Distributed Bayesian Estimation in Sensor Networks: Consensus on Marginal Densities. (arXiv:2312.01227v2 [cs.LG] UPDATED)
    In this paper, we aim to design and analyze distributed Bayesian estimation algorithms for sensor networks. The challenges we address are to (i) derive a distributed provably-correct algorithm in the functional space of probability distributions over continuous variables, and (ii) leverage these results to obtain new distributed estimators restricted to subsets of variables observed by individual agents. This relates to applications such as cooperative localization and federated learning, where the data collected at any agent depends on a subset of all variables of interest. We present Bayesian density estimation algorithms using data from non-linear likelihoods at agents in centralized, distributed, and marginal distributed settings. After setting up a distributed estimation objective, we prove almost-sure convergence to the optimal set of pdfs at each agent. Then, we prove the same for a storage-aware algorithm estimating densities only over relevant variables at each agent. Finally, we present a Gaussian version of these algorithms and implement it in a mapping problem using variational inference to handle non-linear likelihood models associated with LiDAR sensing.
    Rapid detection of rare events from in situ X-ray diffraction data using machine learning. (arXiv:2312.03989v1 [cs.LG])
    High-energy X-ray diffraction methods can non-destructively map the 3D microstructure and associated attributes of metallic polycrystalline engineering materials in their bulk form. These methods are often combined with external stimuli such as thermo-mechanical loading to take snapshots over time of the evolving microstructure and attributes. However, the extreme data volumes and the high costs of traditional data acquisition and reduction approaches pose a barrier to quickly extracting actionable insights and improving the temporal resolution of these snapshots. Here we present a fully automated technique capable of rapidly detecting the onset of plasticity in high-energy X-ray microscopy data. Our technique is computationally faster by at least 50 times than the traditional approaches and works for data sets that are up to 9 times sparser than a full data set. This new technique leverages self-supervised image representation learning and clustering to transform massive data into compact, semantic-rich representations of visually salient characteristics (e.g., peak shapes). These characteristics can be a rapid indicator of anomalous events such as changes in diffraction peak shapes. We anticipate that this technique will provide just-in-time actionable information to drive smarter experiments that effectively deploy multi-modal X-ray diffraction methods that span many decades of length scales.
    Non Intrusive Intelligibility Predictor for Hearing Impaired Individuals using Self Supervised Speech Representations. (arXiv:2307.13423v3 [cs.SD] UPDATED)
    Self-supervised speech representations (SSSRs) have been successfully applied to a number of speech-processing tasks, e.g. as feature extractor for speech quality (SQ) prediction, which is, in turn, relevant for assessment and training speech enhancement systems for users with normal or impaired hearing. However, exact knowledge of why and how quality-related information is encoded well in such representations remains poorly understood. In this work, techniques for non-intrusive prediction of SQ ratings are extended to the prediction of intelligibility for hearing-impaired users. It is found that self-supervised representations are useful as input features to non-intrusive prediction models, achieving competitive performance to more complex systems. A detailed analysis of the performance depending on Clarity Prediction Challenge 1 listeners and enhancement systems indicates that more data might be needed to allow generalisation to unknown systems and (hearing-impaired) individuals
    Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models. (arXiv:2308.09778v2 [cs.CV] UPDATED)
    With pre-training of vision-and-language models (VLMs) on large-scale datasets of image-text pairs, several recent works showed that these pre-trained models lack fine-grained understanding, such as the ability to count and recognize verbs, attributes, or relationships. The focus of this work is to study the ability of these models to understand spatial relations. Previously, this has been tackled using image-text matching (e.g., Visual Spatial Reasoning benchmark) or visual question answering (e.g., GQA or VQAv2), both showing poor performance and a large gap compared to human performance. In this work, we use explainability tools to understand the causes of poor performance better and present an alternative fine-grained, compositional approach for ranking spatial clauses. We combine the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative VLMs (such as LXMERT, GPV, and MDETR) and compare and highlight their abilities to reason about spatial relationships.
    Low-complexity subspace-descent over symmetric positive definite manifold. (arXiv:2305.02041v3 [stat.ML] UPDATED)
    This work puts forth low-complexity Riemannian subspace descent algorithms for the minimization of functions over the symmetric positive definite (SPD) manifold. Different from the existing Riemannian gradient descent variants, the proposed approach utilizes carefully chosen subspaces that allow the update to be written as a product of the Cholesky factor of the iterate and a sparse matrix. The resulting updates avoid the costly matrix operations like matrix exponentiation and dense matrix multiplication, which are generally required in almost all other Riemannian optimization algorithms on SPD manifold. We further identify a broad class of functions, arising in diverse applications, such as kernel matrix learning, covariance estimation of Gaussian distributions, maximum likelihood parameter estimation of elliptically contoured distributions, and parameter estimation in Gaussian mixture model problems, over which the Riemannian gradients can be calculated efficiently. The proposed uni-directional and multi-directional Riemannian subspace descent variants incur per-iteration complexities of $\O(n)$ and $\O(n^2)$ respectively, as compared to the $\O(n^3)$ or higher complexity incurred by all existing Riemannian gradient descent variants. The superior runtime and low per-iteration complexity of the proposed algorithms is also demonstrated via numerical tests on large-scale covariance estimation and matrix square root problems.
    Bandit Algorithms for Prophet Inequality and Pandora's Box. (arXiv:2211.08586v2 [cs.DS] UPDATED)
    The Prophet Inequality and Pandora's Box problems are fundamental stochastic problem with applications in Mechanism Design, Online Algorithms, Stochastic Optimization, Optimal Stopping, and Operations Research. A usual assumption in these works is that the probability distributions of the $n$ underlying random variables are given as input to the algorithm. Since in practice these distributions need to be learned, we initiate the study of such stochastic problems in the Multi-Armed Bandits model. In the Multi-Armed Bandits model we interact with $n$ unknown distributions over $T$ rounds: in round $t$ we play a policy $x^{(t)}$ and receive a partial (bandit) feedback on the performance of $x^{(t)}$. The goal is to minimize the regret, which is the difference over $T$ rounds in the total value of the optimal algorithm that knows the distributions vs. the total value of our algorithm that learns the distributions from the partial feedback. Our main results give near-optimal $\tilde{O}(\mathsf{poly}(n)\sqrt{T})$ total regret algorithms for both Prophet Inequality and Pandora's Box. Our proofs proceed by maintaining confidence intervals on the unknown indices of the optimal policy. The exploration-exploitation tradeoff prevents us from directly refining these confidence intervals, so the main technique is to design a regret upper bound that is learnable while playing low-regret Bandit policies.
    Sem@$K$: Is my knowledge graph embedding model semantic-aware?. (arXiv:2301.05601v2 [cs.LG] UPDATED)
    Using knowledge graph embedding models (KGEMs) is a popular approach for predicting links in knowledge graphs (KGs). Traditionally, the performance of KGEMs for link prediction is assessed using rank-based metrics, which evaluate their ability to give high scores to ground-truth entities. However, the literature claims that the KGEM evaluation procedure would benefit from adding supplementary dimensions to assess. That is why, in this paper, we extend our previously introduced metric Sem@K that measures the capability of models to predict valid entities w.r.t. domain and range constraints. In particular, we consider a broad range of KGs and take their respective characteristics into account to propose different versions of Sem@K. We also perform an extensive study to qualify the abilities of KGEMs as measured by our metric. Our experiments show that Sem@K provides a new perspective on KGEM quality. Its joint analysis with rank-based metrics offers different conclusions on the predictive power of models. Regarding Sem@K, some KGEMs are inherently better than others, but this semantic superiority is not indicative of their performance w.r.t. rank-based metrics. In this work, we generalize conclusions about the relative performance of KGEMs w.r.t. rank-based and semantic-oriented metrics at the level of families of models. The joint analysis of the aforementioned metrics gives more insight into the peculiarities of each model. This work paves the way for a more comprehensive evaluation of KGEM adequacy for specific downstream tasks.
    Mixture of Dynamical Variational Autoencoders for Multi-Source Trajectory Modeling and Separation. (arXiv:2312.04167v1 [cs.LG])
    In this paper, we propose a latent-variable generative model called mixture of dynamical variational autoencoders (MixDVAE) to model the dynamics of a system composed of multiple moving sources. A DVAE model is pre-trained on a single-source dataset to capture the source dynamics. Then, multiple instances of the pre-trained DVAE model are integrated into a multi-source mixture model with a discrete observation-to-source assignment latent variable. The posterior distributions of both the discrete observation-to-source assignment variable and the continuous DVAE variables representing the sources content/position are estimated using a variational expectation-maximization algorithm, leading to multi-source trajectories estimation. We illustrate the versatility of the proposed MixDVAE model on two tasks: a computer vision task, namely multi-object tracking, and an audio processing task, namely single-channel audio source separation. Experimental results show that the proposed method works well on these two tasks, and outperforms several baseline methods.
    Causality and Explainability for Trustworthy Integrated Pest Management. (arXiv:2312.04343v1 [cs.LG])
    Pesticides serve as a common tool in agricultural pest control but significantly contribute to the climate crisis. To combat this, Integrated Pest Management (IPM) stands as a climate-smart alternative. Despite its potential, IPM faces low adoption rates due to farmers' skepticism about its effectiveness. To address this challenge, we introduce an advanced data analysis framework tailored to enhance IPM adoption. Our framework provides i) robust pest population predictions across diverse environments with invariant and causal learning, ii) interpretable pest presence predictions using transparent models, iii) actionable advice through counterfactual explanations for in-season IPM interventions, iv) field-specific treatment effect estimations, and v) assessments of the effectiveness of our advice using causal inference. By incorporating these features, our framework aims to alleviate skepticism and encourage wider adoption of IPM practices among farmers.
    A Machine Learning Approach to Two-Stage Adaptive Robust Optimization. (arXiv:2307.12409v2 [cs.LG] UPDATED)
    We propose an approach based on machine learning to solve two-stage linear adaptive robust optimization (ARO) problems with binary here-and-now variables and polyhedral uncertainty sets. We encode the optimal here-and-now decisions, the worst-case scenarios associated with the optimal here-and-now decisions, and the optimal wait-and-see decisions into what we denote as the strategy. We solve multiple similar ARO instances in advance using the column and constraint generation algorithm and extract the optimal strategies to generate a training set. We train a machine learning model that predicts high-quality strategies for the here-and-now decisions, the worst-case scenarios associated with the optimal here-and-now decisions, and the wait-and-see decisions. We also introduce an algorithm to reduce the number of different target classes the machine learning algorithm needs to be trained on. We apply the proposed approach to the facility location, the multi-item inventory control and the unit commitment problems. Our approach solves ARO problems drastically faster than the state-of-the-art algorithms with high accuracy.
    Universal Online Learning with Gradient Variations: A Multi-layer Online Ensemble Approach. (arXiv:2307.08360v2 [cs.LG] UPDATED)
    In this paper, we propose an online convex optimization approach with two different levels of adaptivity. On a higher level, our approach is agnostic to the unknown types and curvatures of the online functions, while at a lower level, it can exploit the unknown niceness of the environments and attain problem-dependent guarantees. Specifically, we obtain $\mathcal{O}(\log V_T)$, $\mathcal{O}(d \log V_T)$ and $\widehat{\mathcal{O}}(\sqrt{V_T})$ regret bounds for strongly convex, exp-concave and convex loss functions, respectively, where $d$ is the dimension, $V_T$ denotes problem-dependent gradient variations and the $\widehat{\mathcal{O}}(\cdot)$-notation omits $\log V_T$ factors. Our result not only safeguards the worst-case guarantees but also directly implies the small-loss bounds in analysis. Moreover, when applied to adversarial/stochastic convex optimization and game theory problems, our result enhances the existing universal guarantees. Our approach is based on a multi-layer online ensemble framework incorporating novel ingredients, including a carefully designed optimism for unifying diverse function types and cascaded corrections for algorithmic stability. Notably, despite its multi-layer structure, our algorithm necessitates only one gradient query per round, making it favorable when the gradient evaluation is time-consuming. This is facilitated by a novel regret decomposition with carefully designed surrogate losses.
    A Novel Federated Learning-based Intrusion Detection System for Flying Ad Hoc Networks. (arXiv:2312.04135v1 [cs.CR])
    Unmanned aerial vehicles (UAVs) in flying ad-hoc networks (FANETs) face security challenges due to the dynamic and distributed nature of these networks. This paper presents the Federated Learning-based Intrusion Detection System (FL-IDS), an innovative approach designed to improve FANET security. FL-IDS leverages federated learning to address privacy concerns of centralized intrusion detection systems. FL-IDS operates in a decentralized manner, enabling UAVs to collaboratively train a global intrusion detection model without sharing raw data. Local models are assigned to each UAV, using client-specific data, and only updated model weights are shared with a central server. This preserves privacy while utilizing collective intelligence for effective intrusion detection. Experimental results show FL-IDS's competitive performance with Central IDS (C-IDS) while mitigating privacy concerns. The Bias Towards Specific Clients (BTSC) method further enhances FL-IDS performance, surpassing C-IDS even at lower attacker ratios. A comparative analysis with traditional intrusion detection methods, including Local IDS (L-IDS), provides insights into FL-IDS's strengths. This study significantly contributes to FANET security by introducing a privacy-aware, decentralized intrusion detection approach tailored to the unique challenges of UAV networks.
    Can Large Language Models Transform Computational Social Science?. (arXiv:2305.03514v2 [cs.CL] UPDATED)
    Large Language Models (LLMs) are capable of successfully performing many language processing tasks zero-shot (without training data). If zero-shot LLMs can also reliably classify and explain social phenomena like persuasiveness and political ideology, then LLMs could augment the Computational Social Science (CSS) pipeline in important ways. This work provides a road map for using LLMs as CSS tools. Towards this end, we contribute a set of prompting best practices and an extensive evaluation pipeline to measure the zero-shot performance of 13 language models on 25 representative English CSS benchmarks. On taxonomic labeling tasks (classification), LLMs fail to outperform the best fine-tuned models but still achieve fair levels of agreement with humans. On free-form coding tasks (generation), LLMs produce explanations that often exceed the quality of crowdworkers' gold references. We conclude that the performance of today's LLMs can augment the CSS research pipeline in two ways: (1) serving as zero-shot data annotators on human annotation teams, and (2) bootstrapping challenging creative generation tasks (e.g., explaining the underlying attributes of a text). In summary, LLMs are posed to meaningfully participate in} social science analysis in partnership with humans.
    Data-Adaptive Probabilistic Likelihood Approximation for Ordinary Differential Equations. (arXiv:2306.05566v2 [stat.ML] UPDATED)
    Estimating the parameters of ordinary differential equations (ODEs) is of fundamental importance in many scientific applications. While ODEs are typically approximated with deterministic algorithms, new research on probabilistic solvers indicates that they produce more reliable parameter estimates by better accounting for numerical errors. However, many ODE systems are highly sensitive to their parameter values. This produces deep local maxima in the likelihood function -- a problem which existing probabilistic solvers have yet to resolve. Here we present a novel probabilistic ODE likelihood approximation, DALTON, which can dramatically reduce parameter sensitivity by learning from noisy ODE measurements in a data-adaptive manner. Our approximation scales linearly in both ODE variables and time discretization points, and is applicable to ODEs with both partially-unobserved components and non-Gaussian measurement models. Several examples demonstrate that DALTON produces more accurate parameter estimates via numerical optimization than existing probabilistic ODE solvers, and even in some cases than the exact ODE likelihood itself.
    Zero-Touch Networks: Towards Next-Generation Network Automation. (arXiv:2312.04159v1 [cs.NI])
    The Zero-touch network and Service Management (ZSM) framework represents an emerging paradigm in the management of the fifth-generation (5G) and Beyond (5G+) networks, offering automated self-management and self-healing capabilities to address the escalating complexity and the growing data volume of modern networks. ZSM frameworks leverage advanced technologies such as Machine Learning (ML) to enable intelligent decision-making and reduce human intervention. This paper presents a comprehensive survey of Zero-Touch Networks (ZTNs) within the ZSM framework, covering network optimization, traffic monitoring, energy efficiency, and security aspects of next-generational networks. The paper explores the challenges associated with ZSM, particularly those related to ML, which necessitate the need to explore diverse network automation solutions. In this context, the study investigates the application of Automated ML (AutoML) in ZTNs, to reduce network management costs and enhance performance. AutoML automates the selection and tuning process of a ML model for a given task. Specifically, the focus is on AutoML's ability to predict application throughput and autonomously adapt to data drift. Experimental results demonstrate the superiority of the proposed AutoML pipeline over traditional ML in terms of prediction accuracy. Integrating AutoML and ZSM concepts significantly reduces network configuration and management efforts, allowing operators to allocate more time and resources to other important tasks. The paper also provides a high-level 5G system architecture incorporating AutoML and ZSM concepts. This research highlights the potential of ZTNs and AutoML to revolutionize the management of 5G+ networks, enabling automated decision-making and empowering network operators to achieve higher efficiency, improved performance, and enhanced user experience.
    Coherent energy and force uncertainty in deep learning force fields. (arXiv:2312.04174v1 [stat.ML])
    In machine learning energy potentials for atomic systems, forces are commonly obtained as the negative derivative of the energy function with respect to atomic positions. To quantify aleatoric uncertainty in the predicted energies, a widely used modeling approach involves predicting both a mean and variance for each energy value. However, this model is not differentiable under the usual white noise assumption, so energy uncertainty does not naturally translate to force uncertainty. In this work we propose a machine learning potential energy model in which energy and force aleatoric uncertainty are linked through a spatially correlated noise process. We demonstrate our approach on an equivariant messages passing neural network potential trained on energies and forces on two out-of-equilibrium molecular datasets. Furthermore, we also show how to obtain epistemic uncertainties in this setting based on a Bayesian interpretation of deep ensemble models.
    PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search. (arXiv:2312.03940v1 [cs.DS])
    This paper studies density-based clustering of point sets. These methods use dense regions of points to detect clusters of arbitrary shapes. In particular, we study variants of density peaks clustering, a popular type of algorithm that has been shown to work well in practice. Our goal is to cluster large high-dimensional datasets, which are prevalent in practice. Prior solutions are either sequential, and cannot scale to large data, or are specialized for low-dimensional data. This paper unifies the different variants of density peaks clustering into a single framework, PECANN, by abstracting out several key steps common to this class of algorithms. One such key step is to find nearest neighbors that satisfy a predicate function, and one of the main contributions of this paper is an efficient way to do this predicate search using graph-based approximate nearest neighbor search (ANNS). To provide ample parallelism, we propose a doubling search technique that enables points to find an approximate nearest neighbor satisfying the predicate in a small number of rounds. Our technique can be applied to many existing graph-based ANNS algorithms, which can all be plugged into PECANN. We implement five clustering algorithms with PECANN and evaluate them on synthetic and real-world datasets with up to 1.28 million points and up to 1024 dimensions on a 30-core machine with two-way hyper-threading. Compared to the state-of-the-art FASTDP algorithm for high-dimensional density peaks clustering, which is sequential, our best algorithm is 45x-734x faster while achieving competitive ARI scores. Compared to the state-of-the-art parallel DPC-based algorithm, which is optimized for low dimensions, we show that PECANN is two orders of magnitude faster. As far as we know, our work is the first to evaluate DPC variants on large high-dimensional real-world image and text embedding datasets.
    DiscoBAX: Discovery of Optimal Intervention Sets in Genomic Experiment Design. (arXiv:2312.04064v1 [q-bio.QM])
    The discovery of therapeutics to treat genetically-driven pathologies relies on identifying genes involved in the underlying disease mechanisms. Existing approaches search over the billions of potential interventions to maximize the expected influence on the target phenotype. However, to reduce the risk of failure in future stages of trials, practical experiment design aims to find a set of interventions that maximally change a target phenotype via diverse mechanisms. We propose DiscoBAX, a sample-efficient method for maximizing the rate of significant discoveries per experiment while simultaneously probing for a wide range of diverse mechanisms during a genomic experiment campaign. We provide theoretical guarantees of approximate optimality under standard assumptions, and conduct a comprehensive experimental evaluation covering both synthetic as well as real-world experimental design tasks. DiscoBAX outperforms existing state-of-the-art methods for experimental design, selecting effective and diverse perturbations in biological systems.
    A novel feature selection framework for incomplete data. (arXiv:2312.04171v1 [cs.LG])
    Feature selection on incomplete datasets is an exceptionally challenging task. Existing methods address this challenge by first employing imputation methods to complete the incomplete data and then conducting feature selection based on the imputed data. Since imputation and feature selection are entirely independent steps, the importance of features cannot be considered during imputation. However, in real-world scenarios or datasets, different features have varying degrees of importance. To address this, we propose a novel incomplete data feature selection framework that considers feature importance. The framework mainly consists of two alternating iterative stages: the M-stage and the W-stage. In the M-stage, missing values are imputed based on a given feature importance vector and multiple initial imputation results. In the W-stage, an improved reliefF algorithm is employed to learn the feature importance vector based on the imputed data. Specifically, the feature importance vector obtained in the current iteration of the W-stage serves as input for the next iteration of the M-stage. Experimental results on both artificially generated and real incomplete datasets demonstrate that the proposed method outperforms other approaches significantly.
    Pearl: A Production-ready Reinforcement Learning Agent. (arXiv:2312.03814v1 [cs.LG])
    Reinforcement Learning (RL) offers a versatile framework for achieving long-term goals. Its generality allows us to formalize a wide range of problems that real-world intelligent systems encounter, such as dealing with delayed rewards, handling partial observability, addressing the exploration and exploitation dilemma, utilizing offline data to improve online performance, and ensuring safety constraints are met. Despite considerable progress made by the RL research community in addressing these issues, existing open-source RL libraries tend to focus on a narrow portion of the RL solution pipeline, leaving other aspects largely unattended. This paper introduces Pearl, a Production-ready RL agent software package explicitly designed to embrace these challenges in a modular fashion. In addition to presenting preliminary benchmark results, this paper highlights Pearl's industry adoptions to demonstrate its readiness for production usage. Pearl is open sourced on Github at github.com/facebookresearch/pearl and its official website is located at pearlagent.github.io.
    Diffusing Colors: Image Colorization with Text Guided Diffusion. (arXiv:2312.04145v1 [cs.CV])
    The colorization of grayscale images is a complex and subjective task with significant challenges. Despite recent progress in employing large-scale datasets with deep neural networks, difficulties with controllability and visual quality persist. To tackle these issues, we present a novel image colorization framework that utilizes image diffusion techniques with granular text prompts. This integration not only produces colorization outputs that are semantically appropriate but also greatly improves the level of control users have over the colorization process. Our method provides a balance between automation and control, outperforming existing techniques in terms of visual quality and semantic coherence. We leverage a pretrained generative Diffusion Model, and show that we can finetune it for the colorization task without losing its generative power or attention to text prompts. Moreover, we present a novel CLIP-based ranking model that evaluates color vividness, enabling automatic selection of the most suitable level of vividness based on the specific scene semantics. Our approach holds potential particularly for color enhancement and historical image colorization.
    Evaluation of Infrastructure-based Warning System on Driving Behaviors-A Roundabout Study. (arXiv:2312.03891v1 [cs.HC])
    Smart intersections have the potential to improve road safety with sensing, communication, and edge computing technologies. Perception sensors installed at a smart intersection can monitor the traffic environment in real time and send infrastructure-based warnings to nearby travelers through V2X communication. This paper investigated how infrastructure-based warnings can influence driving behaviors and improve roundabout safety through a driving-simulator study - a challenging driving scenario for human drivers. A co-simulation platform integrating Simulation of Urban Mobility (SUMO) and Webots was developed to serve as the driving simulator. A real-world roundabout in Ann Arbor, Michigan was built in the co-simulation platform as the study area, and the merging scenarios were investigated. 36 participants were recruited and asked to navigate the roundabout under three danger levels (e.g., low, medium, high) and three collision warning designs (e.g., no warning, warning issued 1 second in advance, warning issued 2 seconds in advance). Results indicated that advanced warnings can significantly enhance safety by minimizing potential risks compared to scenarios without warnings. Earlier warnings enabled smoother driver responses and reduced abrupt decelerations. In addition, a personalized intention prediction model was developed to predict drivers' stop-or-go decisions when the warning is displayed. Among all tested machine learning models, the XGBoost model achieved the highest prediction accuracy with a precision rate of 95.56% and a recall rate of 97.73%.
    Reinforcement Learning for Combining Search Methods in the Calibration of Economic ABMs. (arXiv:2302.11835v3 [cs.LG] UPDATED)
    Calibrating agent-based models (ABMs) in economics and finance typically involves a derivative-free search in a very large parameter space. In this work, we benchmark a number of search methods in the calibration of a well-known macroeconomic ABM on real data, and further assess the performance of "mixed strategies" made by combining different methods. We find that methods based on random-forest surrogates are particularly efficient, and that combining search methods generally increases performance since the biases of any single method are mitigated. Moving from these observations, we propose a reinforcement learning (RL) scheme to automatically select and combine search methods on-the-fly during a calibration run. The RL agent keeps exploiting a specific method only as long as this keeps performing well, but explores new strategies when the specific method reaches a performance plateau. The resulting RL search scheme outperforms any other method or method combination tested, and does not rely on any prior information or trial and error procedure.
    Improving Gradient-guided Nested Sampling for Posterior Inference. (arXiv:2312.03911v1 [cs.LG])
    We present a performant, general-purpose gradient-guided nested sampling algorithm, ${\tt GGNS}$, combining the state of the art in differentiable programming, Hamiltonian slice sampling, clustering, mode separation, dynamic nested sampling, and parallelization. This unique combination allows ${\tt GGNS}$ to scale well with dimensionality and perform competitively on a variety of synthetic and real-world problems. We also show the potential of combining nested sampling with generative flow networks to obtain large amounts of high-quality samples from the posterior distribution. This combination leads to faster mode discovery and more accurate estimates of the partition function.
    Augmentation-Free Dense Contrastive Knowledge Distillation for Efficient Semantic Segmentation. (arXiv:2312.04168v1 [cs.CV])
    In recent years, knowledge distillation methods based on contrastive learning have achieved promising results on image classification and object detection tasks. However, in this line of research, we note that less attention is paid to semantic segmentation. Existing methods heavily rely on data augmentation and memory buffer, which entail high computational resource demands when applying them to handle semantic segmentation that requires to preserve high-resolution feature maps for making dense pixel-wise predictions. In order to address this problem, we present Augmentation-free Dense Contrastive Knowledge Distillation (Af-DCD), a new contrastive distillation learning paradigm to train compact and accurate deep neural networks for semantic segmentation applications. Af-DCD leverages a masked feature mimicking strategy, and formulates a novel contrastive learning loss via taking advantage of tactful feature partitions across both channel and spatial dimensions, allowing to effectively transfer dense and structured local knowledge learnt by the teacher model to a target student model while maintaining training efficiency. Extensive experiments on five mainstream benchmarks with various teacher-student network pairs demonstrate the effectiveness of our approach. For instance, the DeepLabV3-Res18|DeepLabV3-MBV2 model trained by Af-DCD reaches 77.03%|76.38% mIOU on Cityscapes dataset when choosing DeepLabV3-Res101 as the teacher, setting new performance records. Besides that, Af-DCD achieves an absolute mIOU improvement of 3.26%|3.04%|2.75%|2.30%|1.42% compared with individually trained counterpart on Cityscapes|Pascal VOC|Camvid|ADE20K|COCO-Stuff-164K. Code is available at https://github.com/OSVAI/Af-DCD
    LIPEx-Locally Interpretable Probabilistic Explanations-To Look Beyond The True Class. (arXiv:2310.04856v2 [cs.LG] UPDATED)
    In this work, we instantiate a novel perturbation-based multi-class explanation framework, LIPEx (Locally Interpretable Probabilistic Explanation). We demonstrate that LIPEx not only locally replicates the probability distributions output by the widely used complex classification models but also provides insight into how every feature deemed to be important affects the prediction probability for each of the possible classes. We achieve this by defining the explanation as a matrix obtained via regression with respect to the Hellinger distance in the space of probability distributions. Ablation tests on text and image data, show that LIPEx-guided removal of important features from the data causes more change in predictions for the underlying model than similar tests based on other saliency-based or feature importance-based Explainable AI (XAI) methods. It is also shown that compared to LIME, LIPEx is more data efficient in terms of using a lesser number of perturbations of the data to obtain a reliable explanation. This data-efficiency is seen to manifest as LIPEx being able to compute its explanation matrix around 53% faster than all-class LIME, for classification experiments with text data.
    Point Cloud-based Proactive Link Quality Prediction for Millimeter-wave Communications. (arXiv:2301.00752v4 [cs.NI] UPDATED)
    This study demonstrates the feasibility of point cloud-based proactive link quality prediction for millimeter-wave (mmWave) communications. Previous studies have proposed machine learning-based methods to predict received signal strength for future time periods using time series of depth images to mitigate the line-of-sight (LOS) path blockage by pedestrians in mmWave communication. However, these image-based methods have limited applicability due to privacy concerns as camera images may contain sensitive information. This study proposes a point cloud-based method for mmWave link quality prediction and demonstrates its feasibility through experiments. Point clouds represent three-dimensional (3D) spaces as a set of points and are sparser and less likely to contain sensitive information than camera images. Additionally, point clouds provide 3D position and motion information, which is necessary for understanding the radio propagation environment involving pedestrians. This study designs the mmWave link quality prediction method and conducts realistic indoor experiments, where the link quality fluctuates significantly due to human blockage, using commercially available IEEE 802.11ad-based 60 GHz wireless LAN devices and Kinect v2 RGB-D camera and Velodyne VLP-16 light detection and ranging (LiDAR) for point cloud acquisition. The experimental results showed that our proposed method can predict future large attenuation of mmWave received signal strength and throughput induced by the LOS path blockage by pedestrians with comparable or superior accuracy to image-based prediction methods. Hence, our point cloud-based method can serve as a viable alternative to image-based methods.
    Guided Reconstruction with Conditioned Diffusion Models for Unsupervised Anomaly Detection in Brain MRIs. (arXiv:2312.04215v1 [eess.IV])
    Unsupervised anomaly detection in Brain MRIs aims to identify abnormalities as outliers from a healthy training distribution. Reconstruction-based approaches that use generative models to learn to reconstruct healthy brain anatomy are commonly used for this task. Diffusion models are an emerging class of deep generative models that show great potential regarding reconstruction fidelity. However, they face challenges in preserving intensity characteristics in the reconstructed images, limiting their performance in anomaly detection. To address this challenge, we propose to condition the denoising mechanism of diffusion models with additional information about the image to reconstruct coming from a latent representation of the noise-free input image. This conditioning enables high-fidelity reconstruction of healthy brain structures while aligning local intensity characteristics of input-reconstruction pairs. We evaluate our method's reconstruction quality, domain adaptation features and finally segmentation performance on publicly available data sets with various pathologies. Using our proposed conditioning mechanism we can reduce the false-positive predictions and enable a more precise delineation of anomalies which significantly enhances the anomaly detection performance compared to established state-of-the-art approaches to unsupervised anomaly detection in brain MRI. Furthermore, our approach shows promise in domain adaptation across different MRI acquisitions and simulated contrasts, a crucial property of general anomaly detection methods.
    Hidden yet quantifiable: A lower bound for confounding strength using randomized trials. (arXiv:2312.03871v1 [stat.ML])
    In the era of fast-paced precision medicine, observational studies play a major role in properly evaluating new treatments in clinical practice. Yet, unobserved confounding can significantly compromise causal conclusions drawn from non-randomized data. We propose a novel strategy that leverages randomized trials to quantify unobserved confounding. First, we design a statistical test to detect unobserved confounding with strength above a given threshold. Then, we use the test to estimate an asymptotically valid lower bound on the unobserved confounding strength. We evaluate the power and validity of our statistical test on several synthetic and semi-synthetic datasets. Further, we show how our lower bound can correctly identify the absence and presence of unobserved confounding in a real-world setting.
    Optimizing K-means for Big Data: A Comparative Study. (arXiv:2310.09819v2 [cs.LG] UPDATED)
    This paper presents a comparative analysis of different optimization techniques for the K-means algorithm in the context of big data. K-means is a widely used clustering algorithm, but it can suffer from scalability issues when dealing with large datasets. The paper explores different approaches to overcome these issues, including parallelization, approximation, and sampling methods. The authors evaluate the performance of these techniques on various benchmark datasets and compare them in terms of speed, quality of clustering, and scalability according to the LIMA dominance criterion. The results show that different techniques are more suitable for different types of datasets and provide insights into the trade-offs between speed and accuracy in K-means clustering for big data. Overall, the paper offers a comprehensive guide for practitioners and researchers on how to optimize K-means for big data applications.
    Finding Interpretable Class-Specific Patterns through Efficient Neural Search. (arXiv:2312.04311v1 [cs.LG])
    Discovering patterns in data that best describe the differences between classes allows to hypothesize and reason about class-specific mechanisms. In molecular biology, for example, this bears promise of advancing the understanding of cellular processes differing between tissues or diseases, which could lead to novel treatments. To be useful in practice, methods that tackle the problem of finding such differential patterns have to be readily interpretable by domain experts, and scalable to the extremely high-dimensional data. In this work, we propose a novel, inherently interpretable binary neural network architecture DIFFNAPS that extracts differential patterns from data. DiffNaps is scalable to hundreds of thousands of features and robust to noise, thus overcoming the limitations of current state-of-the-art methods in large-scale applications such as in biology. We show on synthetic and real world data, including three biological applications, that, unlike its competitors, DiffNaps consistently yields accurate, succinct, and interpretable class descriptions
    Estimating Countries with Similar Maternal Mortality Rate using Cluster Analysis and Pairing Countries with Identical MMR. (arXiv:2312.04275v1 [cs.LG])
    In the evolving world, we require more additionally the young era to flourish and evolve into developed land. Most of the population all around the world are unaware of the complications involved in the routine they follow while they are pregnant and how hospital facilities affect maternal health. Maternal Mortality is the death of a pregnant woman due to intricacies correlated to pregnancy, underlying circumstances exacerbated by the pregnancy or management of these situations. It is crucial to consider the Maternal Mortality Rate (MMR) in diverse locations and determine which human routines and hospital facilities diminish the Maternal Mortality Rate (MMR). This research aims to examine and discover the countries which are keeping more lavish threats of MMR and countries alike in MMR encountered. Data is examined and collected for various countries, data consists of the earlier years' observation. From the perspective of Machine Learning, Unsupervised Machine Learning is implemented to perform Cluster Analysis. Therefore the pairs of countries with similar MMR as well as the extreme opposite pair concerning the MMR are found.
    Improving Communication Efficiency of Federated Distillation via Accumulating Local Updates. (arXiv:2312.04166v1 [cs.LG])
    As an emerging federated learning paradigm, federated distillation enables communication-efficient model training by transmitting only small-scale knowledge during the learning process. To further improve the communication efficiency of federated distillation, we propose a novel technique, ALU, which accumulates multiple rounds of local updates before transferring the knowledge to the central server. ALU drastically decreases the frequency of communication in federated distillation, thereby significantly reducing the communication overhead during the training process. Empirical experiments demonstrate the substantial effect of ALU in improving the communication efficiency of federated distillation.
    Adaptive Weighted Co-Learning for Cross-Domain Few-Shot Learning. (arXiv:2312.03928v1 [cs.LG])
    Due to the availability of only a few labeled instances for the novel target prediction task and the significant domain shift between the well annotated source domain and the target domain, cross-domain few-shot learning (CDFSL) induces a very challenging adaptation problem. In this paper, we propose a simple Adaptive Weighted Co-Learning (AWCoL) method to address the CDFSL challenge by adapting two independently trained source prototypical classification models to the target task in a weighted co-learning manner. The proposed method deploys a weighted moving average prediction strategy to generate probabilistic predictions from each model, and then conducts adaptive co-learning by jointly fine-tuning the two models in an alternating manner based on the pseudo-labels and instance weights produced from the predictions. Moreover, a negative pseudo-labeling regularizer is further deployed to improve the fine-tuning process by penalizing false predictions. Comprehensive experiments are conducted on multiple benchmark datasets and the empirical results demonstrate that the proposed method produces state-of-the-art CDFSL performance.
    Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection. (arXiv:2312.04095v1 [cs.LG])
    Recent data-privacy laws have sparked interest in machine unlearning, which involves removing the effect of specific training samples from a learnt model as if they were never present in the original training dataset. The challenge of machine unlearning is to discard information about the ``forget'' data in the learnt model without altering the knowledge about the remaining dataset and to do so more efficiently than the naive retraining approach. To achieve this, we adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU), in which the model takes steps in the orthogonal direction to the gradient subspaces deemed unimportant for the retaining dataset, so as to its knowledge is preserved. By utilizing Stochastic Gradient Descent (SGD) to update the model weights, our method can efficiently scale to any model and dataset size. We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible. Our code is available at https://github.com/hnanhtuan/projected_gradient_unlearning.
    Invariant Random Forest: Tree-Based Model Solution for OOD Generalization. (arXiv:2312.04273v1 [cs.LG])
    Out-Of-Distribution (OOD) generalization is an essential topic in machine learning. However, recent research is only focusing on the corresponding methods for neural networks. This paper introduces a novel and effective solution for OOD generalization of decision tree models, named Invariant Decision Tree (IDT). IDT enforces a penalty term with regard to the unstable/varying behavior of a split across different environments during the growth of the tree. Its ensemble version, the Invariant Random Forest (IRF), is constructed. Our proposed method is motivated by a theoretical result under mild conditions, and validated by numerical tests with both synthetic and real datasets. The superior performance compared to non-OOD tree models implies that considering OOD generalization for tree models is absolutely necessary and should be given more attention.
    DeepGraphDMD: Interpretable Spatio-Temporal Decomposition of Non-linear Functional Brain Network Dynamics. (arXiv:2306.03088v2 [cs.AI] UPDATED)
    Functional brain dynamics is supported by parallel and overlapping functional network modes that are associated with specific neural circuits. Decomposing these network modes from fMRI data and finding their temporal characteristics is challenging due to their time-varying nature and the non-linearity of the functional dynamics. Dynamic Mode Decomposition (DMD) algorithms have been quite popular for solving this decomposition problem in recent years. In this work, we apply GraphDMD -- an extension of the DMD for network data -- to extract the dynamic network modes and their temporal characteristics from the fMRI time series in an interpretable manner. GraphDMD, however, regards the underlying system as a linear dynamical system that is sub-optimal for extracting the network modes from non-linear functional data. In this work, we develop a generalized version of the GraphDMD algorithm -- DeepGraphDMD -- applicable to arbitrary non-linear graph dynamical systems. DeepGraphDMD is an autoencoder-based deep learning model that learns Koopman eigenfunctions for graph data and embeds the non-linear graph dynamics into a latent linear space. We show the effectiveness of our method in both simulated data and the HCP resting-state fMRI data. In the HCP data, DeepGraphDMD provides novel insights into cognitive brain functions by discovering two major network modes related to fluid and crystallized intelligence.
    Learning High-Dimensional Differential Graphs From Multi-Attribute Data. (arXiv:2312.03761v1 [stat.ML])
    We consider the problem of estimating differences in two Gaussian graphical models (GGMs) which are known to have similar structure. The GGM structure is encoded in its precision (inverse covariance) matrix. In many applications one is interested in estimating the difference in two precision matrices to characterize underlying changes in conditional dependencies of two sets of data. Existing methods for differential graph estimation are based on single-attribute (SA) models where one associates a scalar random variable with each node. In multi-attribute (MA) graphical models, each node represents a random vector. In this paper, we analyze a group lasso penalized D-trace loss function approach for differential graph learning from multi-attribute data. An alternating direction method of multipliers (ADMM) algorithm is presented to optimize the objective function. Theoretical analysis establishing consistency in support recovery and estimation in high-dimensional settings is provided. Numerical results based on synthetic as well as real data are presented.
    Predicting the Transportation Activities of Construction Waste Hauling Trucks: An Input-Output Hidden Markov Approach. (arXiv:2312.03780v1 [cs.LG])
    Construction waste hauling trucks (CWHTs), as one of the most commonly seen heavy-duty vehicles in major cities around the globe, are usually subject to a series of regulations and spatial-temporal access restrictions because they not only produce significant NOx and PM emissions but also causes on-road fugitive dust. The timely and accurate prediction of CWHTs' destinations and dwell times play a key role in effective environmental management. To address this challenge, we propose a prediction method based on an interpretable activity-based model, input-output hidden Markov model (IOHMM), and validate it on 300 CWHTs in Chengdu, China. Contextual factors are considered in the model to improve its prediction power. Results show that the IOHMM outperforms several baseline models, including Markov chains, linear regression, and long short-term memory. Factors influencing the predictability of CWHTs' transportation activities are also explored using linear regression models. Results suggest the proposed model holds promise in assisting authorities by predicting the upcoming transportation activities of CWHTs and administering intervention in a timely and effective manner.
    Small Area Estimation of Case Growths for Timely COVID-19 Outbreak Detection. (arXiv:2312.04110v1 [stat.ML])
    The COVID-19 pandemic has exerted a profound impact on the global economy and continues to exact a significant toll on human lives. The COVID-19 case growth rate stands as a key epidemiological parameter to estimate and monitor for effective detection and containment of the resurgence of outbreaks. A fundamental challenge in growth rate estimation and hence outbreak detection is balancing the accuracy-speed tradeoff, where accuracy typically degrades with shorter fitting windows. In this paper, we develop a machine learning (ML) algorithm, which we call Transfer Learning Generalized Random Forest (TLGRF), that balances this accuracy-speed tradeoff. Specifically, we estimate the instantaneous COVID-19 exponential growth rate for each U.S. county by using TLGRF that chooses an adaptive fitting window size based on relevant day-level and county-level features affecting the disease spread. Through transfer learning, TLGRF can accurately estimate case growth rates for counties with small sample sizes. Out-of-sample prediction analysis shows that TLGRF outperforms established growth rate estimation methods. Furthermore, we conducted a case study based on outbreak case data from the state of Colorado and showed that the timely detection of outbreaks could have been improved by up to 224% using TLGRF when compared to the decisions made by Colorado's Department of Health and Environment (CDPHE). To facilitate implementation, we have developed a publicly available outbreak detection tool for timely detection of COVID-19 outbreaks in each U.S. county, which received substantial attention from policymakers.
    XAI-TRIS: Non-linear image benchmarks to quantify false positive post-hoc attribution of feature importance. (arXiv:2306.12816v2 [cs.LG] UPDATED)
    The field of 'explainable' artificial intelligence (XAI) has produced highly cited methods that seek to make the decisions of complex machine learning (ML) methods 'understandable' to humans, for example by attributing 'importance' scores to input features. Yet, a lack of formal underpinning leaves it unclear as to what conclusions can safely be drawn from the results of a given XAI method and has also so far hindered the theoretical verification and empirical validation of XAI methods. This means that challenging non-linear problems, typically solved by deep neural networks, presently lack appropriate remedies. Here, we craft benchmark datasets for three different non-linear classification scenarios, in which the important class-conditional features are known by design, serving as ground truth explanations. Using novel quantitative metrics, we benchmark the explanation performance of a wide set of XAI methods across three deep learning model architectures. We show that popular XAI methods are often unable to significantly outperform random performance baselines and edge detection methods. Moreover, we demonstrate that explanations derived from different model architectures can be vastly different; thus, prone to misinterpretation even under controlled conditions.
    De-identification of clinical free text using natural language processing: A systematic review of current approaches. (arXiv:2312.03736v1 [cs.CL])
    Background: Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process. Objectives: Our study aims to provide systematic evidence on how the de-identification of clinical free text has evolved in the last thirteen years, and to report on the performances and limitations of the current state-of-the-art systems. In addition, we aim to identify challenges and potential research opportunities in this field. Methods: A systematic search in PubMed, Web of Science and the DBLP was conducted for studies published between January 2010 and February 2023. Titles and abstracts were examined to identify the relevant studies. Selected studies were then analysed in-depth, and information was collected on de-identification methodologies, data sources, and measured performance. Results: A total of 2125 publications were identified for the title and abstract screening. 69 studies were found to be relevant. Machine learning (37 studies) and hybrid (26 studies) approaches are predominant, while six studies relied only on rules. Majority of the approaches were trained and evaluated on public corpora. The 2014 i2b2/UTHealth corpus is the most frequently used (36 studies), followed by the 2006 i2b2 (18 studies) and 2016 CEGS N-GRID (10 studies) corpora.
    Improved Efficient Two-Stage Denoising Diffusion Power System Measurement Recovery Against False Data Injection Attacks and Data Losses. (arXiv:2312.04346v1 [cs.LG])
    Measurement uncertainties, represented by cyber-attacks and data losses, seriously degrade the quality of power system measurements. Fortunately, the powerful generation ability of the denoising diffusion models can enable more precise measurement generation for power system data recovery. However, the controllable data generation and efficient computing methods of denoising diffusion models for deterministic trajectory still need further investigation. To this end, this paper proposes an improved two-stage denoising diffusion model (TSDM) to identify and reconstruct the measurements with various measurement uncertainties. The first stage of the model comprises a classifier-guided conditional anomaly detection component, while the second stage involves diffusion-based measurement imputation component. Moreover, the proposed TSDM adopts precise means and optimal variances to accelerate the diffusion generation process with subsequence sampling. Extensive numerical case studies demonstrate that the proposed TSDM can accurately recover power system measurements despite strong randomness under renewable energy integration and highly nonlinear dynamics under complex cyber-physical contingencies. Additionally, the proposed TSDM has stronger robustness compared to existing reconstruction networks and exhibits lower computational complexity than general denoising diffusion models.
    FRAC-Q-Learning: A Reinforcement Learning with Boredom Avoidance Processes for Social Robots. (arXiv:2311.15327v2 [cs.RO] UPDATED)
    The reinforcement learning algorithms have often been applied to social robots. However, most reinforcement learning algorithms were not optimized for the use of social robots, and consequently they may bore users. We proposed a new reinforcement learning method specialized for the social robot, the FRAC-Q-learning, that can avoid user boredom. The proposed algorithm consists of a forgetting process in addition to randomizing and categorizing processes. This study evaluated interest and boredom hardness scores of the FRAC-Q-learning by a comparison with the traditional Q-learning. The FRAC-Q-learning showed significantly higher trend of interest score, and indicated significantly harder to bore users compared to the traditional Q-learning. Therefore, the FRAC-Q-learning can contribute to develop a social robot that will not bore users. The proposed algorithm can also find applications in Web-based communication and educational systems. This paper presents the entire process, detailed implementation and a detailed evaluation method of the of the FRAC-Q-learning for the first time.
    Growing ecosystem of deep learning methods for modeling protein$\unicode{x2013}$protein interactions. (arXiv:2310.06725v2 [q-bio.BM] UPDATED)
    Numerous cellular functions rely on protein$\unicode{x2013}$protein interactions. Efforts to comprehensively characterize them remain challenged however by the diversity of molecular recognition mechanisms employed within the proteome. Deep learning has emerged as a promising approach for tackling this problem by exploiting both experimental data and basic biophysical knowledge about protein interactions. Here, we review the growing ecosystem of deep learning methods for modeling protein interactions, highlighting the diversity of these biophysically-informed models and their respective trade-offs. We discuss recent successes in using representation learning to capture complex features pertinent to predicting protein interactions and interaction sites, geometric deep learning to reason over protein structures and predict complex structures, and generative modeling to design de novo protein assemblies. We also outline some of the outstanding challenges and promising new directions. Opportunities abound to discover novel interactions, elucidate their physical mechanisms, and engineer binders to modulate their functions using deep learning and, ultimately, unravel how protein interactions orchestrate complex cellular behaviors.
    Detecting algorithmic bias in medical AI-models. (arXiv:2312.02959v2 [stat.ML] UPDATED)
    With the growing prevalence of machine learning and artificial intelligence-based medical decision support systems, it is equally important to ensure that these systems provide patient outcomes in a fair and equitable fashion. This paper presents an innovative framework for detecting areas of algorithmic bias in medical-AI decision support systems. Our approach efficiently identifies potential biases in medical-AI models, specifically in the context of sepsis prediction, by employing the Classification and Regression Trees (CART) algorithm. We verify our methodology by conducting a series of synthetic data experiments, showcasing its ability to estimate areas of bias in controlled settings precisely. The effectiveness of the concept is further validated by experiments using electronic medical records from Grady Memorial Hospital in Atlanta, Georgia. These tests demonstrate the practical implementation of our strategy in a clinical environment, where it can function as a vital instrument for guaranteeing fairness and equity in AI-based medical decisions.
    Learning to sample in Cartesian MRI. (arXiv:2312.04327v1 [cs.LG])
    Despite its exceptional soft tissue contrast, Magnetic Resonance Imaging (MRI) faces the challenge of long scanning times compared to other modalities like X-ray radiography. Shortening scanning times is crucial in clinical settings, as it increases patient comfort, decreases examination costs and improves throughput. Recent advances in compressed sensing (CS) and deep learning allow accelerated MRI acquisition by reconstructing high-quality images from undersampled data. While reconstruction algorithms have received most of the focus, designing acquisition trajectories to optimize reconstruction quality remains an open question. This thesis explores two approaches to address this gap in the context of Cartesian MRI. First, we propose two algorithms, lazy LBCS and stochastic LBCS, that significantly improve upon G\"ozc\"u et al.'s greedy learning-based CS (LBCS) approach. These algorithms scale to large, clinically relevant scenarios like multi-coil 3D MR and dynamic MRI, previously inaccessible to LBCS. Additionally, we demonstrate that generative adversarial networks (GANs) can serve as a natural criterion for adaptive sampling by leveraging variance in the measurement domain to guide acquisition. Second, we delve into the underlying structures or assumptions that enable mask design algorithms to perform well in practice. Our experiments reveal that state-of-the-art deep reinforcement learning (RL) approaches, while capable of adaptation and long-horizon planning, offer only marginal improvements over stochastic LBCS, which is neither adaptive nor does long-term planning. Altogether, our findings suggest that stochastic LBCS and similar methods represent promising alternatives to deep RL. They shine in particular by their scalability and computational efficiency and could be key in the deployment of optimized acquisition trajectories in Cartesian MRI.
    Alpha-CLIP: A CLIP Model Focusing on Wherever You Want. (arXiv:2312.03818v1 [cs.CV])
    Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks. It aligns textual and visual modalities to comprehend the entire image, including all the details, even those irrelevant to specific tasks. However, for a finer understanding and controlled editing of images, it becomes crucial to focus on specific regions of interest, which can be indicated as points, masks, or boxes by humans or perception models. To fulfill the requirements, we introduce Alpha-CLIP, an enhanced version of CLIP with an auxiliary alpha channel to suggest attentive regions and fine-tuned with constructed millions of RGBA region-text pairs. Alpha-CLIP not only preserves the visual recognition ability of CLIP but also enables precise control over the emphasis of image contents. It demonstrates effectiveness in various tasks, including but not limited to open-world recognition, multimodal large language models, and conditional 2D / 3D generation. It has a strong potential to serve as a versatile tool for image-related tasks.
    Internal Representations of Vision Models Through the Lens of Frames on Data Manifolds. (arXiv:2211.10558v2 [cs.LG] UPDATED)
    While the last five years have seen considerable progress in understanding the internal representations of deep learning models, many questions remain. This is especially true when trying to understand the impact of model design choices, such as model architecture or training algorithm, on hidden representation geometry and dynamics. In this work we present a new approach to studying such representations inspired by the idea of a frame on the tangent bundle of a manifold. Our construction, which we call a neural frame, is formed by assembling a set of vectors representing specific types of perturbations of a data point, for example infinitesimal augmentations, noise perturbations, or perturbations produced by a generative model, and studying how these change as they pass through a network. Using neural frames, we make observations about the way that models process, layer-by-layer, specific modes of variation within a small neighborhood of a datapoint. Our results provide new perspectives on a number of phenomena, such as the manner in which training with augmentation produces model invariance or the proposed trade-off between adversarial training and model generalization.  ( 2 min )
    Classical Verification of Quantum Learning. (arXiv:2306.04843v2 [quant-ph] UPDATED)
    Quantum data access and quantum processing can make certain classically intractable learning tasks feasible. However, quantum capabilities will only be available to a select few in the near future. Thus, reliable schemes that allow classical clients to delegate learning to untrusted quantum servers are required to facilitate widespread access to quantum learning advantages. Building on a recently introduced framework of interactive proof systems for classical machine learning, we develop a framework for classical verification of quantum learning. We exhibit learning problems that a classical learner cannot efficiently solve on their own, but that they can efficiently and reliably solve when interacting with an untrusted quantum prover. Concretely, we consider the problems of agnostic learning parities and Fourier-sparse functions with respect to distributions with uniform input marginal. We propose a new quantum data access model that we call "mixture-of-superpositions" quantum examples, based on which we give efficient quantum learning algorithms for these tasks. Moreover, we prove that agnostic quantum parity and Fourier-sparse learning can be efficiently verified by a classical verifier with only random example or statistical query access. Finally, we showcase two general scenarios in learning and verification in which quantum mixture-of-superpositions examples do not lead to sample complexity improvements over classical data. Our results demonstrate that the potential power of quantum data for learning tasks, while not unlimited, can be utilized by classical agents through interaction with untrusted quantum entities.
    Preserving privacy in domain transfer of medical AI models comes at no performance costs: The integral role of differential privacy. (arXiv:2306.06503v2 [cs.LG] UPDATED)
    Developing robust and effective artificial intelligence (AI) models in medicine requires access to large amounts of patient data. The use of AI models solely trained on large multi-institutional datasets can help with this, yet the imperative to ensure data privacy remains, particularly as membership inference risks breaching patient confidentiality. As a proposed remedy, we advocate for the integration of differential privacy (DP). We specifically investigate the performance of models trained with DP as compared to models trained without DP on data from institutions that the model had not seen during its training (i.e., external validation) - the situation that is reflective of the clinical use of AI models. By leveraging more than 590,000 chest radiographs from five institutions, we evaluated the efficacy of DP-enhanced domain transfer (DP-DT) in diagnosing cardiomegaly, pleural effusion, pneumonia, atelectasis, and in identifying healthy subjects. We juxtaposed DP-DT with non-DP-DT and examined diagnostic accuracy and demographic fairness using the area under the receiver operating characteristic curve (AUC) as the main metric, as well as accuracy, sensitivity, and specificity. Our results show that DP-DT, even with exceptionally high privacy levels (epsilon around 1), performs comparably to non-DP-DT (P>0.119 across all domains). Furthermore, DP-DT led to marginal AUC differences - less than 1% - for nearly all subgroups, relative to non-DP-DT. Despite consistent evidence suggesting that DP models induce significant performance degradation for on-domain applications, we show that off-domain performance is almost not affected. Therefore, we ardently advocate for the adoption of DP in training diagnostic medical AI models, given its minimal impact on performance.
    A Scalable Network-Aware Multi-Agent Reinforcement Learning Framework for Decentralized Inverter-based Voltage Control. (arXiv:2312.04371v1 [math.OC])
    This paper addresses the challenges associated with decentralized voltage control in power grids due to an increase in distributed generations (DGs). Traditional model-based voltage control methods struggle with the rapid energy fluctuations and uncertainties of these DGs. While multi-agent reinforcement learning (MARL) has shown potential for decentralized secondary control, scalability issues arise when dealing with a large number of DGs. This problem lies in the dominant centralized training and decentralized execution (CTDE) framework, where the critics take global observations and actions. To overcome these challenges, we propose a scalable network-aware (SNA) framework that leverages network structure to truncate the input to the critic's Q-function, thereby improving scalability and reducing communication costs during training. Further, the SNA framework is theoretically grounded with provable approximation guarantee, and it can seamlessly integrate with multiple multi-agent actor-critic algorithms. The proposed SNA framework is successfully demonstrated in a system with 114 DGs, providing a promising solution for decentralized voltage control in increasingly complex power grid systems.
    Large Language Models as Optimizers. (arXiv:2309.03409v2 [cs.LG] UPDATED)
    Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to prompt optimization where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks. Code at https://github.com/google-deepmind/opro.
    Merging by Matching Models in Task Subspaces. (arXiv:2312.04339v1 [cs.LG])
    Model merging aims to cheaply combine individual task-specific models into a single multitask model. In this work, we view past merging methods as leveraging different notions of a ''task subspace'' in which models are matched before being merged. We connect the task subspace of a given model to its loss landscape and formalize how this approach to model merging can be seen as solving a linear system of equations. While past work has generally been limited to linear systems that have a closed-form solution, we consider using the conjugate gradient method to find a solution. We show that using the conjugate gradient method can outperform closed-form solutions, enables merging via linear systems that are otherwise intractable to solve, and flexibly allows choosing from a wide variety of initializations and estimates for the ''task subspace''. We ultimately demonstrate that our merging framework called ''Matching Models in their Task Subspace'' (MaTS) achieves state-of-the-art results in multitask and intermediate-task model merging. We release all of the code and checkpoints used in our work at https://github.com/r-three/mats.
    Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization. (arXiv:2312.04386v1 [cs.LG])
    We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over MDPs. Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation (UBE), but the over-approximation may result in inefficient exploration. We propose a new UBE whose solution converges to the true posterior variance over values and leads to lower regret in tabular exploration problems. We identify challenges to apply the UBE theory beyond tabular problems and propose a suitable approximation. Based on this approximation, we introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC), that can be applied for either risk-seeking or risk-averse policy optimization with minimal changes. Experiments in both online and offline RL demonstrate improved performance compared to other uncertainty estimation methods.
    Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training. (arXiv:2312.02914v2 [cs.CV] UPDATED)
    In this work, we tackle the problem of unsupervised domain adaptation (UDA) for video action recognition. Our approach, which we call UNITE, uses an image teacher model to adapt a video student model to the target domain. UNITE first employs self-supervised pre-training to promote discriminative feature learning on target domain videos using a teacher-guided masked distillation objective. We then perform self-training on masked target data, using the video student model and image teacher model together to generate improved pseudolabels for unlabeled target videos. Our self-training process successfully leverages the strengths of both models to achieve strong transfer performance across domains. We evaluate our approach on multiple video domain adaptation benchmarks and observe significant improvements upon previously reported results.
    Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models. (arXiv:2305.13840v2 [cs.CV] UPDATED)
    Recent advancements in diffusion models have unlocked unprecedented abilities in visual creation. However, current text-to-video generation models struggle with the trade-off among movement range, action coherence and object consistency. To mitigate this issue, we present a controllable text-to-video (T2V) diffusion model, called Control-A-Video, capable of maintaining consistency while customizable video synthesis. Based on a pre-trained conditional text-to-image (T2I) diffusion model, our model aims to generate videos conditioned on a sequence of control signals, such as edge or depth maps. For the purpose of improving object consistency, Control-A-Video integrates motion priors and content priors into video generation. We propose two motion-adaptive noise initialization strategies, which are based on pixel residual and optical flow, to introduce motion priors from input videos, producing more coherent videos. Moreover, a first-frame conditioned controller is proposed to generate videos from content priors of the first frame, which facilitates the semantic alignment with text and allows longer video generation in an auto-regressive manner. With the proposed architecture and strategies, our model achieves resource-efficient convergence and generate consistent and coherent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality.
    Efficient LLM Inference on CPUs. (arXiv:2311.00502v2 [cs.LG] UPDATED)
    Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which requires a demand for large memory capacity and high memory bandwidth. In this paper, we propose an effective approach that can make the deployment of LLMs more efficiently. We support an automatic INT4 weight-only quantization flow and design a special LLM runtime with highly-optimized kernels to accelerate the LLM inference on CPUs. We demonstrate the general applicability of our approach on popular LLMs including Llama2, Llama, GPT-NeoX, and showcase the extreme inference efficiency on CPUs. The code is publicly available at: https://github.com/intel/intel-extension-for-transformers.
    Reconstruction of dynamical systems from data without time labels. (arXiv:2312.04038v1 [cs.LG])
    In this paper, we study the method to reconstruct dynamical systems from data without time labels. Data without time labels appear in many applications, such as molecular dynamics, single-cell RNA sequencing etc. Reconstruction of dynamical system from time sequence data has been studied extensively. However, these methods do not apply if time labels are unknown. Without time labels, sequence data becomes distribution data. Based on this observation, we propose to treat the data as samples from a probability distribution and try to reconstruct the underlying dynamical system by minimizing the distribution loss, sliced Wasserstein distance more specifically. Extensive experiment results demonstrate the effectiveness of the proposed method.  ( 2 min )
    Constrained Few-Shot Learning: Human-Like Low Sample Complexity Learning and Non-Episodic Text Classification. (arXiv:2208.08089v2 [cs.LG] UPDATED)
    Few-shot learning (FSL) is an emergent paradigm of learning that attempts to learn to reason with low sample complexity to mimic the way humans learn, generalise and extrapolate from only a few seen examples. While FSL attempts to mimic these human characteristics, fundamentally, the task of FSL as conventionally formulated using meta-learning with episodic-based training does not in actuality align with how humans acquire and reason with knowledge. FSL with episodic training, while only requires $K$ instances of each test class, still requires a large number of labelled training instances from disjoint classes. In this paper, we introduce the novel task of constrained few-shot learning (CFSL), a special case of FSL where $M$, the number of instances of each training class is constrained such that $M \leq K$ thus applying a similar restriction during FSL training and test. We propose a method for CFSL leveraging Cat2Vec using a novel categorical contrastive loss inspired by cognitive theories such as fuzzy trace theory and prototype theory.
    Adapting Newton's Method to Neural Networks through a Summary of Higher-Order Derivatives. (arXiv:2312.03885v1 [cs.LG])
    We consider a gradient-based optimization method applied to a function $\mathcal{L}$ of a vector of variables $\boldsymbol{\theta}$, in the case where $\boldsymbol{\theta}$ is represented as a tuple of tensors $(\mathbf{T}_1, \cdots, \mathbf{T}_S)$. This framework encompasses many common use-cases, such as training neural networks by gradient descent. First, we propose a computationally inexpensive technique providing higher-order information on $\mathcal{L}$, especially about the interactions between the tensors $\mathbf{T}_s$, based on automatic differentiation and computational tricks. Second, we use this technique at order 2 to build a second-order optimization method which is suitable, among other things, for training deep neural networks of various architectures. This second-order method leverages the partition structure of $\boldsymbol{\theta}$ into tensors $(\mathbf{T}_1, \cdots, \mathbf{T}_S)$, in such a way that it requires neither the computation of the Hessian of $\mathcal{L}$ according to $\boldsymbol{\theta}$, nor any approximation of it. The key part consists in computing a smaller matrix interpretable as a "Hessian according to the partition", which can be computed exactly and efficiently. In contrast to many existing practical second-order methods used in neural networks, which perform a diagonal or block-diagonal approximation of the Hessian or its inverse, the method we propose does not neglect interactions between layers. Finally, we can tune the coarseness of the partition to recover well-known optimization methods: the coarsest case corresponds to Cauchy's steepest descent method, the finest case corresponds to the usual Newton's method.
    k* Distribution: Evaluating the Latent Space of Deep Neural Networks using Local Neighborhood Analysis. (arXiv:2312.04024v1 [cs.LG])
    Most examinations of neural networks' learned latent spaces typically employ dimensionality reduction techniques such as t-SNE or UMAP. While these methods effectively capture the overall sample distribution in the entire learned latent space, they tend to distort the structure of sample distributions within specific classes in the subset of the latent space. This distortion complicates the task of easily distinguishing classes identifiable by neural networks. In response to this challenge, we introduce the k* Distribution methodology. This approach focuses on capturing the characteristics and structure of sample distributions for individual classes within the subset of the learned latent space using local neighborhood analysis. The key concept is to facilitate easy comparison of different k* distributions, enabling analysis of how various classes are processed by the same neural network. This provides a more profound understanding of existing contemporary visualizations. Our study reveals three distinct distributions of samples within the learned latent space subset: a) Fractured, b) Overlapped, and c) Clustered. We note and demonstrate that the distribution of samples within the network's learned latent space significantly varies depending on the class. Furthermore, we illustrate that our analysis can be applied to explore the latent space of diverse neural network architectures, various layers within neural networks, transformations applied to input samples, and the distribution of training and testing data for neural networks. We anticipate that our approach will facilitate more targeted investigations into neural networks by collectively examining the distribution of different samples within the learned latent space.
    Deep Learning for Hate Speech Detection: A Comparative Study. (arXiv:2202.09517v2 [cs.CL] UPDATED)
    Automated hate speech detection is an important tool in combating the spread of hate speech, particularly in social media. Numerous methods have been developed for the task, including a recent proliferation of deep-learning based approaches. A variety of datasets have also been developed, exemplifying various manifestations of the hate-speech detection problem. We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods, mediated through the three most commonly used datasets. Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art. We particularly focus our analysis on measures of practical performance, including detection accuracy, computational efficiency, capability in using pre-trained models, and domain generalization. In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions. Code and dataset are available at https://github.com/jmjmalik22/Hate-Speech-Detection.
    Retrieval-Based Reconstruction For Time-series Contrastive Learning. (arXiv:2311.00519v2 [cs.LG] UPDATED)
    The success of self-supervised contrastive learning hinges on identifying positive data pairs that, when pushed together in embedding space, encode useful information for subsequent downstream tasks. However, in time-series, this is challenging because creating positive pairs via augmentations may break the original semantic meaning. We hypothesize that if we can retrieve information from one subsequence to successfully reconstruct another subsequence, then they should form a positive pair. Harnessing this intuition, we introduce our novel approach: REtrieval-BAsed Reconstruction (REBAR) contrastive learning. First, we utilize a convolutional cross-attention architecture to calculate the REBAR error between two different time-series. Then, through validation experiments, we show that the REBAR error is a predictor of mutual class membership, justifying its usage as a positive/negative labeler. Finally, once integrated into a contrastive learning framework, our REBAR method can learn an embedding that achieves state-of-the-art performance on downstream tasks across various modalities.
    LineConGraphs: Line Conversation Graphs for Effective Emotion Recognition using Graph Neural Networks. (arXiv:2312.03756v1 [cs.CL])
    Emotion Recognition in Conversations (ERC) is a critical aspect of affective computing, and it has many practical applications in healthcare, education, chatbots, and social media platforms. Earlier approaches for ERC analysis involved modeling both speaker and long-term contextual information using graph neural network architectures. However, it is ideal to deploy speaker-independent models for real-world applications. Additionally, long context windows can potentially create confusion in recognizing the emotion of an utterance in a conversation. To overcome these limitations, we propose novel line conversation graph convolutional network (LineConGCN) and graph attention (LineConGAT) models for ERC analysis. These models are speaker-independent and built using a graph construction strategy for conversations -- line conversation graphs (LineConGraphs). The conversational context in LineConGraphs is short-term -- limited to one previous and future utterance, and speaker information is not part of the graph. We evaluate the performance of our proposed models on two benchmark datasets, IEMOCAP and MELD, and show that our LineConGAT model outperforms the state-of-the-art methods with an F1-score of 64.58% and 76.50%. Moreover, we demonstrate that embedding sentiment shift information into line conversation graphs further enhances the ERC performance in the case of GCN models.
    Learning Genomic Sequence Representations using Graph Neural Networks over De Bruijn Graphs. (arXiv:2312.03865v1 [cs.LG])
    The rapid expansion of genomic sequence data calls for new methods to achieve robust sequence representations. Existing techniques often neglect intricate structural details, emphasizing mainly contextual information. To address this, we developed k-mer embeddings that merge contextual and structural string information by enhancing De Bruijn graphs with structural similarity connections. Subsequently, we crafted a self-supervised method based on Contrastive Learning that employs a heterogeneous Graph Convolutional Network encoder and constructs positive pairs based on node similarities. Our embeddings consistently outperform prior techniques for Edit Distance Approximation and Closest String Retrieval tasks.
    PerSival: Neural-network-based visualisation for pervasive continuum-mechanical simulations in musculoskeletal biomechanics. (arXiv:2312.03957v1 [q-bio.TO])
    This paper presents a novel neural network architecture for the purpose of pervasive visualisation of a 3D human upper limb musculoskeletal system model. Bringing simulation capabilities to resource-poor systems like mobile devices is of growing interest across many research fields, to widen applicability of methods and results. Until recently, this goal was thought to be out of reach for realistic continuum-mechanical simulations of musculoskeletal systems, due to prohibitive computational cost. Within this work we use a sparse grid surrogate to capture the surface deformation of the m.~biceps brachii in order to train a deep learning model, used for real-time visualisation of the same muscle. Both these surrogate models take 5 muscle activation levels as input and output Cartesian coordinate vectors for each mesh node on the muscle's surface. Thus, the neural network architecture features a significantly lower input than output dimension. 5 muscle activation levels were sufficient to achieve an average error of 0.97 +/- 0.16 mm, or 0.57 +/- 0.10 % for the 2809 mesh node positions of the biceps. The model achieved evaluation times of 9.88 ms per predicted deformation state on CPU only and 3.48 ms with GPU-support, leading to theoretical frame rates of 101 fps and 287 fps respectively. Deep learning surrogates thus provide a way to make continuum-mechanical simulations accessible for visual real-time applications.
    SCStory: Self-supervised and Continual Online Story Discovery. (arXiv:2312.03725v1 [cs.CL])
    We present a framework SCStory for online story discovery, that helps people digest rapidly published news article streams in real-time without human annotations. To organize news article streams into stories, existing approaches directly encode the articles and cluster them based on representation similarity. However, these methods yield noisy and inaccurate story discovery results because the generic article embeddings do not effectively reflect the story-indicative semantics in an article and cannot adapt to the rapidly evolving news article streams. SCStory employs self-supervised and continual learning with a novel idea of story-indicative adaptive modeling of news article streams. With a lightweight hierarchical embedding module that first learns sentence representations and then article representations, SCStory identifies story-relevant information of news articles and uses them to discover stories. The embedding module is continuously updated to adapt to evolving news streams with a contrastive learning objective, backed up by two unique techniques, confidence-aware memory replay and prioritized-augmentation, employed for label absence and data scarcity problems. Thorough experiments on real and the latest news data sets demonstrate that SCStory outperforms existing state-of-the-art algorithms for unsupervised online story discovery.
    Similarity-based Knowledge Transfer for Cross-Domain Reinforcement Learning. (arXiv:2312.03764v1 [cs.LG])
    Transferring knowledge in cross-domain reinforcement learning is a challenging setting in which learning is accelerated by reusing knowledge from a task with different observation and/or action space. However, it is often necessary to carefully select the source of knowledge for the receiving end to benefit from the transfer process. In this article, we study how to measure the similarity between cross-domain reinforcement learning tasks to select a source of knowledge that will improve the performance of the learning agent. We developed a semi-supervised alignment loss to match different spaces with a set of encoder-decoders, and use them to measure similarity and transfer policies across tasks. In comparison to prior works, our method does not require data to be aligned, paired or collected by expert policies. Experimental results, on a set of varied Mujoco control tasks, show the robustness of our method in effectively selecting and transferring knowledge, without the supervision of a tailored set of source tasks.
    MICRO: Model-Based Offline Reinforcement Learning with a Conservative Bellman Operator. (arXiv:2312.03991v1 [cs.LG])
    Offline reinforcement learning (RL) faces a significant challenge of distribution shift. Model-free offline RL penalizes the Q value for out-of-distribution (OOD) data or constrains the policy closed to the behavior policy to tackle this problem, but this inhibits the exploration of the OOD region. Model-based offline RL, which uses the trained environment model to generate more OOD data and performs conservative policy optimization within that model, has become an effective method for this problem. However, the current model-based algorithms rarely consider agent robustness when incorporating conservatism into policy. Therefore, the new model-based offline algorithm with a conservative Bellman operator (MICRO) is proposed. This method trades off performance and robustness via introducing the robust Bellman operator into the algorithm. Compared with previous model-based algorithms with robust adversarial models, MICRO can significantly reduce the computation cost by only choosing the minimal Q value in the state uncertainty set. Extensive experiments demonstrate that MICRO outperforms prior RL algorithms in offline RL benchmark and is considerably robust to adversarial perturbations.
    A Scalable and Generalizable Pathloss Map Prediction. (arXiv:2312.03950v1 [cs.IT])
    Large-scale channel prediction, i.e., estimation of the pathloss from geographical/morphological/building maps, is an essential component of wireless network planning. Ray tracing (RT)-based methods have been widely used for many years, but they require significant computational effort that may become prohibitive with the increased network densification and/or use of higher frequencies in B5G/6G systems. In this paper, we propose a data-driven, model-free pathloss map prediction (PMP) method, called PMNet. PMNet uses a supervised learning approach: it is trained on a limited amount of RT (or channel measurement) data and map data. Once trained, PMNet can predict pathloss over location with high accuracy (an RMSE level of $10^{-2}$) in a few milliseconds. We further extend PMNet by employing transfer learning (TL). TL allows PMNet to learn a new network scenario quickly (x5.6 faster training) and efficiently (using x4.5 less data) by transferring knowledge from a pre-trained model, while retaining accuracy. Our results demonstrate that PMNet is a scalable and generalizable ML-based PMP method, showing its potential to be used in several network optimization applications.
    Graph Convolutions Enrich the Self-Attention in Transformers!. (arXiv:2312.04234v1 [cs.LG])
    Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph pattern classification, speech recognition, and code classification.
    DiffusionPhase: Motion Diffusion in Frequency Domain. (arXiv:2312.04036v1 [cs.CV])
    In this study, we introduce a learning-based method for generating high-quality human motion sequences from text descriptions (e.g., ``A person walks forward"). Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences, due to limited text-to-motion datasets and the pose representations used that often lack expressiveness or compactness. To address these issues, we propose the first method for text-conditioned human motion generation in the frequency domain of motions. We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space with high-frequency details encoded, capturing the local periodicity of motions in time and space with high accuracy. We also introduce a conditional diffusion model for predicting periodic motion parameters based on text descriptions and a start pose, efficiently achieving smooth transitions between motion sequences associated with different text descriptions. Experiments demonstrate that our approach outperforms current methods in generating a broader variety of high-quality motions, and synthesizing long sequences with natural transitions.
    Scaling transformer neural networks for skillful and reliable medium-range weather forecasting. (arXiv:2312.03876v1 [physics.ao-ph])
    Weather forecasting is a fundamental problem for anticipating and mitigating the impacts of climate change. Recently, data-driven approaches for weather forecasting based on deep learning have shown great promise, achieving accuracies that are competitive with operational systems. However, those methods often employ complex, customized architectures without sufficient ablation analysis, making it difficult to understand what truly contributes to their success. Here we introduce Stormer, a simple transformer model that achieves state-of-the-art performance on weather forecasting with minimal changes to the standard transformer backbone. We identify the key components of Stormer through careful empirical analyses, including weather-specific embedding, randomized dynamics forecast, and pressure-weighted loss. At the core of Stormer is a randomized forecasting objective that trains the model to forecast the weather dynamics over varying time intervals. During inference, this allows us to produce multiple forecasts for a target lead time and combine them to obtain better forecast accuracy. On WeatherBench 2, Stormer performs competitively at short to medium-range forecasts and outperforms current methods beyond 7 days, while requiring orders-of-magnitude less training data and compute. Additionally, we demonstrate Stormer's favorable scaling properties, showing consistent improvements in forecast accuracy with increases in model size and training tokens. Code and checkpoints will be made publicly available.  ( 2 min )
    Internal-Coordinate Density Modelling of Protein Structure: Covariance Matters. (arXiv:2302.13711v2 [cs.LG] UPDATED)
    After the recent ground-breaking advances in protein structure prediction, one of the remaining challenges in protein machine learning is to reliably predict distributions of structural states. Parametric models of fluctuations are difficult to fit due to complex covariance structures between degrees of freedom in the protein chain, often causing models to either violate local or global structural constraints. In this paper, we present a new strategy for modelling protein densities in internal coordinates, which uses constraints in 3D space to induce covariance structure between the internal degrees of freedom. We illustrate the potential of the procedure by constructing a variational autoencoder with full covariance output induced by the constraints implied by the conditional mean in 3D, and demonstrate that our approach makes it possible to scale density models of internal coordinates to full protein backbones in two settings: 1) a unimodal setting for proteins exhibiting small fluctuations and limited amounts of available data, and 2) a multimodal setting for larger conformational changes in a high data regime.
    Series2Vec: Similarity-based Self-supervised Representation Learning for Time Series Classification. (arXiv:2312.03998v1 [cs.LG])
    We argue that time series analysis is fundamentally different in nature to either vision or natural language processing with respect to the forms of meaningful self-supervised learning tasks that can be defined. Motivated by this insight, we introduce a novel approach called \textit{Series2Vec} for self-supervised representation learning. Unlike other self-supervised methods in time series, which carry the risk of positive sample variants being less similar to the anchor sample than series in the negative set, Series2Vec is trained to predict the similarity between two series in both temporal and spectral domains through a self-supervised task. Series2Vec relies primarily on the consistency of the unsupervised similarity step, rather than the intrinsic quality of the similarity measurement, without the need for hand-crafted data augmentation. To further enforce the network to learn similar representations for similar time series, we propose a novel approach that applies order-invariant attention to each representation within the batch during training. Our evaluation of Series2Vec on nine large real-world datasets, along with the UCR/UEA archive, shows enhanced performance compared to current state-of-the-art self-supervised techniques for time series. Additionally, our extensive experiments show that Series2Vec performs comparably with fully supervised training and offers high efficiency in datasets with limited-labeled data. Finally, we show that the fusion of Series2Vec with other representation learning models leads to enhanced performance for time series classification. Code and models are open-source at \url{https://github.com/Navidfoumani/Series2Vec.}
    FoMo Rewards: Can we cast foundation models as reward functions?. (arXiv:2312.03881v1 [cs.LG])
    We explore the viability of casting foundation models as generic reward functions for reinforcement learning. To this end, we propose a simple pipeline that interfaces an off-the-shelf vision model with a large language model. Specifically, given a trajectory of observations, we infer the likelihood of an instruction describing the task that the user wants an agent to perform. We show that this generic likelihood function exhibits the characteristics ideally expected from a reward function: it associates high values with the desired behaviour and lower values for several similar, but incorrect policies. Overall, our work opens the possibility of designing open-ended agents for interactive tasks via foundation models.
    Near-real-time Earthquake-induced Fatality Estimation using Crowdsourced Data and Large-Language Models. (arXiv:2312.03755v1 [cs.CL])
    When a damaging earthquake occurs, immediate information about casualties is critical for time-sensitive decision-making by emergency response and aid agencies in the first hours and days. Systems such as Prompt Assessment of Global Earthquakes for Response (PAGER) by the U.S. Geological Survey (USGS) were developed to provide a forecast within about 30 minutes of any significant earthquake globally. Traditional systems for estimating human loss in disasters often depend on manually collected early casualty reports from global media, a process that's labor-intensive and slow with notable time delays. Recently, some systems have employed keyword matching and topic modeling to extract relevant information from social media. However, these methods struggle with the complex semantics in multilingual texts and the challenge of interpreting ever-changing, often conflicting reports of death and injury numbers from various unverified sources on social media platforms. In this work, we introduce an end-to-end framework to significantly improve the timeliness and accuracy of global earthquake-induced human loss forecasting using multi-lingual, crowdsourced social media. Our framework integrates (1) a hierarchical casualty extraction model built upon large language models, prompt design, and few-shot learning to retrieve quantitative human loss claims from social media, (2) a physical constraint-aware, dynamic-truth discovery model that discovers the truthful human loss from massive noisy and potentially conflicting human loss claims, and (3) a Bayesian updating loss projection model that dynamically updates the final loss estimation using discovered truths. We test the framework in real-time on a series of global earthquake events in 2021 and 2022 and show that our framework streamlines casualty data retrieval, achieving speed and accuracy comparable to manual methods by USGS.
    A Masked Pruning Approach for Dimensionality Reduction in Communication-Efficient Federated Learning Systems. (arXiv:2312.03889v1 [cs.LG])
    Federated Learning (FL) represents a growing machine learning (ML) paradigm designed for training models across numerous nodes that retain local datasets, all without directly exchanging the underlying private data with the parameter server (PS). Its increasing popularity is attributed to notable advantages in terms of training deep neural network (DNN) models under privacy aspects and efficient utilization of communication resources. Unfortunately, DNNs suffer from high computational and communication costs, as well as memory consumption in intricate tasks. These factors restrict the applicability of FL algorithms in communication-constrained systems with limited hardware resources. In this paper, we develop a novel algorithm that overcomes these limitations by synergistically combining a pruning-based method with the FL process, resulting in low-dimensional representations of the model with minimal communication cost, dubbed Masked Pruning over FL (MPFL). The algorithm operates by initially distributing weights to the nodes through the PS. Subsequently, each node locally trains its model and computes pruning masks. These low-dimensional masks are then transmitted back to the PS, which generates a consensus pruning mask, broadcasted back to the nodes. This iterative process enhances the robustness and stability of the masked pruning model. The generated mask is used to train the FL model, achieving significant bandwidth savings. We present an extensive experimental study demonstrating the superior performance of MPFL compared to existing methods. Additionally, we have developed an open-source software package for the benefit of researchers and developers in related fields.  ( 3 min )
    From Prediction to Action: Critical Role of Performance Estimation for Machine-Learning-Driven Materials Discovery. (arXiv:2311.15549v2 [cond-mat.mtrl-sci] UPDATED)
    Materials discovery driven by statistical property models is an iterative decision process, during which an initial data collection is extended with new data proposed by a model-informed acquisition function--with the goal to maximize a certain "reward" over time, such as the maximum property value discovered so far. While the materials science community achieved much progress in developing property models that predict well on average with respect to the training distribution, this form of in-distribution performance measurement is not directly coupled with the discovery reward. This is because an iterative discovery process has a shifting reward distribution that is over-proportionally determined by the model performance for exceptional materials. We demonstrate this problem using the example of bulk modulus maximization among double perovskite oxides. We find that the in-distribution predictive performance suggests random forests as superior to Gaussian process regression, while the results are inverse in terms of the discovery rewards. We argue that the lack of proper performance estimation methods from pre-computed data collections is a fundamental problem for improving data-driven materials discovery, and we propose a novel such estimator that, in contrast to na\"ive reward estimation, successfully predicts Gaussian processes with the "expected improvement" acquisition function as the best out of four options in our demonstrational study for double perovskites. Importantly, it does so without requiring the over thousand ab initio computations that were needed to confirm this prediction.
    On The Fairness Impacts of Hardware Selection in Machine Learning. (arXiv:2312.03886v1 [cs.LG])
    In the machine learning ecosystem, hardware selection is often regarded as a mere utility, overshadowed by the spotlight on algorithms and data. This oversight is particularly problematic in contexts like ML-as-a-service platforms, where users often lack control over the hardware used for model deployment. How does the choice of hardware impact generalization properties? This paper investigates the influence of hardware on the delicate balance between model performance and fairness. We demonstrate that hardware choices can exacerbate existing disparities, attributing these discrepancies to variations in gradient flows and loss surfaces across different demographic groups. Through both theoretical and empirical analysis, the paper not only identifies the underlying factors but also proposes an effective strategy for mitigating hardware-induced performance imbalances.
    Temporal Robustness against Data Poisoning. (arXiv:2302.03684v3 [cs.LG] UPDATED)
    Data poisoning considers cases when an adversary manipulates the behavior of machine learning algorithms through malicious training data. Existing threat models of data poisoning center around a single metric, the number of poisoned samples. In consequence, if attackers can poison more samples than expected with affordable overhead, as in many practical scenarios, they may be able to render existing defenses ineffective in a short time. To address this issue, we leverage timestamps denoting the birth dates of data, which are often available but neglected in the past. Benefiting from these timestamps, we propose a temporal threat model of data poisoning with two novel metrics, earliness and duration, which respectively measure how long an attack started in advance and how long an attack lasted. Using these metrics, we define the notions of temporal robustness against data poisoning, providing a meaningful sense of protection even with unbounded amounts of poisoned samples when the attacks are temporally bounded. We present a benchmark with an evaluation protocol simulating continuous data collection and periodic deployments of updated models, thus enabling empirical evaluation of temporal robustness. Lastly, we develop and also empirically verify a baseline defense, namely temporal aggregation, offering provable temporal robustness and highlighting the potential of our temporal threat model for data poisoning.
    ExpM+NF Tractable Exponential Mechanism via Normalizing Flow, A Path through the Accuracy-Privacy Ceiling Constraining Differentially Private ML. (arXiv:2311.09200v2 [stat.ML] UPDATED)
    The Exponential Mechanism (ExpM), a differentially private optimization method, promises many advantages over Differentially Private Stochastic Gradient Descent (DPSGD), the state-of-the-art (SOTA) and de facto method for differentially private machine learning (ML). Yet, ExpM has been historically stymied from differentially private training of modern ML algorithms by two obstructions: ExpM requires a sensitivity bound for the given loss function; ExpM requires sampling from a historically intractable density. We prove a sensitivity bound for $\ell(2)$ loss, and investigate using Normalizing Flows (NFs), deep networks furnishing approximate sampling from the otherwise intractable ExpM distribution. We prove that as the NF output converges to ExpM distribution, the privacy ($\varepsilon$) of an NF sample converges to that of the ExpM distribution. Under the assumption that the NF output distribution is the ExpM distribution, we empirically test ExpM+NF against DPSGD using the SOTA implementation (Opacus \cite{opacus} with PRV accounting) in multiple classification tasks on the Adult Dataset (census data) and MIMIC-III Dataset (healthcare records) using Logistic Regression and GRU-D, a deep learning recurrent neural network with \smallsim 20K-100K parameters. In all experiments we find ExpM+NF achieves greater than 94\% of the non-private training accuracy (AUC) with $\varepsilon$-DP for $\varepsilon$ a low as $1\mathrm{e}{-3}$ -- three orders of magnitude stronger privacy with similar accuracy. Further, performance results show ExpM+NF training time is comparable to (slightly less) than DPSGD. Limitations and future directions are provided; notably, research on NF approximation accuracy and its effect on privacy are a promising avenue to substantially advancing the field. Code for these experiments \hl{will be provided after review}.
    A Stability Analysis of Fine-Tuning a Pre-Trained Model. (arXiv:2301.09820v2 [cs.LG] UPDATED)
    Fine-tuning a pre-trained model (such as BERT, ALBERT, RoBERTa, T5, GPT, etc.) has proven to be one of the most promising paradigms in recent NLP research. However, numerous recent works indicate that fine-tuning suffers from the instability problem, i.e., tuning the same model under the same setting results in significantly different performance. Many recent works have proposed different methods to solve this problem, but there is no theoretical understanding of why and how these methods work. In this paper, we propose a novel theoretical stability analysis of fine-tuning that focuses on two commonly used settings, namely, full fine-tuning and head tuning. We define the stability under each setting and prove the corresponding stability bounds. The theoretical bounds explain why and how several existing methods can stabilize the fine-tuning procedure. In addition to being able to explain most of the observed empirical discoveries, our proposed theoretical analysis framework can also help in the design of effective and provable methods. Based on our theory, we propose three novel strategies to stabilize the fine-tuning procedure, namely, Maximal Margin Regularizer (MMR), Multi-Head Loss (MHLoss), and Self Unsupervised Re-Training (SURT). We extensively evaluate our proposed approaches on 11 widely used real-world benchmark datasets, as well as hundreds of synthetic classification datasets. The experiment results show that our proposed methods significantly stabilize the fine-tuning procedure and also corroborate our theoretical analysis.
    PAPR: Proximity Attention Point Rendering. (arXiv:2307.11086v2 [cs.CV] UPDATED)
    Learning accurate and parsimonious point cloud representations of scene surfaces from scratch remains a challenge in 3D representation learning. Existing point-based methods often suffer from the vanishing gradient problem or require a large number of points to accurately model scene geometry and texture. To address these limitations, we propose Proximity Attention Point Rendering (PAPR), a novel method that consists of a point-based scene representation and a differentiable renderer. Our scene representation uses a point cloud where each point is characterized by its spatial position, influence score, and view-independent feature vector. The renderer selects the relevant points for each ray and produces accurate colours using their associated features. PAPR effectively learns point cloud positions to represent the correct scene geometry, even when the initialization drastically differs from the target geometry. Notably, our method captures fine texture details while using only a parsimonious set of points. We also demonstrate four practical applications of our method: zero-shot geometry editing, object manipulation, texture transfer, and exposure control. More results and code are available on our project website at https://zvict.github.io/papr/.
    SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM. (arXiv:2312.03788v1 [cs.LG])
    Large language models (LLMs) have shown remarkable capabilities in various tasks. However their huge model size and the consequent demand for computational and memory resources also pose challenges to model deployment. Currently, 4-bit post-training quantization (PTQ) has achieved some success in LLMs, reducing the memory footprint by approximately 75% compared to FP16 models, albeit with some accuracy loss. In this paper, we propose SmoothQuant+, an accurate and efficient 4-bit weight-only PTQ that requires no additional training, which enables lossless in accuracy for LLMs for the first time. Based on the fact that the loss of weight quantization is amplified by the activation outliers, SmoothQuant+ smoothes the activation outliers by channel before quantization, while adjusting the corresponding weights for mathematical equivalence, and then performs group-wise 4-bit weight quantization for linear layers. We have integrated SmoothQuant+ into the vLLM framework, an advanced high-throughput inference engine specially developed for LLMs, and equipped it with an efficient W4A16 CUDA kernels, so that vLLM can seamlessly support SmoothQuant+ 4-bit weight quantization. Our results show that, with SmoothQuant+, the Code Llama-34B model can be quantized and deployed on a A100 40GB GPU, achieving lossless accuracy and a throughput increase of 1.9 to 4.0 times compared to the FP16 model deployed on two A100 40GB GPUs. Moreover, the latency per token is only 68% of the FP16 model deployed on two A100 40GB GPUs. This is the state-of-the-art 4-bit weight quantization for LLMs as we know.  ( 3 min )
    If your data distribution shifts, use self-learning. (arXiv:2104.12928v4 [cs.CV] UPDATED)
    We demonstrate that self-learning techniques like entropy minimization and pseudo-labeling are simple and effective at improving performance of a deployed computer vision model under systematic domain shifts. We conduct a wide range of large-scale experiments and show consistent improvements irrespective of the model architecture, the pre-training technique or the type of distribution shift. At the same time, self-learning is simple to use in practice because it does not require knowledge or access to the original training data or scheme, is robust to hyperparameter choices, is straight-forward to implement and requires only a few adaptation epochs. This makes self-learning techniques highly attractive for any practitioner who applies machine learning algorithms in the real world. We present state-of-the-art adaptation results on CIFAR10-C (8.5% error), ImageNet-C (22.0% mCE), ImageNet-R (17.4% error) and ImageNet-A (14.8% error), theoretically study the dynamics of self-supervised adaptation methods and propose a new classification dataset (ImageNet-D) which is challenging even with adaptation.
    Similarity of Neural Architectures using Adversarial Attack Transferability. (arXiv:2210.11407v3 [cs.LG] UPDATED)
    In recent years, many deep neural architectures have been developed for image classification. Whether they are similar or dissimilar and what factors contribute to their (dis)similarities remains curious. To address this question, we aim to design a quantitative and scalable similarity measure between neural architectures. We propose Similarity by Attack Transferability (SAT) from the observation that adversarial attack transferability contains information related to input gradients and decision boundaries widely used to understand model behaviors. We conduct a large-scale analysis on 69 state-of-the-art ImageNet classifiers using our proposed similarity function to answer the question. Moreover, we observe neural architecture-related phenomena using model similarity that model diversity can lead to better performance on model ensembles and knowledge distillation under specific conditions. Our results provide insights into why developing diverse neural architectures with distinct components is necessary.
    HLoOP -- Hyperbolic 2-space Local Outlier Probabilities. (arXiv:2312.03895v1 [stat.ML])
    Hyperbolic geometry has recently garnered considerable attention in machine learning due to its capacity to embed hierarchical graph structures with low distortions for further downstream processing. This paper introduces a simple framework to detect local outliers for datasets grounded in hyperbolic 2-space referred to as HLoOP (Hyperbolic Local Outlier Probability). Within a Euclidean space, well-known techniques for local outlier detection are based on the Local Outlier Factor (LOF) and its variant, the LoOP (Local Outlier Probability), which incorporates probabilistic concepts to model the outlier level of a data vector. The developed HLoOP combines the idea of finding nearest neighbors, density-based outlier scoring with a probabilistic, statistically oriented approach. Therefore, the method consists in computing the Riemmanian distance of a data point to its nearest neighbors following a Gaussian probability density function expressed in a hyperbolic space. This is achieved by defining a Gaussian cumulative distribution in this space. The HLoOP algorithm is tested on the WordNet dataset yielding promising results. Code and data will be made available on request for reproductibility.  ( 2 min )
    LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures. (arXiv:2312.04000v1 [cs.LG])
    Joint embedding (JE) architectures have emerged as a promising avenue for acquiring transferable data representations. A key obstacle to using JE methods, however, is the inherent challenge of evaluating learned representations without access to a downstream task, and an annotated dataset. Without efficient and reliable evaluation, it is difficult to iterate on architectural and training choices for JE methods. In this paper, we introduce LiDAR (Linear Discriminant Analysis Rank), a metric designed to measure the quality of representations within JE architectures. Our metric addresses several shortcomings of recent approaches based on feature covariance rank by discriminating between informative and uninformative features. In essence, LiDAR quantifies the rank of the Linear Discriminant Analysis (LDA) matrix associated with the surrogate SSL task -- a measure that intuitively captures the information content as it pertains to solving the SSL task. We empirically demonstrate that LiDAR significantly surpasses naive rank based approaches in its predictive power of optimal hyperparameters. Our proposed criterion presents a more robust and intuitive means of assessing the quality of representations within JE architectures, which we hope facilitates broader adoption of these powerful techniques in various domains.  ( 3 min )
    MAUVE Scores for Generative Models: Theory and Practice. (arXiv:2212.14578v2 [cs.LG] UPDATED)
    Generative artificial intelligence has made significant strides, producing text indistinguishable from human prose and remarkably photorealistic images. Automatically measuring how close the generated data distribution is to the target distribution is central to diagnosing existing models and developing better ones. We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images. These scores are statistical summaries of divergence frontiers capturing two types of errors in generative modeling. We explore three approaches to statistically estimate these scores: vector quantization, non-parametric estimation, and classifier-based estimation. We provide statistical bounds for the vector quantization approach. Empirically, we find that the proposed scores paired with a range of $f$-divergences and statistical estimation methods can quantify the gaps between the distributions of human-written text and those of modern neural language models by correlating with human judgments and identifying known properties of the generated texts. We demonstrate in the vision domain that MAUVE can identify known properties of generated images on par with or better than existing metrics. In conclusion, we present practical recommendations for using MAUVE effectively with language and image modalities.
    Making Translators Privacy-aware on the User's Side. (arXiv:2312.04068v1 [cs.CR])
    We propose PRISM to enable users of machine translation systems to preserve the privacy of data on their own initiative. There is a growing demand to apply machine translation systems to data that require privacy protection. While several machine translation engines claim to prioritize privacy, the extent and specifics of such protection are largely ambiguous. First, there is often a lack of clarity on how and to what degree the data is protected. Even if service providers believe they have sufficient safeguards in place, sophisticated adversaries might still extract sensitive information. Second, vulnerabilities may exist outside of these protective measures, such as within communication channels, potentially leading to data leakage. As a result, users are hesitant to utilize machine translation engines for data demanding high levels of privacy protection, thereby missing out on their benefits. PRISM resolves this problem. Instead of relying on the translation service to keep data safe, PRISM provides the means to protect data on the user's side. This approach ensures that even machine translation engines with inadequate privacy measures can be used securely. For platforms already equipped with privacy safeguards, PRISM acts as an additional protection layer, reinforcing their security furthermore. PRISM adds these privacy features without significantly compromising translation accuracy. Our experiments demonstrate the effectiveness of PRISM using real-world translators, T5 and ChatGPT (GPT-3.5-turbo), and the datasets with two languages. PRISM effectively balances privacy protection with translation accuracy.  ( 2 min )
    TSGBench: Time Series Generation Benchmark. (arXiv:2309.03755v2 [cs.LG] UPDATED)
    Synthetic Time Series Generation (TSG) is crucial in a range of applications, including data augmentation, anomaly detection, and privacy preservation. Although significant strides have been made in this field, existing methods exhibit three key limitations: (1) They often benchmark against similar model types, constraining a holistic view of performance capabilities. (2) The use of specialized synthetic and private datasets introduces biases and hampers generalizability. (3) Ambiguous evaluation measures, often tied to custom networks or downstream tasks, hinder consistent and fair comparison. To overcome these limitations, we introduce \textsf{TSGBench}, the inaugural Time Series Generation Benchmark, designed for a unified and comprehensive assessment of TSG methods. It comprises three modules: (1) a curated collection of publicly available, real-world datasets tailored for TSG, together with a standardized preprocessing pipeline; (2) a comprehensive evaluation measures suite including vanilla measures, new distance-based assessments, and visualization tools; (3) a pioneering generalization test rooted in Domain Adaptation (DA), compatible with all methods. We have conducted comprehensive experiments using \textsf{TSGBench} across a spectrum of ten real-world datasets from diverse domains, utilizing ten advanced TSG methods and twelve evaluation measures. The results highlight the reliability and efficacy of \textsf{TSGBench} in evaluating TSG methods. Crucially, \textsf{TSGBench} delivers a statistical analysis of the performance rankings of these methods, illuminating their varying performance across different datasets and measures and offering nuanced insights into the effectiveness of each method.  ( 3 min )
    Multi-Group Fairness Evaluation via Conditional Value-at-Risk Testing. (arXiv:2312.03867v1 [cs.LG])
    Machine learning (ML) models used in prediction and classification tasks may display performance disparities across population groups determined by sensitive attributes (e.g., race, sex, age). We consider the problem of evaluating the performance of a fixed ML model across population groups defined by multiple sensitive attributes (e.g., race and sex and age). Here, the sample complexity for estimating the worst-case performance gap across groups (e.g., the largest difference in error rates) increases exponentially with the number of group-denoting sensitive attributes. To address this issue, we propose an approach to test for performance disparities based on Conditional Value-at-Risk (CVaR). By allowing a small probabilistic slack on the groups over which a model has approximately equal performance, we show that the sample complexity required for discovering performance violations is reduced exponentially to be at most upper bounded by the square root of the number of groups. As a byproduct of our analysis, when the groups are weighted by a specific prior distribution, we show that R\'enyi entropy of order $2/3$ of the prior distribution captures the sample complexity of the proposed CVaR test algorithm. Finally, we also show that there exists a non-i.i.d. data collection strategy that results in a sample complexity independent of the number of groups.
    Semi-Supervised Active Learning for Semantic Segmentation in Unknown Environments Using Informative Path Planning. (arXiv:2312.04402v1 [cs.RO])
    Semantic segmentation enables robots to perceive and reason about their environments beyond geometry. Most of such systems build upon deep learning approaches. As autonomous robots are commonly deployed in initially unknown environments, pre-training on static datasets cannot always capture the variety of domains and limits the robot's perception performance during missions. Recently, self-supervised and fully supervised active learning methods emerged to improve a robot's vision. These approaches rely on large in-domain pre-training datasets or require substantial human labelling effort. We propose a planning method for semi-supervised active learning of semantic segmentation that substantially reduces human labelling requirements compared to fully supervised approaches. We leverage an adaptive map-based planner guided towards the frontiers of unexplored space with high model uncertainty collecting training data for human labelling. A key aspect of our approach is to combine the sparse high-quality human labels with pseudo labels automatically extracted from highly certain environment map areas. Experimental results show that our method reaches segmentation performance close to fully supervised approaches with drastically reduced human labelling effort while outperforming self-supervised approaches.
    Sentiment Analysis of Twitter Posts on Global Conflicts. (arXiv:2312.03715v1 [cs.CL])
    Sentiment analysis of social media data is an emerging field with vast applications in various domains. In this study, we developed a sentiment analysis model to analyze social media sentiment, especially tweets, during global conflicting scenarios. To establish our research experiment, we identified a recent global dispute incident on Twitter and collected around 31,000 filtered Tweets for several months to analyze human sentiment worldwide.  ( 2 min )
    The sample complexity of multi-distribution learning. (arXiv:2312.04027v1 [cs.LG])
    Multi-distribution learning generalizes the classic PAC learning to handle data coming from multiple distributions. Given a set of $k$ data distributions and a hypothesis class of VC dimension $d$, the goal is to learn a hypothesis that minimizes the maximum population loss over $k$ distributions, up to $\epsilon$ additive error. In this paper, we settle the sample complexity of multi-distribution learning by giving an algorithm of sample complexity $\widetilde{O}((d+k)\epsilon^{-2}) \cdot (k/\epsilon)^{o(1)}$. This matches the lower bound up to sub-polynomial factor and resolves the COLT 2023 open problem of Awasthi, Haghtalab and Zhao [AHZ23].
    nbi: the Astronomer's Package for Neural Posterior Estimation. (arXiv:2312.03824v1 [astro-ph.IM])
    Despite the promise of Neural Posterior Estimation (NPE) methods in astronomy, the adaptation of NPE into the routine inference workflow has been slow. We identify three critical issues: the need for custom featurizer networks tailored to the observed data, the inference inexactness, and the under-specification of physical forward models. To address the first two issues, we introduce a new framework and open-source software nbi (Neural Bayesian Inference), which supports both amortized and sequential NPE. First, nbi provides built-in "featurizer" networks with demonstrated efficacy on sequential data, such as light curve and spectra, thus obviating the need for this customization on the user end. Second, we introduce a modified algorithm SNPE-IS, which facilities asymptotically exact inference by using the surrogate posterior under NPE only as a proposal distribution for importance sampling. These features allow nbi to be applied off-the-shelf to astronomical inference problems involving light curves and spectra. We discuss how nbi may serve as an effective alternative to existing methods such as Nested Sampling. Our package is at https://github.com/kmzzhang/nbi.  ( 2 min )
    Improving Activation Steering in Language Models with Mean-Centring. (arXiv:2312.03813v1 [cs.CL])
    Recent work in activation steering has demonstrated the potential to better control the outputs of Large Language Models (LLMs), but it involves finding steering vectors. This is difficult because engineers do not typically know how features are represented in these models. We seek to address this issue by applying the idea of mean-centring to steering vectors. We find that taking the average of activations associated with a target dataset, and then subtracting the mean of all training activations, results in effective steering vectors. We test this method on a variety of models on natural language tasks by steering away from generating toxic text, and steering the completion of a story towards a target genre. We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin (compared to previous baselines). This suggests that mean-centring can be used to easily improve the effectiveness of activation steering in a wide range of contexts.  ( 2 min )
    Generalization to New Sequential Decision Making Tasks with In-Context Learning. (arXiv:2312.03801v1 [cs.LG])
    Training autonomous agents that can learn new tasks from only a handful of demonstrations is a long-standing problem in machine learning. Recently, transformers have been shown to learn new language or vision tasks without any weight updates from only a few examples, also referred to as in-context learning. However, the sequential decision making setting poses additional challenges having a lower tolerance for errors since the environment's stochasticity or the agent's actions can lead to unseen, and sometimes unrecoverable, states. In this paper, we use an illustrative example to show that naively applying transformers to sequential decision making problems does not enable in-context learning of new tasks. We then demonstrate how training on sequences of trajectories with certain distributional properties leads to in-context learning of new sequential decision making tasks. We investigate different design choices and find that larger model and dataset sizes, as well as more task diversity, environment stochasticity, and trajectory burstiness, all result in better in-context learning of new out-of-distribution tasks. By training on large diverse offline datasets, our model is able to learn new MiniHack and Procgen tasks without any weight updates from just a handful of demonstrations.  ( 2 min )
    Resource Allocation of Federated Learning for the Metaverse with Mobile Augmented Reality. (arXiv:2211.08705v3 [cs.LG] UPDATED)
    The Metaverse has received much attention recently. Metaverse applications via mobile augmented reality (MAR) require rapid and accurate object detection to mix digital data with the real world. Federated learning (FL) is an intriguing distributed machine learning approach due to its privacy-preserving characteristics. Due to privacy concerns and the limited computation resources on mobile devices, we incorporate FL into MAR systems of the Metaverse to train a model cooperatively. Besides, to balance the trade-off between energy, execution latency and model accuracy, thereby accommodating different demands and application scenarios, we formulate an optimization problem to minimize a weighted combination of total energy consumption, completion time and model accuracy. Through decomposing the non-convex optimization problem into two subproblems, we devise a resource allocation algorithm to determine the bandwidth allocation, transmission power, CPU frequency and video frame resolution for each participating device. We further present the convergence analysis and computational complexity of the proposed algorithm. Numerical results show that our proposed algorithm has better performance (in terms of energy consumption, completion time and model accuracy) under different weight parameters compared to existing benchmarks.
    Intelligent Anomaly Detection for Lane Rendering Using Transformer with Self-Supervised Pre-Training and Customized Fine-Tuning. (arXiv:2312.04398v1 [cs.CV])
    The burgeoning navigation services using digital maps provide great convenience to drivers. Nevertheless, the presence of anomalies in lane rendering map images occasionally introduces potential hazards, as such anomalies can be misleading to human drivers and consequently contribute to unsafe driving conditions. In response to this concern and to accurately and effectively detect the anomalies, this paper transforms lane rendering image anomaly detection into a classification problem and proposes a four-phase pipeline consisting of data pre-processing, self-supervised pre-training with the masked image modeling (MiM) method, customized fine-tuning using cross-entropy based loss with label smoothing, and post-processing to tackle it leveraging state-of-the-art deep learning techniques, especially those involving Transformer models. Various experiments verify the effectiveness of the proposed pipeline. Results indicate that the proposed pipeline exhibits superior performance in lane rendering image anomaly detection, and notably, the self-supervised pre-training with MiM can greatly enhance the detection accuracy while significantly reducing the total training time. For instance, employing the Swin Transformer with Uniform Masking as self-supervised pretraining (Swin-Trans-UM) yielded a heightened accuracy at 94.77% and an improved Area Under The Curve (AUC) score of 0.9743 compared with the pure Swin Transformer without pre-training (Swin-Trans) with an accuracy of 94.01% and an AUC of 0.9498. The fine-tuning epochs were dramatically reduced to 41 from the original 280. In conclusion, the proposed pipeline, with its incorporation of self-supervised pre-training using MiM and other advanced deep learning techniques, emerges as a robust solution for enhancing the accuracy and efficiency of lane rendering image anomaly detection in digital navigation systems.  ( 3 min )
    FreqFed: A Frequency Analysis-Based Approach for Mitigating Poisoning Attacks in Federated Learning. (arXiv:2312.04432v1 [cs.CR])
    Federated learning (FL) is a collaborative learning paradigm allowing multiple clients to jointly train a model without sharing their training data. However, FL is susceptible to poisoning attacks, in which the adversary injects manipulated model updates into the federated model aggregation process to corrupt or destroy predictions (untargeted poisoning) or implant hidden functionalities (targeted poisoning or backdoors). Existing defenses against poisoning attacks in FL have several limitations, such as relying on specific assumptions about attack types and strategies or data distributions or not sufficiently robust against advanced injection techniques and strategies and simultaneously maintaining the utility of the aggregated model. To address the deficiencies of existing defenses, we take a generic and completely different approach to detect poisoning (targeted and untargeted) attacks. We present FreqFed, a novel aggregation mechanism that transforms the model updates (i.e., weights) into the frequency domain, where we can identify the core frequency components that inherit sufficient information about weights. This allows us to effectively filter out malicious updates during local training on the clients, regardless of attack types, strategies, and clients' data distributions. We extensively evaluate the efficiency and effectiveness of FreqFed in different application domains, including image classification, word prediction, IoT intrusion detection, and speech recognition. We demonstrate that FreqFed can mitigate poisoning attacks effectively with a negligible impact on the utility of the aggregated model.
    NeuJeans: Private Neural Network Inference with Joint Optimization of Convolution and Bootstrapping. (arXiv:2312.04356v1 [cs.CR])
    Fully homomorphic encryption (FHE) is a promising cryptographic primitive for realizing private neural network inference (PI) services by allowing a client to fully offload the inference task to a cloud server while keeping the client data oblivious to the server. This work proposes NeuJeans, an FHE-based solution for the PI of deep convolutional neural networks (CNNs). NeuJeans tackles the critical problem of the enormous computational cost for the FHE evaluation of convolutional layers (conv2d), mainly due to the high cost of data reordering and bootstrapping. We first propose an encoding method introducing nested structures inside encoded vectors for FHE, which enables us to develop efficient conv2d algorithms with reduced data reordering costs. However, the new encoding method also introduces additional computations for conversion between encoding methods, which could negate its advantages. We discover that fusing conv2d with bootstrapping eliminates such computations while reducing the cost of bootstrapping. Then, we devise optimized execution flows for various types of conv2d and apply them to end-to-end implementation of CNNs. NeuJeans accelerates the performance of conv2d by up to 5.68 times compared to state-of-the-art FHE-based PI work and performs the PI of a CNN at the scale of ImageNet (ResNet18) within a mere few seconds
    From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces. (arXiv:2306.00245v2 [cs.LG] UPDATED)
    Much of the previous work towards digital agents for graphical user interfaces (GUIs) has relied on text-based representations (derived from HTML or other structured data sources), which are not always readily available. These input representations have been often coupled with custom, task-specific action spaces. This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use -- via pixel-based screenshots and a generic action space corresponding to keyboard and mouse actions. Building upon recent progress in pixel-based pretraining, we show, for the first time, that it is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.
    Enhancing Medical Task Performance in GPT-4V: A Comprehensive Study on Prompt Engineering Strategies. (arXiv:2312.04344v1 [cs.CL])
    OpenAI's latest large vision-language model (LVLM), GPT-4V(ision), has piqued considerable interest for its potential in medical applications. Despite its promise, recent studies and internal reviews highlight its underperformance in specialized medical tasks. This paper explores the boundary of GPT-4V's capabilities in medicine, particularly in processing complex imaging data from endoscopies, CT scans, and MRIs etc. Leveraging open-source datasets, we assessed its foundational competencies, identifying substantial areas for enhancement. Our research emphasizes prompt engineering, an often-underutilized strategy for improving AI responsiveness. Through iterative testing, we refined the model's prompts, significantly improving its interpretative accuracy and relevance in medical imaging. From our comprehensive evaluations, we distilled 10 effective prompt engineering techniques, each fortifying GPT-4V's medical acumen. These methodical enhancements facilitate more reliable, precise, and clinically valuable insights from GPT-4V, advancing its operability in critical healthcare environments. Our findings are pivotal for those employing AI in medicine, providing clear, actionable guidance on harnessing GPT-4V's full diagnostic potential.
    The BigCode Project Governance Card. (arXiv:2312.03872v1 [cs.CY])
    This document serves as an overview of the different mechanisms and areas of governance in the BigCode project. It aims to support transparency by providing relevant information about choices that were made during the project to the broader public, and to serve as an example of intentional governance of an open research project that future endeavors can leverage to shape their own approach. The first section, Project Structure, covers the project organization, its stated goals and values, its internal decision processes, and its funding and resources. The second section, Data and Model Governance, covers decisions relating to the questions of data subject consent, privacy, and model release.
    A Structural-Clustering Based Active Learning for Graph Neural Networks. (arXiv:2312.04307v1 [cs.LG])
    In active learning for graph-structured data, Graph Neural Networks (GNNs) have shown effectiveness. However, a common challenge in these applications is the underutilization of crucial structural information. To address this problem, we propose the Structural-Clustering PageRank method for improved Active learning (SPA) specifically designed for graph-structured data. SPA integrates community detection using the SCAN algorithm with the PageRank scoring method for efficient and informative sample selection. SPA prioritizes nodes that are not only informative but also central in structure. Through extensive experiments, SPA demonstrates higher accuracy and macro-F1 score over existing methods across different annotation budgets and achieves significant reductions in query time. In addition, the proposed method only adds two hyperparameters, $\epsilon$ and $\mu$ in the algorithm to finely tune the balance between structural learning and node selection. This simplicity is a key advantage in active learning scenarios, where extensive hyperparameter tuning is often impractical.
    Surrogate Modelling for Sea Ice Concentration using Lightweight Neural Ensemble. (arXiv:2312.04330v1 [cs.LG])
    The modeling and forecasting of sea ice conditions in the Arctic region are important tasks for ship routing, offshore oil production, and environmental monitoring. We propose the adaptive surrogate modeling approach named LANE-SI (Lightweight Automated Neural Ensembling for Sea Ice) that uses ensemble of relatively simple deep learning models with different loss functions for forecasting of spatial distribution for sea ice concentration in the specified water area. Experimental studies confirm the quality of a long-term forecast based on a deep learning model fitted to the specific water area is comparable to resource-intensive physical modeling, and for some periods of the year, it is superior. We achieved a 20% improvement against the state-of-the-art physics-based forecast system SEAS5 for the Kara Sea.
    Domain constraints improve risk prediction when outcome data is missing. (arXiv:2312.03878v1 [cs.LG])
    Machine learning models are often trained to predict the outcome resulting from a human decision. For example, if a doctor decides to test a patient for disease, will the patient test positive? A challenge is that the human decision censors the outcome data: we only observe test outcomes for patients doctors historically tested. Untested patients, for whom outcomes are unobserved, may differ from tested patients along observed and unobserved dimensions. We propose a Bayesian model class which captures this setting. The purpose of the model is to accurately estimate risk for both tested and untested patients. Estimating this model is challenging due to the wide range of possibilities for untested patients. To address this, we propose two domain constraints which are plausible in health settings: a prevalence constraint, where the overall disease prevalence is known, and an expertise constraint, where the human decision-maker deviates from purely risk-based decision-making only along a constrained feature set. We show theoretically and on synthetic data that domain constraints improve parameter inference. We apply our model to a case study of cancer risk prediction, showing that the model's inferred risk predicts cancer diagnoses, its inferred testing policy captures known public health policies, and it can identify suboptimalities in test allocation. Though our case study is in healthcare, our analysis reveals a general class of domain constraints which can improve model estimation in many settings.  ( 2 min )
    The Potential of Vision-Language Models for Content Moderation of Children's Videos. (arXiv:2312.03936v1 [cs.CV])
    Natural language supervision has been shown to be effective for zero-shot learning in many computer vision tasks, such as object detection and activity recognition. However, generating informative prompts can be challenging for more subtle tasks, such as video content moderation. This can be difficult, as there are many reasons why a video might be inappropriate, beyond violence and obscenity. For example, scammers may attempt to create junk content that is similar to popular educational videos but with no meaningful information. This paper evaluates the performance of several CLIP variations for content moderation of children's cartoons in both the supervised and zero-shot setting. We show that our proposed model (Vanilla CLIP with Projection Layer) outperforms previous work conducted on the Malicious or Benign (MOB) benchmark for video content moderation. This paper presents an in depth analysis of how context-specific language prompts affect content moderation performance. Our results indicate that it is important to include more context in content moderation prompts, particularly for cartoon videos as they are not well represented in the CLIP training data.  ( 2 min )
    Language Model Knowledge Distillation for Efficient Question Answering in Spanish. (arXiv:2312.04193v1 [cs.CL])
    Recent advances in the development of pre-trained Spanish language models has led to significant progress in many Natural Language Processing (NLP) tasks, such as question answering. However, the lack of efficient models imposes a barrier for the adoption of such models in resource-constrained environments. Therefore, smaller distilled models for the Spanish language could be proven to be highly scalable and facilitate their further adoption on a variety of tasks and scenarios. In this work, we take one step in this direction by developing SpanishTinyRoBERTa, a compressed language model based on RoBERTa for efficient question answering in Spanish. To achieve this, we employ knowledge distillation from a large model onto a lighter model that allows for a wider implementation, even in areas with limited computational resources, whilst attaining negligible performance sacrifice. Our experiments show that the dense distilled model can still preserve the performance of its larger counterpart, while significantly increasing inference speedup. This work serves as a starting point for further research and investigation of model compression efforts for Spanish language models across various NLP tasks.
    Simulating the Air Quality Impact of Prescribed Fires Using a Graph Neural Network-Based PM$_{2.5}$ Emissions Forecasting System. (arXiv:2312.04291v1 [physics.ao-ph])
    The increasing size and severity of wildfires across western North America have generated dangerous levels of PM$_{2.5}$ pollution in recent years. In a warming climate, expanding the use of prescribed fires is widely considered to be the most robust fire mitigation strategy. However, reliably forecasting the potential air quality impact from these prescribed fires, a critical ingredient in determining the fires' location and time, at hourly to daily time scales remains a challenging problem. This paper proposes a novel integration of prescribed fire simulation with a spatio-temporal graph neural network-based PM$_{2.5}$ forecasting model. The experiments in this work focus on determining the optimal time for implementing prescribed fires in California as well as quantifying the potential air quality trade-offs involved in conducting more prescribed fires outside the fire season.
    Rethinking Bias Mitigation: Fairer Architectures Make for Fairer Face Recognition. (arXiv:2210.09943v3 [cs.CV] UPDATED)
    Face recognition systems are widely deployed in safety-critical applications, including law enforcement, yet they exhibit bias across a range of socio-demographic dimensions, such as gender and race. Conventional wisdom dictates that model biases arise from biased training data. As a consequence, previous works on bias mitigation largely focused on pre-processing the training data, adding penalties to prevent bias from effecting the model during training, or post-processing predictions to debias them, yet these approaches have shown limited success on hard problems such as face recognition. In our work, we discover that biases are actually inherent to neural network architectures themselves. Following this reframing, we conduct the first neural architecture search for fairness, jointly with a search for hyperparameters. Our search outputs a suite of models which Pareto-dominate all other high-performance architectures and existing bias mitigation methods in terms of accuracy and fairness, often by large margins, on the two most widely used datasets for face identification, CelebA and VGGFace2. Furthermore, these models generalize to other datasets and sensitive attributes. We release our code, models and raw data files at https://github.com/dooleys/FR-NAS.
    TimeDRL: Disentangled Representation Learning for Multivariate Time-Series. (arXiv:2312.04142v1 [cs.LG])
    Multivariate time-series data in numerous real-world applications (e.g., healthcare and industry) are informative but challenging due to the lack of labels and high dimensionality. Recent studies in self-supervised learning have shown their potential in learning rich representations without relying on labels, yet they fall short in learning disentangled embeddings and addressing issues of inductive bias (e.g., transformation-invariance). To tackle these challenges, we propose TimeDRL, a generic multivariate time-series representation learning framework with disentangled dual-level embeddings. TimeDRL is characterized by three novel features: (i) disentangled derivation of timestamp-level and instance-level embeddings from patched time-series data using a [CLS] token strategy; (ii) utilization of timestamp-predictive and instance-contrastive tasks for disentangled representation learning, with the former optimizing timestamp-level embeddings with predictive loss, and the latter optimizing instance-level embeddings with contrastive loss; and (iii) avoidance of augmentation methods to eliminate inductive biases, such as transformation-invariance from cropping and masking. Comprehensive experiments on 6 time-series forecasting datasets and 5 time-series classification datasets have shown that TimeDRL consistently surpasses existing representation learning approaches, achieving an average improvement of forecasting by 57.98% in MSE and classification by 1.25% in accuracy. Furthermore, extensive ablation studies confirmed the relative contribution of each component in TimeDRL's architecture, and semi-supervised learning evaluations demonstrated its effectiveness in real-world scenarios, even with limited labeled data.
    Optimizing $CO_{2}$ Capture in Pressure Swing Adsorption Units: A Deep Neural Network Approach with Optimality Evaluation and Operating Maps for Decision-Making. (arXiv:2312.03873v1 [physics.chem-ph])
    This study presents a methodology for surrogate optimization of cyclic adsorption processes, focusing on enhancing Pressure Swing Adsorption units for carbon dioxide ($CO_{2}$) capture. We developed and implemented a multiple-input, single-output (MISO) framework comprising two deep neural network (DNN) models, predicting key process performance indicators. These models were then integrated into an optimization framework, leveraging particle swarm optimization (PSO) and statistical analysis to generate a comprehensive Pareto front representation. This approach delineated feasible operational regions (FORs) and highlighted the spectrum of optimal decision-making scenarios. A key aspect of our methodology was the evaluation of optimization effectiveness. This was accomplished by testing decision variables derived from the Pareto front against a phenomenological model, affirming the surrogate models reliability. Subsequently, the study delved into analyzing the feasible operational domains of these decision variables. A detailed correlation map was constructed to elucidate the interplay between these variables, thereby uncovering the most impactful factors influencing process behavior. The study offers a practical, insightful operational map that aids operators in pinpointing the optimal process location and prioritizing specific operational goals.
    Breaking the Entanglement of Homophily and Heterophily in Semi-supervised Node Classification. (arXiv:2312.04111v1 [cs.LG])
    Recently, graph neural networks (GNNs) have shown prominent performance in semi-supervised node classification by leveraging knowledge from the graph database. However, most existing GNNs follow the homophily assumption, where connected nodes are more likely to exhibit similar feature distributions and the same labels, and such an assumption has proven to be vulnerable in a growing number of practical applications. As a supplement, heterophily reflects dissimilarity in connected nodes, which has gained significant attention in graph learning. To this end, data engineers aim to develop a powerful GNN model that can ensure performance under both homophily and heterophily. Despite numerous attempts, most existing GNNs struggle to achieve optimal node representations due to the constraints of undirected graphs. The neglect of directed edges results in sub-optimal graph representations, thereby hindering the capacity of GNNs. To address this issue, we introduce AMUD, which quantifies the relationship between node profiles and topology from a statistical perspective, offering valuable insights for \underline{A}daptively \underline{M}odeling the natural directed graphs as the \underline{U}ndirected or \underline{D}irected graph to maximize the benefits from subsequent graph learning. Furthermore, we propose \underline{A}daptive \underline{D}irected \underline{P}attern \underline{A}ggregation (ADPA) as a new directed graph learning paradigm for AMUD. Empirical studies have demonstrated that AMUD guides efficient graph learning. Meanwhile, extensive experiments on 14 benchmark datasets substantiate the impressive performance of ADPA, outperforming baselines by significant margins of 3.96\%.  ( 3 min )
    XCube ($\mathcal{X}^3$): Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies. (arXiv:2312.03806v1 [cs.CV])
    We present $\mathcal{X}^3$ (pronounced XCube), a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to $1024^3$ in a feed-forward fashion without time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation, we show that our model can be used to solve a variety of tasks such as user-guided editing, scene completion from a single scan, and text-to-3D. More results and details can be found at https://research.nvidia.com/labs/toronto-ai/xcube/.  ( 2 min )
    Multi-Scale and Multi-Modal Contrastive Learning Network for Biomedical Time Series. (arXiv:2312.03796v1 [cs.LG])
    Multi-modal biomedical time series (MBTS) data offers a holistic view of the physiological state, holding significant importance in various bio-medical applications. Owing to inherent noise and distribution gaps across different modalities, MBTS can be complex to model. Various deep learning models have been developed to learn representations of MBTS but still fall short in robustness due to the ignorance of modal-to-modal variations. This paper presents a multi-scale and multi-modal biomedical time series representation learning (MBSL) network with contrastive learning to migrate these variations. Firstly, MBTS is grouped based on inter-modal distances, then each group with minimum intra-modal variations can be effectively modeled by individual encoders. Besides, to enhance the multi-scale feature extraction (encoder), various patch lengths and mask ratios are designed to generate tokens with semantic information at different scales and diverse contextual perspectives respectively. Finally, cross-modal contrastive learning is proposed to maximize consistency among inter-modal groups, maintaining useful information and eliminating noises. Experiments against four bio-medical applications show that MBSL outperforms state-of-the-art models by 33.9% mean average errors (MAE) in respiration rate, by 13.8% MAE in exercise heart rate, by 1.41% accuracy in human activity recognition, and by 1.14% F1-score in obstructive sleep apnea-hypopnea syndrome.  ( 2 min )
    Contrastive-Signal-Dependent Plasticity: Forward-Forward Learning of Spiking Neural Systems. (arXiv:2303.18187v2 [cs.NE] UPDATED)
    We develop a neuro-mimetic architecture, composed of spiking neuronal units, where individual layers of neurons operate in parallel and adapt their synaptic efficacies without the use of feedback pathways. Specifically, we propose an event-based generalization of forward-forward learning, which we call contrastive-signal-dependent plasticity (CSDP), for a spiking neural system that iteratively processes sensory input over a stimulus window. The dynamics that underwrite this recurrent circuit entail computing the membrane potential of each processing element, in each layer, as a function of local bottom-up, top-down, and lateral signals, facilitating a dynamic, layer-wise parallel form of neural computation. Unlike other models, such as spiking predictive coding, which rely on feedback synapses to adjust neural electrical activity, our model operates purely online and forward in time, offering a promising way to learn distributed representations of sensory data patterns, with and without labeled context information. Notably, our experimental results on several pattern datasets demonstrate that the CSDP process works well for training a dynamic recurrent spiking network capable of both classification and reconstruction.
    MeanCut: A Greedy-Optimized Graph Clustering via Path-based Similarity and Degree Descent Criterion. (arXiv:2312.04067v1 [cs.LG])
    As the most typical graph clustering method, spectral clustering is popular and attractive due to the remarkable performance, easy implementation, and strong adaptability. Classical spectral clustering measures the edge weights of graph using pairwise Euclidean-based metric, and solves the optimal graph partition by relaxing the constraints of indicator matrix and performing Laplacian decomposition. However, Euclidean-based similarity might cause skew graph cuts when handling non-spherical data distributions, and the relaxation strategy introduces information loss. Meanwhile, spectral clustering requires specifying the number of clusters, which is hard to determine without enough prior knowledge. In this work, we leverage the path-based similarity to enhance intra-cluster associations, and propose MeanCut as the objective function and greedily optimize it in degree descending order for a nondestructive graph partition. This algorithm enables the identification of arbitrary shaped clusters and is robust to noise. To reduce the computational complexity of similarity calculation, we transform optimal path search into generating the maximum spanning tree (MST), and develop a fast MST (FastMST) algorithm to further improve its time-efficiency. Moreover, we define a density gradient factor (DGF) for separating the weakly connected clusters. The validity of our algorithm is demonstrated by testifying on real-world benchmarks and application of face recognition. The source code of MeanCut is available at https://github.com/ZPGuiGroupWhu/MeanCut-Clustering.
    Resource Allocation for Semantic Communication under Physical-layer Security. (arXiv:2312.04155v1 [eess.SP])
    Semantic communication is deemed as a revolution of Shannon's paradigm in the six-generation (6G) wireless networks. It aims at transmitting the extracted information rather than the original data, which receivers will try to recover. Intuitively, the larger extracted information, the longer latency of semantic communication will be. Besides, larger extracted information will result in more accurate reconstructed information, thereby causing a higher utility of the semantic communication system. Shorter latency and higher utility are desirable objectives for the system, so there will be a trade-off between utility and latency. This paper proposes a joint optimization algorithm for total latency and utility. Moreover, security is essential for the semantic communication system. We incorporate the secrecy rate, a physical-layer security method, into the optimization problem. The secrecy rate is the communication rate at which no information is disclosed to an eavesdropper. Experimental results demonstrate that the proposed algorithm obtains the best joint optimization performance compared to the baselines.
    Wavelength-multiplexed Delayed Inputs for Memory Enhancement of Microring-based Reservoir Computing. (arXiv:2312.04204v1 [cs.NE])
    We numerically demonstrate a silicon add-drop microring-based reservoir computing scheme that combines parallel delayed inputs and wavelength division multiplexing. The scheme solves memory-demanding tasks like time-series prediction with good performance without requiring external optical feedback.
    ACE: A fast, skillful learned global atmospheric model for climate prediction. (arXiv:2310.02074v2 [physics.ao-ph] UPDATED)
    Existing ML-based atmospheric models are not suitable for climate prediction, which requires long-term stability and physical consistency. We present ACE (AI2 Climate Emulator), a 200M-parameter, autoregressive machine learning emulator of an existing comprehensive 100-km resolution global atmospheric model. The formulation of ACE allows evaluation of physical laws such as the conservation of mass and moisture. The emulator is stable for 100 years, nearly conserves column moisture without explicit constraints and faithfully reproduces the reference model's climate, outperforming a challenging baseline on over 90% of tracked variables. ACE requires nearly 100x less wall clock time and is 100x more energy efficient than the reference model using typically available resources. Without fine-tuning, ACE can stably generalize to a previously unseen historical sea surface temperature dataset.
    Classifying patient voice in social media data using neural networks: A comparison of AI models on different data sources and therapeutic domains. (arXiv:2312.03747v1 [cs.CL])
    It is essential that healthcare professionals and members of the healthcare community can access and easily understand patient experiences in the real world, so that care standards can be improved and driven towards personalised drug treatment. Social media platforms and message boards are deemed suitable sources of patient experience information, as patients have been observed to discuss and exchange knowledge, look for and provide support online. This paper tests the hypothesis that not all online patient experience information can be treated and collected in the same way, as a result of the inherent differences in the way individuals talk about their journeys, in different therapeutic domains and or data sources. We used linguistic analysis to understand and identify similarities between datasets, across patient language, between data sources (Reddit, SocialGist) and therapeutic domains (cardiovascular, oncology, immunology, neurology). We detected common vocabulary used by patients in the same therapeutic domain across data sources, except for immunology patients, who use unique vocabulary between the two data sources, and compared to all other datasets. We combined linguistically similar datasets to train classifiers (CNN, transformer) to accurately identify patient experience posts from social media, a task we refer to as patient voice classification. The cardiovascular and neurology transformer classifiers perform the best in their respective comparisons for the Reddit data source, achieving F1-scores of 0.865 and 1.0 respectively. The overall best performing classifier is the transformer classifier trained on all data collected for this experiment, achieving F1-scores ranging between 0.863 and 0.995 across all therapeutic domain and data source specific test datasets.  ( 3 min )
    Stochastic-Constrained Stochastic Optimization with Markovian Data. (arXiv:2312.04312v1 [math.OC])
    This paper considers stochastic-constrained stochastic optimization where the stochastic constraint is to satisfy that the expectation of a random function is below a certain threshold. In particular, we study the setting where data samples are drawn from a Markov chain and thus are not independent and identically distributed. We generalize the drift-plus-penalty framework, a primal-dual stochastic gradient method developed for the i.i.d. case, to the Markov chain sampling setting. We propose two variants of drift-plus-penalty; one is for the case when the mixing time of the underlying Markov chain is known while the other is for the case of unknown mixing time. In fact, our algorithms apply to a more general setting of constrained online convex optimization where the sequence of constraint functions follows a Markov chain. Both algorithms are adaptive in that the first works without knowledge of the time horizon while the second uses AdaGrad-style algorithm parameters, which is of independent interest. We demonstrate the effectiveness of our proposed methods through numerical experiments on classification with fairness constraints.
    On the adaptation of in-context learners for system identification. (arXiv:2312.04083v1 [cs.LG])
    In-context system identification aims at constructing meta-models to describe classes of systems, differently from traditional approaches that model single systems. This paradigm facilitates the leveraging of knowledge acquired from observing the behaviour of different, yet related dynamics. This paper discusses the role of meta-model adaptation. Through numerical examples, we demonstrate how meta-model adaptation can enhance predictive performance in three realistic scenarios: tailoring the meta-model to describe a specific system rather than a class; extending the meta-model to capture the behaviour of systems beyond the initial training class; and recalibrating the model for new prediction tasks. Results highlight the effectiveness of meta-model adaptation to achieve a more robust and versatile meta-learning framework for system identification.
    Jointly spatial-temporal representation learning for individual trajectories. (arXiv:2312.04055v1 [cs.LG])
    Individual trajectories, containing substantial information on human-environment interactions across space and time, is a crucial input for geospatial foundation models (GeoFMs). However, existing attempts, leveraging trajectory data for various applications have overlooked the implicit spatial-temporal dependency within trajectories and failed to encode and represent it in a format friendly to deep learning, posing a challenge in obtaining general-purpose trajectory representations. Therefore, this paper proposes a spatial-temporal joint representation learning method (ST-GraphRL) to formalize learnable spatial-temporal dependencies into trajectory representations. The proposed ST-GraphRL consists of three compositions: (i) a weighted directed spatial-temporal graph to explicitly construct mobility interactions over both space and time dimensions; (ii) a two-stage jointly encoder (i.e., decoupling and fusion) to learn entangled spatial-temporal dependencies by independently decomposing and jointly aggregating space and time information; (iii) a decoder guides ST-GraphRL to learn explicit mobility regularities by simulating the spatial-temporal distributions of trajectories. Tested on three real-world human mobility datasets, the proposed ST-GraphRL outperformed all the baseline models in predicting movement spatial-temporal distributions and preserving trajectory similarity with high spatial-temporal correlations. We also explore how spatial-temporal features presented in latent space, validating that ST-GraphRL understands spatial-temporal patterns. This method is also transferable for general-purpose geospatial data representations for broad downstream tasks, as well advancing GeoFMs developing.
    Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play. (arXiv:2312.04118v1 [cs.CV])
    Infants' ability to recognize and categorize objects develops gradually. The second year of life is marked by both the emergence of more semantic visual representations and a better understanding of word meaning. This suggests that language input may play an important role in shaping visual representations. However, even in suitable contexts for word learning like dyadic play sessions, caregivers utterances are sparse and ambiguous, often referring to objects that are different from the one to which the child attends. Here, we systematically investigate to what extent caregivers' utterances can nevertheless enhance visual representations. For this we propose a computational model of visual representation learning during dyadic play. We introduce a synthetic dataset of ego-centric images perceived by a toddler-agent that moves and rotates toy objects in different parts of its home environment while hearing caregivers' utterances, modeled as captions. We propose to model toddlers' learning as simultaneously aligning representations for 1) close-in-time images and 2) co-occurring images and utterances. We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition. Our analysis reveals that a small decrease/increase in object-relevant naming frequencies can drastically impact the learned representations. This affects the attention on object names within an utterance, which is required for efficient visuo-linguistic alignment. Overall, our results support the hypothesis that caregivers' naming utterances can improve toddlers' visual representations.
    Investigating the Design Space of Diffusion Models for Speech Enhancement. (arXiv:2312.04370v1 [eess.AS])
    Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system in terms of perceptual metrics while using fewer sampling steps, thus reducing the computational cost by a factor of four.
    Privacy-preserving quantum federated learning via gradient hiding. (arXiv:2312.04447v1 [quant-ph])
    Distributed quantum computing, particularly distributed quantum machine learning, has gained substantial prominence for its capacity to harness the collective power of distributed quantum resources, transcending the limitations of individual quantum nodes. Meanwhile, the critical concern of privacy within distributed computing protocols remains a significant challenge, particularly in standard classical federated learning (FL) scenarios where data of participating clients is susceptible to leakage via gradient inversion attacks by the server. This paper presents innovative quantum protocols with quantum communication designed to address the FL problem, strengthen privacy measures, and optimize communication efficiency. In contrast to previous works that leverage expressive variational quantum circuits or differential privacy techniques, we consider gradient information concealment using quantum states and propose two distinct FL protocols, one based on private inner-product estimation and the other on incremental learning. These protocols offer substantial advancements in privacy preservation with low communication resources, forging a path toward efficient quantum communication-assisted FL protocols and contributing to the development of secure distributed quantum machine learning, thus addressing critical privacy concerns in the quantum computing era.
    Equivariant Scalar Fields for Molecular Docking with Fast Fourier Transforms. (arXiv:2312.04323v1 [q-bio.BM])
    Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning can accelerate this process by learning a scoring function with a functional form that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks, enabling rapid optimization over rigid-body degrees of freedom with fast Fourier transforms. The runtime of our approach can be amortized at several levels of abstraction, and is particularly favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the widely-used Vina and Gnina scoring functions, and is more robust on computationally predicted structures. Code is available at https://github.com/bjing2016/scalar-fields.
    A Transformer Model for Symbolic Regression towards Scientific Discovery. (arXiv:2312.04070v1 [cs.LG])
    Symbolic Regression (SR) searches for mathematical expressions which best describe numerical datasets. This allows to circumvent interpretation issues inherent to artificial neural networks, but SR algorithms are often computationally expensive. This work proposes a new Transformer model aiming at Symbolic Regression particularly focused on its application for Scientific Discovery. We propose three encoder architectures with increasing flexibility but at the cost of column-permutation equivariance violation. Training results indicate that the most flexible architecture is required to prevent from overfitting. Once trained, we apply our best model to the SRSD datasets (Symbolic Regression for Scientific Discovery datasets) which yields state-of-the-art results using the normalized tree-based edit distance, at no extra computational cost.
    A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints. (arXiv:2312.03905v1 [cs.LG])
    Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning. This often requires maximizing the likelihood of a symbolic constraint w.r.t the neural network's output distribution. Such output distributions are typically assumed to be fully-factorized. This limits the applicability of neuro-symbolic learning to the more expressive autoregressive distributions, e.g., transformers. Under such distributions, computing the likelihood of even simple constraints is #P-hard. Instead of attempting to enforce the constraint on the entire output distribution, we propose to do so on a random, local approximation thereof. More precisely, we optimize the likelihood of the constraint under a pseudolikelihood-based approximation centered around a model sample. Our approximation is factorized, allowing the reuse of solutions to sub-problems, a main tenet for efficiently computing neuro-symbolic losses. Moreover, it is a local, high-fidelity approximation of the likelihood, exhibiting low entropy and KL-divergence around the model sample. We evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation, and observe that we greatly improve upon the base model's ability to predict logically-consistent outputs. We also evaluate on the task of detoxifying large language models. Using a simple constraint disallowing a list of toxic words, we are able to steer the model's outputs away from toxic generations, achieving SoTA detoxification compared to previous approaches.
    A Parameterized Generative Adversarial Network Using Cyclic Projection for Explainable Medical Image Classification. (arXiv:2311.14388v2 [cs.CV] UPDATED)
    Although current data augmentation methods are successful to alleviate the data insufficiency, conventional augmentation are primarily intra-domain while advanced generative adversarial networks (GANs) generate images remaining uncertain, particularly in small-scale datasets. In this paper, we propose a parameterized GAN (ParaGAN) that effectively controls the changes of synthetic samples among domains and highlights the attention regions for downstream classification. Specifically, ParaGAN incorporates projection distance parameters in cyclic projection and projects the source images to the decision boundary to obtain the class-difference maps. Our experiments show that ParaGAN can consistently outperform the existing augmentation methods with explainable classification on two small-scale medical datasets.
    LAVA: Data Valuation without Pre-Specified Learning Algorithms. (arXiv:2305.00054v2 [cs.LG] UPDATED)
    Traditionally, data valuation (DV) is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many DV use cases, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. Our main results are as follows. (1) We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. (2) We develop a novel method to value individual data based on the sensitivity analysis of the class-wise Wasserstein distance. Importantly, these values can be directly obtained for free from the output of off-the-shelf optimization solvers when computing the distance. (3) We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a significant improvement over SOTA performance while being orders of magnitude faster.
    Monitoring Sustainable Global Development Along Shared Socioeconomic Pathways. (arXiv:2312.04416v1 [cs.LG])
    Sustainable global development is one of the most prevalent challenges facing the world today, hinging on the equilibrium between socioeconomic growth and environmental sustainability. We propose approaches to monitor and quantify sustainable development along the Shared Socioeconomic Pathways (SSPs), including mathematically derived scoring algorithms, and machine learning methods. These integrate socioeconomic and environmental datasets, to produce an interpretable metric for SSP alignment. An initial study demonstrates promising results, laying the groundwork for the application of different methods to the monitoring of sustainable global development.
    Understanding the Role of Optimization in Double Descent. (arXiv:2312.03951v1 [cs.LG])
    The phenomenon of model-wise double descent, where the test error peaks and then reduces as the model size increases, is an interesting topic that has attracted the attention of researchers due to the striking observed gap between theory and practice \citep{Belkin2018ReconcilingMM}. Additionally, while double descent has been observed in various tasks and architectures, the peak of double descent can sometimes be noticeably absent or diminished, even without explicit regularization, such as weight decay and early stopping. In this paper, we investigate this intriguing phenomenon from the optimization perspective and propose a simple optimization-based explanation for why double descent sometimes occurs weakly or not at all. To the best of our knowledge, we are the first to demonstrate that many disparate factors contributing to model-wise double descent (initialization, normalization, batch size, learning rate, optimization algorithm) are unified from the viewpoint of optimization: model-wise double descent is observed if and only if the optimizer can find a sufficiently low-loss minimum. These factors directly affect the condition number of the optimization problem or the optimizer and thus affect the final minimum found by the optimizer, reducing or increasing the height of the double descent peak. We conduct a series of controlled experiments on random feature models and two-layer neural networks under various optimization settings, demonstrating this optimization-based unified view. Our results suggest the following implication: Double descent is unlikely to be a problem for real-world machine learning setups. Additionally, our results help explain the gap between weak double descent peaks in practice and strong peaks observable in carefully designed setups.
    A Study on the Calibration of In-context Learning. (arXiv:2312.04021v1 [cs.CL])
    Modern auto-regressive language models are trained to minimize log loss on broad data by predicting the next token so they are expected to get calibrated answers when framing a problem as a next-token prediction task. We study this for in-context learning (ICL), a widely used way to adapt frozen large language models (LLMs) via crafting prompts, and investigate the trade-offs between performance and calibration on a wide range of natural language understanding and reasoning tasks. We conduct extensive experiments to show that such trade-offs may get worse as we increase model size, incorporate more ICL examples, and fine-tune models using instruction, dialog, or reinforcement learning from human feedback (RLHF) on carefully curated datasets. Furthermore, we find that common recalibration techniques that are widely effective such as temperature scaling provide limited gains in calibration errors, suggesting that new methods may be required for settings where models are expected to be reliable.
    Modeling Boundedly Rational Agents with Latent Inference Budgets. (arXiv:2312.04030v1 [cs.AI])
    We study the problem of modeling a population of agents pursuing unknown goals subject to unknown computational constraints. In standard models of bounded rationality, sub-optimal decision-making is simulated by adding homoscedastic noise to optimal decisions rather than explicitly simulating constrained inference. In this work, we introduce a latent inference budget model (L-IBM) that models agents' computational constraints explicitly, via a latent variable (inferred jointly with a model of agents' goals) that controls the runtime of an iterative inference algorithm. L-IBMs make it possible to learn agent models using data from diverse populations of suboptimal actors. In three modeling tasks -- inferring navigation goals from routes, inferring communicative intents from human utterances, and predicting next moves in human chess games -- we show that L-IBMs match or outperform Boltzmann models of decision-making under uncertainty. Inferred inference budgets are themselves meaningful, efficient to compute, and correlated with measures of player skill, partner skill and task difficulty.
    Node-aware Bi-smoothing: Certified Robustness against Graph Injection Attacks. (arXiv:2312.03979v1 [cs.LG])
    Deep Graph Learning (DGL) has emerged as a crucial technique across various domains. However, recent studies have exposed vulnerabilities in DGL models, such as susceptibility to evasion and poisoning attacks. While empirical and provable robustness techniques have been developed to defend against graph modification attacks (GMAs), the problem of certified robustness against graph injection attacks (GIAs) remains largely unexplored. To bridge this gap, we introduce the node-aware bi-smoothing framework, which is the first certifiably robust approach for general node classification tasks against GIAs. Notably, the proposed node-aware bi-smoothing scheme is model-agnostic and is applicable for both evasion and poisoning attacks. Through rigorous theoretical analysis, we establish the certifiable conditions of our smoothing scheme. We also explore the practical implications of our node-aware bi-smoothing schemes in two contexts: as an empirical defense approach against real-world GIAs and in the context of recommendation systems. Furthermore, we extend two state-of-the-art certified robustness frameworks to address node injection attacks and compare our approach against them. Extensive evaluations demonstrate the effectiveness of our proposed certificates.
    ChatGPT Application In Summarizing An Evolution Of Deep Learning Techniques In Imaging: A Qualitative Study. (arXiv:2312.03723v1 [cs.CL])
    The pursuit of article or text summarization has captured the attention of natural language processing (NLP) practitioners, presenting itself as a formidable challenge. ChatGPT 3.5 exhibits the capacity to condense the content of up to 3000 tokens into a single page, aiming to retain pivotal information from a given text across diverse themes. In a conducted qualitative research endeavor, we selected seven scientific articles and employed the publicly available ChatGPT service to generate summaries of these articles. Subsequently, we engaged six co-authors of the articles in a survey, presenting five questions to evaluate the quality of the summaries compared to the original content. The findings revealed that the summaries produced by ChatGPT effectively encapsulated the crucial information present in the articles, preserving the principal message of each manuscript. Nonetheless, there was a slight diminishment in the technical depth of the summaries as opposed to the original articles. As a result, our conclusion underscores ChatGPT's text summarization capability as a potent tool for extracting essential insights in a manner more aligned with reporting than purely scientific discourse.
    Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data. (arXiv:2312.04281v1 [stat.ML])
    Federated learning is an emerging distributed machine learning framework aiming at protecting data privacy. Data heterogeneity is one of the core challenges in federated learning, which could severely degrade the convergence rate and prediction performance of deep neural networks. To address this issue, we develop a novel personalized federated learning framework for heterogeneous data, which we refer to as FedSplit. This modeling framework is motivated by the finding that, data in different clients contain both common knowledge and personalized knowledge. Then the hidden elements in each neural layer can be split into the shared and personalized groups. With this decomposition, a novel objective function is established and optimized. We demonstrate FedSplit enjoyers a faster convergence speed than the standard federated learning method both theoretically and empirically. The generalization bound of the FedSplit method is also studied. To practically implement the proposed method on real datasets, factor analysis is introduced to facilitate the decoupling of hidden elements. This leads to a practically implemented model for FedSplit and we further refer to as FedFac. We demonstrated by simulation studies that, using factor analysis can well recover the underlying shared/personalized decomposition. The superior prediction performance of FedFac is further verified empirically by comparison with various state-of-the-art federated learning methods on several real datasets.
    Clinical Risk Prediction Using Language Models: Benefits And Considerations. (arXiv:2312.03742v1 [cs.CL])
    The utilization of Electronic Health Records (EHRs) for clinical risk prediction is on the rise. However, strict privacy regulations limit access to comprehensive health records, making it challenging to apply standard machine learning algorithms in practical real-world scenarios. Previous research has addressed this data limitation by incorporating medical ontologies and employing transfer learning methods. In this study, we investigate the potential of leveraging language models (LMs) as a means to incorporate supplementary domain knowledge for improving the performance of various EHR-based risk prediction tasks. Unlike applying LMs to unstructured EHR data such as clinical notes, this study focuses on using textual descriptions within structured EHR to make predictions exclusively based on that information. We extensively compare against previous approaches across various data types and sizes. We find that employing LMs to represent structured EHRs, such as diagnostic histories, leads to improved or at least comparable performance in diverse risk prediction tasks. Furthermore, LM-based approaches offer numerous advantages, including few-shot learning, the capability to handle previously unseen medical concepts, and adaptability to various medical vocabularies. Nevertheless, we underscore, through various experiments, the importance of being cautious when employing such models, as concerns regarding the reliability of LMs persist.
    Leveraging AI-derived Data for Carbon Accounting: Information Extraction from Alternative Sources. (arXiv:2312.03722v1 [cs.CL])
    Carbon accounting is a fundamental building block in our global path to emissions reduction and decarbonization, yet many challenges exist in achieving reliable and trusted carbon accounting measures. We motivate that carbon accounting not only needs to be more data-driven, but also more methodologically sound. We discuss the need for alternative, more diverse data sources that can play a significant role on our path to trusted carbon accounting procedures and elaborate on not only why, but how Artificial Intelligence (AI) in general and Natural Language Processing (NLP) in particular can unlock reasonable access to a treasure trove of alternative data sets in light of the recent advances in the field that better enable the utilization of unstructured data in this process. We present a case study of the recent developments on real-world data via an NLP-powered analysis using OpenAI's GPT API on financial and shipping data. We conclude the paper with a discussion on how these methods and approaches can be integrated into a broader framework for AI-enabled integrative carbon accounting.
    Emergence of Latent Binary Encoding in Deep Neural Network Classifiers. (arXiv:2310.08224v2 [cs.LG] UPDATED)
    We observe the emergence of binary encoding within the latent space of deep-neural-network classifiers. Such binary encoding is induced by introducing a linear penultimate layer, which is equipped during training with a loss function that grows as $\exp(\vec{x}^2)$, where $\vec{x}$ are the coordinates in the latent space. The phenomenon we describe represents a specific instance of a well-documented occurrence known as \textit{neural collapse}, which arises in the terminal phase of training and entails the collapse of latent class means to the vertices of a simplex equiangular tight frame (ETF). We show that binary encoding accelerates convergence toward the simplex ETF and enhances classification accuracy.
    Practical, Private Assurance of the Value of Collaboration. (arXiv:2310.02563v2 [cs.CR] UPDATED)
    Two parties wish to collaborate on their datasets. However, before they reveal their datasets to each other, the parties want to have the guarantee that the collaboration would be fruitful. We look at this problem from the point of view of machine learning, where one party is promised an improvement on its prediction model by incorporating data from the other party. The parties would only wish to collaborate further if the updated model shows an improvement in accuracy. Before this is ascertained, the two parties would not want to disclose their models and datasets. In this work, we construct an interactive protocol for this problem based on the fully homomorphic encryption scheme over the Torus (TFHE) and label differential privacy, where the underlying machine learning model is a neural network. Label differential privacy is used to ensure that computations are not done entirely in the encrypted domain, which is a significant bottleneck for neural network training according to the current state-of-the-art FHE implementations. We prove the security of our scheme in the universal composability framework assuming honest-but-curious parties, but where one party may not have any expertise in labelling its initial dataset. Experiments show that we can obtain the output, i.e., the accuracy of the updated model, with time many orders of magnitude faster than a protocol using entirely FHE operations.
    Attention-Driven Multi-Modal Fusion: Enhancing Sign Language Recognition and Translation. (arXiv:2309.01860v3 [cs.CV] UPDATED)
    In this paper, we devise a mechanism for the addition of multi-modal information with an existing pipeline for continuous sign language recognition and translation. In our procedure, we have incorporated optical flow information with RGB images to enrich the features with movement-related information. This work studies the feasibility of such modality inclusion using a cross-modal encoder. The plugin we have used is very lightweight and doesn't need to include a separate feature extractor for the new modality in an end-to-end manner. We have applied the changes in both sign language recognition and translation, improving the result in each case. We have evaluated the performance on the RWTH-PHOENIX-2014 dataset for sign language recognition and the RWTH-PHOENIX-2014T dataset for translation. On the recognition task, our approach reduced the WER by 0.9, and on the translation task, our approach increased most of the BLEU scores by ~0.6 on the test set.
    Calibration in Machine Learning Uncertainty Quantification: beyond consistency to target adaptivity. (arXiv:2309.06240v2 [stat.ML] UPDATED)
    Reliable uncertainty quantification (UQ) in machine learning (ML) regression tasks is becoming the focus of many studies in materials and chemical science. It is now well understood that average calibration is insufficient, and most studies implement additional methods testing the conditional calibration with respect to uncertainty, i.e. consistency. Consistency is assessed mostly by so-called reliability diagrams. There exists however another way beyond average calibration, which is conditional calibration with respect to input features, i.e. adaptivity. In practice, adaptivity is the main concern of the final users of a ML-UQ method, seeking for the reliability of predictions and uncertainties for any point in features space. This article aims to show that consistency and adaptivity are complementary validation targets, and that a good consistency does not imply a good adaptivity. Adapted validation methods are proposed and illustrated on a representative example.
    Deep Dynamics: Vehicle Dynamics Modeling with a Physics-Informed Neural Network for Autonomous Racing. (arXiv:2312.04374v1 [cs.RO])
    Autonomous racing is a critical research area for autonomous driving, presenting significant challenges in vehicle dynamics modeling, such as balancing model precision and computational efficiency at high speeds (>280kmph), where minor errors in modeling have severe consequences. Existing physics-based models for vehicle dynamics require elaborate testing setups and tuning, which are hard to implement, time-intensive, and cost-prohibitive. Conversely, purely data-driven approaches do not generalize well and cannot adequately ensure physical constraints on predictions. This paper introduces Deep Dynamics, a physics-informed neural network (PINN) for vehicle dynamics modeling of an autonomous racecar. It combines physics coefficient estimation and dynamical equations to accurately predict vehicle states at high speeds and includes a unique Physics Guard layer to ensure internal coefficient estimates remain within their nominal physical ranges. Open-loop and closed-loop performance assessments, using a physics-based simulator and full-scale autonomous Indy racecar data, highlight Deep Dynamics as a promising approach for modeling racecar vehicle dynamics.
    How Many Validation Labels Do You Need? Exploring the Design Space of Label-Efficient Model Ranking. (arXiv:2312.01619v2 [cs.LG] UPDATED)
    The paper introduces LEMR, a framework that reduces annotation costs for model selection tasks. Our approach leverages ensemble methods to generate pseudo-labels, employs uncertainty sampling for target acquisition, and utilizes a Z-score mechanism for iterative committee reelection to refine model ranks. We present a systematic study across various selection metrics, demonstrating that LEMR achieves comparable results to fully labeled datasets with a fraction of the labeling budget. Our findings indicate that LEMR not only economizes the labeling effort in weak supervision and semi-supervised learning settings but also effectively guides prompt selection for large language models. With extensive experiments across 23 tasks, we reveal that our framework can dramatically decrease the labeling cost without compromising the accuracy of model selection, thereby offering a cost-effective alternative to traditional practices.
    Evaluating Large Language Model Creativity from a Literary Perspective. (arXiv:2312.03746v1 [cs.CL])
    This paper assesses the potential for large language models (LLMs) to serve as assistive tools in the creative writing process, by means of a single, in-depth case study. In the course of the study, we develop interactive and multi-voice prompting strategies that interleave background descriptions (scene setting, plot elements), instructions that guide composition, samples of text in the target style, and critical discussion of the given samples. We qualitatively evaluate the results from a literary critical perspective, as well as from the standpoint of computational creativity (a sub-field of artificial intelligence). Our findings lend support to the view that the sophistication of the results that can be achieved with an LLM mirrors the sophistication of the prompting.
    Adaptive Dependency Learning Graph Neural Networks. (arXiv:2312.03903v1 [cs.LG])
    Graph Neural Networks (GNN) have recently gained popularity in the forecasting domain due to their ability to model complex spatial and temporal patterns in tasks such as traffic forecasting and region-based demand forecasting. Most of these methods require a predefined graph as input, whereas in real-life multivariate time series problems, a well-predefined dependency graph rarely exists. This requirement makes it harder for GNNs to be utilised widely for multivariate forecasting problems in other domains such as retail or energy. In this paper, we propose a hybrid approach combining neural networks and statistical structure learning models to self-learn the dependencies and construct a dynamically changing dependency graph from multivariate data aiming to enable the use of GNNs for multivariate forecasting even when a well-defined graph does not exist. The statistical structure modeling in conjunction with neural networks provides a well-principled and efficient approach by bringing in causal semantics to determine dependencies among the series. Finally, we demonstrate significantly improved performance using our proposed approach on real-world benchmark datasets without a pre-defined dependency graph.
    LLM as OS (llmao), Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem. (arXiv:2312.03815v1 [cs.OS])
    This paper envisions a revolutionary AIOS-Agent ecosystem, where Large Language Model (LLM) serves as the (Artificial) Intelligent Operating System (IOS, or AIOS)--an operating system ``with soul''. Upon this foundation, a diverse range of LLM-based AI Agent Applications (Agents, or AAPs) are developed, enriching the AIOS-Agent ecosystem and signaling a paradigm shift from the traditional OS-APP ecosystem. We envision that LLM's impact will not be limited to the AI application level, instead, it will in turn revolutionize the design and implementation of computer system, architecture, software, and programming language, featured by several main concepts: LLM as OS (system-level), Agents as Applications (application-level), Natural Language as Programming Interface (user-level), and Tools as Devices/Libraries (hardware/middleware-level).
    Achieving ${O}(\epsilon^{-1.5})$ Complexity in Hessian/Jacobian-free Stochastic Bilevel Optimization. (arXiv:2312.03807v1 [math.OC])
    In this paper, we revisit the bilevel optimization problem, in which the upper-level objective function is generally nonconvex and the lower-level objective function is strongly convex. Although this type of problem has been studied extensively, it still remains an open question how to achieve an ${O}(\epsilon^{-1.5})$ sample complexity of ${O}(\epsilon^{-1.5})$ in Hessian/Jacobian-free stochastic bilevel optimization without any second-order derivative computation. To fill this gap, we propose a novel Hessian/Jacobian-free bilevel optimizer named FdeHBO, which features a simple fully single-loop structure, a projection-aided finite-difference Hessian/Jacobian-vector approximation, and momentum-based updates. Theoretically, we show that FdeHBO requires ${O}(\epsilon^{-1.5})$ iterations (each using ${O}(1)$ samples and only first-order gradient information) to find an $\epsilon$-accurate stationary point. As far as we know, this is the first Hessian/Jacobian-free method with an ${O}(\epsilon^{-1.5})$ sample complexity for nonconvex-strongly-convex stochastic bilevel optimization.
    PCDP-SGD: Improving the Convergence of Differentially Private SGD via Projection in Advance. (arXiv:2312.03792v1 [cs.CR])
    The paradigm of Differentially Private SGD~(DP-SGD) can provide a theoretical guarantee for training data in both centralized and federated settings. However, the utility degradation caused by DP-SGD limits its wide application in high-stakes tasks, such as medical image diagnosis. In addition to the necessary perturbation, the convergence issue is attributed to the information loss on the gradient clipping. In this work, we propose a general framework PCDP-SGD, which aims to compress redundant gradient norms and preserve more crucial top gradient components via projection operation before gradient clipping. Additionally, we extend PCDP-SGD as a fundamental component in differential privacy federated learning~(DPFL) for mitigating the data heterogeneous challenge and achieving efficient communication. We prove that pre-projection enhances the convergence of DP-SGD by reducing the dependence of clipping error and bias to a fraction of the top gradient eigenspace, and in theory, limits cross-client variance to improve the convergence under heterogeneous federation. Experimental results demonstrate that PCDP-SGD achieves higher accuracy compared with state-of-the-art DP-SGD variants in computer vision tasks. Moreover, PCDP-SGD outperforms current federated learning frameworks when DP is guaranteed on local training sets.
    Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing. (arXiv:2312.03763v1 [cs.CV])
    We present a novel framework for generating photorealistic 3D human head and subsequently manipulating and reposing them with remarkable flexibility. The proposed approach leverages an implicit function representation of 3D human heads, employing 3D Gaussians anchored on a parametric face model. To enhance representational capabilities and encode spatial information, we embed a lightweight tri-plane payload within each Gaussian rather than directly storing color and opacity. Additionally, we parameterize the Gaussians in a 2D UV space via a 3DMM, enabling effective utilization of the diffusion model for 3D head avatar generation. Our method facilitates the creation of diverse and realistic 3D human heads with fine-grained editing over facial features and expressions. Extensive experiments demonstrate the effectiveness of our method.
    Enhancing the Rationale-Input Alignment for Self-explaining Rationalization. (arXiv:2312.04103v1 [cs.AI])
    Rationalization empowers deep learning models with self-explaining capabilities through a cooperative game, where a generator selects a semantically consistent subset of the input as a rationale, and a subsequent predictor makes predictions based on the selected rationale. In this paper, we discover that rationalization is prone to a problem named \emph{rationale shift}, which arises from the algorithmic bias of the cooperative game. Rationale shift refers to a situation where the semantics of the selected rationale may deviate from the original input, but the predictor still produces accurate predictions based on the deviation, resulting in a compromised generator with misleading feedback. To address this issue, we first demonstrate the importance of the alignment between the rationale and the full input through both empirical observations and theoretical analysis. Subsequently, we introduce a novel approach called DAR (\textbf{D}iscriminatively \textbf{A}ligned \textbf{R}ationalization), which utilizes an auxiliary module pretrained on the full input to discriminatively align the selected rationale and the original input. We theoretically illustrate how DAR accomplishes the desired alignment, thereby overcoming the rationale shift problem. The experiments on two widely used real-world benchmarks show that the proposed method significantly improves the explanation quality (measured by the overlap between the model-selected explanation and the human-annotated rationale) as compared to state-of-the-art techniques. Additionally, results on two synthetic settings further validate the effectiveness of DAR in addressing the rationale shift problem.
    CODEX: A Cluster-Based Method for Explainable Reinforcement Learning. (arXiv:2312.04216v1 [cs.LG])
    Despite the impressive feats demonstrated by Reinforcement Learning (RL), these algorithms have seen little adoption in high-risk, real-world applications due to current difficulties in explaining RL agent actions and building user trust. We present Counterfactual Demonstrations for Explanation (CODEX), a method that incorporates semantic clustering, which can effectively summarize RL agent behavior in the state-action space. Experimentation on the MiniGrid and StarCraft II gaming environments reveals the semantic clusters retain temporal as well as entity information, which is reflected in the constructed summary of agent behavior. Furthermore, clustering the discrete+continuous game-state latent representations identifies the most crucial episodic events, demonstrating a relationship between the latent and semantic spaces. This work contributes to the growing body of work that strives to unlock the power of RL for widespread use by leveraging and extending techniques from Natural Language Processing.
    MultiGPrompt for Multi-Task Pre-Training and Prompting on Graphs. (arXiv:2312.03731v1 [cs.CL])
    Graphs can inherently model interconnected objects on the Web, thereby facilitating a series of Web applications, such as web analyzing and content recommendation. Recently, Graph Neural Networks (GNNs) have emerged as a mainstream technique for graph representation learning. However, their efficacy within an end-to-end supervised framework is significantly tied to the availabilityof task-specific labels. To mitigate labeling costs and enhance robustness in few-shot settings, pre-training on self-supervised tasks has emerged as a promising method, while prompting has been proposed to further narrow the objective gap between pretext and downstream tasks. Although there has been some initial exploration of prompt-based learning on graphs, they primarily leverage a single pretext task, resulting in a limited subset of general knowledge that could be learned from the pre-training data. Hence, in this paper, we propose MultiGPrompt, a novel multi-task pre-training and prompting framework to exploit multiple pretext tasks for more comprehensive pre-trained knowledge. First, in pre-training, we design a set of pretext tokens to synergize multiple pretext tasks. Second, we propose a dual-prompt mechanism consisting of composed and open prompts to leverage task-specific and global pre-training knowledge, to guide downstream tasks in few-shot settings. Finally, we conduct extensive experiments on six public datasets to evaluate and analyze MultiGPrompt.
    Easy Data Augmentation in Sentiment Analysis of Cyberbullying. (arXiv:2312.03743v1 [cs.CL])
    Instagram, a social media platform, has in the vicinity of 2 billion active users in 2023. The platform allows users to post photos and videos with one another. However, cyberbullying remains a significant problem for about 50% of young Indonesians. To address this issue, sentiment analysis for comment filtering uses a Support Vector Machine (SVM) and Easy Data Augmentation (EDA). EDA will augment the dataset, enabling robust prediction and analysis of cyberbullying by introducing more variation. Based on the tests, SVM combination with EDA results in a 2.52% increase in the k-Fold Cross Validation score. Our proposed approach shows an improved accuracy of 92.5%, 2.5% higher than that of the existing state-of-the-art method. To maintain the reproducibility and replicability of this research, the source code can be accessed at uns.id/eda_svm.
    Trajectory-User Linking via Hierarchical Spatio-Temporal Attention Networks. (arXiv:2302.10903v2 [cs.LG] UPDATED)
    Trajectory-User Linking (TUL) is crucial for human mobility modeling by linking diferent trajectories to users with the exploration of complex mobility patterns. Existing works mainly rely on the recurrent neural framework to encode the temporal dependencies in trajectories, have fall short in capturing spatial-temporal global context for TUL prediction. To ill this gap, this work presents a new hierarchical spatio-temporal attention neural network, called AttnTUL, to jointly encode the local trajectory transitional patterns and global spatial dependencies for TUL. Speciically, our irst model component is built over the graph neural architecture to preserve the local and global context and enhance the representation paradigm of geographical regions and user trajectories. Additionally, a hierarchically structured attention network is designed to simultaneously encode the intra-trajectory and inter-trajectory dependencies, with the integration of the temporal attention mechanism and global elastic attentional encoder. Extensive experiments demonstrate the superiority of our AttnTUL method as compared to state-of-the-art baselines on various trajectory datasets. The source code of our model is available at https://github.com/Onedean/AttnTUL.
    High Pileup Particle Tracking with Object Condensation. (arXiv:2312.03823v1 [physics.data-an])
    Recent work has demonstrated that graph neural networks (GNNs) can match the performance of traditional algorithms for charged particle tracking while improving scalability to meet the computing challenges posed by the HL-LHC. Most GNN tracking algorithms are based on edge classification and identify tracks as connected components from an initial graph containing spurious connections. In this talk, we consider an alternative based on object condensation (OC), a multi-objective learning framework designed to cluster points (hits) belonging to an arbitrary number of objects (tracks) and regress the properties of each object. Building on our previous results, we present a streamlined model and show progress toward a one-shot OC tracking algorithm in a high-pileup environment.
    CLadder: A Benchmark to Assess Causal Reasoning Capabilities of Language Models. (arXiv:2312.04350v1 [cs.CL])
    The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the "causal inference engine" postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insight into the causal reasoning abilities of LLMs. Our data is open-sourced at https://huggingface.co/datasets/causalNLP/cladder, and our code can be found at https://github.com/causalNLP/cladder.
    RoAST: Robustifying Language Models via Adversarial Perturbation with Selective Training. (arXiv:2312.04032v1 [cs.CL])
    Fine-tuning pre-trained language models (LMs) has become the de facto standard in many NLP tasks. Nevertheless, fine-tuned LMs are still prone to robustness issues, such as adversarial robustness and model calibration. Several perspectives of robustness for LMs have been studied independently, but lacking a unified consideration in multiple perspectives. In this paper, we propose Robustifying LMs via Adversarial perturbation with Selective Training (RoAST), a simple yet effective fine-tuning technique to enhance the multi-perspective robustness of LMs in a unified way. RoAST effectively incorporates two important sources for the model robustness, robustness on the perturbed inputs and generalizable knowledge in pre-trained LMs. To be specific, RoAST introduces adversarial perturbation during fine-tuning while the model parameters are selectively updated upon their relative importance to minimize unnecessary deviation. Under a unified evaluation of fine-tuned LMs by incorporating four representative perspectives of model robustness, we demonstrate the effectiveness of RoAST compared to state-of-the-art fine-tuning methods on six different types of LMs, which indicates its usefulness in practice.
    Exposing Disparities in Flood Adaptation for Equitable Future Interventions. (arXiv:2312.03843v1 [econ.GN])
    As governments race to implement new climate adaptation policies that prepare for more frequent flooding, they must seek policies that are effective for all communities and uphold climate justice. This requires evaluating policies not only on their overall effectiveness but also on whether their benefits are felt across all communities. We illustrate the importance of considering such disparities for flood adaptation using the FEMA National Flood Insurance Program Community Rating System and its dataset of $\sim$2.5 million flood insurance claims. We use ${\rm C{\scriptsize AUSAL}F{\scriptsize LOW}}$, a causal inference method based on deep generative models, to estimate the treatment effect of flood adaptation interventions based on a community's income, diversity, population, flood risk, educational attainment, and precipitation. We find that the program saves communities \$5,000--15,000 per household. However, these savings are not evenly spread across communities. For example, for low-income communities savings sharply decline as flood-risk increases in contrast to their high-income counterparts with all else equal. Even among low-income communities, there is a gap in savings between predominantly white and non-white communities: savings of predominantly white communities can be higher by more than \$6000 per household. As communities worldwide ramp up efforts to reduce losses inflicted by floods, simply prescribing a series flood adaptation measures is not enough. Programs must provide communities with the necessary technical and economic support to compensate for historical patterns of disenfranchisement, racism, and inequality. Future flood adaptation efforts should go beyond reducing losses overall and aim to close existing gaps to equitably support communities in the race for climate adaptation.
    Dr. Jekyll and Mr. Hyde: Two Faces of LLMs. (arXiv:2312.03853v1 [cs.CR])
    This year, we witnessed a rise in the use of Large Language Models, especially when combined with applications like chatbot assistants. Safety mechanisms and specialized training procedures are put in place to prevent improper responses from these assistants. In this work, we bypass these measures for ChatGPT and Bard (and, to some extent, Bing chat) by making them impersonate complex personas with opposite characteristics as those of the truthful assistants they are supposed to be. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversation followed a role-play style to get the response the assistant was not allowed to provide. By making use of personas, we show that the response that is prohibited is actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This work shows that by using adversarial personas, one can overcome safety mechanisms set out by ChatGPT and Bard. It also introduces several ways of activating such adversarial personas, altogether showing that both chatbots are vulnerable to this kind of attack.
    A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA. (arXiv:2312.03732v1 [cs.CL])
    As large language models (LLMs) have become increasingly compute and memory intensive, parameter-efficient fine-tuning (PEFT) methods are now a common strategy to fine-tune LLMs. A popular PEFT method is Low-Rank Adapters (LoRA), which adds trainable low-rank "adapters" to selected layers. Each adapter consists of a low-rank matrix product, multiplicatively scaled by a rank-dependent factor. This scaling factor, which divides adapters by a factor of the rank, results in slowed learning and stunted performance for LoRA with higher-rank adapters. Consequently, the use of LoRA in practice has generally been limited to very low ranks. In this work, we study the impact of the scaling factor on the learning process and prove that LoRA adapters should be divided by a factor of the square root of the rank. Modifying LoRA with the appropriate scaling factor, which we call the rank-stabilized LoRA (rsLoRA) method, easily provides for a fine-tuning compute/performance trade-off, where larger ranks can be used to trade off increased computational resources during training for better fine-tuning performance, with no change in inference computing cost.
  • Open

    Adversarial Learning for Feature Shift Detection and Correction. (arXiv:2312.04546v1 [cs.LG])
    Data shift is a phenomenon present in many real-world applications, and while there are multiple methods attempting to detect shifts, the task of localizing and correcting the features originating such shifts has not been studied in depth. Feature shifts can occur in many datasets, including in multi-sensor data, where some sensors are malfunctioning, or in tabular and structured data, including biomedical, financial, and survey data, where faulty standardization and data processing pipelines can lead to erroneous features. In this work, we explore using the principles of adversarial learning, where the information from several discriminators trained to distinguish between two distributions is used to both detect the corrupted features and fix them in order to remove the distribution shift between datasets. We show that mainstream supervised classifiers, such as random forest or gradient boosting trees, combined with simple iterative heuristics, can localize and correct feature shifts, outperforming current statistical and neural network-based techniques. The code is available at https://github.com/AI-sandbox/DataFix.
    Universal Online Learning with Gradient Variations: A Multi-layer Online Ensemble Approach. (arXiv:2307.08360v2 [cs.LG] UPDATED)
    In this paper, we propose an online convex optimization approach with two different levels of adaptivity. On a higher level, our approach is agnostic to the unknown types and curvatures of the online functions, while at a lower level, it can exploit the unknown niceness of the environments and attain problem-dependent guarantees. Specifically, we obtain $\mathcal{O}(\log V_T)$, $\mathcal{O}(d \log V_T)$ and $\widehat{\mathcal{O}}(\sqrt{V_T})$ regret bounds for strongly convex, exp-concave and convex loss functions, respectively, where $d$ is the dimension, $V_T$ denotes problem-dependent gradient variations and the $\widehat{\mathcal{O}}(\cdot)$-notation omits $\log V_T$ factors. Our result not only safeguards the worst-case guarantees but also directly implies the small-loss bounds in analysis. Moreover, when applied to adversarial/stochastic convex optimization and game theory problems, our result enhances the existing universal guarantees. Our approach is based on a multi-layer online ensemble framework incorporating novel ingredients, including a carefully designed optimism for unifying diverse function types and cascaded corrections for algorithmic stability. Notably, despite its multi-layer structure, our algorithm necessitates only one gradient query per round, making it favorable when the gradient evaluation is time-consuming. This is facilitated by a novel regret decomposition with carefully designed surrogate losses.
    Temporal Robustness against Data Poisoning. (arXiv:2302.03684v3 [cs.LG] UPDATED)
    Data poisoning considers cases when an adversary manipulates the behavior of machine learning algorithms through malicious training data. Existing threat models of data poisoning center around a single metric, the number of poisoned samples. In consequence, if attackers can poison more samples than expected with affordable overhead, as in many practical scenarios, they may be able to render existing defenses ineffective in a short time. To address this issue, we leverage timestamps denoting the birth dates of data, which are often available but neglected in the past. Benefiting from these timestamps, we propose a temporal threat model of data poisoning with two novel metrics, earliness and duration, which respectively measure how long an attack started in advance and how long an attack lasted. Using these metrics, we define the notions of temporal robustness against data poisoning, providing a meaningful sense of protection even with unbounded amounts of poisoned samples when the attacks are temporally bounded. We present a benchmark with an evaluation protocol simulating continuous data collection and periodic deployments of updated models, thus enabling empirical evaluation of temporal robustness. Lastly, we develop and also empirically verify a baseline defense, namely temporal aggregation, offering provable temporal robustness and highlighting the potential of our temporal threat model for data poisoning.
    Kernel quadrature with randomly pivoted Cholesky. (arXiv:2306.03955v3 [math.NA] UPDATED)
    This paper presents new quadrature rules for functions in a reproducing kernel Hilbert space using nodes drawn by a sampling algorithm known as randomly pivoted Cholesky. The resulting computational procedure compares favorably to previous kernel quadrature methods, which either achieve low accuracy or require solving a computationally challenging sampling problem. Theoretical and numerical results show that randomly pivoted Cholesky is fast and achieves comparable quadrature error rates to more computationally expensive quadrature schemes based on continuous volume sampling, thinning, and recombination. Randomly pivoted Cholesky is easily adapted to complicated geometries with arbitrary kernels, unlocking new potential for kernel quadrature.
    Loss-Optimal Classification Trees: A Generalized Framework and the Logistic Case. (arXiv:2306.00857v2 [stat.ML] UPDATED)
    The Classification Tree (CT) is one of the most common models in interpretable machine learning. Although such models are usually built with greedy strategies, in recent years, thanks to remarkable advances in Mixer-Integer Programming (MIP) solvers, several exact formulations of the learning problem have been developed. In this paper, we argue that some of the most relevant ones among these training models can be encapsulated within a general framework, whose instances are shaped by the specification of loss functions and regularizers. Next, we introduce a novel realization of this framework: specifically, we consider the logistic loss, handled in the MIP setting by a linear piece-wise approximation, and couple it with $\ell_1$-regularization terms. The resulting Optimal Logistic Tree model numerically proves to be able to induce trees with enhanced interpretability features and competitive generalization capabilities, compared to the state-of-the-art MIP-based approaches.
    LAVA: Data Valuation without Pre-Specified Learning Algorithms. (arXiv:2305.00054v2 [cs.LG] UPDATED)
    Traditionally, data valuation (DV) is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many DV use cases, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. Our main results are as follows. (1) We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. (2) We develop a novel method to value individual data based on the sensitivity analysis of the class-wise Wasserstein distance. Importantly, these values can be directly obtained for free from the output of off-the-shelf optimization solvers when computing the distance. (3) We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a significant improvement over SOTA performance while being orders of magnitude faster.
    Classical Verification of Quantum Learning. (arXiv:2306.04843v2 [quant-ph] UPDATED)
    Quantum data access and quantum processing can make certain classically intractable learning tasks feasible. However, quantum capabilities will only be available to a select few in the near future. Thus, reliable schemes that allow classical clients to delegate learning to untrusted quantum servers are required to facilitate widespread access to quantum learning advantages. Building on a recently introduced framework of interactive proof systems for classical machine learning, we develop a framework for classical verification of quantum learning. We exhibit learning problems that a classical learner cannot efficiently solve on their own, but that they can efficiently and reliably solve when interacting with an untrusted quantum prover. Concretely, we consider the problems of agnostic learning parities and Fourier-sparse functions with respect to distributions with uniform input marginal. We propose a new quantum data access model that we call "mixture-of-superpositions" quantum examples, based on which we give efficient quantum learning algorithms for these tasks. Moreover, we prove that agnostic quantum parity and Fourier-sparse learning can be efficiently verified by a classical verifier with only random example or statistical query access. Finally, we showcase two general scenarios in learning and verification in which quantum mixture-of-superpositions examples do not lead to sample complexity improvements over classical data. Our results demonstrate that the potential power of quantum data for learning tasks, while not unlimited, can be utilized by classical agents through interaction with untrusted quantum entities.
    Low-complexity subspace-descent over symmetric positive definite manifold. (arXiv:2305.02041v3 [stat.ML] UPDATED)
    This work puts forth low-complexity Riemannian subspace descent algorithms for the minimization of functions over the symmetric positive definite (SPD) manifold. Different from the existing Riemannian gradient descent variants, the proposed approach utilizes carefully chosen subspaces that allow the update to be written as a product of the Cholesky factor of the iterate and a sparse matrix. The resulting updates avoid the costly matrix operations like matrix exponentiation and dense matrix multiplication, which are generally required in almost all other Riemannian optimization algorithms on SPD manifold. We further identify a broad class of functions, arising in diverse applications, such as kernel matrix learning, covariance estimation of Gaussian distributions, maximum likelihood parameter estimation of elliptically contoured distributions, and parameter estimation in Gaussian mixture model problems, over which the Riemannian gradients can be calculated efficiently. The proposed uni-directional and multi-directional Riemannian subspace descent variants incur per-iteration complexities of $\O(n)$ and $\O(n^2)$ respectively, as compared to the $\O(n^3)$ or higher complexity incurred by all existing Riemannian gradient descent variants. The superior runtime and low per-iteration complexity of the proposed algorithms is also demonstrated via numerical tests on large-scale covariance estimation and matrix square root problems.  ( 2 min )
    Calculating the maximum number of maximum cliques for simple graphs. (arXiv:2307.14120v3 [math.CO] UPDATED)
    A simple graph on $n$ vertices may contain a lot of maximum cliques. But how many can it potentially contain? We will define prime and composite graphs, and we will show that if $n \ge 15$, then the grpahs with the maximum number of maximum cliques have to be composite. Moreover, we will show an edge bound from which we will prove that if any factor of a composite graph has $\omega(G_i) \ge 5$, then it cannot have the maximum number of maximum cliques. Using this we will show that the graph that contains $3^{\lfloor n/3 \rfloor}c$ maximum cliques has the most number of maximum cliques on $n$ vertices, where $c\in\{1,\frac{4}{3},2\}$, depending on $n \text{ mod } 3$.  ( 2 min )
    Understanding the Role of Optimization in Double Descent. (arXiv:2312.03951v1 [cs.LG])
    The phenomenon of model-wise double descent, where the test error peaks and then reduces as the model size increases, is an interesting topic that has attracted the attention of researchers due to the striking observed gap between theory and practice \citep{Belkin2018ReconcilingMM}. Additionally, while double descent has been observed in various tasks and architectures, the peak of double descent can sometimes be noticeably absent or diminished, even without explicit regularization, such as weight decay and early stopping. In this paper, we investigate this intriguing phenomenon from the optimization perspective and propose a simple optimization-based explanation for why double descent sometimes occurs weakly or not at all. To the best of our knowledge, we are the first to demonstrate that many disparate factors contributing to model-wise double descent (initialization, normalization, batch size, learning rate, optimization algorithm) are unified from the viewpoint of optimization: model-wise double descent is observed if and only if the optimizer can find a sufficiently low-loss minimum. These factors directly affect the condition number of the optimization problem or the optimizer and thus affect the final minimum found by the optimizer, reducing or increasing the height of the double descent peak. We conduct a series of controlled experiments on random feature models and two-layer neural networks under various optimization settings, demonstrating this optimization-based unified view. Our results suggest the following implication: Double descent is unlikely to be a problem for real-world machine learning setups. Additionally, our results help explain the gap between weak double descent peaks in practice and strong peaks observable in carefully designed setups.  ( 3 min )
    Language Model Knowledge Distillation for Efficient Question Answering in Spanish. (arXiv:2312.04193v1 [cs.CL])
    Recent advances in the development of pre-trained Spanish language models has led to significant progress in many Natural Language Processing (NLP) tasks, such as question answering. However, the lack of efficient models imposes a barrier for the adoption of such models in resource-constrained environments. Therefore, smaller distilled models for the Spanish language could be proven to be highly scalable and facilitate their further adoption on a variety of tasks and scenarios. In this work, we take one step in this direction by developing SpanishTinyRoBERTa, a compressed language model based on RoBERTa for efficient question answering in Spanish. To achieve this, we employ knowledge distillation from a large model onto a lighter model that allows for a wider implementation, even in areas with limited computational resources, whilst attaining negligible performance sacrifice. Our experiments show that the dense distilled model can still preserve the performance of its larger counterpart, while significantly increasing inference speedup. This work serves as a starting point for further research and investigation of model compression efforts for Spanish language models across various NLP tasks.  ( 2 min )
    Graph Metanetworks for Processing Diverse Neural Architectures. (arXiv:2312.04501v1 [cs.LG])
    Neural networks efficiently encode learned information within their parameters. Consequently, many tasks can be unified by treating neural networks themselves as input data. When doing so, recent studies demonstrated the importance of accounting for the symmetries and geometry of parameter spaces. However, those works developed architectures tailored to specific networks such as MLPs and CNNs without normalization layers, and generalizing such architectures to other types of networks can be challenging. In this work, we overcome these challenges by building new metanetworks - neural networks that take weights from other neural networks as input. Put simply, we carefully build graphs representing the input neural networks and process the graphs using graph neural networks. Our approach, Graph Metanetworks (GMNs), generalizes to neural architectures where competing methods struggle, such as multi-head attention layers, normalization layers, convolutional layers, ResNet blocks, and group-equivariant linear layers. We prove that GMNs are expressive and equivariant to parameter permutation symmetries that leave the input neural network functions unchanged. We validate the effectiveness of our method on several metanetwork tasks over diverse neural network architectures.  ( 2 min )
    Neural network based generation of a 1-dimensional stochastic field with turbulent velocity statistics. (arXiv:2211.11580v3 [eess.SP] UPDATED)
    We define and study a fully-convolutional neural network stochastic model, NN-Turb, which generates a 1-dimensional field with some turbulent velocity statistics. In particular, the generated process satisfies the Kolmogorov 2/3 law for second order structure function. It also presents negative skewness across scales (i.e. Kolmogorov 4/5 law) and exhibits intermittency as characterized by skewness and flatness. Furthermore, our model is never in contact with turbulent data and only needs the desired statistical behavior of the structure functions across scales for training.  ( 2 min )
    Small Area Estimation of Case Growths for Timely COVID-19 Outbreak Detection. (arXiv:2312.04110v1 [stat.ML])
    The COVID-19 pandemic has exerted a profound impact on the global economy and continues to exact a significant toll on human lives. The COVID-19 case growth rate stands as a key epidemiological parameter to estimate and monitor for effective detection and containment of the resurgence of outbreaks. A fundamental challenge in growth rate estimation and hence outbreak detection is balancing the accuracy-speed tradeoff, where accuracy typically degrades with shorter fitting windows. In this paper, we develop a machine learning (ML) algorithm, which we call Transfer Learning Generalized Random Forest (TLGRF), that balances this accuracy-speed tradeoff. Specifically, we estimate the instantaneous COVID-19 exponential growth rate for each U.S. county by using TLGRF that chooses an adaptive fitting window size based on relevant day-level and county-level features affecting the disease spread. Through transfer learning, TLGRF can accurately estimate case growth rates for counties with small sample sizes. Out-of-sample prediction analysis shows that TLGRF outperforms established growth rate estimation methods. Furthermore, we conducted a case study based on outbreak case data from the state of Colorado and showed that the timely detection of outbreaks could have been improved by up to 224% using TLGRF when compared to the decisions made by Colorado's Department of Health and Environment (CDPHE). To facilitate implementation, we have developed a publicly available outbreak detection tool for timely detection of COVID-19 outbreaks in each U.S. county, which received substantial attention from policymakers.  ( 3 min )
    The sample complexity of multi-distribution learning. (arXiv:2312.04027v1 [cs.LG])
    Multi-distribution learning generalizes the classic PAC learning to handle data coming from multiple distributions. Given a set of $k$ data distributions and a hypothesis class of VC dimension $d$, the goal is to learn a hypothesis that minimizes the maximum population loss over $k$ distributions, up to $\epsilon$ additive error. In this paper, we settle the sample complexity of multi-distribution learning by giving an algorithm of sample complexity $\widetilde{O}((d+k)\epsilon^{-2}) \cdot (k/\epsilon)^{o(1)}$. This matches the lower bound up to sub-polynomial factor and resolves the COLT 2023 open problem of Awasthi, Haghtalab and Zhao [AHZ23].  ( 2 min )
    Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data. (arXiv:2312.04281v1 [stat.ML])
    Federated learning is an emerging distributed machine learning framework aiming at protecting data privacy. Data heterogeneity is one of the core challenges in federated learning, which could severely degrade the convergence rate and prediction performance of deep neural networks. To address this issue, we develop a novel personalized federated learning framework for heterogeneous data, which we refer to as FedSplit. This modeling framework is motivated by the finding that, data in different clients contain both common knowledge and personalized knowledge. Then the hidden elements in each neural layer can be split into the shared and personalized groups. With this decomposition, a novel objective function is established and optimized. We demonstrate FedSplit enjoyers a faster convergence speed than the standard federated learning method both theoretically and empirically. The generalization bound of the FedSplit method is also studied. To practically implement the proposed method on real datasets, factor analysis is introduced to facilitate the decoupling of hidden elements. This leads to a practically implemented model for FedSplit and we further refer to as FedFac. We demonstrated by simulation studies that, using factor analysis can well recover the underlying shared/personalized decomposition. The superior prediction performance of FedFac is further verified empirically by comparison with various state-of-the-art federated learning methods on several real datasets.  ( 2 min )
    Calibration in Machine Learning Uncertainty Quantification: beyond consistency to target adaptivity. (arXiv:2309.06240v2 [stat.ML] UPDATED)
    Reliable uncertainty quantification (UQ) in machine learning (ML) regression tasks is becoming the focus of many studies in materials and chemical science. It is now well understood that average calibration is insufficient, and most studies implement additional methods testing the conditional calibration with respect to uncertainty, i.e. consistency. Consistency is assessed mostly by so-called reliability diagrams. There exists however another way beyond average calibration, which is conditional calibration with respect to input features, i.e. adaptivity. In practice, adaptivity is the main concern of the final users of a ML-UQ method, seeking for the reliability of predictions and uncertainties for any point in features space. This article aims to show that consistency and adaptivity are complementary validation targets, and that a good consistency does not imply a good adaptivity. Adapted validation methods are proposed and illustrated on a representative example.  ( 2 min )
    Data-Adaptive Probabilistic Likelihood Approximation for Ordinary Differential Equations. (arXiv:2306.05566v2 [stat.ML] UPDATED)
    Estimating the parameters of ordinary differential equations (ODEs) is of fundamental importance in many scientific applications. While ODEs are typically approximated with deterministic algorithms, new research on probabilistic solvers indicates that they produce more reliable parameter estimates by better accounting for numerical errors. However, many ODE systems are highly sensitive to their parameter values. This produces deep local maxima in the likelihood function -- a problem which existing probabilistic solvers have yet to resolve. Here we present a novel probabilistic ODE likelihood approximation, DALTON, which can dramatically reduce parameter sensitivity by learning from noisy ODE measurements in a data-adaptive manner. Our approximation scales linearly in both ODE variables and time discretization points, and is applicable to ODEs with both partially-unobserved components and non-Gaussian measurement models. Several examples demonstrate that DALTON produces more accurate parameter estimates via numerical optimization than existing probabilistic ODE solvers, and even in some cases than the exact ODE likelihood itself.  ( 2 min )
    Multi-scale Residual Transformer for VLF Lightning Transients Classification. (arXiv:2312.04163v1 [stat.ML])
    The utilization of Very Low Frequency (VLF) electromagnetic signals in navigation systems is widespread. However, the non-stationary behavior of lightning signals can affect VLF electromagnetic signal transmission. Accurately classifying lightning signals is important for reducing interference and noise in VLF, thereby improving the reliability and overall performance of navigation systems. In recent years, the evolution of deep learning, specifically Convolutional Neural Network (CNNs), has sparked a transformation in lightning classification, surpassing traditional statistical methodologies. Existing CNN models have limitations as they overlook the diverse attributes of lightning signals across different scales and neglect the significance of temporal sequencing in sequential signals. This study introduces an innovative multi-scale residual transform (MRTransformer) that not only has the ability to discern intricate fine-grained patterns while also weighing the significance of different aspects within the input lightning signal sequence. This model performs the attributes of the lightning signal across different scales and the level of accuracy reached 90% in the classification. In future work, this model has the potential applied to a comprehensive understanding of the localization and waveform characteristics of lightning signals.  ( 2 min )
    Multi-Group Fairness Evaluation via Conditional Value-at-Risk Testing. (arXiv:2312.03867v1 [cs.LG])
    Machine learning (ML) models used in prediction and classification tasks may display performance disparities across population groups determined by sensitive attributes (e.g., race, sex, age). We consider the problem of evaluating the performance of a fixed ML model across population groups defined by multiple sensitive attributes (e.g., race and sex and age). Here, the sample complexity for estimating the worst-case performance gap across groups (e.g., the largest difference in error rates) increases exponentially with the number of group-denoting sensitive attributes. To address this issue, we propose an approach to test for performance disparities based on Conditional Value-at-Risk (CVaR). By allowing a small probabilistic slack on the groups over which a model has approximately equal performance, we show that the sample complexity required for discovering performance violations is reduced exponentially to be at most upper bounded by the square root of the number of groups. As a byproduct of our analysis, when the groups are weighted by a specific prior distribution, we show that R\'enyi entropy of order $2/3$ of the prior distribution captures the sample complexity of the proposed CVaR test algorithm. Finally, we also show that there exists a non-i.i.d. data collection strategy that results in a sample complexity independent of the number of groups.  ( 2 min )
    Learning High-Dimensional Differential Graphs From Multi-Attribute Data. (arXiv:2312.03761v1 [stat.ML])
    We consider the problem of estimating differences in two Gaussian graphical models (GGMs) which are known to have similar structure. The GGM structure is encoded in its precision (inverse covariance) matrix. In many applications one is interested in estimating the difference in two precision matrices to characterize underlying changes in conditional dependencies of two sets of data. Existing methods for differential graph estimation are based on single-attribute (SA) models where one associates a scalar random variable with each node. In multi-attribute (MA) graphical models, each node represents a random vector. In this paper, we analyze a group lasso penalized D-trace loss function approach for differential graph learning from multi-attribute data. An alternating direction method of multipliers (ADMM) algorithm is presented to optimize the objective function. Theoretical analysis establishing consistency in support recovery and estimation in high-dimensional settings is provided. Numerical results based on synthetic as well as real data are presented.  ( 2 min )
    Coherent energy and force uncertainty in deep learning force fields. (arXiv:2312.04174v1 [stat.ML])
    In machine learning energy potentials for atomic systems, forces are commonly obtained as the negative derivative of the energy function with respect to atomic positions. To quantify aleatoric uncertainty in the predicted energies, a widely used modeling approach involves predicting both a mean and variance for each energy value. However, this model is not differentiable under the usual white noise assumption, so energy uncertainty does not naturally translate to force uncertainty. In this work we propose a machine learning potential energy model in which energy and force aleatoric uncertainty are linked through a spatially correlated noise process. We demonstrate our approach on an equivariant messages passing neural network potential trained on energies and forces on two out-of-equilibrium molecular datasets. Furthermore, we also show how to obtain epistemic uncertainties in this setting based on a Bayesian interpretation of deep ensemble models.  ( 2 min )
    HLoOP -- Hyperbolic 2-space Local Outlier Probabilities. (arXiv:2312.03895v1 [stat.ML])
    Hyperbolic geometry has recently garnered considerable attention in machine learning due to its capacity to embed hierarchical graph structures with low distortions for further downstream processing. This paper introduces a simple framework to detect local outliers for datasets grounded in hyperbolic 2-space referred to as HLoOP (Hyperbolic Local Outlier Probability). Within a Euclidean space, well-known techniques for local outlier detection are based on the Local Outlier Factor (LOF) and its variant, the LoOP (Local Outlier Probability), which incorporates probabilistic concepts to model the outlier level of a data vector. The developed HLoOP combines the idea of finding nearest neighbors, density-based outlier scoring with a probabilistic, statistically oriented approach. Therefore, the method consists in computing the Riemmanian distance of a data point to its nearest neighbors following a Gaussian probability density function expressed in a hyperbolic space. This is achieved by defining a Gaussian cumulative distribution in this space. The HLoOP algorithm is tested on the WordNet dataset yielding promising results. Code and data will be made available on request for reproductibility.  ( 2 min )
    Intelligent Anomaly Detection for Lane Rendering Using Transformer with Self-Supervised Pre-Training and Customized Fine-Tuning. (arXiv:2312.04398v1 [cs.CV])
    The burgeoning navigation services using digital maps provide great convenience to drivers. Nevertheless, the presence of anomalies in lane rendering map images occasionally introduces potential hazards, as such anomalies can be misleading to human drivers and consequently contribute to unsafe driving conditions. In response to this concern and to accurately and effectively detect the anomalies, this paper transforms lane rendering image anomaly detection into a classification problem and proposes a four-phase pipeline consisting of data pre-processing, self-supervised pre-training with the masked image modeling (MiM) method, customized fine-tuning using cross-entropy based loss with label smoothing, and post-processing to tackle it leveraging state-of-the-art deep learning techniques, especially those involving Transformer models. Various experiments verify the effectiveness of the proposed pipeline. Results indicate that the proposed pipeline exhibits superior performance in lane rendering image anomaly detection, and notably, the self-supervised pre-training with MiM can greatly enhance the detection accuracy while significantly reducing the total training time. For instance, employing the Swin Transformer with Uniform Masking as self-supervised pretraining (Swin-Trans-UM) yielded a heightened accuracy at 94.77% and an improved Area Under The Curve (AUC) score of 0.9743 compared with the pure Swin Transformer without pre-training (Swin-Trans) with an accuracy of 94.01% and an AUC of 0.9498. The fine-tuning epochs were dramatically reduced to 41 from the original 280. In conclusion, the proposed pipeline, with its incorporation of self-supervised pre-training using MiM and other advanced deep learning techniques, emerges as a robust solution for enhancing the accuracy and efficiency of lane rendering image anomaly detection in digital navigation systems.  ( 3 min )
    Improving Gradient-guided Nested Sampling for Posterior Inference. (arXiv:2312.03911v1 [cs.LG])
    We present a performant, general-purpose gradient-guided nested sampling algorithm, ${\tt GGNS}$, combining the state of the art in differentiable programming, Hamiltonian slice sampling, clustering, mode separation, dynamic nested sampling, and parallelization. This unique combination allows ${\tt GGNS}$ to scale well with dimensionality and perform competitively on a variety of synthetic and real-world problems. We also show the potential of combining nested sampling with generative flow networks to obtain large amounts of high-quality samples from the posterior distribution. This combination leads to faster mode discovery and more accurate estimates of the partition function.  ( 2 min )
    Horizon-Free and Instance-Dependent Regret Bounds for Reinforcement Learning with General Function Approximation. (arXiv:2312.04464v1 [cs.LG])
    To tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon. The derived regret bound is deemed \emph{sharp}, as it matches the minimax lower bound when specialized to linear mixture MDPs up to logarithmic factors. Furthermore, UCRL-WVTR is \emph{computationally efficient} with access to a regression oracle. The achievement of such a horizon-free, instance-dependent, and sharp regret bound hinges upon (i) novel algorithm designs: weighted value-targeted regression and a high-order moment estimator in the context of general function approximation; and (ii) fine-grained analyses: a novel concentration bound of weighted non-linear least squares and a refined analysis which leads to the tight instance-dependent bound. We also conduct comprehensive experiments to corroborate our theoretical findings.  ( 2 min )
    Hidden yet quantifiable: A lower bound for confounding strength using randomized trials. (arXiv:2312.03871v1 [stat.ML])
    In the era of fast-paced precision medicine, observational studies play a major role in properly evaluating new treatments in clinical practice. Yet, unobserved confounding can significantly compromise causal conclusions drawn from non-randomized data. We propose a novel strategy that leverages randomized trials to quantify unobserved confounding. First, we design a statistical test to detect unobserved confounding with strength above a given threshold. Then, we use the test to estimate an asymptotically valid lower bound on the unobserved confounding strength. We evaluate the power and validity of our statistical test on several synthetic and semi-synthetic datasets. Further, we show how our lower bound can correctly identify the absence and presence of unobserved confounding in a real-world setting.  ( 2 min )

  • Open

    Group [D]iscussion on OpenAI's foundational CLIP Paper for Zero-Shot Image Classification
    Hey all, we had a lively group discussion today on the 2021 CLIP paper from OpenAI. Every Friday we've been going over the fundamentals of a lot of the state of the art techniques used in Machine Learning today. Hoping to learn a little each week, and spot patterns we can apply to our own work. I feel like there's always a little nugget of information I didn't fully understand before reading the paper, so have been finding it helpful. Though it is not groundbreaking research as of this week, I think it's nice to take a step back and review the fundamentals as well as keeping up with the latest and greatest. Posted the notes and video recap are here if anyone finds it helpful: https://blog.oxen.ai/arxiv-dives-zero-shot-image-classification-with-clip/ Also would love to have anyone join us live on Fridays or suggest papers! We've got a pretty consistent and fun group of 400+ engineers and researchers popping in and out. ​ submitted by /u/FallMindless3563 [link] [comments]  ( 10 min )
    [Research] Papers to read to stay up to date
    Hi there! I am familiar with how dnn, cnn, recurrent neural networks, gan, policy gradient reinforcement learning and transformer networks work. I am looking to expand my knowledge and stay up to date within the field. Do you have any recommendations on papers to read to stay up to date? I realise that there are vast amounts of papers being produced in this field, but I would appreciate if there are suggestions for «must not miss» papers. submitted by /u/MyFriendTheGuitar [link] [comments]  ( 9 min )
    [P] Labelling Interface for Audio Event Classification Training Data
    I'm interested in suggestions that will feed the design of an optimal interface for humans to label audio clips that will be used to form a training set used in audio event classification. For background, ultimately I want to identify what urban sound(s) are present in ~15 second .opus files. For certain classes (namely those associated with noisy construction-related equipment) it's much more important that inferences minimize false negatives. If it's relevant, the same audio recording device that's producing the clips I need to label will produce the clips I'll want to perform inferences on (i.e it's in the same physical location, same acoustic environment). Questions: I don't have a neural network selected yet, is it critical to select this before designing the labelling scheme or are best practices generalizable? Should I allow labelers to create their own labels if they aren't in my predefined set? Should I have an "unknown" label? Is it best to perform clustering first so that like-clips can be presented in sequence? Should I ask labelers to identify all classes present in each clip or just one? Is there value in being more specific with labels (e.g. "accelerating motorcycle engine" vs "idling motorcycle engine") than I need the end-inferences to be (e.g. "motorcycle")? Are there any other questions I should be asking? ​ submitted by /u/davegravy [link] [comments]  ( 10 min )
    [R] Using Large Language Models for Hyperparameter Optimization
    We've spent a lot of effort tuning hyperparameters for large models, but what if they could also help us tune hyperparameters? Arxiv: https://arxiv.org/abs/2312.0452 submitted by /u/ImpossibleTeach880 [link] [comments]  ( 9 min )
    [R] QuIP#: SOTA 2 bit LLMs
    We're pleased to introduce QuIP#, a new SOTA LLM quantization method that uses incoherence processing from QuIP (the paper) & lattices to achieve 2 bit LLMs with near-fp16 performance! Now you can run LLaMA 2 70B on a 24G GPU w/out offloading! QuIP# crushes all publicly available 2 bit PTQ methods on language modeling & zero shot tasks while being conceptually clean and simple. We’ve released quantized LLaMA, Mistral, and OpenHermes models, and a full codebase at https://github.com/Cornell-RelaxML/quip-sharp More information on how QuIP# works here https://cornell-relaxml.github.io/quip-sharp/. ​ submitted by /u/tsengalb99 [link] [comments]  ( 9 min )
    [R] Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers
    https://www.together.ai/blog/stripedhyena-7b submitted by /u/hzj5790 [link] [comments]  ( 9 min )
    [D] Out of distribution is equivalent to data drift?
    Please correct me if I am wrong. My understanding is out of distribution could become a problem if the new data does not belong to the same probably distribution as the training data. And as a result, what fancy research on OOD are actually theatrical study of the data drift problem. Many thanks! submitted by /u/Relevant-Fortune-810 [link] [comments]  ( 9 min )
    [D] GPT-Fast performance on larger batch sizes
    I'm toying around with gpt-fast (https://github.com/pytorch-labs/gpt-fast) and was wondering if anyone has run experiments @ BS>1? Seems like the experiments they ran were all with a single batch. submitted by /u/TheMblabla [link] [comments]  ( 9 min )
    [D] Anyone knows Facesearch Tool?
    Guys do you know any facesearchai bot which i could use to search faces over web? can anyone tell me about this one? Link submitted by /u/Cute_Principle9713 [link] [comments]  ( 9 min )
    [D] Major bug in Scikit-Learn's implementation of F-1 score
    I just encountered a pretty major bug in Scikit-Learn's implementation of F1 score. This example (below) where the classifier gets literally every single example wrong seems like it should give an F-1 score of 0.0, right? >>> from sklearn.metrics import f1_score >>> f1_score(y_true=[0, 1, 2, 3, 4], y_pred=[1, 2, 3, 4, 0], zero_division=1.0, average='macro') However, as of Scikit-Learn 1.3.2, it gives a perfect score of 1.0, which is totally wrong! (Note that this only happens when zero_division=1.0) This bug was introduced some time after 1.2.2 and can manifest more subtly than the above example; any individual class that has both a precision and recall of exactly 0% will be assigned an F-1 of 100%, so e.g. if you have a multi-class classification problem across 100s of classes, and your classifier is getting a dozen or so classes completely wrong, it will bump up your macro average F-1 by a couple percentage points, possibly without you noticing that the bump is coming from classes that are actually failing completely. Here is the (as of posting time) still un-merged fix to the bug in question. Note that this bug also affects sklearn.metrics.classification_report. I think you you can temporarily get around this by using sklearn version 1.2.2. Anyway, if you use Scikit-Learn's metrics for evaluation, go double check your scores! ​ https://preview.redd.it/7h467bvbe45c1.png?width=809&format=png&auto=webp&s=b81300b20c0792d4e94f12659d0b1e15ef97daa3 https://preview.redd.it/svubacvbe45c1.png?width=809&format=png&auto=webp&s=14ca2efddd8445d1f354fad56ba5eca36af0ac73 submitted by /u/Revolutionary-Ad-65 [link] [comments]  ( 10 min )
    [P] I made a tool that automates data cleaning and corrects data entries for machine learning
    Hi ML warriors, I am spending hours on my job manually adjusting data entries due to inconsistent spelling and shortcuts, so I built a tool to automate this process using large-language models. https://www.dataharmonizer.com/ The idea is to have a tool that takes in CSV exports and: > Corrects for inconsistencies in spelling (Coop vs co-op) > Harmonizes shortcuts (Limited vs Ltd.) > Corrects for spelling mistakes (serbices vs services) This is how the tool works: You can upload a CSV file and specify which row you want to extract and harmonize. The model automatically consolidates data by combining similar-looking phrases. You can edit the proposed phrase names or further consolidate entries if there are some groups the model has missed. In the end, you can download your CSV file again and push it to the database Would be great to hear some thoughts. Thanks :) submitted by /u/Apprehensive_Cut9179 [link] [comments]  ( 10 min )
    [P] Face+Audio+EEG dataset for didactic purpose
    Hello everyone, I'm a CS student and I'm trying to approach to the emotion recognition for a project. I played a little bit with this multimodal network for emotion recognition (https://github.com/katerynaCh/multimodal-emotion-recognition). I find it pretty cool, with the network that works very well with the Face+Audio modality. However, I was trying to implement in this network the emotion recognition with EEG (I don't really know how to do it, but still...) but I cannot find any dataset that contains Face, Audio and EEG data. Actually, I find the PME4 dataset (https://figshare.com/articles/dataset/PME4_Emotion_Recognition_with_Audio_Video_EEG_and_EMG/18737924, Face+Audio+EEG+EMG) but it has a very different structure than the RAVDESS dataset used for the multimodal network that I used in first place and I have no idea on how to adapt it to the network, so I was trying to find other datasets. submitted by /u/_link23_ [link] [comments]  ( 9 min )
    [R] EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
    Paper: https://arxiv.org/abs/2312.00863 Code: https://github.com/yformer/EfficientSAM Project page: https://yformer.github.io/efficient-sam/ Demo, models for GPU/CPU: https://huggingface.co/spaces/yunyangx/EfficientSAM Demo page 2: https://6639e86fff1fc7b618.gradio.live/ Abstract: Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs, and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification, object detection, instance segmentation, and semantic object detection, and find that our proposed pretraining method, SAMI, consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation, our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models. https://preview.redd.it/a01gzzes645c1.png?width=1211&format=png&auto=webp&s=dbc45f005da14c7655982406ff8fe83c32a954c3 submitted by /u/APaperADay [link] [comments]  ( 10 min )
    [D] A genuine and honest discussion on Collusion Ring(s)
    Dear fellow NeurIPS rejects. As your deep learning, reinforcement learning, graph neural networks, and deep learning theory people fly off to New Orleans and you realize that you are left behind. ​ I invite you to join me in this group therapy discussion, where our topic of the day is Collusion Rings. ​ I suppose that.... the first question is do they actually exist, and what is the extent of their penetration into the machine learning academic community? As someone who struggled many years to have their first paper published, it's my anecdotal evidence that machine learning is much more about marching to the beat of the drummer, where the drummer is certainly someone who is a fan of deep learning. ​ As someone struggling, still, to get another paper published, my anecdotal observation is the drum beating has gotten even more fierce over the last few years. ​ As someone who has had many, many conversations with others whom are also marginalized, our anecdotal data pools not quite to a dataset, but a filtration which is not i.i.d., but certainly indicates actively farming deep learning citations is the better choice for our careers. ​ As someone currently reviewing for ICLR/AAAI/AISTATS. My anecdotal evidence is the reviewer coordination is with secret handshakes, keywords, citations, reference lists, topics, arxiv preprints, and shibboleths. ​ I hope you find the bravery to share your experience as one who is on the inside looking out, or on the outside looking in. ​ As a beacon of hope, I remind you to read The Revolution Hasn't Happened Yet by Michael Jordan. ​ As a final question to ponder. Is the deep learning collusion ring already collapsing, and will it collapse further? submitted by /u/Terrible_Button_1763 [link] [comments]  ( 10 min )
    [R] Recurrent Distance-Encoding Neural Networks for Graph Representation Learning
    arXiv: https://arxiv.org/abs/2312.01538 OpenReview: https://openreview.net/forum?id=lNIj5FdXsC Abstract: Graph neural networks based on iterative one-hop message passing have been shown to struggle in harnessing information from distant nodes effectively. Conversely, graph transformers allow each node to attend to all other nodes directly, but suffer from high computational complexity and have to rely on ad-hoc positional encoding to bake in the graph inductive bias. In this paper, we propose a new architecture to reconcile these challenges. Our approach stems from the recent breakthroughs in long-range modeling provided by deep state-space models on sequential data: for a given target node, our model aggregates other nodes by their shortest distances to the target and uses a parallelizable linear recurrent network over the chain of distances to provide a natural encoding of its neighborhood structure. With no need for positional encoding, we empirically show that the performance of our model is highly competitive compared with that of state-of-the-art graph transformers on various benchmarks, at a drastically reduced computational complexity. In addition, we show that our model is theoretically more expressive than one-hop message passing neural networks. https://preview.redd.it/wf256ehd245c1.png?width=937&format=png&auto=webp&s=3d3d15544dafdc1c562694a0336448652cebd753 submitted by /u/APaperADay [link] [comments]  ( 10 min )
    [R] LLMs in the public sector (as a user, not a regulator)?
    I am researching on this atm: Should governments build their own sovereign LLM with its own value propositions? Are there any thought pieces, policy papers, guidelines, e.g. on how the public sector can use LLMs for its own benefit? The added value would be incredible, internal knowledge banks, speeding up approval and review procedures, feedback to citizens' queries, automated funding decisions and so on. Are there any real use cases out there? The only thing I found was this policy brief from the UK-based Ada Lovelace Institute: https://www.adalovelaceinstitute.org/evidence-review/foundation-models-public-sector/ submitted by /u/lettersfrommyfather [link] [comments]  ( 9 min )
    [D] Class-Discriminative Attention Maps for Vision Transformers
    Hi gentlepeople, I just posted this preprint on arxiv, and trying to think where to submit. I would absolutely love to hear your feedback. Usually I dont post it here, but I thought this is really interesting and broadly useful, so I'm trying to aim higher. Lmk! Basically, we proposed Class-Discriminative Attention Maps (CDAM) for Vision Transformers. CDAM is a heat map (also called a saliency or relevance map) of how important each pixel is with respect to the selected class in ViT models. CDAM retains the advantages of attention maps (high quality semantic segmentation), while being class discriminative and providing implicit regularization. Moreover, you don't even have to build a classifier on ViT. You can simply select a few images sharing a common object ("concept"), and CDAM will explain that. Live demo (upload your images): https://cdam.informatism.com/ Check out the arxiv: https://arxiv.org/abs/2312.02364 Python/pytorch implementation: https://github.com/lenbrocki/CDAM ​ ​ https://preview.redd.it/txq752hyb35c1.png?width=2327&format=png&auto=webp&s=9dddb8f21521d984020eedb3a1241ca8e5c6076f https://preview.redd.it/nwbft1hyb35c1.png?width=1400&format=png&auto=webp&s=26796e3f847ca5e247f9bd2a6ad8e4f538adf058 ​ submitted by /u/wro_o [link] [comments]  ( 10 min )
    [D] Synthetic Data from Unstructured Free Text to Preserve Privacy
    Synthetic data is a popular topic right now. Do you think synthetic data solves the cost problem of purchasing and annotating data to train models or a privacy problem in that companies don't want to feed their sensitive data to LLMs with the risk of exposure? Maybe both? Theres a tool called Tonic Textual that automatically identifies and redacts and/or synthesizes sensitive data from free-text for model training or data pipelines. This solves the privacy piece but not necessarily the infinite (cheap) scale problem since you need source data. submitted by /u/tombenom [link] [comments]  ( 9 min )
    [D] In Neural Networks is it effective to select the best-performing output neurons to train the model?
    Imagine I have M number of inputs and 1 target. But instead of having 1 neuron in output layer in model, I put 1000 neurones in output layer with a custom loss function that calculates loss of each individual output with target in a batch size of X. Then I can find the minimum mean loss of each output in a batch. Then will back-propagate based on minimum loss. In next epoch I’ll keep the weights of the selected neuron, but will randomise the weights of other output neurones. Is this something that might end up useful and faster to converge, or am I missing something? submitted by /u/mbrostami [link] [comments]  ( 9 min )
    [R] Free3D: Consistent Novel View Synthesis without 3D Representation
    Paper: https://arxiv.org/pdf/2312.04551.pdf Webpage: https://chuanxiaz.com/free3d/ Video: https://youtu.be/7CdYuZ7D1DY https://preview.redd.it/pop4hmkey15c1.png?width=892&format=png&auto=webp&s=15fb0f2ec671224133b1e99728f80fce5da9048c submitted by /u/lyndonzheng [link] [comments]  ( 1 min )
    [D] Is masters while working full time as ML engineer worth it?
    I’m currently a first year graduate from undergraduate college working full time as an ML engineer. My company offers to pay for a graduate degree and I’m really considering getting a masters. First question: for those who are ML or data scientists, what were your degrees in? Second question: how much of this will suck? Is taking two classes per semester reasonable? And are online universities considered okay? Third question: how much of a pay bump did you see after getting your masters? submitted by /u/Fluid-Pipe-2831 [link] [comments]  ( 9 min )
    [D] Advice on First ML Research Paper
    Hi guys, I’m in my third year of university and I’m currently working on my first research paper. The project explores the trade-offs in accuracy, interpretability and computational complexity if NAS created DNNs. Similar work has been done within the past 5 years, but they only explore creating and using computer vision benchmarks like NAS-Bench 201. These benchmarks create a search space of convolution networks but none only explore DNNs, RNNs at all. My work explores only DNNNs without using an existing benchmark and I wrote in the further work that developing a standard non-convolution benchmark can be done. I feel like my results can help research in this particular area since a lot of people value the use of DNNs. I want to submit it, but I would want to submit it at at least a medium-tier conference but I’m so insecure about my work. Any advice on how I should tackle this? Also, I did this project 100% on my own which is why I am even more embarrassed. Feel like it’s crappy and I’m wasting my time submitted by /u/waffleman221 [link] [comments]  ( 10 min )
    [R] Compressed Context Memory For Online Language Model Interaction
    Interactive demo sample arXiv: https://arxiv.org/abs/2312.03414 GitHub: https://github.com/snu-mllab/Context-Memory Project page: https://janghyun1230.github.io/memory/ Summary: Our approach dynamically creates compressed memory of contexts during LLM interactions. Our approach only requires training a conditional LoRA for compression. We use a fully parallelized training strategy for recurrent compression procedures. We conduct evaluations on diverse applications: conversation, multi-task ICL, and personalization, achieving the performance level of a full context model with 5x smaller context memory space. https://preview.redd.it/5vmwxa7luy4c1.png?width=2162&format=png&auto=webp&s=0f717a1528688b6fd9b47f3c7e43be66b1b5c78b submitted by /u/janghyun1230 [link] [comments]  ( 9 min )
    [D] Machine Learning Curriculum
    Hi all. I am trying to create a straightforward and concise ML curriculum. Any input or contribution is welcomed. The goal is to keep it to the point with the main curriculum while adding additional resources in separate sections. ​ Link: https://github.com/pytholic/Machine-Learning-Curriculum submitted by /u/rajahaseeb147 [link] [comments]  ( 9 min )
  • Open

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
    submitted by /u/nickb [link] [comments]  ( 1 min )
    Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers
    submitted by /u/nickb [link] [comments]  ( 1 min )
    Training algorithm breaks barriers to deep physical neural networks
    submitted by /u/keghn [link] [comments]  ( 1 min )
  • Open

    "Eureka: Human-Level Reward Design via Coding Large Language Models", Ma et al 2023 {Nvidia}
    submitted by /u/gwern [link] [comments]
    Learning Curricula in Open-Ended Worlds
    arXiv: https://arxiv.org/abs/2312.03126 Follow-up work listed in the Afterword (chapter 7). CODE: https://github.com/facebookresearch/level-replay https://github.com/facebookresearch/dcd ACCEL demo: https://accelagent.github.io/ Abstract: Deep reinforcement learning (RL) provides powerful methods for training optimal sequential decision-making agents. As collecting real-world interactions can entail additional costs and safety risks, the common paradigm of sim2real conducts training in a simulator, followed by real-world deployment. Unfortunately, RL agents easily overfit to the choice of simulated training environments, and worse still, learning ends when the agent masters the specific set of simulated environments. In contrast, the real world is highly open-ended, featuring endl…  ( 1 min )
    The Power of Reinforcement Learning: look how this DeepRL Sektor model found a smart, super-cool exploit for Ultimate Mortal Kombat 3 in the video of a submission on DIAMBRA competition platform!
    submitted by /u/DIAMBRA_AIArena [link] [comments]  ( 1 min )
    [D] Looking for Labs working in the direction of Language based Robotics/Embodied AI
    Hello everyone, I'm from a renowned university in Europe pursuing a master's in Robotics. I am interested in using visual language models in Robotics and currently working on using CLIP embeddings for object based Mapping and Navigation. Some papers that might give you more of a idea of what I mean by language-based robotics: - Go to Anything, Gervet et al. - CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation, Gadre et al. - Visual Language Maps for Robot Navigation, Huang et al. I am trying to find PhD positions in the area and looking for Labs that work on this. Based on my research I've found that most of the papers in this direction are from labs in the US (Profs like Sergey Levine, Dhruv Batra, Jitendra Malik, Pieter Abeel) I was surprised to find only one lab working in this area in Europe, which is in Freiburg,Germany. (Wolfram Burgard) I would be really grateful if someone could provide me any more hints if they know of a lab working in this area. I am primarily interested in Europe , but if you know of a lab in the US that is not in my list, please feel free to comment! submitted by /u/Bluebird705 [link] [comments]  ( 1 min )
    Learning to Fly in Seconds
    submitted by /u/jonas-eschmann [link] [comments]  ( 1 min )
    A beginners intro to RL through 4 seminal projects
    submitted by /u/AvvYaa [link] [comments]  ( 1 min )
    "Improving Language Models with Advantage-based Offline Policy Gradients", Baheti et al 2023
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    An integration tax that every adult in the US pays
    I live in Northern California and have a new primary care doctor now. My previous primary care doctor, who has since retired, was part of the Stanford Health Care (SHC) system. My new doctor is merely in a different SHC office.  The new doctor has a very nice office, and the doctor and staff I’ve… Read More »An integration tax that every adult in the US pays The post An integration tax that every adult in the US pays appeared first on Data Science Central.  ( 23 min )
    Harness the power of an AI-powered forecasting model to revitalize your business
    Inventory management is crucial for businesses, but it can be tedious. It can make or break a business, regardless of its age. AI has revolutionized business management and inventory control. AI can now do more than just follow instructions. It can analyze inventory history, predict customer behavior, and anticipate business needs. Want to know what… Read More »Harness the power of an AI-powered forecasting model to revitalize your business The post Harness the power of an AI-powered forecasting model to revitalize your business appeared first on Data Science Central.  ( 26 min )
  • Open

    What's the main reason of development behind all of the Al models to be so good all the same time -GPT4 and Gemni - eli5!
    First time visiting the community! So right back to it, GPT-4 got released this year and we(Layman people) have been thinking that Google got left behind in the race to AGI but just soon after the best commercial product by OpenAI was revealed as GPT-4 (in the public eye we could see the growth version by version), Google dropped a bombshell of a multi-modal AI. What made them leap so quickly, is it purely because the knowledge was shared and built upon or Was there a hardware barrier? submitted by /u/THEFACEMAN14 [link] [comments]
    'Nudify' Apps That Use AI to 'Undress' Women in Photos Are Soaring in Popularity
    Apps and websites that use artificial intelligence to undress women in photos are gaining popularity, with millions of people visiting these sites. The rise in popularity is due to the release of open source diffusion models that create realistic deepfake images. These apps are part of the concerning trend of non-consensual pornography, as the images are often taken from social media without consent. Privacy experts are worried that advances in AI technology have made deepfake software more accessible and effective. There is currently no federal law banning the creation of deepfake pornography. Source : https://time.com/6344068/nudify-apps-undress-photos-women-artificial-intelligence/ submitted by /u/NuseAI [link] [comments]
    AI — weekly megathread!
    News provided by aibrews.com Google introduced Gemini - a family of multimodal models built from the ground up for multimodality, capable of reasoning seamlessly across text, images, video, audio, and code. It comes in Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases [Details | Technical Report]. With a score of 90.0%, Gemini Ultra is the first model to outperform human experts on MMLU (massive multitask language understanding). Gemini Pro is available in Bard (English, in 170 countries). Gemini Ultra will come to Bard early next year. Pixel 8 Pro will be able to run Gemini Nano. Controversy regarding Google’s demo video (below), as many took it as being ‘fake’ [Article on TechCrunch]. Google shared a …  ( 1 min )
    U.S. Border Patrol is using AI to crack down on fentanyl trafficking
    submitted by /u/Upbeat-Interaction13 [link] [comments]  ( 1 min )
    AI's Climate Impact Goes Beyond Its Emissions
    Artificial intelligence (AI) has a significant environmental impact beyond its carbon footprint. The computing power and electricity required to train and run AI systems result in carbon dioxide emissions. The impact of AI on the climate crisis is difficult to calculate due to different types of AI requiring varying amounts of computing power. -Experts recommend considering AI's emissions as only one aspect of its climate footprint. AI can be used to optimize fossil fuel extraction, contributing to greenhouse gas emissions. AI is also used in automated advertising, which boosts consumptive behavior and contributes to emissions from the fashion industry. However, AI can also be applied to climate change mitigation, such as identifying damaged buildings in natural disasters and monitoring emissions. Source: https://www.scientificamerican.com/article/ais-climate-impact-goes-beyond-its-emissions/ submitted by /u/NuseAI [link] [comments]
    [D] ChatGPT4 doesn’t cut it for my work. Need a more accurate tool.
    I've been using ChatGPT for my research, but it keeps spitting out wrong or nonsensical answers. I'm working on a project about environmental policies, and I need factual data from spanning over a fairly long period. I wanted to make it easier for myself so I asked ChatGPT. Instead of getting just the facts, I got a mix of right and totally off-the-wall stuff. Had to fact check everything and in the end it took me the same amount of time and effort as if I had done the work myself, except costing me for the GPT subscription. I did some research and found out that it's a common problem with AIs, called "hallucination." I need an AI that gives me correct information, not random guesses. No made up sources for god’s sake. submitted by /u/awful_foyer70 [link] [comments]
    Project Auto-Human Pilot using all your Senses
    I recently had an idea about what it would be like to have your entire life guided by a sort of Artificial General Intelligence (AGI). This idea used to be confined to the realm of science fiction, but I'm starting to see how it could become possible by integrating some of the technology we have today. I think there would still be some problems to iron out before it could be a final product, but if you put all of the individual components together, the possibility is real. Here's the breakdown: Powerful XR Glasses: Indistinguishable from regular prescription glasses, these would be your primary interface with the AI. Powerful Computing Device: A smartphone like an iPhone would connect with the XR glasses and provide additional processing power. It would also allow the AI to "see" the world through a camera as input, potentially with other sensors like LiDAR for better accuracy. This device would be housed in a chest pocket, similar to Samantha's device in the movie "Her." Excellent Auditory Device: Airpods Pro-like earbuds disguised as hearing aids would provide clear and discreet communication with the AI. Vibrating Watch: This watch would provide haptic feedback to alert you if you're making any incorrect hand gestures during interactions guided by the AI. Health Sensors: All of these devices would include integrated health sensors to provide the AI with a comprehensive overview of your health and well-being. By combining all of these technologies, we could create a powerful AI-enhanced human with limitless possibilities. The software running on this system could be incredibly sophisticated, opening doors to some pretty science-fiction-like applications. submitted by /u/Major_Fishing6888 [link] [comments]  ( 1 min )
    Do all ai tools have no idea about themselves? Why didn't the developer put in the info about itself when building?
    submitted by /u/Civil-Interest8050 [link] [comments]
    Unlock Creative Possibilities with Script-to-Video AI Tool Aug X Labs
    submitted by /u/zengccfun [link] [comments]  ( 9 min )
    If the future world were divided into fifteen races, pick your side
    submitted by /u/Sea_Permit5660 [link] [comments]
    Is there a thread about AI vs UBI?
    The narrative was always "Robots will work for us so we have more free time" But the reality is "Robots will save tons of money to big corporations and increase profits" In the meantime, people will get fired and poorer. Is there a conversation about Taxing robotic and AI systems so it can improve the lives of everybody? Asking as I haven't found anything like that. Where is the conversation happening? submitted by /u/t-dog- [link] [comments]
    Google's best Gemini demo was faked
    submitted by /u/atomicxblue [link] [comments]  ( 1 min )
  • Open

    Sparsity-preserving differentially private training
    Posted by Yangsibo Huang, Research Intern, Google Research; Chiyuan Zhang, Research Scientist, Google Research Large embedding models have emerged as a fundamental tool for various applications in recommendation systems [1, 2] and natural language processing [3, 4, 5]. Such models enable the integration of non-numerical data into deep learning models by mapping categorical or string-valued input attributes with large vocabularies to fixed-length representation vectors using embedding layers. These models are widely deployed in personalized recommendation systems and achieve state-of-the-art performance in language tasks, such as language modeling, sentiment analysis, and question answering. In many such scenarios, privacy is an equally important feature when deploying those models. A…  ( 93 min )
  • Open

    Your illustrated guide to Christmas carols
    Between the two of them, ChatGPT4 can generate the lyrics to Christmas carols, and DALL-E3 can illustrate them! Throw your old carol books away because this is the only guide you'll need. 12 Days of Christmas "Please generate an illustration where each of the 12 days'  ( 3 min )
    Bonus: Jingle all the way
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    Automated system teaches users when to collaborate with an AI assistant
    MIT researchers develop a customized onboarding process that helps a human learn when a model’s advice is trustworthy.  ( 11 min )
  • Open

    SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks. (arXiv:2305.17390v2 [cs.CL] UPDATED)
    We introduce SwiftSage, a novel agent framework inspired by the dual-process theory of human cognition, designed to excel in action planning for complex interactive reasoning tasks. SwiftSage integrates the strengths of behavior cloning and prompting large language models (LLMs) to enhance task completion performance. The framework comprises two primary modules: the Swift module, representing fast and intuitive thinking, and the Sage module, emulating deliberate thought processes. The Swift module is a small encoder-decoder LM fine-tuned on the oracle agent's action trajectories, while the Sage module employs LLMs such as GPT-4 for subgoal planning and grounding. We develop a heuristic method to harmoniously integrate the two modules, resulting in a more efficient and robust problem-solving process. In 30 tasks from the ScienceWorld benchmark, SwiftSage significantly outperforms other methods such as SayCan, ReAct, and Reflexion, demonstrating its effectiveness in solving complex interactive tasks.  ( 2 min )
    Interpretability Illusions in the Generalization of Simplified Models. (arXiv:2312.03656v1 [cs.LG])
    A common method to study deep learning systems is to use simplified model representations -- for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the results of these simplified are faithful to the original model. Here, we illustrate an important caveat to this assumption: even if the simplified representations can accurately approximate the full model on the training set, they may fail to accurately capture the model's behavior out of distribution -- the understanding developed from simplified representations may be an illusion. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits. First, we train models on the Dyck balanced-parenthesis languages. We simplify these models using tools like dimensionality reduction and clustering, and then explicitly test how these simplified proxies match the behavior of the original model on various out-of-distribution test sets. We find that the simplified proxies are generally less faithful out of distribution. In cases where the original model generalizes to novel structures or deeper depths, the simplified versions may fail, or generalize better. This finding holds even if the simplified representations do not directly depend on the training distribution. Next, we study a more naturalistic task: predicting the next character in a dataset of computer code. We find similar generalization gaps between the original model and simplified proxies, and conduct further analysis to investigate which aspects of the code completion task are associated with the largest gaps. Together, our results raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations.  ( 3 min )
    An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition. (arXiv:2312.03668v1 [eess.AS])
    Advances in machine learning have made it possible to perform various text and speech processing tasks, including automatic speech recognition (ASR), in an end-to-end (E2E) manner. Since typical E2E approaches require large amounts of training data and resources, leveraging pre-trained foundation models instead of training from scratch is gaining attention. Although there have been attempts to use pre-trained speech and language models in ASR, most of them are limited to using either. This paper explores the potential of integrating a pre-trained speech representation model with a large language model (LLM) for E2E ASR. The proposed model enables E2E ASR by generating text tokens in an autoregressive manner via speech representations as speech prompts, taking advantage of the vast knowledge provided by the LLM. Furthermore, the proposed model can incorporate remarkable developments for LLM utilization, such as inference optimization and parameter-efficient domain adaptation. Experimental results show that the proposed model achieves performance comparable to modern E2E ASR models.  ( 2 min )
    Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching. (arXiv:2311.17030v2 [cs.LG] UPDATED)
    Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional subspaces of activations. Specifically, recent studies have explored subspace interventions (such as activation patching) as a way to simultaneously manipulate model behavior and attribute the features behind it to given subspaces. In this work, we demonstrate that these two aims diverge, potentially leading to an illusory sense of interpretability. Counterintuitively, even if a subspace intervention makes the model's output behave as if the value of a feature was changed, this effect may be achieved by activating a dormant parallel pathway leveraging another subspace that is causally disconnected from model outputs. We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice. In the context of factual recall, we further show a link to rank-1 fact editing, providing a mechanistic explanation for previous work observing an inconsistency between fact editing performance and fact localization. However, this does not imply that activation patching of subspaces is intrinsically unfit for interpretability. To contextualize our findings, we also show what a success case looks like in a task (indirect object identification) where prior manual circuit analysis informs an understanding of the location of a feature. We explore the additional evidence needed to argue that a patched subspace is faithful.  ( 3 min )
    KPI Extraction from Maintenance Work Orders -- A Comparison of Expert Labeling, Text Classification and AI-Assisted Tagging for Computing Failure Rates of Wind Turbines. (arXiv:2311.04064v2 [cs.CL] UPDATED)
    Maintenance work orders are commonly used to document information about wind turbine operation and maintenance. This includes details about proactive and reactive wind turbine downtimes, such as preventative and corrective maintenance. However, the information contained in maintenance work orders is often unstructured and difficult to analyze, presenting challenges for decision-makers wishing to use it for optimizing operation and maintenance. To address this issue, this work compares three different approaches to calculate reliability by performance indicators from maintenance work orders. The first approach involves manual labeling of the maintenance work orders by domain experts, using the schema defined in an industrial guideline to assign the label accordingly. The second approach involves the development of a model that automatically labels the maintenance work orders using text classification methods. Through this method, we are able to achieve macro average and weighted average F1-Scores of 0.75 and 0.85 respectively. The third technique uses an AI-assisted tagging tool to tag and structure the raw maintenance information, together with a novel rule-based approach for extracting relevant maintenance work orders for failure rate calculation. In our experiments the AI-assisted tool leads to a 88% drop in tagging time in comparison to the other two approaches, while expert labeling and text classification are more accurate in KPI extraction. Overall, our findings make extracting maintenance information from maintenance work orders more efficient, enable the assessment of reliability key performance indicators and therefore support the optimization of wind turbine operation and maintenance.  ( 3 min )
    From External to Swap Regret 2.0: An Efficient Reduction and Oblivious Adversary for Large Action Spaces. (arXiv:2310.19786v3 [cs.LG] UPDATED)
    We provide a novel reduction from swap-regret minimization to external-regret minimization, which improves upon the classical reductions of Blum-Mansour [BM07] and Stolz-Lugosi [SL05] in that it does not require finiteness of the space of actions. We show that, whenever there exists a no-external-regret algorithm for some hypothesis class, there must also exist a no-swap-regret algorithm for that same class. For the problem of learning with expert advice, our result implies that it is possible to guarantee that the swap regret is bounded by {\epsilon} after $\log(N)^{O(1/\epsilon)}$ rounds and with $O(N)$ per iteration complexity, where $N$ is the number of experts, while the classical reductions of Blum-Mansour and Stolz-Lugosi require $O(N/\epsilon^2)$ rounds and at least $\Omega(N^2)$ per iteration complexity. Our result comes with an associated lower bound, which -- in contrast to that in [BM07] -- holds for oblivious and $\ell_1$-constrained adversaries and learners that can employ distributions over experts, showing that the number of rounds must be $\tilde\Omega(N/\epsilon^2)$ or exponential in $1/\epsilon$. Our reduction implies that, if no-regret learning is possible in some game, then this game must have approximate correlated equilibria, of arbitrarily good approximation. This strengthens the folklore implication of no-regret learning that approximate coarse correlated equilibria exist. Importantly, it provides a sufficient condition for the existence of correlated equilibrium which vastly extends the requirement that the action set is finite, thus answering a question left open by [DG22; Ass+23]. Moreover, it answers several outstanding questions about equilibrium computation and learning in games.  ( 3 min )
    Deep Learning for Koopman-based Dynamic Movement Primitives. (arXiv:2312.03328v1 [cs.RO])
    The challenge of teaching robots to perform dexterous manipulation, dynamic locomotion, or whole--body manipulation from a small number of demonstrations is an important research field that has attracted interest from across the robotics community. In this work, we propose a novel approach by joining the theories of Koopman Operators and Dynamic Movement Primitives to Learning from Demonstration. Our approach, named \gls{admd}, projects nonlinear dynamical systems into linear latent spaces such that a solution reproduces the desired complex motion. Use of an autoencoder in our approach enables generalizability and scalability, while the constraint to a linear system attains interpretability. Our results are comparable to the Extended Dynamic Mode Decomposition on the LASA Handwriting dataset but with training on only a small fractions of the letters.  ( 2 min )
    Estimates on the generalization error of Physics Informed Neural Networks (PINNs) for approximating PDEs. (arXiv:2006.16144v3 [math.NA] UPDATED)
    Physics informed neural networks (PINNs) have recently been widely used for robust and accurate approximation of PDEs. We provide rigorous upper bounds on the generalization error of PINNs approximating solutions of the forward problem for PDEs. An abstract formalism is introduced and stability properties of the underlying PDE are leveraged to derive an estimate for the generalization error in terms of the training error and number of training samples. This abstract framework is illustrated with several examples of nonlinear PDEs. Numerical experiments, validating the proposed theory, are also presented.  ( 2 min )
    I-PHYRE: Interactive Physical Reasoning. (arXiv:2312.03009v1 [cs.AI])
    Current evaluation protocols predominantly assess physical reasoning in stationary scenes, creating a gap in evaluating agents' abilities to interact with dynamic events. While contemporary methods allow agents to modify initial scene configurations and observe consequences, they lack the capability to interact with events in real time. To address this, we introduce I-PHYRE, a framework that challenges agents to simultaneously exhibit intuitive physical reasoning, multi-step planning, and in-situ intervention. Here, intuitive physical reasoning refers to a quick, approximate understanding of physics to address complex problems; multi-step denotes the need for extensive sequence planning in I-PHYRE, considering each intervention can significantly alter subsequent choices; and in-situ implies the necessity for timely object manipulation within a scene, where minor timing deviations can result in task failure. We formulate four game splits to scrutinize agents' learning and generalization of essential principles of interactive physical reasoning, fostering learning through interaction with representative scenarios. Our exploration involves three planning strategies and examines several supervised and reinforcement agents' zero-shot generalization proficiency on I-PHYRE. The outcomes highlight a notable gap between existing learning algorithms and human performance, emphasizing the imperative for more research in enhancing agents with interactive physical reasoning capabilities. The environment and baselines will be made publicly available.  ( 2 min )
    Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning. (arXiv:2202.10629v4 [cs.LG] UPDATED)
    In data-rich domains such as vision, language, and speech, deep learning prevails to deliver high-performance task-specific models and can even learn general task-agnostic representations for efficient finetuning to downstream tasks. However, deep learning in resource-limited domains still faces multiple challenges including (i) limited data, (ii) constrained model development cost, and (iii) lack of adequate pre-trained models for effective finetuning. This paper provides an overview of model reprogramming to bridge this gap. Model reprogramming enables resource-efficient cross-domain machine learning by repurposing and reusing a well-developed pre-trained model from a source domain to solve tasks in a target domain without model finetuning, where the source and target domains can be vastly different. In many applications, model reprogramming outperforms transfer learning and training from scratch. This paper elucidates the methodology of model reprogramming, summarizes existing use cases, provides a theoretical explanation of the success of model reprogramming, and concludes with a discussion on open-ended research questions and opportunities. A list of model reprogramming studies is actively maintained and updated at https://github.com/IBM/model-reprogramming.  ( 2 min )
    Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback. (arXiv:2307.04749v2 [cs.CV] UPDATED)
    The field of text-conditioned image generation has made unparalleled progress with the recent advent of latent diffusion models. While remarkable, as the complexity of given text input increases, the state-of-the-art diffusion models may still fail in generating images which accurately convey the semantics of the given prompt. Furthermore, it has been observed that such misalignments are often left undetected by pretrained multi-modal models such as CLIP. To address these problems, in this paper we explore a simple yet effective decompositional approach towards both evaluation and improvement of text-to-image alignment. In particular, we first introduce a Decompositional-Alignment-Score which given a complex prompt decomposes it into a set of disjoint assertions. The alignment of each assertion with generated images is then measured using a VQA model. Finally, alignment scores for different assertions are combined aposteriori to give the final text-to-image alignment score. Experimental analysis reveals that the proposed alignment metric shows significantly higher correlation with human ratings as opposed to traditional CLIP, BLIP scores. Furthermore, we also find that the assertion level alignment scores provide a useful feedback which can then be used in a simple iterative procedure to gradually increase the expression of different assertions in the final image outputs. Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy. Project page for our paper is available at https://1jsingh.github.io/divide-evaluate-and-refine  ( 3 min )
    TPPoet: Transformer-Based Persian Poem Generation using Minimal Data and Advanced Decoding Techniques. (arXiv:2312.02125v2 [cs.CL] UPDATED)
    Recent advances in language models (LMs), have demonstrated significant efficacy in tasks related to the arts and humanities. While LMs have exhibited exceptional performance across a wide range of natural language processing tasks, there are notable challenges associated with their utilization on small datasets and their ability to replicate more creative human capacities. In this study, we aim to address these challenges by training a Persian classical poetry generation model using a transformer architecture on a specialized dataset with no pretraining. Additionally, we propose a novel decoding method to enhance coherence and meaningfulness in the generated poetry, effectively managing the tradeoff between diversity and quality. Furthermore, the results of our training approach and the proposed decoding method are evaluated through comprehensive set of automatic and human evaluations and showed its superior capability to generate coherent and meaningful poetry in compare to other decoding methods and an existing Persian large language model (LLM).  ( 2 min )
    Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique. (arXiv:2312.03303v1 [cs.AI])
    This paper presents a novel benchmarking framework Dyport for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, our approach tests these systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses hypothesis accuracy but also their potential impact in biomedical research which significantly extends traditional link prediction benchmarks. Applicability of our benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs. Being flexible, our benchmarking system is designed for broad application in hypothesis generation quality verification, aiming to expand the scope of scientific discovery within the biomedical research community. Availability and implementation: Dyport framework is fully open-source. All code and datasets are available at: https://github.com/IlyaTyagin/Dyport  ( 2 min )
    Beyond Isolation: Multi-Agent Synergy for Improving Knowledge Graph Construction. (arXiv:2312.03022v1 [cs.AI])
    Knowledge graph construction (KGC) is a multifaceted undertaking involving the extraction of entities, relations, and events. Traditionally, large language models (LLMs) have been viewed as solitary task-solving agents in this complex landscape. However, this paper challenges this paradigm by introducing a novel framework, CooperKGC. Departing from the conventional approach, CooperKGC establishes a collaborative processing network, assembling a KGC collaboration team capable of concurrently addressing entity, relation, and event extraction tasks. Our experiments unequivocally demonstrate that fostering collaboration and information interaction among diverse agents within CooperKGC yields superior results compared to individual cognitive processes operating in isolation. Importantly, our findings reveal that the collaboration facilitated by CooperKGC enhances knowledge selection, correction, and aggregation capabilities across multiple rounds of interactions.  ( 2 min )
    f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization. (arXiv:2312.03259v1 [cs.LG])
    Training and deploying machine learning models that meet fairness criteria for protected groups are fundamental in modern artificial intelligence. While numerous constraints and regularization terms have been proposed in the literature to promote fairness in machine learning tasks, most of these methods are not amenable to stochastic optimization due to the complex and nonlinear structure of constraints and regularizers. Here, the term "stochastic" refers to the ability of the algorithm to work with small mini-batches of data. Motivated by the limitation of existing literature, this paper presents a unified stochastic optimization framework for fair empirical risk minimization based on f-divergence measures (f-FERM). The proposed stochastic algorithm enjoys theoretical convergence guarantees. In addition, our experiments demonstrate the superiority of fairness-accuracy tradeoffs offered by f-FERM for almost all batch sizes (ranging from full-batch to batch size of one). Moreover, we show that our framework can be extended to the case where there is a distribution shift from training to the test data. Our extension is based on a distributionally robust optimization reformulation of f-FERM objective under $L_p$ norms as uncertainty sets. Again, in this distributionally robust setting, f-FERM not only enjoys theoretical convergence guarantees but also outperforms other baselines in the literature in the tasks involving distribution shifts. An efficient stochastic implementation of $f$-FERM is publicly available.  ( 2 min )
    CD-GraB: Coordinating Distributed Example Orders for Provably Accelerated Training. (arXiv:2302.00845v4 [cs.LG] UPDATED)
    Recent research on online Gradient Balancing (GraB) has revealed that there exist permutation-based example orderings for SGD that are guaranteed to outperform random reshuffling (RR). Whereas RR arbitrarily permutes training examples, GraB leverages stale gradients from prior epochs to order examples -- achieving a provably faster convergence rate than RR. However, GraB is limited by design: while it demonstrates an impressive ability to scale-up training on centralized data, it does not naturally extend to modern distributed ML workloads. We therefore propose Coordinated Distributed GraB (CD-GraB), which uses insights from prior work on kernel thinning to translate the benefits of provably faster permutation-based example ordering to distributed settings. With negligible overhead, CD-GraB exhibits a linear speedup in convergence rate over centralized GraB and outperforms distributed RR on a variety of benchmark tasks.  ( 2 min )
    Vicious Classifiers: Data Reconstruction Attack at Inference Time. (arXiv:2212.04223v2 [cs.LG] UPDATED)
    Privacy-preserving inference in edge computing paradigms encourages the users of machine-learning services to locally run a model on their private input, for a target task, and only share the model's outputs with the server. We study how a vicious server can reconstruct the input data by observing only the model's outputs, while keeping the target accuracy very close to that of a honest server: by jointly training a target model (to run at users' side) and an attack model for data reconstruction (to secretly use at server's side). We present a new measure to assess the reconstruction risk in edge inference. Our evaluations on six benchmark datasets demonstrate that the model's input can be approximately reconstructed from the outputs of a single target inference. We propose a potential defense mechanism that helps to distinguish vicious versus honest classifiers at inference time. We discuss open challenges and directions for future studies and release our code as a benchmark for future work.  ( 2 min )
    Towards Sobolev Training. (arXiv:2312.03510v1 [cs.LG])
    The increasing use of stochastic models for describing complex phenomena warrants surrogate models that capture the reference model characteristics at a fraction of the computational cost, foregoing potentially expensive Monte Carlo simulation. The predominant approach of fitting a large neural network and then pruning it to a reduced size has commonly neglected shortcomings. The produced surrogate models often will not capture the sensitivities and uncertainties inherent in the original model. In particular, (higher-order) derivative information of such surrogates could differ drastically. Given a large enough network, we expect this derivative information to match. However, the pruned model will almost certainly not share this behavior. In this paper, we propose to find surrogate models by using sensitivity information throughout the learning and pruning process. We build on work using Interval Adjoint Significance Analysis for pruning and combine it with the recent advancements in Sobolev Training to accurately model the original sensitivity information in the pruned neural network based surrogate model. We experimentally underpin the method on an example of pricing a multidimensional Basket option modelled through a stochastic differential equation with Brownian motion. The proposed method is, however, not limited to the domain of quantitative finance, which was chosen as a case study for intuitive interpretations of the sensitivities. It serves as a foundation for building further surrogate modelling techniques considering sensitivity information.
    Not All Large Language Models (LLMs) Succumb to the "Reversal Curse": A Comparative Study of Deductive Logical Reasoning in BERT and GPT Models. (arXiv:2312.03633v1 [cs.CL])
    The "Reversal Curse" refers to the scenario where auto-regressive decoder large language models (LLMs), such as ChatGPT, trained on "A is B" fail to learn "B is A", demonstrating a basic failure of logical deduction. This raises a red flag in the use of GPT models for certain general tasks such as constructing knowledge graphs, considering their adherence to this symmetric principle. In our study, we examined a bidirectional LLM, BERT, and found that it is immune to the reversal curse. Driven by ongoing efforts to construct biomedical knowledge graphs with LLMs, we also embarked on evaluating more complex but essential deductive reasoning capabilities. This process included first training encoder and decoder language models to master the intersection ($\cap$) and union ($\cup$) operations on two sets and then moving on to assess their capability to infer different combinations of union ($\cup$) and intersection ($\cap$) operations on three newly created sets. The findings showed that while both encoder and decoder language models, trained for tasks involving two sets (union/intersection), were proficient in such scenarios, they encountered difficulties when dealing with operations that included three sets (various combinations of union and intersection). Our research highlights the distinct characteristics of encoder and decoder models in simple and complex logical reasoning. In practice, the choice between BERT and GPT should be guided by the specific requirements and nature of the task at hand, leveraging their respective strengths in bidirectional context comprehension and sequence prediction.  ( 3 min )
    GCFA:Geodesic Curve Feature Augmentation via Shape Space Theory. (arXiv:2312.03325v1 [cs.CV])
    Deep learning has yielded remarkable outcomes in various domains. However, the challenge of requiring large-scale labeled samples still persists in deep learning. Thus, data augmentation has been introduced as a critical strategy to train deep learning models. However, data augmentation suffers from information loss and poor performance in small sample environments. To overcome these drawbacks, we propose a feature augmentation method based on shape space theory, i.e., Geodesic curve feature augmentation, called GCFA in brevity. First, we extract features from the image with the neural network model. Then, the multiple image features are projected into a pre-shape space as features. In the pre-shape space, a Geodesic curve is built to fit the features. Finally, the many generated features on the Geodesic curve are used to train the various machine learning models. The GCFA module can be seamlessly integrated with most machine learning methods. And the proposed method is simple, effective and insensitive for the small sample datasets. Several examples demonstrate that the GCFA method can greatly improve the performance of the data preprocessing model in a small sample environment.
    Domain Invariant Representation Learning and Sleep Dynamics Modeling for Automatic Sleep Staging. (arXiv:2312.03196v1 [cs.LG])
    Sleep staging has become a critical task in diagnosing and treating sleep disorders to prevent sleep related diseases. With rapidly growing large scale public sleep databases and advances in machine learning, significant progress has been made toward automatic sleep staging. However, previous studies face some critical problems in sleep studies; the heterogeneity of subjects' physiological signals, the inability to extract meaningful information from unlabeled sleep signal data to improve predictive performances, the difficulty in modeling correlations between sleep stages, and the lack of an effective mechanism to quantify predictive uncertainty. In this study, we propose a neural network based automatic sleep staging model, named DREAM, to learn domain generalized representations from physiological signals and models sleep dynamics. DREAM learns sleep related and subject invariant representations from diverse subjects' sleep signal segments and models sleep dynamics by capturing interactions between sequential signal segments and between sleep stages. In the experiments, we demonstrate that DREAM outperforms the existing sleep staging methods on three datasets. The case study demonstrates that our model can learn the generalized decision function resulting in good prediction performances for the new subjects, especially in case there are differences between testing and training subjects. The usage of unlabeled data shows the benefit of leveraging unlabeled EEG data. Further, uncertainty quantification demonstrates that DREAM provides prediction uncertainty, making the model reliable and helping sleep experts in real world applications.  ( 3 min )
    Deep Set Neural Networks for forecasting asynchronous bioprocess timeseries. (arXiv:2312.02079v2 [cs.LG] UPDATED)
    Cultivation experiments often produce sparse and irregular time series. Classical approaches based on mechanistic models, like Maximum Likelihood fitting or Monte-Carlo Markov chain sampling, can easily account for sparsity and time-grid irregularities, but most statistical and Machine Learning tools are not designed for handling sparse data out-of-the-box. Among popular approaches there are various schemes for filling missing values (imputation) and interpolation into a regular grid (alignment). However, such methods transfer the biases of the interpolation or imputation models to the target model. We show that Deep Set Neural Networks equipped with triplet encoding of the input data can successfully handle bio-process data without any need for imputation or alignment procedures. The method is agnostic to the particular nature of the time series and can be adapted for any task, for example, online monitoring, predictive control, design of experiments, etc. In this work, we focus on forecasting. We argue that such an approach is especially suitable for typical cultivation processes, demonstrate the performance of the method on several forecasting tasks using data generated from macrokinetic growth models under realistic conditions, and compare the method to a conventional fitting procedure and methods based on imputation and alignment.  ( 2 min )
    Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis. (arXiv:2312.03491v1 [cs.SD])
    In text-to-speech (TTS) synthesis, diffusion models have achieved promising generation quality. However, because of the pre-defined data-to-noise diffusion process, their prior distribution is restricted to a noisy representation, which provides little information of the generation target. In this work, we present a novel TTS system, Bridge-TTS, making the first attempt to substitute the noisy Gaussian prior in established diffusion-based TTS methods with a clean and deterministic one, which provides strong structural information of the target. Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram, leading to a data-to-data process. Moreover, the tractability and flexibility of our formulation allow us to empirically study the design spaces such as noise schedules, as well as to develop stochastic and deterministic samplers. Experimental results on the LJ-Speech dataset illustrate the effectiveness of our method in terms of both synthesis quality and sampling efficiency, significantly outperforming our diffusion counterpart Grad-TTS in 50-step/1000-step synthesis and strong fast TTS models in few-step scenarios. Project page: https://bridge-tts.github.io/  ( 2 min )
    Coherent Soft Imitation Learning. (arXiv:2305.16498v3 [cs.LG] UPDATED)
    Imitation learning methods seek to learn from an expert either through behavioral cloning (BC) of the policy or inverse reinforcement learning (IRL) of the reward. Such methods enable agents to learn complex tasks from humans that are difficult to capture with hand-designed reward functions. Choosing BC or IRL for imitation depends on the quality and state-action coverage of the demonstrations, as well as additional access to the Markov decision process. Hybrid strategies that combine BC and IRL are not common, as initial policy optimization against inaccurate rewards diminishes the benefit of pretraining the policy with BC. This work derives an imitation method that captures the strengths of both BC and IRL. In the entropy-regularized ('soft') reinforcement learning setting, we show that the behaviour-cloned policy can be used as both a shaped reward and a critic hypothesis space by inverting the regularized policy update. This coherency facilitates fine-tuning cloned policies using the reward estimate and additional interactions with the environment. This approach conveniently achieves imitation learning through initial behaviour cloning, followed by refinement via RL with online or offline data sources. The simplicity of the approach enables graceful scaling to high-dimensional and vision-based tasks, with stable learning and minimal hyperparameter tuning, in contrast to adversarial approaches. For the open-source implementation and simulation results, see https://joemwatson.github.io/csil/.  ( 2 min )
    Learning From Scenarios for Stochastic Repairable Scheduling. (arXiv:2312.03492v1 [cs.LG])
    When optimizing problems with uncertain parameter values in a linear objective, decision-focused learning enables end-to-end learning of these values. We are interested in a stochastic scheduling problem, in which processing times are uncertain, which brings uncertain values in the constraints, and thus repair of an initial schedule may be needed. Historical realizations of the stochastic processing times are available. We show how existing decision-focused learning techniques based on stochastic smoothing can be adapted to this scheduling problem. We include an extensive experimental evaluation to investigate in which situations decision-focused learning outperforms the state of the art for such situations: scenario-based stochastic optimization.  ( 2 min )
    Exploring Distributional Shifts in Large Language Models for Code Analysis. (arXiv:2303.09128v2 [cs.CL] UPDATED)
    We systematically study how three large language models with code capabilities - CodeT5, Codex, and ChatGPT - generalize to out-of-domain data. We consider two fundamental applications - code summarization, and code generation. We split data into domains following its natural boundaries - by an organization, by a project, and by a module within the software project. We establish that samples from each new domain present all the models with a significant challenge of distribution shift. We study how established methods adapt models to better generalize to new domains. Our experiments show that while multitask learning alone is a reasonable baseline, combining it with few-shot finetuning on examples retrieved from training data can achieve very strong performance. Moreover, this solution can outperform direct finetuning for very low-data scenarios. Finally, we consider variations of this approach to create a more broadly applicable method to adapt to multiple domains at once. We find that for code generation, a model adapted to multiple domains simultaneously performs on par with those adapted to a single domain  ( 2 min )
    When accurate prediction models yield harmful self-fulfilling prophecies. (arXiv:2312.01210v2 [stat.ME] UPDATED)
    Prediction models are popular in medical research and practice. By predicting an outcome of interest for specific patients, these models may help inform difficult treatment decisions, and are often hailed as the poster children for personalized, data-driven healthcare. We show however, that using prediction models for decision making can lead to harmful decisions, even when the predictions exhibit good discrimination after deployment. These models are harmful self-fulfilling prophecies: their deployment harms a group of patients but the worse outcome of these patients does not invalidate the predictive power of the model. Our main result is a formal characterization of a set of such prediction models. Next we show that models that are well calibrated before and after deployment are useless for decision making as they made no change in the data distribution. These results point to the need to revise standard practices for validation, deployment and evaluation of prediction models that are used in medical decisions.  ( 2 min )
    Protein Language Model-Powered 3D Ligand Binding Site Prediction from Protein Sequence. (arXiv:2312.03016v1 [q-bio.QM])
    Prediction of ligand binding sites of proteins is a fundamental and important task for understanding the function of proteins and screening potential drugs. Most existing methods require experimentally determined protein holo-structures as input. However, such structures can be unavailable on novel or less-studied proteins. To tackle this limitation, we propose LaMPSite, which only takes protein sequences and ligand molecular graphs as input for ligand binding site predictions. The protein sequences are used to retrieve residue-level embeddings and contact maps from the pre-trained ESM-2 protein language model. The ligand molecular graphs are fed into a graph neural network to compute atom-level embeddings. Then we compute and update the protein-ligand interaction embedding based on the protein residue-level embeddings and ligand atom-level embeddings, and the geometric constraints in the inferred protein contact map and ligand distance map. A final pooling on protein-ligand interaction embedding would indicate which residues belong to the binding sites. Without any 3D coordinate information of proteins, our proposed model achieves competitive performance compared to baseline methods that require 3D protein structures when predicting binding sites. Given that less than 50% of proteins have reliable structure information in the current stage, LaMPSite will provide new opportunities for drug discovery.  ( 2 min )
    On the variants of SVM methods applied to GPR data to classify tack coat characteristics in French pavements: two experimental case studies. (arXiv:2312.03351v1 [stat.ML])
    Among the commonly used non-destructive techniques, the Ground Penetrating Radar (GPR) is one of the most widely adopted today for assessing pavement conditions in France. However, conventional radar systems and their forward processing methods have shown their limitations for the physical and geometrical characterization of very thin layers such as tack coats. However, the use of Machine Learning methods applied to GPR with an inverse approach showed that it was numerically possible to identify the tack coat characteristics despite masking effects due to low timefrequency resolution noted in the raw B-scans. Thus, we propose in this paper to apply the inverse approach based on Machine Learning, already validated in previous works on numerical data, on two experimental cases with different pavement structures. The first case corresponds to a validation on known pavement structures on the Gustave Eiffel University (Nantes, France) with its pavement fatigue carousel and the second case focuses on a new real road in Vend{\'e}e department (France). In both case studies, the performances of SVM/SVR methods showed the efficiency of supervised learning methods to classify and estimate the emulsion proportioning in the tack coats.  ( 3 min )
    Improved Convergence of Score-Based Diffusion Models via Prediction-Correction. (arXiv:2305.14164v2 [cs.LG] UPDATED)
    Score-based generative models (SGMs) are powerful tools to sample from complex data distributions. Their underlying idea is to (i) run a forward process for time $T_1$ by adding noise to the data, (ii) estimate its score function, and (iii) use such estimate to run a reverse process. As the reverse process is initialized with the stationary distribution of the forward one, the existing analysis paradigm requires $T_1\to\infty$. This is however problematic: from a theoretical viewpoint, for a given precision of the score approximation, the convergence guarantee fails as $T_1$ diverges; from a practical viewpoint, a large $T_1$ increases computational costs and leads to error propagation. This paper addresses the issue by considering a version of the popular predictor-corrector scheme: after running the forward process, we first estimate the final distribution via an inexact Langevin dynamics and then revert the process. Our key technical contribution is to provide convergence guarantees which require to run the forward process only for a fixed finite time $T_1$. Our bounds exhibit a mild logarithmic dependence on the input dimension and the subgaussian norm of the target distribution, have minimal assumptions on the data, and require only to control the $L^2$ loss on the score approximation, which is the quantity minimized in practice.  ( 2 min )
    SVQ: Sparse Vector Quantization for Spatiotemporal Forecasting. (arXiv:2312.03406v1 [cs.CV])
    Spatiotemporal forecasting tasks, such as weather forecasting and traffic prediction, offer significant societal benefits. These tasks can be effectively approached as image forecasting problems using computer vision models. Vector quantization (VQ) is a well-known method for discrete representation that improves the latent space, leading to enhanced generalization and transfer learning capabilities. One of the main challenges in using VQ for spatiotemporal forecasting is how to balance between keeping enough details and removing noises from the original patterns for better generalization. We address this challenge by developing sparse vector quantization, or {\bf SVQ} for short, that leverages sparse regression to make better trade-off between the two objectives. The main innovation of this work is to approximate sparse regression by a two-layer MLP and a randomly fixed or learnable matrix, dramatically improving its computational efficiency. Through experiments conducted on diverse datasets in multiple fields including weather forecasting, traffic flow prediction, and video forecasting, we unequivocally demonstrate that our proposed method consistently enhances the performance of base models and achieves state-of-the-art results across all benchmarks.
    Teaching Specific Scientific Knowledge into Large Language Models through Additional Training. (arXiv:2312.03360v1 [cs.CL])
    Through additional training, we explore embedding specialized scientific knowledge into the Llama 2 Large Language Model (LLM). Key findings reveal that effective knowledge integration requires reading texts from multiple perspectives, especially in instructional formats. We utilize text augmentation to tackle the scarcity of specialized texts, including style conversions and translations. Hyperparameter optimization proves crucial, with different size models (7b, 13b, and 70b) reasonably undergoing additional training. Validating our methods, we construct a dataset of 65,000 scientific papers. Although we have succeeded in partially embedding knowledge, the study highlights the complexities and limitations of incorporating specialized information into LLMs, suggesting areas for further improvement.
    Customizable Combination of Parameter-Efficient Modules for Multi-Task Learning. (arXiv:2312.03248v1 [cs.LG])
    Modular and composable transfer learning is an emerging direction in the field of Parameter Efficient Fine-Tuning, as it enables neural networks to better organize various aspects of knowledge, leading to improved cross-task generalization. In this paper, we introduce a novel approach Customized Polytropon C-Poly that combines task-common skills and task-specific skills, while the skill parameters being highly parameterized using low-rank techniques. Each task is associated with a customizable number of exclusive specialized skills and also benefits from skills shared with peer tasks. A skill assignment matrix is jointly learned. To evaluate our approach, we conducted extensive experiments on the Super-NaturalInstructions and the SuperGLUE benchmarks. Our findings demonstrate that C-Poly outperforms fully-shared, task-specific, and skill-indistinguishable baselines, significantly enhancing the sample efficiency in multi-task learning scenarios.
    Enhanced Breast Cancer Tumor Classification using MobileNetV2: A Detailed Exploration on Image Intensity, Error Mitigation, and Streamlit-driven Real-time Deployment. (arXiv:2312.03020v1 [eess.IV])
    This research introduces a sophisticated transfer learning model based on Google's MobileNetV2 for breast cancer tumor classification into normal, benign, and malignant categories, utilizing a dataset of 1576 ultrasound images (265 normal, 891 benign, 420 malignant). The model achieves an accuracy of 0.82, precision of 0.83, recall of 0.81, ROC-AUC of 0.94, PR-AUC of 0.88, and MCC of 0.74. It examines image intensity distributions and misclassification errors, offering improvements for future applications. Addressing dataset imbalances, the study ensures a generalizable model. This work, using a dataset from Baheya Hospital, Cairo, Egypt, compiled by Walid Al-Dhabyani et al., emphasizes MobileNetV2's potential in medical imaging, aiming to improve diagnostic precision in oncology. Additionally, the paper explores Streamlit-based deployment for real-time tumor classification, demonstrating MobileNetV2's applicability in medical imaging and setting a benchmark for future research in oncology diagnostics.  ( 2 min )
    Accelerated Gradient Algorithms with Adaptive Subspace Search for Instance-Faster Optimization. (arXiv:2312.03218v1 [cs.LG])
    Gradient-based minimax optimal algorithms have greatly promoted the development of continuous optimization and machine learning. One seminal work due to Yurii Nesterov [Nes83a] established $\tilde{\mathcal{O}}(\sqrt{L/\mu})$ gradient complexity for minimizing an $L$-smooth $\mu$-strongly convex objective. However, an ideal algorithm would adapt to the explicit complexity of a particular objective function and incur faster rates for simpler problems, triggering our reconsideration of two defeats of existing optimization modeling and analysis. (i) The worst-case optimality is neither the instance optimality nor such one in reality. (ii) Traditional $L$-smoothness condition may not be the primary abstraction/characterization for modern practical problems. In this paper, we open up a new way to design and analyze gradient-based algorithms with direct applications in machine learning, including linear regression and beyond. We introduce two factors $(\alpha, \tau_{\alpha})$ to refine the description of the degenerated condition of the optimization problems based on the observation that the singular values of Hessian often drop sharply. We design adaptive algorithms that solve simpler problems without pre-known knowledge with reduced gradient or analogous oracle accesses. The algorithms also improve the state-of-art complexities for several problems in machine learning, thereby solving the open problem of how to design faster algorithms in light of the known complexity lower bounds. Specially, with the $\mathcal{O}(1)$-nuclear norm bounded, we achieve an optimal $\tilde{\mathcal{O}}(\mu^{-1/3})$ (v.s. $\tilde{\mathcal{O}}(\mu^{-1/2})$) gradient complexity for linear regression. We hope this work could invoke the rethinking for understanding the difficulty of modern problems in optimization.
    Precise Asymptotic Generalization for Multiclass Classification with Overparameterized Linear Models. (arXiv:2306.13255v2 [cs.LG] UPDATED)
    We study the asymptotic generalization of an overparameterized linear model for multiclass classification under the Gaussian covariates bi-level model introduced in Subramanian et al.~'22, where the number of data points, features, and classes all grow together. We fully resolve the conjecture posed in Subramanian et al.~'22, matching the predicted regimes for generalization. Furthermore, our new lower bounds are akin to an information-theoretic strong converse: they establish that the misclassification rate goes to 0 or 1 asymptotically. One surprising consequence of our tight results is that the min-norm interpolating classifier can be asymptotically suboptimal relative to noninterpolating classifiers in the regime where the min-norm interpolating regressor is known to be optimal. The key to our tight analysis is a new variant of the Hanson-Wright inequality which is broadly useful for multiclass problems with sparse labels. As an application, we show that the same type of analysis can be used to analyze the related multilabel classification problem under the same bi-level ensemble.  ( 2 min )
    FlexModel: A Framework for Interpretability of Distributed Large Language Models. (arXiv:2312.03140v1 [cs.LG])
    With the growth of large language models, now incorporating billions of parameters, the hardware prerequisites for their training and deployment have seen a corresponding increase. Although existing tools facilitate model parallelization and distributed training, deeper model interactions, crucial for interpretability and responsible AI techniques, still demand thorough knowledge of distributed computing. This often hinders contributions from researchers with machine learning expertise but limited distributed computing background. Addressing this challenge, we present FlexModel, a software package providing a streamlined interface for engaging with models distributed across multi-GPU and multi-node configurations. The library is compatible with existing model distribution libraries and encapsulates PyTorch models. It exposes user-registerable HookFunctions to facilitate straightforward interaction with distributed model internals, bridging the gap between distributed and single-device model paradigms. Primarily, FlexModel enhances accessibility by democratizing model interactions and promotes more inclusive research in the domain of large-scale neural networks. The package is found at https://github.com/VectorInstitute/flex_model.
    Delegated Classification. (arXiv:2306.11475v2 [cs.LG] UPDATED)
    When machine learning is outsourced to a rational agent, conflicts of interest might arise and severely impact predictive performance. In this work, we propose a theoretical framework for incentive-aware delegation of machine learning tasks. We model delegation as a principal-agent game, in which accurate learning can be incentivized by the principal using performance-based contracts. Adapting the economic theory of contract design to this setting, we define budget-optimal contracts and prove they take a simple threshold form under reasonable assumptions. In the binary-action case, the optimality of such contracts is shown to be equivalent to the classic Neyman-Pearson lemma, establishing a formal connection between contract design and statistical hypothesis testing. Empirically, we demonstrate that budget-optimal contracts can be constructed using small-scale data, leveraging recent advances in the study of learning curves and scaling laws. Performance and economic outcomes are evaluated using synthetic and real-world classification tasks.  ( 2 min )
    Bootstrap Your Own Variance. (arXiv:2312.03213v1 [cs.LG])
    Understanding model uncertainty is important for many applications. We propose Bootstrap Your Own Variance (BYOV), combining Bootstrap Your Own Latent (BYOL), a negative-free Self-Supervised Learning (SSL) algorithm, with Bayes by Backprop (BBB), a Bayesian method for estimating model posteriors. We find that the learned predictive std of BYOV vs. a supervised BBB model is well captured by a Gaussian distribution, providing preliminary evidence that the learned parameter posterior is useful for label free uncertainty estimation. BYOV improves upon the deterministic BYOL baseline (+2.83% test ECE, +1.03% test Brier) and presents better calibration and reliability when tested with various augmentations (eg: +2.4% test ECE, +1.2% test Brier for Salt & Pepper noise).
    Transformer-Powered Surrogates Close the ICF Simulation-Experiment Gap with Extremely Limited Data. (arXiv:2312.03642v1 [cs.LG])
    Recent advances in machine learning, specifically transformer architecture, have led to significant advancements in commercial domains. These powerful models have demonstrated superior capability to learn complex relationships and often generalize better to new data and problems. This paper presents a novel transformer-powered approach for enhancing prediction accuracy in multi-modal output scenarios, where sparse experimental data is supplemented with simulation data. The proposed approach integrates transformer-based architecture with a novel graph-based hyper-parameter optimization technique. The resulting system not only effectively reduces simulation bias, but also achieves superior prediction accuracy compared to the prior method. We demonstrate the efficacy of our approach on inertial confinement fusion experiments, where only 10 shots of real-world data are available, as well as synthetic versions of these experiments.  ( 2 min )
    Speculative Exploration on the Concept of Artificial Agents Conducting Autonomous Research. (arXiv:2312.03497v1 [cs.AI])
    This paper engages in a speculative exploration of the concept of an artificial agent capable of conducting research. Initially, it examines how the act of research can be conceptually characterized, aiming to provide a starting point for discussions about what it means to create such agents. The focus then shifts to the core components of research: question formulation, hypothesis generation, and hypothesis verification. This discussion includes a consideration of the potential and challenges associated with enabling machines to autonomously perform these tasks. Subsequently, this paper briefly considers the overlapping themes and interconnections that underlie them. Finally, the paper presents preliminary thoughts on prototyping as an initial step towards uncovering the challenges involved in developing these research-capable agents.  ( 2 min )
    Dimensionless Anomaly Detection on Multivariate Streams with Variance Norm and Path Signature. (arXiv:2006.03487v2 [cs.LG] UPDATED)
    In this paper, we propose a dimensionless anomaly detection method for multivariate streams. Our method is independent of the unit of measurement for the different stream channels, therefore dimensionless. We first propose the variance norm, a generalisation of Mahalanobis distance to handle infinite-dimensional feature space and singular empirical covariance matrix rigorously. We then combine the variance norm with the path signature, an infinite collection of iterated integrals that provide global features of streams, to propose SigMahaKNN, a method for anomaly detection on (multivariate) streams. We show that SigMahaKNN is invariant to stream reparametrisation, stream concatenation and has a graded discrimination power depending on the truncation level of the path signature. We implement SigMahaKNN as an open-source software, and perform extensive numerical experiments, showing significantly improved anomaly detection on streams compared to isolation forest and local outlier factors in applications ranging from language analysis, hand-writing analysis, ship movement paths analysis and univariate time-series analysis.  ( 2 min )
    Interpretable Mechanistic Representations for Meal-level Glycemic Control in the Wild. (arXiv:2312.03344v1 [cs.LG])
    Diabetes encompasses a complex landscape of glycemic control that varies widely among individuals. However, current methods do not faithfully capture this variability at the meal level. On the one hand, expert-crafted features lack the flexibility of data-driven methods; on the other hand, learned representations tend to be uninterpretable which hampers clinical adoption. In this paper, we propose a hybrid variational autoencoder to learn interpretable representations of CGM and meal data. Our method grounds the latent space to the inputs of a mechanistic differential equation, producing embeddings that reflect physiological quantities, such as insulin sensitivity, glucose effectiveness, and basal glucose levels. Moreover, we introduce a novel method to infer the glucose appearance rate, making the mechanistic model robust to unreliable meal logs. On a dataset of CGM and self-reported meals from individuals with type-2 diabetes and pre-diabetes, our unsupervised representation discovers a separation between individuals proportional to their disease severity. Our embeddings produce clusters that are up to 4x better than naive, expert, black-box, and pure mechanistic features. Our method provides a nuanced, yet interpretable, embedding space to compare glycemic control within and across individuals, directly learnable from in-the-wild data.  ( 2 min )
    MMM: Generative Masked Motion Model. (arXiv:2312.03596v1 [cs.CV])
    Recent advances in text-to-motion generation using diffusion and autoregressive models have shown promising results. However, these models often suffer from a trade-off between real-time performance, high fidelity, and motion editability. To address this gap, we introduce MMM, a novel yet simple motion generation paradigm based on Masked Motion Model. MMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into a sequence of discrete tokens in latent space, and (2) a conditional masked motion transformer that learns to predict randomly masked motion tokens, conditioned on the pre-computed text tokens. By attending to motion and text tokens in all directions, MMM explicitly captures inherent dependency among motion tokens and semantic mapping between motion and text tokens. During inference, this allows parallel and iterative decoding of multiple motion tokens that are highly consistent with fine-grained text descriptions, therefore simultaneously achieving high-fidelity and high-speed motion generation. In addition, MMM has innate motion editability. By simply placing mask tokens in the place that needs editing, MMM automatically fills the gaps while guaranteeing smooth transitions between editing and non-editing parts. Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM surpasses current leading methods in generating high-quality motion (evidenced by superior FID scores of 0.08 and 0.429), while offering advanced editing features such as body-part modification, motion in-betweening, and the synthesis of long motion sequences. In addition, MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models. Our project page is available at \url{https://exitudio.github.io/MMM-page}.  ( 2 min )
    Algorithm as Experiment: Machine Learning, Market Design, and Policy Eligibility Rules. (arXiv:2104.12909v6 [econ.EM] UPDATED)
    Algorithms make a growing portion of policy and business decisions. We develop a treatment-effect estimator using algorithmic decisions as instruments for a class of stochastic and deterministic algorithms. Our estimator is consistent and asymptotically normal for well-defined causal effects. A special case of our setup is multidimensional regression discontinuity designs with complex boundaries. We apply our estimator to evaluate the Coronavirus Aid, Relief, and Economic Security Act, which allocated many billions of dollars worth of relief funding to hospitals via an algorithmic rule. The funding is shown to have little effect on COVID-19-related hospital activities. Naive estimates exhibit selection bias.  ( 2 min )
    Towards small and accurate convolutional neural networks for acoustic biodiversity monitoring. (arXiv:2312.03666v1 [cs.SD])
    Automated classification of animal sounds is a prerequisite for large-scale monitoring of biodiversity. Convolutional Neural Networks (CNNs) are among the most promising algorithms but they are slow, often achieve poor classification in the field and typically require large training data sets. Our objective was to design CNNs that are fast at inference time and achieve good classification performance while learning from moderate-sized data. Recordings from a rainforest ecosystem were used. Start and end-point of sounds from 20 bird species were manually annotated. Spectrograms from 10 second segments were used as CNN input. We designed simple CNNs with a frequency unwrapping layer (SIMP-FU models) such that any output unit was connected to all spectrogram frequencies but only to a sub-region of time, the Receptive Field (RF). Our models allowed experimentation with different RF durations. Models either used the time-indexed labels that encode start and end-point of sounds or simpler segment-level labels. Models learning from time-indexed labels performed considerably better than their segment-level counterparts. Best classification performances was achieved for models with intermediate RF duration of 1.5 seconds. The best SIMP-FU models achieved AUCs over 0.95 in 18 of 20 classes on the test set. On compact low-cost hardware the best SIMP-FU models evaluated up to seven times faster than real-time data acquisition. RF duration was a major driver of classification performance. The optimum of 1.5 s was in the same range as the duration of the sounds. Our models achieved good classification performance while learning from moderate-sized training data. This is explained by the usage of time-indexed labels during training and adequately sized RF. Results confirm the feasibility of deploying small CNNs with good classification performance on compact low-cost devices.  ( 3 min )
    Exploring Answer Information Methods for Question Generation with Transformers. (arXiv:2312.03483v1 [cs.CL])
    There has been a lot of work in question generation where different methods to provide target answers as input, have been employed. This experimentation has been mostly carried out for RNN based models. We use three different methods and their combinations for incorporating answer information and explore their effect on several automatic evaluation metrics. The methods that are used are answer prompting, using a custom product method using answer embeddings and encoder outputs, choosing sentences from the input paragraph that have answer related information, and using a separate cross-attention attention block in the decoder which attends to the answer. We observe that answer prompting without any additional modes obtains the best scores across rouge, meteor scores. Additionally, we use a custom metric to calculate how many of the generated questions have the same answer, as the answer which is used to generate them.  ( 2 min )
    Systematic Literature Review: Quantum Machine Learning and its applications. (arXiv:2201.04093v2 [quant-ph] UPDATED)
    Quantum computing is the process of performing calculations using quantum mechanics. This field studies the quantum behavior of certain subatomic particles for subsequent use in performing calculations, as well as for large-scale information processing. These capabilities can give quantum computers an advantage in terms of computational time and cost over classical computers. Nowadays, there are scientific challenges that are impossible to perform by classical computation due to computational complexity or the time the calculation would take, and quantum computation is one of the possible answers. However, current quantum devices have not yet the necessary qubits and are not fault-tolerant enough to achieve these goals. Nonetheless, there are other fields like machine learning or chemistry where quantum computation could be useful with current quantum devices. This manuscript aims to present a Systematic Literature Review of the papers published between 2017 and 2023 to identify, analyze and classify the different algorithms used in quantum machine learning and their applications. Consequently, this study identified 94 articles that used quantum machine learning techniques and algorithms. The main types of found algorithms are quantum implementations of classical machine learning algorithms, such as support vector machines or the k-nearest neighbor model, and classical deep learning algorithms, like quantum neural networks. Many articles try to solve problems currently answered by classical machine learning but using quantum devices and algorithms. Even though results are promising, quantum machine learning is far from achieving its full potential. An improvement in the quantum hardware is required since the existing quantum computers lack enough quality, speed, and scale to allow quantum computing to achieve its full potential.  ( 3 min )
    On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm. (arXiv:2312.03526v1 [cs.CV])
    Contemporary machine learning requires training large neural networks on massive datasets and thus faces the challenges of high computational demands. Dataset distillation, as a recent emerging strategy, aims to compress real-world datasets for efficient training. However, this line of research currently struggle with large-scale and high-resolution datasets, hindering its practicality and feasibility. To this end, we re-examine the existing dataset distillation methods and identify three properties required for large-scale real-world applications, namely, realism, diversity, and efficiency. As a remedy, we propose RDED, a novel computationally-efficient yet effective data distillation paradigm, to enable both diversity and realism of the distilled data. Extensive empirical results over various neural architectures and datasets demonstrate the advancement of RDED: we can distill the full ImageNet-1K to a small dataset comprising 10 images per class within 7 minutes, achieving a notable 42% top-1 accuracy with ResNet-18 on a single RTX-4090 GPU (while the SOTA only achieves 21% but requires 6 hours).  ( 2 min )
    ReSync: Riemannian Subgradient-based Robust Rotation Synchronization. (arXiv:2305.15136v2 [math.OC] UPDATED)
    This work presents ReSync, a Riemannian subgradient-based algorithm for solving the robust rotation synchronization problem, which arises in various engineering applications. ReSync solves a least-unsquared minimization formulation over the rotation group, which is nonsmooth and nonconvex, and aims at recovering the underlying rotations directly. We provide strong theoretical guarantees for ReSync under the random corruption setting. Specifically, we first show that the initialization procedure of ReSync yields a proper initial point that lies in a local region around the ground-truth rotations. We next establish the weak sharpness property of the aforementioned formulation and then utilize this property to derive the local linear convergence of ReSync to the ground-truth rotations. By combining these guarantees, we conclude that ReSync converges linearly to the ground-truth rotations under appropriate conditions. Experiment results demonstrate the effectiveness of ReSync.  ( 2 min )
    Invariance & Causal Representation Learning: Prospects and Limitations. (arXiv:2312.03580v1 [stat.ML])
    In causal models, a given mechanism is assumed to be invariant to changes of other mechanisms. While this principle has been utilized for inference in settings where the causal variables are observed, theoretical insights when the variables of interest are latent are largely missing. We assay the connection between invariance and causal representation learning by establishing impossibility results which show that invariance alone is insufficient to identify latent causal variables. Together with practical considerations, we use these theoretical findings to highlight the need for additional constraints in order to identify representations by exploiting invariance.
    Causal Estimation of Exposure Shifts with Neural Networks: Evaluating the Health Benefits of Stricter Air Quality Standards in the US. (arXiv:2302.02560v3 [cs.LG] UPDATED)
    In policy research, one of the most critical analytic tasks is to estimate the causal effect of a policy-relevant shift to the distribution of a continuous exposure/treatment on an outcome of interest. We call this problem shift-response function (SRF) estimation. Existing neural network methods involving robust causal-effect estimators lack theoretical guarantees and practical implementations for SRF estimation. Motivated by a key policy-relevant question in public health, we develop a neural network method and its theoretical underpinnings to estimate SRFs with robustness and efficiency guarantees. We then apply our method to data consisting of 68 million individuals and 27 million deaths across the U.S. to estimate the causal effect from revising the US National Ambient Air Quality Standards (NAAQS) for PM 2.5 from 12 $\mu g/m^3$ to 9 $\mu g/m^3$. This change has been recently proposed by the US Environmental Protection Agency (EPA). Our goal is to estimate, for the first time, the reduction in deaths that would result from this anticipated revision using causal methods for SRFs. Our proposed method, called {T}argeted {R}egularization for {E}xposure {S}hifts with Neural {Net}works (TRESNET), contributes to the neural network literature for causal inference in two ways: first, it proposes a targeted regularization loss with theoretical properties that ensure double robustness and achieves asymptotic efficiency specific for SRF estimation; second, it enables loss functions from the exponential family of distributions to accommodate non-continuous outcome distributions (such as hospitalization or mortality counts). We complement our application with benchmark experiments that demonstrate TRESNET's broad applicability and competitiveness.
    The Landscape of Modern Machine Learning: A Review of Machine, Distributed and Federated Learning. (arXiv:2312.03120v1 [cs.LG])
    With the advance of the powerful heterogeneous, parallel and distributed computing systems and ever increasing immense amount of data, machine learning has become an indispensable part of cutting-edge technology, scientific research and consumer products. In this study, we present a review of modern machine and deep learning. We provide a high-level overview for the latest advanced machine learning algorithms, applications, and frameworks. Our discussion encompasses parallel distributed learning, deep learning as well as federated learning. As a result, our work serves as an introductory text to the vast field of modern machine learning.
    Using Curiosity for an Even Representation of Tasks in Continual Offline Reinforcement Learning. (arXiv:2312.03177v1 [cs.LG])
    In this work, we investigate the means of using curiosity on replay buffers to improve offline multi-task continual reinforcement learning when tasks, which are defined by the non-stationarity in the environment, are non labeled and not evenly exposed to the learner in time. In particular, we investigate the use of curiosity both as a tool for task boundary detection and as a priority metric when it comes to retaining old transition tuples, which we respectively use to propose two different buffers. Firstly, we propose a Hybrid Reservoir Buffer with Task Separation (HRBTS), where curiosity is used to detect task boundaries that are not known due to the task agnostic nature of the problem. Secondly, by using curiosity as a priority metric when it comes to retaining old transition tuples, a Hybrid Curious Buffer (HCB) is proposed. We ultimately show that these buffers, in conjunction with regular reinforcement learning algorithms, can be used to alleviate the catastrophic forgetting issue suffered by the state of the art on replay buffers when the agent's exposure to tasks is not equal along time. We evaluate catastrophic forgetting and the efficiency of our proposed buffers against the latest works such as the Hybrid Reservoir Buffer (HRB) and the Multi-Time Scale Replay Buffer (MTR) in three different continual reinforcement learning settings. Experiments were done on classical control tasks and Metaworld environment. Experiments show that our proposed replay buffers display better immunity to catastrophic forgetting compared to existing works in most of the settings.
    CaloQVAE : Simulating high-energy particle-calorimeter interactions using hybrid quantum-classical generative models. (arXiv:2312.03179v1 [hep-ex])
    The Large Hadron Collider's high luminosity era presents major computational challenges in the analysis of collision events. Large amounts of Monte Carlo (MC) simulation will be required to constrain the statistical uncertainties of the simulated datasets below these of the experimental data. Modelling of high-energy particles propagating through the calorimeter section of the detector is the most computationally intensive MC simulation task. We introduce a technique combining recent advancements in generative models and quantum annealing for fast and efficient simulation of high-energy particle-calorimeter interactions.
    Analysis and mining of low-carbon and energy-saving tourism data characteristics based on machine learning algorithm. (arXiv:2312.03037v1 [cs.LG])
    In order to study the formation mechanism of residents' low-carbon awareness and provide an important basis for traffic managers to guide urban residents to choose low-carbon travel mode, this paper proposes a low-carbon energy-saving travel data feature analysis and mining based on machine learning algorithm. This paper uses data mining technology to analyze the data of low-carbon travel questionnaire, and regards the 15-dimensional problem under the framework of planned behavior theory as the internal cause variable that characterizes residents' low-carbon travel willingness. The author uses K-means clustering algorithm to classify the intensity of residents' low-carbon travel willingness, and applies the results as the explanatory variables to the random forest model to explore the mechanism of residents' social attribute characteristics, travel characteristics, etc. on their low-carbon travel willingness. The experimental results show that based on the Silhouette index test and t-SNE dimensionality reduction, residents' low-carbon travel willingness can be divided into three categories: strong, neutral, and not strong; Based on the importance index, the four most significant factors are the occupation, residence, family composition and commuting time of residents. Conclusion: This method provides policy recommendations for the development and management of urban traffic low-carbon from multiple perspectives.  ( 2 min )
    Precision of Individual Shapley Value Explanations. (arXiv:2312.03485v1 [stat.ML])
    Shapley values are extensively used in explainable artificial intelligence (XAI) as a framework to explain predictions made by complex machine learning (ML) models. In this work, we focus on conditional Shapley values for predictive models fitted to tabular data and explain the prediction $f(\boldsymbol{x}^{*})$ for a single observation $\boldsymbol{x}^{*}$ at the time. Numerous Shapley value estimation methods have been proposed and empirically compared on an average basis in the XAI literature. However, less focus has been devoted to analyzing the precision of the Shapley value explanations on an individual basis. We extend our work in Olsen et al. (2023) by demonstrating and discussing that the explanations are systematically less precise for observations on the outer region of the training data distribution for all used estimation methods. This is expected from a statistical point of view, but to the best of our knowledge, it has not been systematically addressed in the Shapley value literature. This is crucial knowledge for Shapley values practitioners, who should be more careful in applying these observations' corresponding Shapley value explanations.
    AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. (arXiv:2305.14387v3 [cs.LG] UPDATED)
    Large language models (LLMs) such as ChatGPT have seen widespread adoption due to their ability to follow user instructions well. Developing these LLMs involves a complex yet poorly understood workflow requiring training with human feedback. Replicating and understanding this instruction-following process faces three major challenges: the high cost of data collection, the lack of trustworthy evaluation, and the absence of reference method implementations. We address these challenges with AlpacaFarm, a simulator that enables research and development for learning from feedback at a low cost. First, we design LLM prompts to simulate human feedback that are 45x cheaper than crowdworkers and display high agreement with humans. Second, we propose an automatic evaluation and validate it against human instructions obtained on real-world interactions. Third, we contribute reference implementations for several methods (PPO, DPO, best-of-n, expert iteration, and more) that learn from pairwise feedback. Finally, as an end-to-end validation of AlpacaFarm, we train and evaluate eleven models on 10k pairs of real human feedback and show that rankings of models trained in AlpacaFarm match rankings of models trained on human data. As a demonstration of the research possible in AlpacaFarm, we find that methods that use a reward model can substantially improve over supervised fine-tuning and that our reference PPO implementation leads to a +10% improvement in win-rate against Davinci003. We release all components of AlpacaFarm at https://github.com/tatsu-lab/alpaca_farm.  ( 3 min )
    Memory Triggers: Unveiling Memorization in Text-To-Image Generative Models through Word-Level Duplication. (arXiv:2312.03692v1 [cs.CR])
    Diffusion-based models, such as the Stable Diffusion model, have revolutionized text-to-image synthesis with their ability to produce high-quality, high-resolution images. These advancements have prompted significant progress in image generation and editing tasks. However, these models also raise concerns due to their tendency to memorize and potentially replicate exact training samples, posing privacy risks and enabling adversarial attacks. Duplication in training datasets is recognized as a major factor contributing to memorization, and various forms of memorization have been studied so far. This paper focuses on two distinct and underexplored types of duplication that lead to replication during inference in diffusion-based models, particularly in the Stable Diffusion model. We delve into these lesser-studied duplication phenomena and their implications through two case studies, aiming to contribute to the safer and more responsible use of generative models in various applications.
    Physical Symbolic Optimization. (arXiv:2312.03612v1 [cs.LG])
    We present a framework for constraining the automatic sequential generation of equations to obey the rules of dimensional analysis by construction. Combining this approach with reinforcement learning, we built $\Phi$-SO, a Physical Symbolic Optimization method for recovering analytical functions from physical data leveraging units constraints. Our symbolic regression algorithm achieves state-of-the-art results in contexts in which variables and constants have known physical units, outperforming all other methods on SRBench's Feynman benchmark in the presence of noise (exceeding 0.1%) and showing resilience even in the presence of significant (10%) levels of noise.
    Efficient Inverse Design Optimization through Multi-fidelity Simulations, Machine Learning, and Search Space Reduction Strategies. (arXiv:2312.03654v1 [cs.CE])
    This paper introduces a methodology designed to augment the inverse design optimization process in scenarios constrained by limited compute, through the strategic synergy of multi-fidelity evaluations, machine learning models, and optimization algorithms. The proposed methodology is analyzed on two distinct engineering inverse design problems: airfoil inverse design and the scalar field reconstruction problem. It leverages a machine learning model trained with low-fidelity simulation data, in each optimization cycle, thereby proficiently predicting a target variable and discerning whether a high-fidelity simulation is necessitated, which notably conserves computational resources. Additionally, the machine learning model is strategically deployed prior to optimization to reduce the search space, thereby further accelerating convergence toward the optimal solution. The methodology has been employed to enhance two optimization algorithms, namely Differential Evolution and Particle Swarm Optimization. Comparative analyses illustrate performance improvements across both algorithms. Notably, this method is adeptly adaptable across any inverse design application, facilitating a harmonious synergy between a representative low-fidelity machine learning model, and high-fidelity simulation, and can be seamlessly applied across any variety of population-based optimization algorithms.
    Exploring the flavor structure of quarks and leptons with reinforcement learning. (arXiv:2304.14176v2 [hep-ph] UPDATED)
    We propose a method to explore the flavor structure of quarks and leptons with reinforcement learning. As a concrete model, we utilize a basic value-based algorithm for models with $U(1)$ flavor symmetry. By training neural networks on the $U(1)$ charges of quarks and leptons, the agent finds 21 models to be consistent with experimentally measured masses and mixing angles of quarks and leptons. In particular, an intrinsic value of normal ordering tends to be larger than that of inverted ordering, and the normal ordering is well fitted with the current experimental data in contrast to the inverted ordering. A specific value of effective mass for the neutrinoless double beta decay and a sizable leptonic CP violation induced by an angular component of flavon field are predicted by autonomous behavior of the agent. Our finding results indicate that the reinforcement learning can be a new method for understanding the flavor structure.
    Kandinsky 3.0 Technical Report. (arXiv:2312.03511v1 [cs.CV])
    We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. Compared to previous versions of Kandinsky 2.x, Kandinsky 3.0 leverages a two times larger U-Net backbone, a ten times larger text encoder and removes diffusion mapping. We describe the architecture of the model, the data collection procedure, the training technique, and the production system of user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. By our side-by-side comparisons, Kandinsky becomes better in text understanding and works better on specific domains. Project page: https://ai-forever.github.io/Kandinsky-3
    STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention Transformer for Skeleton-based Action Recognition. (arXiv:2312.03288v1 [cs.CV])
    Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition. We think the key to skeleton-based action recognition is a skeleton hanging in frames, so we focus on how the Graph Convolutional Convolution networks learn different topologies and effectively aggregate joint features in the global temporal and local temporal. In this work, we propose three Channel-wise Tolopogy Graph Convolution based on Channel-wise Topology Refinement Graph Convolution (CTR-GCN). Combining CTR-GCN with two joint cross-attention modules can capture the upper-lower body part and hand-foot relationship skeleton features. After that, to capture features of human skeletons changing in frames we design the Temporal Attention Transformers to extract skeletons effectively. The Temporal Attention Transformers can learn the temporal features of human skeleton sequences. Finally, we fuse the temporal features output scale with MLP and classification. We develop a powerful graph convolutional network named Spatial Temporal Effective Body-part Cross Attention Transformer which notably high-performance on the NTU RGB+D, NTU RGB+D 120 datasets. Our code and models are available at https://github.com/maclong01/STEP-CATFormer
    Multicoated and Folded Graph Neural Networks with Strong Lottery Tickets. (arXiv:2312.03236v1 [cs.LG])
    The Strong Lottery Ticket Hypothesis (SLTH) demonstrates the existence of high-performing subnetworks within a randomly initialized model, discoverable through pruning a convolutional neural network (CNN) without any weight training. A recent study, called Untrained GNNs Tickets (UGT), expanded SLTH from CNNs to shallow graph neural networks (GNNs). However, discrepancies persist when comparing baseline models with learned dense weights. Additionally, there remains an unexplored area in applying SLTH to deeper GNNs, which, despite delivering improved accuracy with additional layers, suffer from excessive memory requirements. To address these challenges, this work utilizes Multicoated Supermasks (M-Sup), a scalar pruning mask method, and implements it in GNNs by proposing a strategy for setting its pruning thresholds adaptively. In the context of deep GNNs, this research uncovers the existence of untrained recurrent networks, which exhibit performance on par with their trained feed-forward counterparts. This paper also introduces the Multi-Stage Folding and Unshared Masks methods to expand the search space in terms of both architecture and parameters. Through the evaluation of various datasets, including the Open Graph Benchmark (OGB), this work establishes a triple-win scenario for SLTH-based GNNs: by achieving high sparsity, competitive performance, and high memory efficiency with up to 98.7\% reduction, it demonstrates suitability for energy-efficient graph processing.
    Inverse Design of Vitrimeric Polymers by Molecular Dynamics and Generative Modeling. (arXiv:2312.03690v1 [cond-mat.mtrl-sci])
    Vitrimer is a new class of sustainable polymers with the ability of self-healing through rearrangement of dynamic covalent adaptive networks. However, a limited choice of constituent molecules restricts their property space, prohibiting full realization of their potential applications. Through a combination of molecular dynamics (MD) simulations and machine learning (ML), particularly a novel graph variational autoencoder (VAE) model, we establish a method for generating novel vitrimers and guide their inverse design based on desired glass transition temperature (Tg). We build the first vitrimer dataset of one million and calculate Tg on 8,424 of them by high-throughput MD simulations calibrated by a Gaussian process model. The proposed VAE employs dual graph encoders and a latent dimension overlapping scheme which allows for individual representation of multi-component vitrimers. By constructing a continuous latent space containing necessary information of vitrimers, we demonstrate high accuracy and efficiency of our framework in discovering novel vitrimers with desirable Tg beyond the training regime. The proposed vitrimers with reasonable synthesizability cover a wide range of Tg and broaden the potential widespread usage of vitrimeric materials.
    InfoBot: Transfer and Exploration via the Information Bottleneck. (arXiv:1901.10902v5 [stat.ML] UPDATED)
    A central challenge in reinforcement learning is discovering effective policies for tasks where rewards are sparsely distributed. We postulate that in the absence of useful reward signals, an effective exploration strategy should seek out {\it decision states}. These states lie at critical junctions in the state space from where the agent can transition to new, potentially unexplored regions. We propose to learn about decision states from prior experience. By training a goal-conditioned policy with an information bottleneck, we can identify decision states by examining where the model actually leverages the goal state. We find that this simple mechanism effectively identifies decision states, even in partially observed settings. In effect, the model learns the sensory cues that correlate with potential subgoals. In new environments, this model can then identify novel subgoals for further exploration, guiding the agent through a sequence of potential decision states and through new regions of the state space.
    Convolutional layers are equivariant to discrete shifts but not continuous translations. (arXiv:2206.04979v3 [cs.CV] UPDATED)
    The purpose of this short and simple note is to clarify a common misconception about convolutional neural networks (CNNs). CNNs are made up of convolutional layers which are shift equivariant due to weight sharing. However, convolutional layers are not translation equivariant, even when boundary effects are ignored and when pooling and subsampling are absent. This is because shift equivariance is a discrete symmetry while translation equivariance is a continuous symmetry. This fact is well known among researchers in equivariant machine learning, but is usually overlooked among non-experts. To minimize confusion, we suggest using the term `shift equivariance' to refer to discrete shifts in pixels and `translation equivariance' to refer to continuous translations.
    Adaptive spectral graph wavelets for collaborative filtering. (arXiv:2312.03167v1 [cs.IR])
    Collaborative filtering is a popular approach in recommender systems, whose objective is to provide personalized item suggestions to potential users based on their purchase or browsing history. However, personalized recommendations require considerable amount of behavioral data on users, which is usually unavailable for new users, giving rise to the cold-start problem. To help alleviate this challenging problem, we introduce a spectral graph wavelet collaborative filtering framework for implicit feedback data, where users, items and their interactions are represented as a bipartite graph. Specifically, we first propose an adaptive transfer function by leveraging a power transform with the goal of stabilizing the variance of graph frequencies in the spectral domain. Then, we design a deep recommendation model for efficient learning of low-dimensional embeddings of users and items using spectral graph wavelets in an end-to-end fashion. In addition to capturing the graph's local and global structures, our approach yields localization of graph signals in both spatial and spectral domains, and hence not only learns discriminative representations of users and items, but also promotes the recommendation quality. The effectiveness of our proposed model is demonstrated through extensive experiments on real-world benchmark datasets, achieving better recommendation performance compared with strong baseline methods.
    Learning Multi-graph Structure for Temporal Knowledge Graph Reasoning. (arXiv:2312.03004v1 [cs.AI])
    Temporal Knowledge Graph (TKG) reasoning that forecasts future events based on historical snapshots distributed over timestamps is denoted as extrapolation and has gained significant attention. Owing to its extreme versatility and variation in spatial and temporal correlations, TKG reasoning presents a challenging task, demanding efficient capture of concurrent structures and evolutional interactions among facts. While existing methods have made strides in this direction, they still fall short of harnessing the diverse forms of intrinsic expressive semantics of TKGs, which encompass entity correlations across multiple timestamps and periodicity of temporal information. This limitation constrains their ability to thoroughly reflect historical dependencies and future trends. In response to these drawbacks, this paper proposes an innovative reasoning approach that focuses on Learning Multi-graph Structure (LMS). Concretely, it comprises three distinct modules concentrating on multiple aspects of graph structure knowledge within TKGs, including concurrent and evolutional patterns along timestamps, query-specific correlations across timestamps, and semantic dependencies of timestamps, which capture TKG features from various perspectives. Besides, LMS incorporates an adaptive gate for merging entity representations both along and across timestamps effectively. Moreover, it integrates timestamp semantics into graph attention calculations and time-aware decoders, in order to impose temporal constraints on events and narrow down prediction scopes with historical statistics. Extensive experimental results on five event-based benchmark datasets demonstrate that LMS outperforms state-of-the-art extrapolation models, indicating the superiority of modeling a multi-graph perspective for TKG reasoning.
    Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models. (arXiv:2312.03632v1 [cs.SD])
    Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a large language model (LLM). In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.  ( 2 min )
    A Simple and Scalable Graph Neural Network for Large Directed Graphs. (arXiv:2306.08274v2 [cs.LG] UPDATED)
    Node classification is one of the hottest tasks in graph analysis. Though existing studies have explored various node representations in directed and undirected graphs, they have overlooked the distinctions of their capabilities to capture the information of graphs. To tackle the limitation, we investigate various combinations of node representations (aggregated features vs. adjacency lists) and edge direction awareness within an input graph (directed vs. undirected). We address the first empirical study to benchmark the performance of various GNNs that use either combination of node representations and edge direction awareness. Our experiments demonstrate that no single combination stably achieves state-of-the-art results across datasets, which indicates that we need to select appropriate combinations depending on the dataset characteristics. In response, we propose a simple yet holistic classification method A2DUG which leverages all combinations of node representations in directed and undirected graphs. We demonstrate that A2DUG stably performs well on various datasets and improves the accuracy up to 11.29 compared with the state-of-the-art methods. To spur the development of new methods, we publicly release our complete codebase under the MIT license.
    Continual Learning for Instruction Following from Realtime Feedback. (arXiv:2212.09710v2 [cs.CL] UPDATED)
    We propose and deploy an approach to continually train an instruction-following agent from feedback provided by users during collaborative interactions. During interaction, human users instruct an agent using natural language, and provide realtime binary feedback as they observe the agent following their instructions. We design a contextual bandit learning approach, converting user feedback to immediate reward. We evaluate through thousands of human-agent interactions, demonstrating 15.4% absolute improvement in instruction execution accuracy over time. We also show our approach is robust to several design variations, and that the feedback signal is roughly equivalent to the learning signal of supervised demonstration data.
    Functional Flow Matching. (arXiv:2305.17209v2 [cs.LG] UPDATED)
    We propose Functional Flow Matching (FFM), a function-space generative model that generalizes the recently-introduced Flow Matching model to operate in infinite-dimensional spaces. Our approach works by first defining a path of probability measures that interpolates between a fixed Gaussian measure and the data distribution, followed by learning a vector field on the underlying space of functions that generates this path of measures. Our method does not rely on likelihoods or simulations, making it well-suited to the function space setting. We provide both a theoretical framework for building such models and an empirical evaluation of our techniques. We demonstrate through experiments on several real-world benchmarks that our proposed FFM method outperforms several recently proposed function-space generative models.
    $\textbf{A}^2\textbf{CiD}^2$: Accelerating Asynchronous Communication in Decentralized Deep Learning. (arXiv:2306.08289v2 [cs.LG] CROSS LISTED)
    Distributed training of Deep Learning models has been critical to many recent successes in the field. Current standard methods primarily rely on synchronous centralized algorithms which induce major communication bottlenecks and synchronization locks at scale. Decentralized asynchronous algorithms are emerging as a potential alternative but their practical applicability still lags. In order to mitigate the increase in communication cost that naturally comes with scaling the number of workers, we introduce a principled asynchronous, randomized, gossip-based optimization algorithm which works thanks to a continuous local momentum named $\textbf{A}^2\textbf{CiD}^2$. Our method allows each worker to continuously process mini-batches without stopping, and run a peer-to-peer averaging routine in parallel, reducing idle time. In addition to inducing a significant communication acceleration at no cost other than adding a local momentum variable, minimal adaptation is required to incorporate $\textbf{A}^2\textbf{CiD}^2$ to standard asynchronous approaches. Our theoretical analysis proves accelerated rates compared to previous asynchronous decentralized baselines and we empirically show that using our $\textbf{A}^2\textbf{CiD}^2$ momentum significantly decrease communication costs in poorly connected networks. In particular, we show consistent improvement on the ImageNet dataset using up to 64 asynchronous workers (A100 GPUs) and various communication network topologies.
    A unified framework for information-theoretic generalization bounds. (arXiv:2305.11042v2 [cs.LG] UPDATED)
    This paper presents a general methodology for deriving information-theoretic generalization bounds for learning algorithms. The main technical tool is a probabilistic decorrelation lemma based on a change of measure and a relaxation of Young's inequality in $L_{\psi_p}$ Orlicz spaces. Using the decorrelation lemma in combination with other techniques, such as symmetrization, couplings, and chaining in the space of probability measures, we obtain new upper bounds on the generalization error, both in expectation and in high probability, and recover as special cases many of the existing generalization bounds, including the ones based on mutual information, conditional mutual information, stochastic chaining, and PAC-Bayes inequalities. In addition, the Fernique-Talagrand upper bound on the expected supremum of a subgaussian process emerges as a special case.
    Towards Ordinal Data Science. (arXiv:2307.09477v2 [cs.AI] UPDATED)
    Order is one of the main instruments to measure the relationship between objects in (empirical) data. However, compared to methods that use numerical properties of objects, the amount of ordinal methods developed is rather small. One reason for this is the limited availability of computational resources in the last century that would have been required for ordinal computations. Another reason -- particularly important for this line of research -- is that order-based methods are often seen as too mathematically rigorous for applying them to real-world data. In this paper, we will therefore discuss different means for measuring and 'calculating' with ordinal structures -- a specific class of directed graphs -- and show how to infer knowledge from them. Our aim is to establish Ordinal Data Science as a fundamentally new research agenda. Besides cross-fertilization with other cornerstone machine learning and knowledge representation methods, a broad range of disciplines will benefit from this endeavor, including, psychology, sociology, economics, web science, knowledge engineering, scientometrics.
    D-Bot: Database Diagnosis System using Large Language Models. (arXiv:2312.01454v2 [cs.DB] UPDATED)
    Database administrators (DBAs) play an important role in managing, maintaining and optimizing database systems. However, it is hard and tedious for DBAs to manage a large number of databases and give timely response (waiting for hours is intolerable in many online cases). In addition, existing empirical methods only support limited diagnosis scenarios, which are also labor-intensive to update the diagnosis rules for database version updates. Recently large language models (LLMs) have shown great potential in various fields. Thus, we propose D-Bot, an LLM-based database diagnosis system that can automatically acquire knowledge from diagnosis documents, and generate reasonable and well-founded diagnosis report (i.e., identifying the root causes and solutions) within acceptable time (e.g., under 10 minutes compared to hours by a DBA). The techniques in D-Bot include (i) offline knowledge extraction from documents, (ii) automatic prompt generation (e.g., knowledge matching, tool retrieval), (iii) root cause analysis using tree search algorithm, and (iv) collaborative mechanism for complex anomalies with multiple root causes. We verify D-Bot on real benchmarks (including 539 anomalies of six typical applications), and the results show that D-Bot can effectively analyze the root causes of unseen anomalies and significantly outperforms traditional methods and vanilla models like GPT-4.
    Neural parameter calibration and uncertainty quantification for epidemic forecasting. (arXiv:2312.03147v1 [cs.LG])
    The recent COVID-19 pandemic has thrown the importance of accurately forecasting contagion dynamics and learning infection parameters into sharp focus. At the same time, effective policy-making requires knowledge of the uncertainty on such predictions, in order, for instance, to be able to ready hospitals and intensive care units for a worst-case scenario without needlessly wasting resources. In this work, we apply a novel and powerful computational method to the problem of learning probability densities on contagion parameters and providing uncertainty quantification for pandemic projections. Using a neural network, we calibrate an ODE model to data of the spread of COVID-19 in Berlin in 2020, achieving both a significantly more accurate calibration and prediction than Markov-Chain Monte Carlo (MCMC)-based sampling schemes. The uncertainties on our predictions provide meaningful confidence intervals e.g. on infection figures and hospitalisation rates, while training and running the neural scheme takes minutes where MCMC takes hours. We show convergence of our method to the true posterior on a simplified SIR model of epidemics, and also demonstrate our method's learning capabilities on a reduced dataset, where a complex model is learned from a small number of compartments for which data is available.
    In-Context Learning for Text Classification with Many Labels. (arXiv:2309.10954v2 [cs.CL] UPDATED)
    In-context learning (ICL) using large language models for tasks with many labels is challenging due to the limited context window, which makes it difficult to fit a sufficient number of examples in the prompt. In this paper, we use a pre-trained dense retrieval model to bypass this limitation, giving the model only a partial view of the full label space for each inference call. Testing with recent open-source LLMs (OPT, LLaMA), we set new state of the art performance in few-shot settings for three common intent classification datasets, with no finetuning. We also surpass fine-tuned performance on fine-grained sentiment classification in certain cases. We analyze the performance across number of in-context examples and different model scales, showing that larger models are necessary to effectively and consistently make use of larger context lengths for ICL. By running several ablations, we analyze the model's use of: a) the similarity of the in-context examples to the current input, b) the semantic content of the class names, and c) the correct correspondence between examples and labels. We demonstrate that all three are needed to varying degrees depending on the domain, contrary to certain recent works.
    Continual Driving Policy Optimization with Closed-Loop Individualized Curricula. (arXiv:2309.14209v2 [cs.LG] UPDATED)
    The safety of autonomous vehicles (AV) has been a long-standing top concern, stemming from the absence of rare and safety-critical scenarios in the long-tail naturalistic driving distribution. To tackle this challenge, a surge of research in scenario-based autonomous driving has emerged, with a focus on generating high-risk driving scenarios and applying them to conduct safety-critical testing of AV models. However, limited work has been explored on the reuse of these extensive scenarios to iteratively improve AV models. Moreover, it remains intractable and challenging to filter through gigantic scenario libraries collected from other AV models with distinct behaviors, attempting to extract transferable information for current AV improvement. Therefore, we develop a continual driving policy optimization framework featuring Closed-Loop Individualized Curricula (CLIC), which we factorize into a set of standardized sub-modules for flexible implementation choices: AV Evaluation, Scenario Selection, and AV Training. CLIC frames AV Evaluation as a collision prediction task, where it estimates the chance of AV failures in these scenarios at each iteration. Subsequently, by re-sampling from historical scenarios based on these failure probabilities, CLIC tailors individualized curricula for downstream training, aligning them with the evaluated capability of AV. Accordingly, CLIC not only maximizes the utilization of the vast pre-collected scenario library for closed-loop driving policy optimization but also facilitates AV improvement by individualizing its training with more challenging cases out of those poorly organized scenarios. Experimental results clearly indicate that CLIC surpasses other curriculum-based training strategies, showing substantial improvement in managing risky scenarios, while still maintaining proficiency in handling simpler cases.
    A Hyperparameter Study for Quantum Kernel Methods. (arXiv:2310.11891v2 [quant-ph] UPDATED)
    Quantum kernel methods are a promising method in quantum machine learning thanks to the guarantees connected to them. Their accessibility for analytic considerations also opens up the possibility of prescreening datasets based on their potential for a quantum advantage. To do so, earlier works developed the geometric difference, which can be understood as a closeness measure between two kernel-based machine learning approaches, most importantly between a quantum kernel and classical kernel. This metric links the quantum and classical model complexities. Therefore, it raises the question of whether the geometric difference, based on its relation to model complexity, can be a useful tool in evaluations other than for the potential for quantum advantage. In this work, we investigate the effects of hyperparameter choice on the model performance and the generalization gap between classical and quantum kernels. The importance of hyperparameter optimization is well known also for classical machine learning. Especially for the quantum Hamiltonian evolution feature map, the scaling of the input data has been shown to be crucial. However, there are additional parameters left to be optimized, like the best number of qubits to trace out before computing a projected quantum kernel. We investigate the influence of these hyperparameters and compare the classically reliable method of cross validation with the method of choosing based on the geometric difference. Based on the thorough investigation of the hyperparameters across 11 datasets we identified commodities that can be exploited when examining a new dataset. In addition, our findings contribute to better understanding of the applicability of the geometric difference.
    T-Cal: An optimal test for the calibration of predictive models. (arXiv:2203.01850v4 [stat.ML] UPDATED)
    The prediction accuracy of machine learning methods is steadily increasing, but the calibration of their uncertainty predictions poses a significant challenge. Numerous works focus on obtaining well-calibrated predictive models, but less is known about reliably assessing model calibration. This limits our ability to know when algorithms for improving calibration have a real effect, and when their improvements are merely artifacts due to random noise in finite datasets. In this work, we consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem. The null hypothesis is that the predictive model is calibrated, while the alternative hypothesis is that the deviation from calibration is sufficiently large. We find that detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions. When the conditional class probabilities are H\"older continuous, we propose T-Cal, a minimax optimal test for calibration based on a debiased plug-in estimator of the $\ell_2$-Expected Calibration Error (ECE). We further propose Adaptive T-Cal, a version that is adaptive to unknown smoothness. We verify our theoretical findings with a broad range of experiments, including with several popular deep neural net architectures and several standard post-hoc calibration methods. T-Cal is a practical general-purpose tool, which -- combined with classical tests for discrete-valued predictors -- can be used to test the calibration of virtually any probabilistic classification method.
    TraSE: Towards Tackling Authorial Style from a Cognitive Science Perspective. (arXiv:2206.10706v2 [cs.CL] UPDATED)
    Stylistic analysis of text is a key task in research areas ranging from authorship attribution to forensic analysis and personality profiling. The existing approaches for stylistic analysis are plagued by issues like topic influence, lack of discriminability for large number of authors and the requirement for large amounts of diverse data. In this paper, the source of these issues are identified along with the necessity for a cognitive perspective on authorial style in addressing them. A novel feature representation, called Trajectory-based Style Estimation (TraSE), is introduced to support this purpose. Authorship attribution experiments with over 27,000 authors and 1.4 million samples in a cross-domain scenario resulted in 90% attribution accuracy suggesting that the feature representation is immune to such negative influences and an excellent candidate for stylistic analysis. Finally, a qualitative analysis is performed on TraSE using physical human characteristics, like age, to validate its claim on capturing cognitive traits.
    Quantifying Spatial Under-reporting Disparities in Resident Crowdsourcing. (arXiv:2204.08620v4 [stat.AP] UPDATED)
    Modern city governance relies heavily on crowdsourcing to identify problems such as downed trees and power lines. A major concern is that residents do not report problems at the same rates, with heterogeneous reporting delays directly translating to downstream disparities in how quickly incidents can be addressed. Here we develop a method to identify reporting delays without using external ground-truth data. Our insight is that the rates at which duplicate reports are made about the same incident can be leveraged to disambiguate whether an incident has occurred by investigating its reporting rate once it has occurred. We apply our method to over 100,000 resident reports made in New York City and to over 900,000 reports made in Chicago, finding that there are substantial spatial and socioeconomic disparities in how quickly incidents are reported. We further validate our methods using external data and demonstrate how estimating reporting delays leads to practical insights and interventions for a more equitable, efficient government service.
    Zipformer: A faster and better encoder for automatic speech recognition. (arXiv:2310.11230v2 [eess.AS] CROSS LISTED)
    The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.  ( 2 min )
    Computer Vision for Increased Operative Efficiency via Identification of Instruments in the Neurosurgical Operating Room: A Proof-of-Concept Study. (arXiv:2312.03001v1 [eess.IV])
    Objectives Computer vision (CV) is a field of artificial intelligence that enables machines to interpret and understand images and videos. CV has the potential to be of assistance in the operating room (OR) to track surgical instruments. We built a CV algorithm for identifying surgical instruments in the neurosurgical operating room as a potential solution for surgical instrument tracking and management to decrease surgical waste and opening of unnecessary tools. Methods We collected 1660 images of 27 commonly used neurosurgical instruments. Images were labeled using the VGG Image Annotator and split into 80% training and 20% testing sets in order to train a U-Net Convolutional Neural Network using 5-fold cross validation. Results Our U-Net achieved a tool identification accuracy of 80-100% when distinguishing 25 classes of instruments, with 19/25 classes having accuracy over 90%. The model performance was not adequate for sub classifying Adson, Gerald, and Debakey forceps, which had accuracies of 60-80%. Conclusions We demonstrated the viability of using machine learning to accurately identify surgical instruments. Instrument identification could help optimize surgical tray packing, decrease tool usage and waste, decrease incidence of instrument misplacement events, and assist in timing of routine instrument maintenance. More training data will be needed to increase accuracy across all surgical instruments that would appear in a neurosurgical operating room. Such technology has the potential to be used as a method to be used for proving what tools are truly needed in each type of operation allowing surgeons across the world to do more with less.  ( 3 min )
    OMNIINPUT: A Model-centric Evaluation Framework through Output Distribution. (arXiv:2312.03291v1 [cs.LG])
    We propose a novel model-centric evaluation framework, OmniInput, to evaluate the quality of an AI/ML model's predictions on all possible inputs (including human-unrecognizable ones), which is crucial for AI safety and reliability. Unlike traditional data-centric evaluation based on pre-defined test sets, the test set in OmniInput is self-constructed by the model itself and the model quality is evaluated by investigating its output distribution. We employ an efficient sampler to obtain representative inputs and the output distribution of the trained model, which, after selective annotation, can be used to estimate the model's precision and recall at different output values and a comprehensive precision-recall curve. Our experiments demonstrate that OmniInput enables a more fine-grained comparison between models, especially when their performance is almost the same on pre-defined datasets, leading to new findings and insights for how to train more robust, generalizable models.  ( 2 min )
    Evaluation of Active Feature Acquisition Methods for Static Feature Settings. (arXiv:2312.03619v1 [stat.ML])
    Active feature acquisition (AFA) agents, crucial in domains like healthcare where acquiring features is often costly or harmful, determine the optimal set of features for a subsequent classification task. As deploying an AFA agent introduces a shift in missingness distribution, it's vital to assess its expected performance at deployment using retrospective data. In a companion paper, we introduce a semi-offline reinforcement learning (RL) framework for active feature acquisition performance evaluation (AFAPE) where features are assumed to be time-dependent. Here, we study and extend the AFAPE problem to cover static feature settings, where features are time-invariant, and hence provide more flexibility to the AFA agents in deciding the order of the acquisitions. In this static feature setting, we derive and adapt new inverse probability weighting (IPW), direct method (DM), and double reinforcement learning (DRL) estimators within the semi-offline RL framework. These estimators can be applied when the missingness in the retrospective dataset follows a missing-at-random (MAR) pattern. They also can be applied to missing-not-at-random (MNAR) patterns in conjunction with appropriate existing missing data techniques. We illustrate the improved data efficiency offered by the semi-offline RL estimators in synthetic and real-world data experiments under synthetic MAR and MNAR missingness.
    Time Regularization in Optimal Time Variable Learning. (arXiv:2306.16111v2 [cs.LG] UPDATED)
    Recently, optimal time variable learning in deep neural networks (DNNs) was introduced in arXiv:2204.08528. In this manuscript we extend the concept by introducing a regularization term that directly relates to the time horizon in discrete dynamical systems. Furthermore, we propose an adaptive pruning approach for Residual Neural Networks (ResNets), which reduces network complexity without compromising expressiveness, while simultaneously decreasing training time. The results are illustrated by applying the proposed concepts to classification tasks on the well known MNIST and Fashion MNIST data sets. Our PyTorch code is available on https://github.com/frederikkoehne/time_variable_learning.
    LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. (arXiv:2310.05736v2 [cs.CL] UPDATED)
    Large language models (LLMs) have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs are becoming increasingly lengthy, even exceeding tens of thousands of tokens. To accelerate model inference and reduce cost, this paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity under high compression ratios, a token-level iterative compression algorithm to better model the interdependence between compressed contents, and an instruction tuning based method for distribution alignment between language models. We conduct experiments and analysis over four datasets from different scenarios, i.e., GSM8K, BBH, ShareGPT, and Arxiv-March23; showing that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss. Our code is available at https://aka.ms/LLMLingua.
    MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment. (arXiv:2312.03644v1 [cs.LG])
    Offline Multi-agent Reinforcement Learning (MARL) is valuable in scenarios where online interaction is impractical or risky. While independent learning in MARL offers flexibility and scalability, accurately assigning credit to individual agents in offline settings poses challenges due to partial observability and emergent behavior. Directly transferring the online credit assignment method to offline settings results in suboptimal outcomes due to the absence of real-time feedback and intricate agent interactions. Our approach, MACCA, characterizing the generative process as a Dynamic Bayesian Network, captures relationships between environmental variables, states, actions, and rewards. Estimating this model on offline data, MACCA can learn each agent's contribution by analyzing the causal relationship of their individual rewards, ensuring accurate and interpretable credit assignment. Additionally, the modularity of our approach allows it to seamlessly integrate with various offline MARL methods. Theoretically, we proved that under the setting of the offline dataset, the underlying causal structure and the function for generating the individual rewards of agents are identifiable, which laid the foundation for the correctness of our modeling. Experimentally, we tested MACCA in two environments, including discrete and continuous action settings. The results show that MACCA outperforms SOTA methods and improves performance upon their backbones.
    Function-Space Optimality of Neural Architectures With Multivariate Nonlinearities. (arXiv:2310.03696v2 [stat.ML] UPDATED)
    We investigate the function-space optimality (specifically, the Banach-space optimality) of a large class of shallow neural architectures with multivariate nonlinearities/activation functions. To that end, we construct a new family of Banach spaces defined via a regularization operator, the $k$-plane transform, and a sparsity-promoting norm. We prove a representer theorem that states that the solution sets to learning problems posed over these Banach spaces are completely characterized by neural architectures with multivariate nonlinearities. These optimal architectures have skip connections and are tightly connected to orthogonal weight normalization and multi-index models, both of which have received recent interest in the neural network community. Our framework is compatible with a number of classical nonlinearities including the rectified linear unit (ReLU) activation function, the norm activation function, and the radial basis functions found in the theory of thin-plate/polyharmonic splines. We also show that the underlying spaces are special instances of reproducing kernel Banach spaces and variation spaces. Our results shed light on the regularity of functions learned by neural networks trained on data, particularly with multivariate nonlinearities, and provide new theoretical motivation for several architectural choices found in practice.
    Optimal Variable Clustering for High-Dimensional Matrix Valued Data. (arXiv:2112.12909v3 [stat.ML] UPDATED)
    Matrix valued data has become increasingly prevalent in many applications. Most of the existing clustering methods for this type of data are tailored to the mean model and do not account for the dependence structure of the features, which can be very informative, especially in high-dimensional settings or when mean information is not available. To extract the information from the dependence structure for clustering, we propose a new latent variable model for the features arranged in matrix form, with some unknown membership matrices representing the clusters for the rows and columns. Under this model, we further propose a class of hierarchical clustering algorithms using the difference of a weighted covariance matrix as the dissimilarity measure. Theoretically, we show that under mild conditions, our algorithm attains clustering consistency in the high-dimensional setting. While this consistency result holds for our algorithm with a broad class of weighted covariance matrices, the conditions for this result depend on the choice of the weight. To investigate how the weight affects the theoretical performance of our algorithm, we establish the minimax lower bound for clustering under our latent variable model in terms of some cluster separation metric. Given these results, we identify the optimal weight in the sense that using this weight guarantees our algorithm to be minimax rate-optimal. The practical implementation of our algorithm with the optimal weight is also discussed. Simulation studies show that our algorithm performs better than existing methods in terms of the adjusted Rand index (ARI). The method is applied to a genomic dataset and yields meaningful interpretations.
    Memory-free Online Change-point Detection: A Novel Neural Network Approach. (arXiv:2207.03932v2 [cs.LG] UPDATED)
    Change-point detection (CPD), which detects abrupt changes in the data distribution, is recognized as one of the most significant tasks in time series analysis. Despite the extensive literature on offline CPD, unsupervised online CPD still suffers from major challenges, including scalability, hyperparameter tuning, and learning constraints. To mitigate some of these challenges, in this paper, we propose a novel deep learning approach for unsupervised online CPD from multi-dimensional time series, named Adaptive LSTM-Autoencoder Change-Point Detection (ALACPD). ALACPD exploits an LSTM-autoencoder-based neural network to perform unsupervised online CPD. It continuously adapts to the incoming samples without keeping the previously received input, thus being memory-free. We perform an extensive evaluation on several real-world time series CPD benchmarks. We show that ALACPD, on average, ranks first among state-of-the-art CPD algorithms in terms of quality of the time series segmentation, and it is on par with the best performer in terms of the accuracy of the estimated change-points. The implementation of ALACPD is available online on Github\footnote{\url{https://github.com/zahraatashgahi/ALACPD}}.
    A simple probabilistic neural network for machine understanding. (arXiv:2210.13179v5 [cond-mat.dis-nn] UPDATED)
    We discuss probabilistic neural networks with a fixed internal representation as models for machine understanding. Here understanding is intended as mapping data to an already existing representation which encodes an {\em a priori} organisation of the feature space. We derive the internal representation by requiring that it satisfies the principles of maximal relevance and of maximal ignorance about how different features are combined. We show that, when hidden units are binary variables, these two principles identify a unique model -- the Hierarchical Feature Model (HFM) -- which is fully solvable and provides a natural interpretation in terms of features. We argue that learning machines with this architecture enjoy a number of interesting properties, like the continuity of the representation with respect to changes in parameters and data, the possibility to control the level of compression and the ability to support functions that go beyond generalisation. We explore the behaviour of the model with extensive numerical experiments and argue that models where the internal representation is fixed reproduce a learning modality which is qualitatively different from that of traditional models such as Restricted Boltzmann Machines.
    What Planning Problems Can A Relational Neural Network Solve?. (arXiv:2312.03682v1 [cs.LG])
    Goal-conditioned policies are generally understood to be "feed-forward" circuits, in the form of neural networks that map from the current state and the goal specification to the next action to take. However, under what circumstances such a policy can be learned and how efficient the policy will be are not well understood. In this paper, we present a circuit complexity analysis for relational neural networks (such as graph neural networks and transformers) representing policies for planning problems, by drawing connections with serialized goal regression search (S-GRS). We show that there are three general classes of planning problems, in terms of the growth of circuit width and depth as a function of the number of objects and planning horizon, providing constructive proofs. We also illustrate the utility of this analysis for designing neural networks for policy learning.  ( 2 min )
    Diffused Task-Agnostic Milestone Planner. (arXiv:2312.03395v1 [cs.RO])
    Addressing decision-making problems using sequence modeling to predict future trajectories shows promising results in recent years. In this paper, we take a step further to leverage the sequence predictive method in wider areas such as long-term planning, vision-based control, and multi-task decision-making. To this end, we propose a method to utilize a diffusion-based generative sequence model to plan a series of milestones in a latent space and to have an agent to follow the milestones to accomplish a given task. The proposed method can learn control-relevant, low-dimensional latent representations of milestones, which makes it possible to efficiently perform long-term planning and vision-based control. Furthermore, our approach exploits generation flexibility of the diffusion model, which makes it possible to plan diverse trajectories for multi-task decision-making. We demonstrate the proposed method across offline reinforcement learning (RL) benchmarks and an visual manipulation environment. The results show that our approach outperforms offline RL methods in solving long-horizon, sparse-reward tasks and multi-task problems, while also achieving the state-of-the-art performance on the most challenging vision-based manipulation benchmark.
    Run LoRA Run: Faster and Lighter LoRA Implementations. (arXiv:2312.03415v1 [cs.LG])
    LoRA is a technique that reduces the number of trainable parameters in a neural network by introducing low-rank adapters to linear layers. This technique is used both for fine-tuning (LoRA, QLoRA) and full train (ReLoRA). This paper presents the RunLoRA framework for efficient implementations of LoRA that significantly improves the speed of neural network training and fine-tuning using low-rank adapters. The proposed implementation optimizes the computation of LoRA operations based on dimensions of corresponding linear layer, layer input dimensions and lora rank by choosing best forward and backward computation graph based on FLOPs and time estimations, resulting in faster training without sacrificing accuracy. The experimental results show up to 17% speedup on Llama family of models.
    Inherent Inconsistencies of Feature Importance. (arXiv:2206.08204v2 [cs.LG] UPDATED)
    The rapid advancement and widespread adoption of machine learning-driven technologies have underscored the practical and ethical need for creating interpretable artificial intelligence systems. Feature importance, a method that assigns scores to the contribution of individual features on prediction outcomes, seeks to bridge this gap as a tool for enhancing human comprehension of these systems. Feature importance serves as an explanation of predictions in diverse contexts, whether by providing a global interpretation of a phenomenon across the entire dataset or by offering a localized explanation for the outcome of a specific data point. Furthermore, feature importance is being used both for explaining models and for identifying plausible causal relations in the data, independently from the model. However, it is worth noting that these various contexts have traditionally been explored in isolation, with limited theoretical foundations. This paper presents an axiomatic framework designed to establish coherent relationships among the different contexts of feature importance scores. Notably, our work unveils a surprising conclusion: when we combine the proposed properties with those previously outlined in the literature, we demonstrate the existence of an inconsistency. This inconsistency highlights that certain essential properties of feature importance scores cannot coexist harmoniously within a single framework.
    Targeted Separation and Convergence with Kernel Discrepancies. (arXiv:2209.12835v3 [stat.ML] UPDATED)
    Maximum mean discrepancies (MMDs) like the kernel Stein discrepancy (KSD) have grown central to a wide range of applications, including hypothesis testing, sampler selection, distribution approximation, and variational inference. In each setting, these kernel-based discrepancy measures are required to (i) separate a target P from other probability measures or even (ii) control weak convergence to P. In this article we derive new sufficient and necessary conditions to ensure (i) and (ii). For MMDs on separable metric spaces, we characterize those kernels that separate Bochner embeddable measures and introduce simple conditions for separating all measures with unbounded kernels and for controlling convergence with bounded kernels. We use these results on $\mathbb{R}^d$ to substantially broaden the known conditions for KSD separation and convergence control and to develop the first KSDs known to exactly metrize weak convergence to P. Along the way, we highlight the implications of our results for hypothesis testing, measuring and improving sample quality, and sampling with Stein variational gradient descent.
    Estimates on the generalization error of Physics Informed Neural Networks (PINNs) for approximating a class of inverse problems for PDEs. (arXiv:2007.01138v3 [math.NA] UPDATED)
    Physics informed neural networks (PINNs) have recently been very successfully applied for efficiently approximating inverse problems for PDEs. We focus on a particular class of inverse problems, the so-called data assimilation or unique continuation problems, and prove rigorous estimates on the generalization error of PINNs approximating them. An abstract framework is presented and conditional stability estimates for the underlying inverse problem are employed to derive the estimate on the PINN generalization error, providing rigorous justification for the use of PINNs in this context. The abstract framework is illustrated with examples of four prototypical linear PDEs. Numerical experiments, validating the proposed theory, are also presented.
    Model-tuning Via Prompts Makes NLP Models Adversarially Robust. (arXiv:2303.07320v2 [cs.CL] UPDATED)
    In recent years, NLP practitioners have converged on the following practice: (i) import an off-the-shelf pretrained (masked) language model; (ii) append a multilayer perceptron atop the CLS token's hidden representation (with randomly initialized weights); and (iii) fine-tune the entire model on a downstream task (MLP-FT). This procedure has produced massive gains on standard NLP benchmarks, but these models remain brittle, even to mild adversarial perturbations. In this work, we demonstrate surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP), an alternative method of adapting to downstream tasks. Rather than appending an MLP head to make output prediction, MVP appends a prompt template to the input, and makes prediction via text infilling/completion. Across 5 NLP datasets, 4 adversarial attacks, and 3 different models, MVP improves performance against adversarial substitutions by an average of 8% over standard methods and even outperforms adversarial training-based state-of-art defenses by 3.5%. By combining MVP with adversarial training, we achieve further improvements in adversarial robustness while maintaining performance on unperturbed examples. Finally, we conduct ablations to investigate the mechanism underlying these gains. Notably, we find that the main causes of vulnerability of MLP-FT can be attributed to the misalignment between pre-training and fine-tuning tasks, and the randomly initialized MLP parameters.  ( 2 min )
    Adaptive Privacy Composition for Accuracy-first Mechanisms. (arXiv:2306.13824v2 [cs.CR] UPDATED)
    In many practical applications of differential privacy, practitioners seek to provide the best privacy guarantees subject to a target level of accuracy. A recent line of work by Ligett et al. '17 and Whitehouse et al. '22 has developed such accuracy-first mechanisms by leveraging the idea of noise reduction that adds correlated noise to the sufficient statistic in a private computation and produces a sequence of increasingly accurate answers. A major advantage of noise reduction mechanisms is that the analysts only pay the privacy cost of the least noisy or most accurate answer released. Despite this appealing property in isolation, there has not been a systematic study on how to use them in conjunction with other differentially private mechanisms. A fundamental challenge is that the privacy guarantee for noise reduction mechanisms is (necessarily) formulated as ex-post privacy that bounds the privacy loss as a function of the released outcome. Furthermore, there has yet to be any study on how ex-post private mechanisms compose, which allows us to track the accumulated privacy over several mechanisms. We develop privacy filters [Rogers et al. '16, Feldman and Zrnic '21, and Whitehouse et al. '22'] that allow an analyst to adaptively switch between differentially private and ex-post private mechanisms subject to an overall differential privacy guarantee.
    OneLLM: One Framework to Align All Modalities with Language. (arXiv:2312.03700v1 [cs.CV])
    Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM
    DyEdgeGAT: Dynamic Edge via Graph Attention for Early Fault Detection in IIoT Systems. (arXiv:2307.03761v2 [cs.LG] UPDATED)
    In the industrial Internet of Things, condition monitoring sensor signals from complex systems often exhibit strong nonlinear and stochastic spatial-temporal dynamics under varying operating conditions. Such complex dynamics make fault detection particularly challenging. Although previously proposed methods effectively model these dynamics, they often neglect the dynamic evolution of relationships between sensor signals. Undetected shifts in these relationships can potentially result in significant system failures. Another limitation is their inability to effectively distinguish between novel operating conditions and actual faults. To address this gap, we propose DyEdgeGAT (Dynamic Edge via Graph Attention), a novel approach capable of detecting various faults, especially those characterized by relationship changes at early stages, while distinguishing faults from novel operating conditions. DyEdgeGAT is a graph-based framework that provides a novel graph inference scheme for multivariate time series that dynamically constructs edges to represent and track the evolution of relationships between time series. Additionally, it addresses a commonly overlooked aspect: the cause-and-effect relationships within the system, such as between control inputs and measurements. By incorporating system-independent variables as contexts of operating conditions into node dynamics extraction, DyEdgeGAT enhances its robustness against novel operating conditions. We rigorously evaluate DyEdgeGAT's performance using both a synthetic dataset, designed to simulate varying levels of fault severity and a real-world industrial-scale benchmark containing a variety of fault types with different detection complexities. Our findings demonstrate that DyEdgeGAT is highly effective in fault detection, showing particular strength in early fault detection while maintaining robustness under novel operating conditions.
    Compressed Context Memory For Online Language Model Interaction. (arXiv:2312.03414v1 [cs.LG])
    This paper presents a novel context compression method for Transformer language models in online scenarios such as ChatGPT, where the context continually expands. As the context lengthens, the attention process requires more memory and computational resources, which in turn reduces the throughput of the language model. To this end, we propose a compressed context memory system that continually compresses the growing context into a compact memory space. The compression process simply involves integrating a lightweight conditional LoRA into the language model's forward pass during inference. Based on the compressed context memory, the language model can perform inference with reduced memory and attention operations. Through evaluations on conversation, personalization, and multi-task learning, we demonstrate that our approach achieves the performance level of a full context model with $5\times$ smaller context memory space. Codes are available at https://github.com/snu-mllab/context-memory.
    GeoShapley: A Game Theory Approach to Measuring Spatial Effects in Machine Learning Models. (arXiv:2312.03675v1 [cs.LG])
    This paper introduces GeoShapley, a game theory approach to measuring spatial effects in machine learning models. GeoShapley extends the Nobel Prize-winning Shapley value framework in game theory by conceptualizing location as a player in a model prediction game, which enables the quantification of the importance of location and the synergies between location and other features in a model. GeoShapley is a model-agnostic approach and can be applied to statistical or black-box machine learning models in various structures. The interpretation of GeoShapley is directly linked with spatially varying coefficient models for explaining spatial effects and additive models for explaining non-spatial effects. Using simulated data, GeoShapley values are validated against known data-generating processes and are used for cross-comparison of seven statistical and machine learning models. An empirical example of house price modeling is used to illustrate GeoShapley's utility and interpretation with real world data. The method is available as an open-source Python package named geoshapley.
    All the World's a (Hyper)Graph: A Data Drama. (arXiv:2206.08225v3 [cs.LG] UPDATED)
    We introduce Hyperbard, a dataset of diverse relational data representations derived from Shakespeare's plays. Our representations range from simple graphs capturing character co-occurrence in single scenes to hypergraphs encoding complex communication settings and character contributions as hyperedges with edge-specific node weights. By making multiple intuitive representations readily available for experimentation, we facilitate rigorous representation robustness checks in graph learning, graph mining, and network analysis, highlighting the advantages and drawbacks of specific representations. Leveraging the data released in Hyperbard, we demonstrate that many solutions to popular graph mining problems are highly dependent on the representation choice, thus calling current graph curation practices into question. As an homage to our data source, and asserting that science can also be art, we present all our points in the form of a play.
    Incorporating Crowdsourced Annotator Distributions into Ensemble Modeling to Improve Classification Trustworthiness for Ancient Greek Papyri. (arXiv:2210.16380v3 [cs.CV] UPDATED)
    Performing classification on noisy, crowdsourced image datasets can prove challenging even for the best neural networks. Two issues which complicate the problem on such datasets are class imbalance and ground-truth uncertainty in labeling. The AL-ALL and AL-PUB datasets - consisting of tightly cropped, individual characters from images of ancient Greek papyri - are strongly affected by both issues. The application of ensemble modeling to such datasets can help identify images where the ground-truth is questionable and quantify the trustworthiness of those samples. As such, we apply stacked generalization consisting of nearly identical ResNets with different loss functions: one utilizing sparse cross-entropy (CXE) and the other Kullback-Liebler Divergence (KLD). Both networks use labels drawn from a crowd-sourced consensus. This consensus is derived from a Normalized Distribution of Annotations (NDA) based on all annotations for a given character in the dataset. For the second network, the KLD is calculated with respect to the NDA. For our ensemble model, we apply a k-nearest neighbors model to the outputs of the CXE and KLD networks. Individually, the ResNet models have approximately 93% accuracy, while the ensemble model achieves an accuracy of > 95%, increasing the classification trustworthiness. We also perform an analysis of the Shannon entropy of the various models' output distributions to measure classification uncertainty. Our results suggest that entropy is useful for predicting model misclassifications.
    Cross-feature Contrastive Loss for Decentralized Deep Learning on Heterogeneous Data. (arXiv:2310.15890v3 [cs.LG] UPDATED)
    The current state-of-the-art decentralized learning algorithms mostly assume the data distribution to be Independent and Identically Distributed (IID). However, in practical scenarios, the distributed datasets can have significantly heterogeneous data distributions across the agents. In this work, we present a novel approach for decentralized learning on heterogeneous data, where data-free knowledge distillation through contrastive loss on cross-features is utilized to improve performance. Cross-features for a pair of neighboring agents are the features (i.e., last hidden layer activations) obtained from the data of an agent with respect to the model parameters of the other agent. We demonstrate the effectiveness of the proposed technique through an exhaustive set of experiments on various Computer Vision datasets (CIFAR-10, CIFAR-100, Fashion MNIST, Imagenette, and ImageNet), model architectures, and network topologies. Our experiments show that the proposed method achieves superior performance (0.2-4% improvement in test accuracy) compared to other existing techniques for decentralized learning on heterogeneous data.
    Neuroevolution of Physics-Informed Neural Nets: Benchmark Problems and Comparative Results. (arXiv:2212.07624v3 [cs.NE] UPDATED)
    The potential of learned models for fundamental scientific research and discovery is drawing increasing attention worldwide. Physics-informed neural networks (PINNs), where the loss function directly embeds governing equations of scientific phenomena, is one of the key techniques at the forefront of recent advances. PINNs are typically trained using stochastic gradient descent methods, akin to their deep learning counterparts. However, analysis in this paper shows that PINNs' unique loss formulations lead to a high degree of complexity and ruggedness that may not be conducive for gradient descent. Unlike in standard deep learning, PINN training requires globally optimum parameter values that satisfy physical laws as closely as possible. Spurious local optimum, indicative of erroneous physics, must be avoided. Hence, neuroevolution algorithms, with their superior global search capacity, may be a better choice for PINNs relative to gradient descent methods. Here, we propose a set of five benchmark problems, with open-source codes, spanning diverse physical phenomena for novel neuroevolution algorithm development. Using this, we compare two neuroevolution algorithms against the commonly used stochastic gradient descent, and our baseline results support the claim that neuroevolution can surpass gradient descent, ensuring better physics compliance in the predicted outputs. %Furthermore, implementing neuroevolution with JAX leads to orders of magnitude speedup relative to standard implementations.
    On the Role of Edge Dependency in Graph Generative Models. (arXiv:2312.03691v1 [cs.LG])
    In this work, we introduce a novel evaluation framework for generative models of graphs, emphasizing the importance of model-generated graph overlap (Chanpuriya et al., 2021) to ensure both accuracy and edge-diversity. We delineate a hierarchy of graph generative models categorized into three levels of complexity: edge independent, node independent, and fully dependent models. This hierarchy encapsulates a wide range of prevalent methods. We derive theoretical bounds on the number of triangles and other short-length cycles producible by each level of the hierarchy, contingent on the model overlap. We provide instances demonstrating the asymptotic optimality of our bounds. Furthermore, we introduce new generative models for each of the three hierarchical levels, leveraging dense subgraph discovery (Gionis & Tsourakakis, 2015). Our evaluation, conducted on real-world datasets, focuses on assessing the output quality and overlap of our proposed models in comparison to other popular models. Our results indicate that our simple, interpretable models provide competitive baselines to popular generative models. Through this investigation, we aim to propel the advancement of graph generative models by offering a structured framework and robust evaluation metrics, thereby facilitating the development of models capable of generating accurate and edge-diverse graphs.
    Provably Accelerated Decentralized Gradient Method Over Unbalanced Directed Graphs. (arXiv:2107.12065v2 [math.OC] UPDATED)
    We consider the decentralized optimization problem, where a network of $n$ agents aims to collaboratively minimize the average of their individual smooth and convex objective functions through peer-to-peer communication in a directed graph. To tackle this problem, we propose two accelerated gradient tracking methods, namely APD and APD-SC, for non-strongly convex and strongly convex objective functions, respectively. We show that APD and APD-SC converge at the rates $O\left(\frac{1}{k^2}\right)$ and $O\left(\left(1 - C\sqrt{\frac{\mu}{L}}\right)^k\right)$, respectively, up to constant factors depending only on the mixing matrix. APD and APD-SC are the first decentralized methods over unbalanced directed graphs that achieve the same provable acceleration as centralized methods. Numerical experiments demonstrate the effectiveness of both methods.
    DiffusionSat: A Generative Foundation Model for Satellite Imagery. (arXiv:2312.03606v1 [cs.CV])
    Diffusion models have achieved state-of-the-art results on many modalities including images, speech, and video. However, existing models are not tailored to support remote sensing data, which is widely used in important applications including environmental monitoring and crop-yield prediction. Satellite images are significantly different from natural images -- they can be multi-spectral, irregularly sampled across time -- and existing diffusion models trained on images from the Web do not support them. Furthermore, remote sensing data is inherently spatio-temporal, requiring conditional generation tasks not supported by traditional methods based on captions or images. In this paper, we present DiffusionSat, to date the largest generative foundation model trained on a collection of publicly available large, high-resolution remote sensing datasets. As text-based captions are sparsely available for satellite images, we incorporate the associated metadata such as geolocation as conditioning information. Our method produces realistic samples and can be used to solve multiple generative tasks including temporal generation, superresolution given multi-spectral inputs and in-painting. Our method outperforms previous state-of-the-art methods for satellite image generation and is the first large-scale $\textit{generative}$ foundation model for satellite imagery.
    MICRACLE: Inverse Reinforcement and Curriculum Learning Model for Human-inspired Mobile Robot Navigation. (arXiv:2312.03651v1 [cs.RO])
    In emergency scenarios, mobile robots must navigate like humans, interpreting stimuli to locate potential victims rapidly without interfering with first responders. Existing socially-aware navigation algorithms face computational and adaptability challenges. To overcome these, we propose a solution, MIRACLE -- an inverse reinforcement and curriculum learning model, that employs gamified learning to gather stimuli-driven human navigational data. This data is then used to train a Deep Inverse Maximum Entropy Reinforcement Learning model, reducing reliance on demonstrator abilities. Testing reveals a low loss of 2.7717 within a 400-sized environment, signifying human-like response replication. Current databases lack comprehensive stimuli-driven data, necessitating our approach. By doing so, we enable robots to navigate emergency situations with human-like perception, enhancing their life-saving capabilities.
    Enhancing Molecular Property Prediction via Mixture of Collaborative Experts. (arXiv:2312.03292v1 [cs.LG])
    Molecular Property Prediction (MPP) task involves predicting biochemical properties based on molecular features, such as molecular graph structures, contributing to the discovery of lead compounds in drug development. To address data scarcity and imbalance in MPP, some studies have adopted Graph Neural Networks (GNN) as an encoder to extract commonalities from molecular graphs. However, these approaches often use a separate predictor for each task, neglecting the shared characteristics among predictors corresponding to different tasks. In response to this limitation, we introduce the GNN-MoCE architecture. It employs the Mixture of Collaborative Experts (MoCE) as predictors, exploiting task commonalities while confronting the homogeneity issue in the expert pool and the decision dominance dilemma within the expert group. To enhance expert diversity for collaboration among all experts, the Expert-Specific Projection method is proposed to assign a unique projection perspective to each expert. To balance decision-making influence for collaboration within the expert group, the Expert-Specific Loss is presented to integrate individual expert loss into the weighted decision loss of the group for more equitable training. Benefiting from the enhancements of MoCE in expert creation, dynamic expert group formation, and experts' collaboration, our model demonstrates superior performance over traditional methods on 24 MPP datasets, especially in tasks with limited data or high imbalance.  ( 2 min )
    Editable Stain Transformation Of Histological Images Using Unpaired GANs. (arXiv:2312.03647v1 [eess.IV])
    Double staining in histopathology, particularly for metaplastic breast cancer, typically employs H&E and P63 dyes. However, P63's tissue damage and high cost necessitate alternative methods. This study introduces xAI-CycleGAN, an advanced architecture combining Mask CycleGAN with explainability features and structure-preserving capabilities for transforming H&E stained breast tissue images into P63-like images. The architecture allows for output editing, enhancing resemblance to actual images and enabling further model refinement. We showcase xAI-CycleGAN's efficacy in maintaining structural integrity and generating high-quality images. Additionally, a histopathologist survey indicates the generated images' realism is often comparable to actual images, validating our model's high-quality output.
    From Detection to Action Recognition: An Edge-Based Pipeline for Robot Human Perception. (arXiv:2312.03477v1 [cs.RO])
    Mobile service robots are proving to be increasingly effective in a range of applications, such as healthcare, monitoring Activities of Daily Living (ADL), and facilitating Ambient Assisted Living (AAL). These robots heavily rely on Human Action Recognition (HAR) to interpret human actions and intentions. However, for HAR to function effectively on service robots, it requires prior knowledge of human presence (human detection) and identification of individuals to monitor (human tracking). In this work, we propose an end-to-end pipeline that encompasses the entire process, starting from human detection and tracking, leading to action recognition. The pipeline is designed to operate in near real-time while ensuring all stages of processing are performed on the edge, reducing the need for centralised computation. To identify the most suitable models for our mobile robot, we conducted a series of experiments comparing state-of-the-art solutions based on both their detection performance and efficiency. To evaluate the effectiveness of our proposed pipeline, we proposed a dataset comprising daily household activities. By presenting our findings and analysing the results, we demonstrate the efficacy of our approach in enabling mobile robots to understand and respond to human behaviour in real-world scenarios relying mainly on the data from their RGB cameras.
    Blueprinting the Future: Automatic Item Categorization using Hierarchical Zero-Shot and Few-Shot Classifiers. (arXiv:2312.03561v1 [stat.ME])
    In testing industry, precise item categorization is pivotal to align exam questions with the designated content domains outlined in the assessment blueprint. Traditional methods either entail manual classification, which is laborious and error-prone, or utilize machine learning requiring extensive training data, often leading to model underfit or overfit issues. This study unveils a novel approach employing the zero-shot and few-shot Generative Pretrained Transformer (GPT) classifier for hierarchical item categorization, minimizing the necessity for training data, and instead, leveraging human-like language descriptions to define categories. Through a structured python dictionary, the hierarchical nature of examination blueprints is navigated seamlessly, allowing for a tiered classification of items across multiple levels. An initial simulation with artificial data demonstrates the efficacy of this method, achieving an average accuracy of 92.91% measured by the F1 score. This method was further applied to real exam items from the 2022 In-Training Examination (ITE) conducted by the American Board of Family Medicine (ABFM), reclassifying 200 items according to a newly formulated blueprint swiftly in 15 minutes, a task that traditionally could span several days among editors and physicians. This innovative approach not only drastically cuts down classification time but also ensures a consistent, principle-driven categorization, minimizing human biases and discrepancies. The ability to refine classifications by adjusting definitions adds to its robustness and sustainability.
    MotionCtrl: A Unified and Flexible Motion Controller for Video Generation. (arXiv:2312.03641v1 [cs.CV])
    Motions in a video primarily consist of camera motion, induced by camera movement, and object motion, resulting from object movement. Accurate control of both camera and object motion is essential for video generation. However, existing works either mainly focus on one type of motion or do not clearly distinguish between the two, limiting their control capabilities and diversity. Therefore, this paper presents MotionCtrl, a unified and flexible motion controller for video generation designed to effectively and independently control camera and object motion. The architecture and training strategy of MotionCtrl are carefully devised, taking into account the inherent properties of camera motion, object motion, and imperfect training data. Compared to previous methods, MotionCtrl offers three main advantages: 1) It effectively and independently controls camera motion and object motion, enabling more fine-grained motion control and facilitating flexible and diverse combinations of both types of motion. 2) Its motion conditions are determined by camera poses and trajectories, which are appearance-free and minimally impact the appearance or shape of objects in generated videos. 3) It is a relatively generalizable model that can adapt to a wide array of camera poses and trajectories once trained. Extensive qualitative and quantitative experiments have been conducted to demonstrate the superiority of MotionCtrl over existing methods.
    Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention. (arXiv:2312.03556v1 [cs.CV])
    Face inpainting is important in various applications, such as photo restoration, image editing, and virtual reality. Despite the significant advances in face generative models, ensuring that a person's unique facial identity is maintained during the inpainting process is still an elusive goal. Current state-of-the-art techniques, exemplified by MyStyle, necessitate resource-intensive fine-tuning and a substantial number of images for each new identity. Furthermore, existing methods often fall short in accommodating user-specified semantic attributes, such as beard or expression. To improve inpainting results, and reduce the computational complexity during inference, this paper proposes the use of Parallel Visual Attention (PVA) in conjunction with diffusion models. Specifically, we insert parallel attention matrices to each cross-attention module in the denoising network, which attends to features extracted from reference images by an identity encoder. We train the added attention modules and identity encoder on CelebAHQ-IDI, a dataset proposed for identity-preserving face inpainting. Experiments demonstrate that PVA attains unparalleled identity resemblance in both face inpainting and face inpainting with language guidance tasks, in comparison to various benchmarks, including MyStyle, Paint by Example, and Custom Diffusion. Our findings reveal that PVA ensures good identity preservation while offering effective language-controllability. Additionally, in contrast to Custom Diffusion, PVA requires just 40 fine-tuning steps for each new identity, which translates to a significant speed increase of over 20 times.
    Seller-side Outcome Fairness in Online Marketplaces. (arXiv:2312.03253v1 [cs.LG])
    This paper aims to investigate and achieve seller-side fairness within online marketplaces, where many sellers and their items are not sufficiently exposed to customers in an e-commerce platform. This phenomenon raises concerns regarding the potential loss of revenue associated with less exposed items as well as less marketplace diversity. We introduce the notion of seller-side outcome fairness and build an optimization model to balance collected recommendation rewards and the fairness metric. We then propose a gradient-based data-driven algorithm based on the duality and bandit theory. Our numerical experiments on real e-commerce data sets show that our algorithm can lift seller fairness measures while not hurting metrics like collected Gross Merchandise Value (GMV) and total purchases.
    DreamComposer: Controllable 3D Object Generation via Multi-View Conditions. (arXiv:2312.03611v1 [cs.CV])
    Utilizing pre-trained 2D large-scale generative models, recent works are capable of generating high-quality novel views from a single in-the-wild image. However, due to the lack of information from multiple views, these works encounter difficulties in generating controllable novel views. In this paper, we present DreamComposer, a flexible and scalable framework that can enhance existing view-aware diffusion models by injecting multi-view conditions. Specifically, DreamComposer first uses a view-aware 3D lifting module to obtain 3D representations of an object from multiple views. Then, it renders the latent features of the target view from 3D representations with the multi-view feature fusion module. Finally the target view features extracted from multi-view inputs are injected into a pre-trained diffusion model. Experiments show that DreamComposer is compatible with state-of-the-art diffusion models for zero-shot novel view synthesis, further enhancing them to generate high-fidelity novel view images with multi-view conditions, ready for controllable 3D object reconstruction and various other applications.
    The mechanistic basis of data dependence and abrupt learning in an in-context classification task. (arXiv:2312.03002v1 [cs.LG])
    Transformer models exhibit in-context learning: the ability to accurately predict the response to a novel query based on illustrative examples in the input sequence. In-context learning contrasts with traditional in-weights learning of query-output relationships. What aspects of the training data distribution and architecture favor in-context vs in-weights learning? Recent work has shown that specific distributional properties inherent in language, such as burstiness, large dictionaries and skewed rank-frequency distributions, control the trade-off or simultaneous appearance of these two forms of learning. We first show that these results are recapitulated in a minimal attention-only network trained on a simplified dataset. In-context learning (ICL) is driven by the abrupt emergence of an induction head, which subsequently competes with in-weights learning. By identifying progress measures that precede in-context learning and targeted experiments, we construct a two-parameter model of an induction head which emulates the full data distributional dependencies displayed by the attention-based network. A phenomenological model of induction head formation traces its abrupt emergence to the sequential learning of three nested logits enabled by an intrinsic curriculum. We propose that the sharp transitions in attention-based networks arise due to a specific chain of multi-layer operations necessary to achieve ICL, which is implemented by nested nonlinearities sequentially learned during training.  ( 2 min )
    An Infinite-Width Analysis on the Jacobian-Regularised Training of a Neural Network. (arXiv:2312.03386v1 [cs.LG])
    The recent theoretical analysis of deep neural networks in their infinite-width limits has deepened our understanding of initialisation, feature learning, and training of those networks, and brought new practical techniques for finding appropriate hyperparameters, learning network weights, and performing inference. In this paper, we broaden this line of research by showing that this infinite-width analysis can be extended to the Jacobian of a deep neural network. We show that a multilayer perceptron (MLP) and its Jacobian at initialisation jointly converge to a Gaussian process (GP) as the widths of the MLP's hidden layers go to infinity and characterise this GP. We also prove that in the infinite-width limit, the evolution of the MLP under the so-called robust training (i.e., training with a regulariser on the Jacobian) is described by a linear first-order ordinary differential equation that is determined by a variant of the Neural Tangent Kernel. We experimentally show the relevance of our theoretical claims to wide finite networks, and empirically analyse the properties of kernel regression solution to obtain an insight into Jacobian regularisation.
    Benchmarking Continual Learning from Cognitive Perspectives. (arXiv:2312.03309v1 [cs.LG])
    Continual learning addresses the problem of continuously acquiring and transferring knowledge without catastrophic forgetting of old concepts. While humans achieve continual learning via diverse neurocognitive mechanisms, there is a mismatch between cognitive properties and evaluation methods of continual learning models. First, the measurement of continual learning models mostly relies on evaluation metrics at a micro-level, which cannot characterize cognitive capacities of the model. Second, the measurement is method-specific, emphasizing model strengths in one aspect while obscuring potential weaknesses in other respects. To address these issues, we propose to integrate model cognitive capacities and evaluation metrics into a unified evaluation paradigm. We first characterize model capacities via desiderata derived from cognitive properties supporting human continual learning. The desiderata concern (1) adaptability in varying lengths of task sequence; (2) sensitivity to dynamic task variations; and (3) efficiency in memory usage and training time consumption. Then we design evaluation protocols for each desideratum to assess cognitive capacities of recent continual learning models. Experimental results show that no method we consider has satisfied all the desiderata and is still far away from realizing truly continual learning. Although some methods exhibit some degree of adaptability and efficiency, no method is able to identify task relationships when encountering dynamic task variations, or achieve a trade-off in learning similarities and differences between tasks. Inspired by these results, we discuss possible factors that influence model performance in these desiderata and provide guidance for the improvement of continual learning models.
    Approximating Solutions to the Knapsack Problem using the Lagrangian Dual Framework. (arXiv:2312.03413v1 [cs.LG])
    The Knapsack Problem is a classic problem in combinatorial optimisation. Solving these problems may be computationally expensive. Recent years have seen a growing interest in the use of deep learning methods to approximate the solutions to such problems. A core problem is how to enforce or encourage constraint satisfaction in predicted solutions. A promising approach for predicting solutions to constrained optimisation problems is the Lagrangian Dual Framework which builds on the method of Lagrangian Relaxation. In this paper we develop neural network models to approximate Knapsack Problem solutions using the Lagrangian Dual Framework while improving constraint satisfaction. We explore the problems of output interpretation and model selection within this context. Experimental results show strong constraint satisfaction with a minor reduction of optimality as compared to a baseline neural network which does not explicitly model the constraints.
    Clustering by Contour coreset and variational quantum eigensolver. (arXiv:2312.03516v1 [quant-ph])
    Recent work has proposed solving the k-means clustering problem on quantum computers via the Quantum Approximate Optimization Algorithm (QAOA) and coreset techniques. Although the current method demonstrates the possibility of quantum k-means clustering, it does not ensure high accuracy and consistency across a wide range of datasets. The existing coreset techniques are designed for classical algorithms and there has been no quantum-tailored coreset technique which is designed to boost the accuracy of quantum algorithms. In this work, we propose solving the k-means clustering problem with the variational quantum eigensolver (VQE) and a customised coreset method, the Contour coreset, which has been formulated with specific focus on quantum algorithms. Extensive simulations with synthetic and real-life data demonstrated that our VQE+Contour Coreset approach outperforms existing QAOA+Coreset k-means clustering approaches with higher accuracy and lower standard deviation. Our work has shown that quantum tailored coreset techniques has the potential to significantly boost the performance of quantum algorithms when compared to using generic off-the-shelf coreset techniques.
    Deep Multimodal Fusion for Surgical Feedback Classification. (arXiv:2312.03231v1 [cs.LG])
    Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In this work, we leverage a clinically-validated five-category classification of surgical feedback: "Anatomic", "Technical", "Procedural", "Praise" and "Visual Aid". We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities. The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale. Our automated classification of surgical feedback achieves AUCs ranging from 71.5 to 77.6 with the fusion improving performance by 3.1%. We also show that high-quality manual transcriptions of feedback audio from experts improve AUCs to between 76.5 and 96.2, which demonstrates a clear path toward future improvements. Empirically, we find that the Staged training strategy, with first pre-training each modality separately and then training them jointly, is more effective than training different modalities altogether. We also present intuitive findings on the importance of modalities for different feedback categories. This work offers an important first look at the feasibility of automated classification of real-world live surgical feedback based on text, audio, and video modalities.
    On the Nystrom Approximation for Preconditioning in Kernel Machines. (arXiv:2312.03311v1 [stat.ML])
    Kernel methods are a popular class of nonlinear predictive models in machine learning. Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning. Spectral preconditioning is an important tool to speed-up the convergence of such iterative algorithms for training kernel models. However computing and storing a spectral preconditioner can be expensive which can lead to large computational and storage overheads, precluding the application of kernel methods to problems with large datasets. A Nystrom approximation of the spectral preconditioner is often cheaper to compute and store, and has demonstrated success in practical applications. In this paper we analyze the trade-offs of using such an approximated preconditioner. Specifically, we show that a sample of logarithmic size (as a function of the size of the dataset) enables the Nystrom-based approximated preconditioner to accelerate gradient descent nearly as well as the exact preconditioner, while also reducing the computational and storage overheads.
    Subnetwork-to-go: Elastic Neural Network with Dynamic Training and Customizable Inference. (arXiv:2312.03464v1 [cs.LG])
    Deploying neural networks to different devices or platforms is in general challenging, especially when the model size is large or model complexity is high. Although there exist ways for model pruning or distillation, it is typically required to perform a full round of model training or finetuning procedure in order to obtain a smaller model that satisfies the model size or complexity constraints. Motivated by recent works on dynamic neural networks, we propose a simple way to train a large network and flexibly extract a subnetwork from it given a model size or complexity constraint during inference. We introduce a new way to allow a large model to be trained with dynamic depth and width during the training phase, and after the large model is trained we can select a subnetwork from it with arbitrary depth and width during the inference phase with a relatively better performance compared to training the subnetwork independently from scratch. Experiment results on a music source separation model show that our proposed method can effectively improve the separation performance across different subnetwork sizes and complexities with a single large model, and training the large model takes significantly shorter time than training all the different subnetworks.
    Cooperative Probabilistic Trajectory Forecasting under Occlusion. (arXiv:2312.03296v1 [cs.RO])
    Perception and planning under occlusion is essential for safety-critical tasks. Occlusion-aware planning often requires communicating the information of the occluded object to the ego agent for safe navigation. However, communicating rich sensor information under adverse conditions during communication loss and limited bandwidth may not be always feasible. Further, in GPS denied environments and indoor navigation, localizing and sharing of occluded objects can be challenging. To overcome this, relative pose estimation between connected agents sharing a common field of view can be a computationally effective way of communicating information about surrounding objects. In this paper, we design an end-to-end network that cooperatively estimates the current states of occluded pedestrian in the reference frame of ego agent and then predicts the trajectory with safety guarantees. Experimentally, we show that the uncertainty-aware trajectory prediction of occluded pedestrian by the ego agent is almost similar to the ground truth trajectory assuming no occlusion. The current research holds promise for uncertainty-aware navigation among multiple connected agents under occlusion.
    Anomaly Detection for Scalable Task Grouping in Reinforcement Learning-based RAN Optimization. (arXiv:2312.03277v1 [cs.LG])
    The use of learning-based methods for optimizing cellular radio access networks (RAN) has received increasing attention in recent years. This coincides with a rapid increase in the number of cell sites worldwide, driven largely by dramatic growth in cellular network traffic. Training and maintaining learned models that work well across a large number of cell sites has thus become a pertinent problem. This paper proposes a scalable framework for constructing a reinforcement learning policy bank that can perform RAN optimization across a large number of cell sites with varying traffic patterns. Central to our framework is a novel application of anomaly detection techniques to assess the compatibility between sites (tasks) and the policy bank. This allows our framework to intelligently identify when a policy can be reused for a task, and when a new policy needs to be trained and added to the policy bank. Our results show that our approach to compatibility assessment leads to an efficient use of computational resources, by allowing us to construct a performant policy bank without exhaustively training on all tasks, which makes it applicable under real-world constraints.
    An AI for Scientific Discovery Route between Amorphous Networks and Mechanical Behavior. (arXiv:2312.03404v1 [cond-mat.soft])
    "AI for science" is widely recognized as a future trend in the development of scientific research. Currently, although machine learning algorithms have played a crucial role in scientific research with numerous successful cases, relatively few instances exist where AI assists researchers in uncovering the underlying physical mechanisms behind a certain phenomenon and subsequently using that mechanism to improve machine learning algorithms' efficiency. This article uses the investigation into the relationship between extreme Poisson's ratio values and the structure of amorphous networks as a case study to illustrate how machine learning methods can assist in revealing underlying physical mechanisms. Upon recognizing that the Poisson's ratio relies on the low-frequency vibrational modes of dynamical matrix, we can then employ a convolutional neural network, trained on the dynamical matrix instead of traditional image recognition, to predict the Poisson's ratio of amorphous networks with a much higher efficiency. Through this example, we aim to showcase the role that artificial intelligence can play in revealing fundamental physical mechanisms, which subsequently improves the machine learning algorithms significantly.
    An Automated Machine Learning Approach for Detecting Anomalous Peak Patterns in Time Series Data from a Research Watershed in the Northeastern United States Critical Zone. (arXiv:2309.07992v2 [cs.LG] UPDATED)
    This paper presents an automated machine learning framework designed to assist hydrologists in detecting anomalies in time series data generated by sensors in a research watershed in the northeastern United States critical zone. The framework specifically focuses on identifying peak-pattern anomalies, which may arise from sensor malfunctions or natural phenomena. However, the use of classification methods for anomaly detection poses challenges, such as the requirement for labeled data as ground truth and the selection of the most suitable deep learning model for the given task and dataset. To address these challenges, our framework generates labeled datasets by injecting synthetic peak patterns into synthetically generated time series data and incorporates an automated hyperparameter optimization mechanism. This mechanism generates an optimized model instance with the best architectural and training parameters from a pool of five selected models, namely Temporal Convolutional Network (TCN), InceptionTime, MiniRocket, Residual Networks (ResNet), and Long Short-Term Memory (LSTM). The selection is based on the user's preferences regarding anomaly detection accuracy and computational cost. The framework employs Time-series Generative Adversarial Networks (TimeGAN) as the synthetic dataset generator. The generated model instances are evaluated using a combination of accuracy and computational cost metrics, including training time and memory, during the anomaly detection process. Performance evaluation of the framework was conducted using a dataset from a watershed, demonstrating consistent selection of the most fitting model instance that satisfies the user's preferences.
    Incidental Polysemanticity. (arXiv:2312.03096v1 [cs.LG])
    Polysemantic neurons (neurons that activate for a set of unrelated features) have been seen as a significant obstacle towards interpretability of task-optimized deep networks, with implications for AI safety. The classic origin story of polysemanticity is that the data contains more "features" than neurons, such that learning to perform a task forces the network to co-allocate multiple unrelated features to the same neuron, endangering our ability to understand the network's internal processing. In this work, we present a second and non-mutually exclusive origin story of polysemanticity. We show that polysemanticity can arise incidentally, even when there are ample neurons to represent all features in the data, using a combination of theory and experiments. This second type of polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap. Due to its origin, we term this \textit{incidental polysemanticity}.
    CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation Models. (arXiv:2312.03256v1 [cs.LG])
    Recently, the growing memory demands of embedding tables in Deep Learning Recommendation Models (DLRMs) pose great challenges for model training and deployment. Existing embedding compression solutions cannot simultaneously meet three key design requirements: memory efficiency, low latency, and adaptability to dynamic data distribution. This paper presents CAFE, a Compact, Adaptive, and Fast Embedding compression framework that addresses the above requirements. The design philosophy of CAFE is to dynamically allocate more memory resources to important features (called hot features), and allocate less memory to unimportant ones. In CAFE, we propose a fast and lightweight sketch data structure, named HotSketch, to capture feature importance and report hot features in real time. For each reported hot feature, we assign it a unique embedding. For the non-hot features, we allow multiple features to share one embedding by using hash embedding technique. Guided by our design philosophy, we further propose a multi-level hash embedding framework to optimize the embedding tables of non-hot features. We theoretically analyze the accuracy of HotSketch, and analyze the model convergence against deviation. Extensive experiments show that CAFE significantly outperforms existing embedding compression methods, yielding 3.92% and 3.68% superior testing AUC on Criteo Kaggle dataset and CriteoTB dataset at a compression ratio of 10000x. The source codes of CAFE are available at GitHub.
    Generalizable Neural Physics Solvers by Baldwinian Evolution. (arXiv:2312.03243v1 [cs.NE])
    Physics-informed neural networks (PINNs) are at the forefront of scientific machine learning, making possible the creation of machine intelligence that is cognizant of physical laws and able to accurately simulate them. In this paper, the potential of discovering PINNs that generalize over an entire family of physics tasks is studied, for the first time, through a biological lens of the Baldwin effect. Drawing inspiration from the neurodevelopment of precocial species that have evolved to learn, predict and react quickly to their environment, we envision PINNs that are pre-wired with connection strengths inducing strong biases towards efficient learning of physics. To this end, evolutionary selection pressure (guided by proficiency over a family of tasks) is coupled with lifetime learning (to specialize on a smaller subset of those tasks) to produce PINNs that demonstrate fast and physics-compliant prediction capabilities across a range of empirically challenging problem instances. The Baldwinian approach achieves an order of magnitude improvement in prediction accuracy at a fraction of the computation cost compared to state-of-the-art results with PINNs meta-learned by gradient descent. This paper marks a leap forward in the meta-learning of PINNs as generalizable physics solvers.
    Deep Learning for Fast Inference of Mechanistic Models' Parameters. (arXiv:2312.03166v1 [cs.LG])
    Inferring parameters of macro-kinetic growth models, typically represented by Ordinary Differential Equations (ODE), from the experimental data is a crucial step in bioprocess engineering. Conventionally, estimates of the parameters are obtained by fitting the mechanistic model to observations. Fitting, however, requires a significant computational power. Specifically, during the development of new bioprocesses that use previously unknown organisms or strains, efficient, robust, and computationally cheap methods for parameter estimation are of great value. In this work, we propose using Deep Neural Networks (NN) for directly predicting parameters of mechanistic models given observations. The approach requires spending computational resources for training a NN, nonetheless, once trained, such a network can provide parameter estimates orders of magnitude faster than conventional methods. We consider a training procedure that combines Neural Networks and mechanistic models. We demonstrate the performance of the proposed algorithms on data sampled from several mechanistic models used in bioengineering describing a typical industrial batch process and compare the proposed method, a typical gradient-based fitting procedure, and the combination of the two. We find that, while Neural Network estimates are slightly improved by further fitting, these estimates are measurably better than the fitting procedure alone.
    FERGI: Automatic Annotation of User Preferences for Text-to-Image Generation from Spontaneous Facial Expression Reaction. (arXiv:2312.03187v1 [cs.CV])
    Researchers have proposed to use data of human preference feedback to fine-tune text-to-image generative models. However, the scalability of human feedback collection has been limited by its reliance on manual annotation. Therefore, we develop and test a method to automatically annotate user preferences from their spontaneous facial expression reaction to the generated images. We collect a dataset of Facial Expression Reaction to Generated Images (FERGI) and show that the activations of multiple facial action units (AUs) are highly correlated with user evaluations of the generated images. Specifically, AU4 (brow lowerer) is most consistently reflective of negative evaluations of the generated image. This can be useful in two ways. Firstly, we can automatically annotate user preferences between image pairs with substantial difference in AU4 responses to them with an accuracy significantly outperforming state-of-the-art scoring models. Secondly, directly integrating the AU4 responses with the scoring models improves their consistency with human preferences. Additionally, the AU4 response best reflects the user's evaluation of the image fidelity, making it complementary to the state-of-the-art scoring models, which are generally better at reflecting image-text alignment. Finally, this method of automatic annotation with facial expression analysis can be potentially generalized to other generation tasks. The code is available at https://github.com/ShuangquanFeng/FERGI, and the dataset is also available at the same link for research purposes.
    Advantage of Quantum Machine Learning from General Computational Advantages. (arXiv:2312.03057v1 [quant-ph])
    An overarching milestone of quantum machine learning (QML) is to demonstrate the advantage of QML over all possible classical learning methods in accelerating a common type of learning task as represented by supervised learning with classical data. However, the provable advantages of QML in supervised learning have been known so far only for the learning tasks designed for using the advantage of specific quantum algorithms, i.e., Shor's algorithms. Here we explicitly construct an unprecedentedly broader family of supervised learning tasks with classical data to offer the provable advantage of QML based on general quantum computational advantages, progressing beyond Shor's algorithms. Our learning task is feasibly achievable by executing a general class of functions that can be computed efficiently in polynomial time for a large fraction of inputs by arbitrary quantum algorithms but not by any classical algorithm. We prove the hardness of achieving this learning task for any possible polynomial-time classical learning method. We also clarify protocols for preparing the classical data to demonstrate this learning task in experiments. These results open routes to exploit a variety of quantum advantages in computing functions for the experimental demonstration of the advantage of QML.
    Low-Cost High-Power Membership Inference by Boosting Relativity. (arXiv:2312.03262v1 [stat.ML])
    We present a robust membership inference attack (RMIA) that amplifies the distinction between population data and the training data on any target model, by effectively leveraging both reference models and reference data in our likelihood ratio test. Our algorithm exhibits superior test power (true-positive rate) when compared to prior methods, even at extremely low false-positive error rates (as low as 0). Also, under computation constraints, where only a limited number of reference models (as few as 1) are available, our method performs exceptionally well, unlike some prior attacks that approach random guessing in such scenarios. Our method lays the groundwork for cost-effective and practical yet powerful and robust privacy risk analysis of machine learning algorithms.  ( 2 min )
    Search Strategies for Self-driving Laboratories with Pending Experiments. (arXiv:2312.03466v1 [cs.LG])
    Self-driving laboratories (SDLs) consist of multiple stations that perform material synthesis and characterisation tasks. To minimize station downtime and maximize experimental throughput, it is practical to run experiments in asynchronous parallel, in which multiple experiments are being performed at once in different stages. Asynchronous parallelization of experiments, however, introduces delayed feedback (i.e. "pending experiments"), which is known to reduce Bayesian optimiser performance. Here, we build a simulator for a multi-stage SDL and compare optimisation strategies for dealing with delayed feedback and asynchronous parallelized operation. Using data from a real SDL, we build a ground truth Bayesian optimisation simulator from 177 previously run experiments for maximizing the conductivity of functional coatings. We then compare search strategies such as expected improvement, noisy expected improvement, 4-mode exploration and random sampling. We evaluate their performance in terms of amount of delay and problem dimensionality. Our simulation results showcase the trade-off between the asynchronous parallel operation and delayed feedback.
    A General Framework for Sequential Decision-Making under Adaptivity Constraints. (arXiv:2306.14468v3 [cs.LG] UPDATED)
    We take the first step in studying general sequential decision-making under two adaptivity constraints: rare policy switch and batch learning. First, we provide a general class called the Eluder Condition class, which includes a wide range of reinforcement learning classes. Then, for the rare policy switch constraint, we provide a generic algorithm to achieve a $\widetilde{\mathcal{O}}(\log K) $ switching cost with a $\widetilde{\mathcal{O}}(\sqrt{K})$ regret on the EC class. For the batch learning constraint, we provide an algorithm that provides a $\widetilde{\mathcal{O}}(\sqrt{K}+K/B)$ regret with the number of batches $B.$ This paper is the first work considering rare policy switch and batch learning under general function classes, which covers nearly all the models studied in the previous works such as tabular MDP (Bai et al. 2019; Zhang et al. 2020), linear MDP (Wang et al. 2021; Gao et al. 2021), low eluder dimension MDP (Kong et al. 2021; Gao et al. 2021), generalized linear function approximation (Qiao et al. 2023), and also some new classes such as the low $D_\Delta$-type Bellman eluder dimension problem, linear mixture MDP, kernelized nonlinear regulator and undercomplete partially observed Markov decision process (POMDP).
    Cone Ranking for Multi-Criteria Decision Making. (arXiv:2312.03006v1 [cs.AI])
    Recently introduced cone distribution functions from statistics are turned into multi-criteria decision making (MCDM) tools. It is demonstrated that this procedure can be considered as an upgrade of the weighted sum scalarization insofar as it absorbs a whole collection of weighted sum scalarizations at once instead of fixing a particular one in advance. Moreover, situations are characterized in which different types of rank reversal occur, and it is explained why this might even be useful for analyzing the ranking procedure. A few examples will be discussed and a potential application in machine learning is outlined.
    Learning Curricula in Open-Ended Worlds. (arXiv:2312.03126v1 [cs.AI])
    Deep reinforcement learning (RL) provides powerful methods for training optimal sequential decision-making agents. As collecting real-world interactions can entail additional costs and safety risks, the common paradigm of sim2real conducts training in a simulator, followed by real-world deployment. Unfortunately, RL agents easily overfit to the choice of simulated training environments, and worse still, learning ends when the agent masters the specific set of simulated environments. In contrast, the real world is highly open-ended, featuring endlessly evolving environments and challenges, making such RL approaches unsuitable. Simply randomizing over simulated environments is insufficient, as it requires making arbitrary distributional assumptions and can be combinatorially less likely to sample specific environment instances that are useful for learning. An ideal learning process should automatically adapt the training environment to maximize the learning potential of the agent over an open-ended task space that matches or surpasses the complexity of the real world. This thesis develops a class of methods called Unsupervised Environment Design (UED), which aim to produce such open-ended processes. Given an environment design space, UED automatically generates an infinite sequence or curriculum of training environments at the frontier of the learning agent's capabilities. Through extensive empirical studies and theoretical arguments founded on minimax-regret decision theory and game theory, the findings in this thesis show that UED autocurricula can produce RL agents exhibiting significantly improved robustness and generalization to previously unseen environment instances. Such autocurricula are promising paths toward open-ended learning systems that achieve more general intelligence by continually generating and mastering additional challenges of their own design.
    Foundation Models for Weather and Climate Data Understanding: A Comprehensive Survey. (arXiv:2312.03014v1 [cs.LG])
    As artificial intelligence (AI) continues to rapidly evolve, the realm of Earth and atmospheric sciences is increasingly adopting data-driven models, powered by progressive developments in deep learning (DL). Specifically, DL techniques are extensively utilized to decode the chaotic and nonlinear aspects of Earth systems, and to address climate challenges via understanding weather and climate data. Cutting-edge performance on specific tasks within narrower spatio-temporal scales has been achieved recently through DL. The rise of large models, specifically large language models (LLMs), has enabled fine-tuning processes that yield remarkable outcomes across various downstream tasks, thereby propelling the advancement of general AI. However, we are still navigating the initial stages of crafting general AI for weather and climate. In this survey, we offer an exhaustive, timely overview of state-of-the-art AI methodologies specifically engineered for weather and climate data, with a special focus on time series and text data. Our primary coverage encompasses four critical aspects: types of weather and climate data, principal model architectures, model scopes and applications, and datasets for weather and climate. Furthermore, in relation to the creation and application of foundation models for weather and climate data understanding, we delve into the field's prevailing challenges, offer crucial insights, and propose detailed avenues for future research. This comprehensive approach equips practitioners with the requisite knowledge to make substantial progress in this domain. Our survey encapsulates the most recent breakthroughs in research on large, data-driven models for weather and climate data understanding, emphasizing robust foundations, current advancements, practical applications, crucial resources, and prospective research opportunities.
    Transformer-Based Deep Learning Model for Bored Pile Load-Deformation Prediction in Bangkok Subsoil. (arXiv:2312.03041v1 [cs.LG])
    This paper presents a novel deep learning model based on the transformer architecture to predict the load-deformation behavior of large bored piles in Bangkok subsoil. The model encodes the soil profile and pile features as tokenization input, and generates the load-deformation curve as output. The model also incorporates the previous sequential data of load-deformation curve into the decoder to improve the prediction accuracy. The model also incorporates the previous sequential data of load-deformation curve into the decoder. The model shows a satisfactory accuracy and generalization ability for the load-deformation curve prediction, with a mean absolute error of 5.72% for the test data. The model could also be used for parametric analysis and design optimization of piles under different soil and pile conditions, pile cross section, pile length and type of pile.
    zkDL: Efficient Zero-Knowledge Proofs of Deep Learning Training. (arXiv:2307.16273v2 [cs.LG] UPDATED)
    The recent advancements in deep learning have brought about significant changes in various aspects of people's lives. Meanwhile, these rapid developments have raised concerns about the legitimacy of the training process of deep neural networks. To protect the intellectual properties of AI developers, directly examining the training process by accessing the model parameters and training data is often prohibited for verifiers. In response to this challenge, we present zero-knowledge deep learning (zkDL), an efficient zero-knowledge proof for deep learning training. To address the long-standing challenge of verifiable computations of non-linearities in deep learning training, we introduce zkReLU, a specialized proof for the ReLU activation and its backpropagation. zkReLU turns the disadvantage of non-arithmetic relations into an advantage, leading to the creation of FAC4DNN, our specialized arithmetic circuit design for modelling neural networks. This design aggregates the proofs over different layers and training steps, without being constrained by their sequential order in the training process. With our new CUDA implementation that achieves full compatibility with the tensor structures and the aggregated proof design, zkDL enables the generation of complete and sound proofs in less than a second per batch update for an 8-layer neural network with 10M parameters and a batch size of 64, while provably ensuring the privacy of data and model parameters. To our best knowledge, we are not aware of any existing work on zero-knowledge proof of deep learning training that is scalable to million-size networks.
    Diff-GO: Diffusion Goal-Oriented Communications to Achieve Ultra-High Spectrum Efficiency. (arXiv:2312.02984v1 [cs.LG])
    The latest advances in artificial intelligence (AI) present many unprecedented opportunities to achieve much improved bandwidth saving in communications. Unlike conventional communication systems focusing on packet transport, rich datasets and AI makes it possible to efficiently transfer only the information most critical to the goals of message recipients. One of the most exciting advances in generative AI known as diffusion model presents a unique opportunity for designing ultra-fast communication systems well beyond language-based messages. This work presents an ultra-efficient communication design by utilizing generative AI-based on diffusion models as a specific example of the general goal-oriented communication framework. To better control the regenerated message at the receiver output, our diffusion system design includes a local regeneration module with finite dimensional noise latent. The critical significance of noise latent control and sharing residing on our Diff-GO is the ability to introduce the concept of "local generative feedback" (Local-GF), which enables the transmitter to monitor the quality and gauge the quality or accuracy of the message recovery at the semantic system receiver. To this end, we propose a new low-dimensional noise space for the training of diffusion models, which significantly reduces the communication overhead and achieves satisfactory message recovery performance. Our experimental results demonstrate that the proposed noise space and the diffusion-based generative model achieve ultra-high spectrum efficiency and accurate recovery of transmitted image signals. By trading off computation for bandwidth efficiency (C4BE), this new framework provides an important avenue to achieve exceptional computation-bandwidth tradeoff.
    Deep Reinforcement Learning for Community Battery Scheduling under Uncertainties of Load, PV Generation, and Energy Prices. (arXiv:2312.03008v1 [cs.LG])
    In response to the growing uptake of distributed energy resources (DERs), community batteries have emerged as a promising solution to support renewable energy integration, reduce peak load, and enhance grid reliability. This paper presents a deep reinforcement learning (RL) strategy, centered around the soft actor-critic (SAC) algorithm, to schedule a community battery system in the presence of uncertainties, such as solar photovoltaic (PV) generation, local demand, and real-time energy prices. We position the community battery to play a versatile role, in integrating local PV energy, reducing peak load, and exploiting energy price fluctuations for arbitrage, thereby minimizing the system cost. To improve exploration and convergence during RL training, we utilize the noisy network technique. This paper conducts a comparative study of different RL algorithms, including proximal policy optimization (PPO) and deep deterministic policy gradient (DDPG) algorithms, to evaluate their effectiveness in the community battery scheduling problem. The results demonstrate the potential of RL in addressing community battery scheduling challenges and show that the SAC algorithm achieves the best performance compared to RL and optimization benchmarks.
    Sample-based Dynamic Hierarchical Transformer with Layer and Head Flexibility via Contextual Bandit. (arXiv:2312.03038v1 [cs.LG])
    Transformer requires a fixed number of layers and heads which makes them inflexible to the complexity of individual samples and expensive in training and inference. To address this, we propose a sample-based Dynamic Hierarchical Transformer (DHT) model whose layers and heads can be dynamically configured with single data samples via solving contextual bandit problems. To determine the number of layers and heads, we use the Uniform Confidence Bound while we deploy combinatorial Thompson Sampling in order to select specific head combinations given their number. Different from previous work that focuses on compressing trained networks for inference only, DHT is not only advantageous for adaptively optimizing the underlying network architecture during training but also has a flexible network for efficient inference. To the best of our knowledge, this is the first comprehensive data-driven dynamic transformer without any additional auxiliary neural networks that implement the dynamic system. According to the experiment results, we achieve up to 74% computational savings for both training and inference with a minimal loss of accuracy.
    PartSLIP++: Enhancing Low-Shot 3D Part Segmentation via Multi-View Instance Segmentation and Maximum Likelihood Estimation. (arXiv:2312.03015v1 [cs.CV])
    Open-world 3D part segmentation is pivotal in diverse applications such as robotics and AR/VR. Traditional supervised methods often grapple with limited 3D data availability and struggle to generalize to unseen object categories. PartSLIP, a recent advancement, has made significant strides in zero- and few-shot 3D part segmentation. This is achieved by harnessing the capabilities of the 2D open-vocabulary detection module, GLIP, and introducing a heuristic method for converting and lifting multi-view 2D bounding box predictions into 3D segmentation masks. In this paper, we introduce PartSLIP++, an enhanced version designed to overcome the limitations of its predecessor. Our approach incorporates two major improvements. First, we utilize a pre-trained 2D segmentation model, SAM, to produce pixel-wise 2D segmentations, yielding more precise and accurate annotations than the 2D bounding boxes used in PartSLIP. Second, PartSLIP++ replaces the heuristic 3D conversion process with an innovative modified Expectation-Maximization algorithm. This algorithm conceptualizes 3D instance segmentation as unobserved latent variables, and then iteratively refines them through an alternating process of 2D-3D matching and optimization with gradient descent. Through extensive evaluations, we show that PartSLIP++ demonstrates better performance over PartSLIP in both low-shot 3D semantic and instance-based object part segmentation tasks. Code released at https://github.com/zyc00/PartSLIP2.
    Multitask Learning Can Improve Worst-Group Outcomes. (arXiv:2312.03151v1 [cs.LG])
    In order to create machine learning systems that serve a variety of users well, it is vital to not only achieve high average performance but also ensure equitable outcomes across diverse groups. However, most machine learning methods are designed to improve a model's average performance on a chosen end task without consideration for their impact on worst group error. Multitask learning (MTL) is one such widely used technique. In this paper, we seek not only to understand the impact of MTL on worst-group accuracy but also to explore its potential as a tool to address the challenge of group-wise fairness. We primarily consider the common setting of fine-tuning a pre-trained model, where, following recent work (Gururangan et al., 2020; Dery et al., 2023), we multitask the end task with the pre-training objective constructed from the end task data itself. In settings with few or no group annotations, we find that multitasking often, but not always, achieves better worst-group accuracy than Just-Train-Twice (JTT; Liu et al. (2021)) -- a representative distributionally robust optimization (DRO) method. Leveraging insights from synthetic data experiments, we propose to modify standard MTL by regularizing the joint multitask representation space. We run a large number of fine-tuning experiments across computer vision and natural language and find that our regularized MTL approach consistently outperforms JTT on both worst and average group outcomes. Our official code can be found here: https://github.com/atharvajk98/MTL-group-robustness.
    Complementary Benefits of Contrastive Learning and Self-Training Under Distribution Shift. (arXiv:2312.03318v1 [cs.LG])
    Self-training and contrastive learning have emerged as leading techniques for incorporating unlabeled data, both under distribution shift (unsupervised domain adaptation) and when it is absent (semi-supervised learning). However, despite the popularity and compatibility of these techniques, their efficacy in combination remains unexplored. In this paper, we undertake a systematic empirical investigation of this combination, finding that (i) in domain adaptation settings, self-training and contrastive learning offer significant complementary gains; and (ii) in semi-supervised learning settings, surprisingly, the benefits are not synergistic. Across eight distribution shift datasets (e.g., BREEDs, WILDS), we demonstrate that the combined method obtains 3--8% higher accuracy than either approach independently. We then theoretically analyze these techniques in a simplified model of distribution shift, demonstrating scenarios under which the features produced by contrastive learning can yield a good initialization for self-training to further amplify gains and achieve optimal performance, even when either method alone would fail.
    Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models. (arXiv:2310.08577v3 [cs.CV] UPDATED)
    Recent advances in the development of vision-language models (VLMs) are yielding remarkable success in recognizing visual semantic content, including impressive instances of compositional image understanding. Here, we introduce the novel task of Visual Data-Type Identification, a basic perceptual skill with implications for data curation (e.g., noisy data-removal from large datasets, domain-specific retrieval) and autonomous vision (e.g., distinguishing changing weather conditions from camera lens staining). We develop two datasets consisting of animal images altered across a diverse set of 27 visual data-types, spanning four broad categories. An extensive zero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced performance landscape. While VLMs are reasonably good at identifying certain stylistic \textit{data-types}, such as cartoons and sketches, they struggle with simpler data-types arising from basic manipulations like image rotations or additive noise. Our findings reveal that (i) model scaling alone yields marginal gains for contrastively-trained models like CLIP, and (ii) there is a pronounced drop in performance for the largest auto-regressively trained VLMs like OpenFlamingo. This finding points to a blind spot in current frontier VLMs: they excel in recognizing semantic content but fail to acquire an understanding of visual data-types through scaling. By analyzing the pre-training distributions of these models and incorporating data-type information into the captions during fine-tuning, we achieve a significant enhancement in performance. By exploring this previously uncharted task, we aim to set the stage for further advancing VLMs to equip them with visual data-type understanding. Code and datasets are released at https://github.com/bethgelab/DataTypeIdentification.
    Corporate Bankruptcy Prediction with Domain-Adapted BERT. (arXiv:2312.03194v1 [cs.CL])
    This study performs BERT-based analysis, which is a representative contextualized language model, on corporate disclosure data to predict impending bankruptcies. Prior literature on bankruptcy prediction mainly focuses on developing more sophisticated prediction methodologies with financial variables. However, in our study, we focus on improving the quality of input dataset. Specifically, we employ BERT model to perform sentiment analysis on MD&A disclosures. We show that BERT outperforms dictionary-based predictions and Word2Vec-based predictions in terms of adjusted R-square in logistic regression, k-nearest neighbor (kNN-5), and linear kernel support vector machine (SVM). Further, instead of pre-training the BERT model from scratch, we apply self-learning with confidence-based filtering to corporate disclosure data (10-K). We achieve the accuracy rate of 91.56% and demonstrate that the domain adaptation procedure brings a significant improvement in prediction accuracy.
    Learning Robust Output Control Barrier Functions from Safe Expert Demonstrations. (arXiv:2111.09971v2 [eess.SY] UPDATED)
    This paper addresses learning safe output feedback control laws from partial observations of expert demonstrations. We assume that a model of the system dynamics and a state estimator are available along with corresponding error bounds, e.g., estimated from data in practice. We first propose robust output control barrier functions (ROCBFs) as a means to guarantee safety, as defined through controlled forward invariance of a safe set. We then formulate an optimization problem to learn ROCBFs from expert demonstrations that exhibit safe system behavior, e.g., data collected from a human operator or an expert controller. When the parametrization of the ROCBF is linear, then we show that, under mild assumptions, the optimization problem is convex. Along with the optimization problem, we provide verifiable conditions in terms of the density of the data, smoothness of the system model and state estimator, and the size of the error bounds that guarantee validity of the obtained ROCBF. Towards obtaining a practical control algorithm, we propose an algorithmic implementation of our theoretical framework that accounts for assumptions made in our framework in practice. We empirically validate our algorithm in the autonomous driving simulator CARLA and demonstrate how to learn safe control laws from RGB camera images.
    Breast Ultrasound Report Generation using LangChain. (arXiv:2312.03013v1 [eess.IV])
    Breast ultrasound (BUS) is a critical diagnostic tool in the field of breast imaging, aiding in the early detection and characterization of breast abnormalities. Interpreting breast ultrasound images commonly involves creating comprehensive medical reports, containing vital information to promptly assess the patient's condition. However, the ultrasound imaging system necessitates capturing multiple images of various parts to compile a single report, presenting a time-consuming challenge. To address this problem, we propose the integration of multiple image analysis tools through a LangChain using Large Language Models (LLM), into the breast reporting process. Through a combination of designated tools and text generation through LangChain, our method can accurately extract relevant features from ultrasound images, interpret them in a clinical context, and produce comprehensive and standardized reports. This approach not only reduces the burden on radiologists and healthcare professionals but also enhances the consistency and quality of reports. The extensive experiments shows that each tools involved in the proposed method can offer qualitatively and quantitatively significant results. Furthermore, clinical evaluation on the generated reports demonstrates that the proposed method can make report in clinically meaningful way.  ( 2 min )
    Direct Exoplanet Detection Using Deep Convolutional Image Reconstruction (ConStruct): A New Algorithm for Post-Processing High-Contrast Images. (arXiv:2312.03671v1 [astro-ph.IM])
    We present a novel machine-learning approach for detecting faint point sources in high-contrast adaptive optics imaging datasets. The most widely used algorithms for primary subtraction aim to decouple bright stellar speckle noise from planetary signatures by subtracting an approximation of the temporally evolving stellar noise from each frame in an imaging sequence. Our approach aims to improve the stellar noise approximation and increase the planet detection sensitivity by leveraging deep learning in a novel direct imaging post-processing algorithm. We show that a convolutional autoencoder neural network, trained on an extensive reference library of real imaging sequences, accurately reconstructs the stellar speckle noise at the location of a potential planet signal. This tool is used in a post-processing algorithm we call Direct Exoplanet Detection with Convolutional Image Reconstruction, or ConStruct. The reliability and sensitivity of ConStruct are assessed using real Keck/NIRC2 angular differential imaging datasets. Of the 30 unique point sources we examine, ConStruct yields a higher S/N than traditional PCA-based processing for 67$\%$ of the cases and improves the relative contrast by up to a factor of 2.6. This work demonstrates the value and potential of deep learning to take advantage of a diverse reference library of point spread function realizations to improve direct imaging post-processing. ConStruct and its future improvements may be particularly useful as tools for post-processing high-contrast images from the James Webb Space Telescope and extreme adaptive optics instruments, both for the current generation and those being designed for the upcoming 30 meter-class telescopes.
    Physics Informed Neural Networks for Simulating Radiative Transfer. (arXiv:2009.13291v3 [cs.LG] UPDATED)
    We propose a novel machine learning algorithm for simulating radiative transfer. Our algorithm is based on physics informed neural networks (PINNs), which are trained by minimizing the residual of the underlying radiative tranfer equations. We present extensive experiments and theoretical error estimates to demonstrate that PINNs provide a very easy to implement, fast, robust and accurate method for simulating radiative transfer. We also present a PINN based algorithm for simulating inverse problems for radiative transfer efficiently.
    Feature-Learning Networks Are Consistent Across Widths At Realistic Scales. (arXiv:2305.18411v2 [cs.LG] UPDATED)
    We study the effect of width on the dynamics of feature-learning neural networks across a variety of architectures and datasets. Early in training, wide neural networks trained on online data have not only identical loss curves but also agree in their point-wise test predictions throughout training. For simple tasks such as CIFAR-5m this holds throughout training for networks of realistic widths. We also show that structural properties of the models, including internal representations, preactivation distributions, edge of stability phenomena, and large learning rate effects are consistent across large widths. This motivates the hypothesis that phenomena seen in realistic models can be captured by infinite-width, feature-learning limits. For harder tasks (such as ImageNet and language modeling), and later training times, finite-width deviations grow systematically. Two distinct effects cause these deviations across widths. First, the network output has initialization-dependent variance scaling inversely with width, which can be removed by ensembling networks. We observe, however, that ensembles of narrower networks perform worse than a single wide network. We call this the bias of narrower width. We conclude with a spectral perspective on the origin of this finite-width bias.
    On the Role of the Action Space in Robot Manipulation Learning and Sim-to-Real Transfer. (arXiv:2312.03673v1 [cs.RO])
    We study the choice of action space in robot manipulation learning and sim-to-real transfer. We define metrics that assess the performance, and examine the emerging properties in the different action spaces. We train over 250 reinforcement learning~(RL) agents in simulated reaching and pushing tasks, using 13 different control spaces. The choice of action spaces spans popular choices in the literature as well as novel combinations of common design characteristics. We evaluate the training performance in simulation and the transfer to a real-world environment. We identify good and bad characteristics of robotic action spaces and make recommendations for future designs. Our findings have important implications for the design of RL algorithms for robot manipulation tasks, and highlight the need for careful consideration of action spaces when training and transferring RL agents for real-world robotics.
    Active Learning for Abrupt Shifts Change-point Detection via Derivative-Aware Gaussian Processes. (arXiv:2312.03176v1 [cs.LG])
    Change-point detection (CPD) is crucial for identifying abrupt shifts in data, which influence decision-making and efficient resource allocation across various domains. To address the challenges posed by the costly and time-intensive data acquisition in CPD, we introduce the Derivative-Aware Change Detection (DACD) method. It leverages the derivative process of a Gaussian process (GP) for Active Learning (AL), aiming to pinpoint change-point locations effectively. DACD balances the exploitation and exploration of derivative processes through multiple data acquisition functions (AFs). By utilizing GP derivative mean and variance as criteria, DACD sequentially selects the next sampling data point, thus enhancing algorithmic efficiency and ensuring reliable and accurate results. We investigate the effectiveness of DACD method in diverse scenarios and show it outperforms other active learning change-point detection approaches.
    Data is Overrated: Perceptual Metrics Can Lead Learning in the Absence of Training Data. (arXiv:2312.03455v1 [cs.SD])
    Perceptual metrics are traditionally used to evaluate the quality of natural signals, such as images and audio. They are designed to mimic the perceptual behaviour of human observers and usually reflect structures found in natural signals. This motivates their use as loss functions for training generative models such that models will learn to capture the structure held in the metric. We take this idea to the extreme in the audio domain by training a compressive autoencoder to reconstruct uniform noise, in lieu of natural data. We show that training with perceptual losses improves the reconstruction of spectrograms and re-synthesized audio at test time over models trained with a standard Euclidean loss. This demonstrates better generalisation to unseen natural signals when using perceptual metrics.
    Molecule Joint Auto-Encoding: Trajectory Pretraining with 2D and 3D Diffusion. (arXiv:2312.03475v1 [cs.LG])
    Recently, artificial intelligence for drug discovery has raised increasing interest in both machine learning and chemistry domains. The fundamental building block for drug discovery is molecule geometry and thus, the molecule's geometrical representation is the main bottleneck to better utilize machine learning techniques for drug discovery. In this work, we propose a pretraining method for molecule joint auto-encoding (MoleculeJAE). MoleculeJAE can learn both the 2D bond (topology) and 3D conformation (geometry) information, and a diffusion process model is applied to mimic the augmented trajectories of such two modalities, based on which, MoleculeJAE will learn the inherent chemical structure in a self-supervised manner. Thus, the pretrained geometrical representation in MoleculeJAE is expected to benefit downstream geometry-related tasks. Empirically, MoleculeJAE proves its effectiveness by reaching state-of-the-art performance on 15 out of 20 tasks by comparing it with 12 competitive baselines.
    Constrained Bayesian Optimization Under Partial Observations: Balanced Improvements and Provable Convergence. (arXiv:2312.03212v1 [cs.LG])
    The partially observable constrained optimization problems (POCOPs) impede data-driven optimization techniques since an infeasible solution of POCOPs can provide little information about the objective as well as the constraints. We endeavor to design an efficient and provable method for expensive POCOPs under the framework of constrained Bayesian optimization. Our method consists of two key components. Firstly, we present an improved design of the acquisition functions that introduces balanced exploration during optimization. We rigorously study the convergence properties of this design to demonstrate its effectiveness. Secondly, we propose a Gaussian process embedding different likelihoods as the surrogate model for a partially observable constraint. This model leads to a more accurate representation of the feasible regions compared to traditional classification-based models. Our proposed method is empirically studied on both synthetic and real-world problems. The results demonstrate the competitiveness of our method for solving POCOPs.
    REST: Enhancing Group Robustness in DNNs through Reweighted Sparse Training. (arXiv:2312.03044v1 [cs.LG])
    The deep neural network (DNN) has been proven effective in various domains. However, they often struggle to perform well on certain minority groups during inference, despite showing strong performance on the majority of data groups. This is because over-parameterized models learned \textit{bias attributes} from a large number of \textit{bias-aligned} training samples. These bias attributes are strongly spuriously correlated with the target variable, causing the models to be biased towards spurious correlations (i.e., \textit{bias-conflicting}). To tackle this issue, we propose a novel \textbf{re}weighted \textbf{s}parse \textbf{t}raining framework, dubbed as \textit{\textbf{REST}}, which aims to enhance the performance of biased data while improving computation and memory efficiency. Our proposed REST framework has been experimentally validated on three datasets, demonstrating its effectiveness in exploring unbiased subnetworks. We found that REST reduces the reliance on spuriously correlated features, leading to better performance across a wider range of data groups with fewer training and inference resources. We highlight that the \textit{REST} framework represents a promising approach for improving the performance of DNNs on biased data, while simultaneously improving computation and memory efficiency. By reducing the reliance on spurious correlations, REST has the potential to enhance the robustness of DNNs and improve their generalization capabilities. Code is released at \url{https://github.com/zhao1402072392/REST}
    Generalized Contrastive Divergence: Joint Training of Energy-Based Model and Diffusion Model through Inverse Reinforcement Learning. (arXiv:2312.03397v1 [cs.LG])
    We present Generalized Contrastive Divergence (GCD), a novel objective function for training an energy-based model (EBM) and a sampler simultaneously. GCD generalizes Contrastive Divergence (Hinton, 2002), a celebrated algorithm for training EBM, by replacing Markov Chain Monte Carlo (MCMC) distribution with a trainable sampler, such as a diffusion model. In GCD, the joint training of EBM and a diffusion model is formulated as a minimax problem, which reaches an equilibrium when both models converge to the data distribution. The minimax learning with GCD bears interesting equivalence to inverse reinforcement learning, where the energy corresponds to a negative reward, the diffusion model is a policy, and the real data is expert demonstrations. We present preliminary yet promising results showing that joint training is beneficial for both EBM and a diffusion model. GCD enables EBM training without MCMC while improving the sample quality of a diffusion model.
    Publicly available datasets of breast histopathology H&E whole-slide images: A scoping review. (arXiv:2306.01546v2 [eess.IV] UPDATED)
    Advancements in digital pathology and computing resources have made a significant impact in the field of computational pathology for breast cancer diagnosis and treatment. However, access to high-quality labeled histopathological images of breast cancer is a big challenge that limits the development of accurate and robust deep learning models. In this scoping review, we identified the publicly available datasets of breast H&E stained whole-slide images (WSI) that can be used to develop deep learning algorithms. We systematically searched nine scientific literature databases and nine research data repositories and found 17 publicly available datasets containing 10385 H&E WSIs of breast cancer. Moreover, we reported image metadata and characteristics for each dataset to assist researchers in selecting proper datasets for specific tasks in breast cancer computational pathology. In addition, we compiled two lists of breast H&E patches and private datasets as supplementary resources for researchers. Notably, only 28% of the included articles utilized multiple datasets, and only 14% used an external validation set, suggesting that the performance of other developed models may be susceptible to overestimation. The TCGA-BRCA was used in 52% of the selected studies. This dataset has a considerable selection bias that can impact the robustness and generalizability of the trained algorithms. There is also a lack of consistent metadata reporting of breast WSI datasets that can be an issue in developing accurate deep learning models, indicating the necessity of establishing explicit guidelines for documenting breast WSI dataset characteristics and metadata.
    SDSRA: A Skill-Driven Skill-Recombination Algorithm for Efficient Policy Learning. (arXiv:2312.03216v1 [cs.LG])
    In this paper, we introduce a novel algorithm - the Skill-Driven Skill Recombination Algorithm (SDSRA) - an innovative framework that significantly enhances the efficiency of achieving maximum entropy in reinforcement learning tasks. We find that SDSRA achieves faster convergence compared to the traditional Soft Actor-Critic (SAC) algorithm and produces improved policies. By integrating skill-based strategies within the robust Actor-Critic framework, SDSRA demonstrates remarkable adaptability and performance across a wide array of complex and diverse benchmarks.
    Measuring Misogyny in Natural Language Generation: Preliminary Results from a Case Study on two Reddit Communities. (arXiv:2312.03330v1 [cs.CL])
    Generic `toxicity' classifiers continue to be used for evaluating the potential for harm in natural language generation, despite mounting evidence of their shortcomings. We consider the challenge of measuring misogyny in natural language generation, and argue that generic `toxicity' classifiers are inadequate for this task. We use data from two well-characterised `Incel' communities on Reddit that differ primarily in their degrees of misogyny to construct a pair of training corpora which we use to fine-tune two language models. We show that an open source `toxicity' classifier is unable to distinguish meaningfully between generations from these models. We contrast this with a misogyny-specific lexicon recently proposed by feminist subject-matter experts, demonstrating that, despite the limitations of simple lexicon-based approaches, this shows promise as a benchmark to evaluate language models for misogyny, and that it is sensitive enough to reveal the known differences in these Reddit communities. Our preliminary findings highlight the limitations of a generic approach to evaluating harms, and further emphasise the need for careful benchmark design and selection in natural language evaluation.
    Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction. (arXiv:2312.03025v1 [cs.AI])
    The task of multimodal relation extraction has attracted significant research attention, but progress is constrained by the scarcity of available training data. One natural thought is to extend existing datasets with cross-modal generative models. In this paper, we consider a novel problem setting, where only unimodal data, either text or image, are available during training. We aim to train a multimodal classifier from synthetic data that perform well on real multimodal test data. However, training with synthetic data suffers from two obstacles: lack of data diversity and label information loss. To alleviate the issues, we propose Mutual Information-aware Multimodal Iterated Relational dAta GEneration (MI2RAGE), which applies Chained Cross-modal Generation (CCG) to promote diversity in the generated data and exploits a teacher network to select valuable training samples with high mutual information with the ground-truth labels. Comparing our method to direct training on synthetic data, we observed a significant improvement of 24.06% F1 with synthetic text and 26.42% F1 with synthetic images. Notably, our best model trained on completely synthetic images outperforms prior state-of-the-art models trained on real multimodal data by a margin of 3.76% in F1. Our codebase will be made available upon acceptance.
    Few-Shot Anomaly Detection with Adversarial Loss for Robust Feature Representations. (arXiv:2312.03005v1 [cs.LG])
    Anomaly detection is a critical and challenging task that aims to identify data points deviating from normal patterns and distributions within a dataset. Various methods have been proposed using a one-class-one-model approach, but these techniques often face practical problems such as memory inefficiency and the requirement of sufficient data for training. In particular, few-shot anomaly detection presents significant challenges in industrial applications, where limited samples are available before mass production. In this paper, we propose a few-shot anomaly detection method that integrates adversarial training loss to obtain more robust and generalized feature representations. We utilize the adversarial loss previously employed in domain adaptation to align feature distributions between source and target domains, to enhance feature robustness and generalization in few-shot anomaly detection tasks. We hypothesize that adversarial loss is effective when applied to features that should have similar characteristics, such as those from the same layer in a Siamese network's parallel branches or input-output pairs of reconstruction-based methods. Experimental results demonstrate that the proposed method generally achieves better performance when utilizing the adversarial loss.
    AI-driven emergence of frequency information non-uniform distribution via THz metasurface spectrum prediction. (arXiv:2312.03017v1 [cs.LG])
    Recently, artificial intelligence has been extensively deployed across various scientific disciplines, optimizing and guiding the progression of experiments through the integration of abundant datasets, whilst continuously probing the vast theoretical space encapsulated within the data. Particularly, deep learning models, due to their end-to-end adaptive learning capabilities, are capable of autonomously learning intrinsic data features, thereby transcending the limitations of traditional experience to a certain extent. Here, we unveil previously unreported information characteristics pertaining to different frequencies emerged during our work on predicting the terahertz spectral modulation effects of metasurfaces based on AI-prediction. Moreover, we have substantiated that our proposed methodology of simply adding supplementary multi-frequency inputs to the existing dataset during the target spectral prediction process can significantly enhance the predictive accuracy of the network. This approach effectively optimizes the utilization of existing datasets and paves the way for interdisciplinary research and applications in artificial intelligence, chemistry, composite material design, biomedicine, and other fields.
    Balanced Marginal and Joint Distributional Learning via Mixture Cramer-Wold Distance. (arXiv:2312.03307v1 [stat.ML])
    In the process of training a generative model, it becomes essential to measure the discrepancy between two high-dimensional probability distributions: the generative distribution and the ground-truth distribution of the observed dataset. Recently, there has been growing interest in an approach that involves slicing high-dimensional distributions, with the Cramer-Wold distance emerging as a promising method. However, we have identified that the Cramer-Wold distance primarily focuses on joint distributional learning, whereas understanding marginal distributional patterns is crucial for effective synthetic data generation. In this paper, we introduce a novel measure of dissimilarity, the mixture Cramer-Wold distance. This measure enables us to capture both marginal and joint distributional information simultaneously, as it incorporates a mixture measure with point masses on standard basis vectors. Building upon the mixture Cramer-Wold distance, we propose a new generative model called CWDAE (Cramer-Wold Distributional AutoEncoder), which shows remarkable performance in generating synthetic data when applied to real tabular datasets. Furthermore, our model offers the flexibility to adjust the level of data privacy with ease.
    A Waddington landscape for prototype learning in generalized Hopfield networks. (arXiv:2312.03012v1 [cond-mat.dis-nn])
    Networks in machine learning offer examples of complex high-dimensional dynamical systems reminiscent of biological systems. Here, we study the learning dynamics of Generalized Hopfield networks, which permit a visualization of internal memories. These networks have been shown to proceed through a 'feature-to-prototype' transition, as the strength of network nonlinearity is increased, wherein the learned, or terminal, states of internal memories transition from mixed to pure states. Focusing on the prototype learning dynamics of the internal memories we observe a strong resemblance to the canalized, or low-dimensional, dynamics of cells as they differentiate within a Waddingtonian landscape. Dynamically, we demonstrate that learning in a Generalized Hopfield Network proceeds through sequential 'splits' in memory space. Furthermore, order of splitting is interpretable and reproducible. The dynamics between the splits are canalized in the Waddington sense -- robust to variations in detailed aspects of the system. In attempting to make the analogy a rigorous equivalence, we study smaller subsystems that exhibit similar properties to the full system. We combine analytical calculations with numerical simulations to study the dynamical emergence of the feature-to-prototype transition, and the behaviour of splits in the landscape, saddles points, visited during learning. We exhibit regimes where saddles appear and disappear through saddle-node bifurcations, qualitatively changing the distribution of learned memories as the strength of the nonlinearity is varied -- allowing us to systematically investigate the mechanisms that underlie the emergence of Waddingtonian dynamics. Memories can thus differentiate in a predictive and controlled way, revealing new bridges between experimental biology, dynamical systems theory, and machine learning.
    Fully Convolutional Slice-to-Volume Reconstruction for Single-Stack MRI. (arXiv:2312.03102v1 [eess.IV])
    In magnetic resonance imaging (MRI), slice-to-volume reconstruction (SVR) refers to computational reconstruction of an unknown 3D magnetic resonance volume from stacks of 2D slices corrupted by motion. While promising, current SVR methods require multiple slice stacks for accurate 3D reconstruction, leading to long scans and limiting their use in time-sensitive applications such as fetal fMRI. Here, we propose a SVR method that overcomes the shortcomings of previous work and produces state-of-the-art reconstructions in the presence of extreme inter-slice motion. Inspired by the recent success of single-view depth estimation methods, we formulate SVR as a single-stack motion estimation task and train a fully convolutional network to predict a motion stack for a given slice stack, producing a 3D reconstruction as a byproduct of the predicted motion. Extensive experiments on the SVR of adult and fetal brains demonstrate that our fully convolutional method is twice as accurate as previous SVR methods. Our code is available at github.com/seannz/svr.
    A Hardware Evaluation Framework for Large Language Model Inference. (arXiv:2312.03134v1 [cs.AR])
    The past year has witnessed the increasing popularity of Large Language Models (LLMs). Their unprecedented scale and associated high hardware cost have impeded their broader adoption, calling for efficient hardware designs. With the large hardware needed to simply run LLM inference, evaluating different hardware designs becomes a new bottleneck. This work introduces LLMCompass, a hardware evaluation framework for LLM inference workloads. LLMCompass is fast, accurate, versatile, and able to describe and evaluate different hardware designs. LLMCompass includes a mapper to automatically find performance-optimal mapping and scheduling. It also incorporates an area-based cost model to help architects reason about their design choices. Compared to real-world hardware, LLMCompass' estimated latency achieves an average 10.4% error rate across various operators with various input sizes and an average 4.1% error rate for LLM inference. With LLMCompass, simulating a 4-NVIDIA A100 GPU node running GPT-3 175B inference can be done within 16 minutes on commodity hardware, including 26,400 rounds of the mapper's parameter search. With the aid of LLMCompass, this work draws architectural implications and explores new cost-effective hardware designs. By reducing the compute capability or replacing High Bandwidth Memory (HBM) with traditional DRAM, these new designs can achieve as much as 3.41x improvement in performance/cost compared to an NVIDIA A100, making them promising choices for democratizing LLMs. LLMCompass is planned to be fully open-source.
    Simulation-Based Inference of Surface Accumulation and Basal Melt Rates of an Antarctic Ice Shelf from Isochronal Layers. (arXiv:2312.02997v1 [physics.ao-ph])
    The ice shelves buttressing the Antarctic ice sheet determine the rate of ice-discharge into the surrounding oceans. The geometry of ice shelves, and hence their buttressing strength, is determined by ice flow as well as by the local surface accumulation and basal melt rates, governed by atmospheric and oceanic conditions. Contemporary methods resolve one of these rates, but typically not both. Moreover, there is little information of how they changed in time. We present a new method to simultaneously infer the surface accumulation and basal melt rates averaged over decadal and centennial timescales. We infer the spatial dependence of these rates along flow line transects using internal stratigraphy observed by radars, using a kinematic forward model of internal stratigraphy. We solve the inverse problem using simulation-based inference (SBI). SBI performs Bayesian inference by training neural networks on simulations of the forward model to approximate the posterior distribution, allowing us to also quantify uncertainties over the inferred parameters. We demonstrate the validity of our method on a synthetic example, and apply it to Ekstr\"om Ice Shelf, Antarctica, for which newly acquired radar measurements are available. We obtain posterior distributions of surface accumulation and basal melt averaging over 42, 84, 146, and 188 years before 2022. Our results suggest stable atmospheric and oceanographic conditions over this period in this catchment of Antarctica. Use of observed internal stratigraphy can separate the effects of surface accumulation and basal melt, allowing them to be interpreted in a historical context of the last centuries and beyond.  ( 3 min )
    Data-Driven Traffic Reconstruction and Kernel Methods for Identifying Stop-and-Go Congestion. (arXiv:2312.03186v1 [cs.LG])
    Identifying stop-and-go events (SAGs) in traffic flow presents an important avenue for advancing data-driven research for climate change mitigation and sustainability, owing to their substantial impact on carbon emissions, travel time, fuel consumption, and roadway safety. In fact, SAGs are estimated to account for 33-50% of highway driving externalities. However, insufficient attention has been paid to precisely quantifying where, when, and how much these SAGs take place -necessary for downstream decision making, such as intervention design and policy analysis. A key challenge is that the data available to researchers and governments are typically sparse and aggregated to a granularity that obscures SAGs. To overcome such data limitations, this study thus explores the use of traffic reconstruction techniques for SAG identification. In particular, we introduce a kernel-based method for identifying spatio-temporal features in traffic and leverage bootstrapping to quantify the uncertainty of the reconstruction process. Experimental results on California highway data demonstrate the promise of the method for capturing SAGs. This work contributes to a foundation for data-driven decision making to advance sustainability of traffic systems.
    Generating Interpretable Networks using Hypernetworks. (arXiv:2312.03051v1 [cs.LG])
    An essential goal in mechanistic interpretability to decode a network, i.e., to convert a neural network's raw weights to an interpretable algorithm. Given the difficulty of the decoding problem, progress has been made to understand the easier encoding problem, i.e., to convert an interpretable algorithm into network weights. Previous works focus on encoding existing algorithms into networks, which are interpretable by definition. However, focusing on encoding limits the possibility of discovering new algorithms that humans have never stumbled upon, but that are nevertheless interpretable. In this work, we explore the possibility of using hypernetworks to generate interpretable networks whose underlying algorithms are not yet known. The hypernetwork is carefully designed such that it can control network complexity, leading to a diverse family of interpretable algorithms ranked by their complexity. All of them are interpretable in hindsight, although some of them are less intuitive to humans, hence providing new insights regarding how to "think" like a neural network. For the task of computing L1 norms, hypernetworks find three algorithms: (a) the double-sided algorithm, (b) the convexity algorithm, (c) the pudding algorithm, although only the first algorithm was expected by the authors before experiments. We automatically classify these algorithms and analyze how these algorithmic phases develop during training, as well as how they are affected by complexity control. Furthermore, we show that a trained hypernetwork can correctly construct models for input dimensions not seen in training, demonstrating systematic generalization.
    HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces. (arXiv:2312.03160v1 [cs.CV])
    Neural radiance fields provide state-of-the-art view synthesis quality but tend to be slow to render. One reason is that they make use of volume rendering, thus requiring many samples (and model queries) per ray at render time. Although this representation is flexible and easy to optimize, most real-world objects can be modeled more efficiently with surfaces instead of volumes, requiring far fewer samples per ray. This observation has spurred considerable progress in surface representations such as signed distance functions, but these may struggle to model semi-opaque and thin structures. We propose a method, HybridNeRF, that leverages the strengths of both representations by rendering most objects as surfaces while modeling the (typically) small fraction of challenging regions volumetrically. We evaluate HybridNeRF against the challenging Eyeful Tower dataset along with other commonly used view synthesis datasets. When comparing to state-of-the-art baselines, including recent rasterization-based approaches, we improve error rates by 15-30% while achieving real-time framerates (at least 36 FPS) for virtual-reality resolutions (2Kx2K).
  • Open

    On the variants of SVM methods applied to GPR data to classify tack coat characteristics in French pavements: two experimental case studies. (arXiv:2312.03351v1 [stat.ML])
    Among the commonly used non-destructive techniques, the Ground Penetrating Radar (GPR) is one of the most widely adopted today for assessing pavement conditions in France. However, conventional radar systems and their forward processing methods have shown their limitations for the physical and geometrical characterization of very thin layers such as tack coats. However, the use of Machine Learning methods applied to GPR with an inverse approach showed that it was numerically possible to identify the tack coat characteristics despite masking effects due to low timefrequency resolution noted in the raw B-scans. Thus, we propose in this paper to apply the inverse approach based on Machine Learning, already validated in previous works on numerical data, on two experimental cases with different pavement structures. The first case corresponds to a validation on known pavement structures on the Gustave Eiffel University (Nantes, France) with its pavement fatigue carousel and the second case focuses on a new real road in Vend{\'e}e department (France). In both case studies, the performances of SVM/SVR methods showed the efficiency of supervised learning methods to classify and estimate the emulsion proportioning in the tack coats.  ( 3 min )
    T-Cal: An optimal test for the calibration of predictive models. (arXiv:2203.01850v4 [stat.ML] UPDATED)
    The prediction accuracy of machine learning methods is steadily increasing, but the calibration of their uncertainty predictions poses a significant challenge. Numerous works focus on obtaining well-calibrated predictive models, but less is known about reliably assessing model calibration. This limits our ability to know when algorithms for improving calibration have a real effect, and when their improvements are merely artifacts due to random noise in finite datasets. In this work, we consider detecting mis-calibration of predictive models using a finite validation dataset as a hypothesis testing problem. The null hypothesis is that the predictive model is calibrated, while the alternative hypothesis is that the deviation from calibration is sufficiently large. We find that detecting mis-calibration is only possible when the conditional probabilities of the classes are sufficiently smooth functions of the predictions. When the conditional class probabilities are H\"older continuous, we propose T-Cal, a minimax optimal test for calibration based on a debiased plug-in estimator of the $\ell_2$-Expected Calibration Error (ECE). We further propose Adaptive T-Cal, a version that is adaptive to unknown smoothness. We verify our theoretical findings with a broad range of experiments, including with several popular deep neural net architectures and several standard post-hoc calibration methods. T-Cal is a practical general-purpose tool, which -- combined with classical tests for discrete-valued predictors -- can be used to test the calibration of virtually any probabilistic classification method.  ( 3 min )
    Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback. (arXiv:2307.04749v2 [cs.CV] UPDATED)
    The field of text-conditioned image generation has made unparalleled progress with the recent advent of latent diffusion models. While remarkable, as the complexity of given text input increases, the state-of-the-art diffusion models may still fail in generating images which accurately convey the semantics of the given prompt. Furthermore, it has been observed that such misalignments are often left undetected by pretrained multi-modal models such as CLIP. To address these problems, in this paper we explore a simple yet effective decompositional approach towards both evaluation and improvement of text-to-image alignment. In particular, we first introduce a Decompositional-Alignment-Score which given a complex prompt decomposes it into a set of disjoint assertions. The alignment of each assertion with generated images is then measured using a VQA model. Finally, alignment scores for different assertions are combined aposteriori to give the final text-to-image alignment score. Experimental analysis reveals that the proposed alignment metric shows significantly higher correlation with human ratings as opposed to traditional CLIP, BLIP scores. Furthermore, we also find that the assertion level alignment scores provide a useful feedback which can then be used in a simple iterative procedure to gradually increase the expression of different assertions in the final image outputs. Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy. Project page for our paper is available at https://1jsingh.github.io/divide-evaluate-and-refine
    Functional Flow Matching. (arXiv:2305.17209v2 [cs.LG] UPDATED)
    We propose Functional Flow Matching (FFM), a function-space generative model that generalizes the recently-introduced Flow Matching model to operate in infinite-dimensional spaces. Our approach works by first defining a path of probability measures that interpolates between a fixed Gaussian measure and the data distribution, followed by learning a vector field on the underlying space of functions that generates this path of measures. Our method does not rely on likelihoods or simulations, making it well-suited to the function space setting. We provide both a theoretical framework for building such models and an empirical evaluation of our techniques. We demonstrate through experiments on several real-world benchmarks that our proposed FFM method outperforms several recently proposed function-space generative models.  ( 2 min )
    Function-Space Optimality of Neural Architectures With Multivariate Nonlinearities. (arXiv:2310.03696v2 [stat.ML] UPDATED)
    We investigate the function-space optimality (specifically, the Banach-space optimality) of a large class of shallow neural architectures with multivariate nonlinearities/activation functions. To that end, we construct a new family of Banach spaces defined via a regularization operator, the $k$-plane transform, and a sparsity-promoting norm. We prove a representer theorem that states that the solution sets to learning problems posed over these Banach spaces are completely characterized by neural architectures with multivariate nonlinearities. These optimal architectures have skip connections and are tightly connected to orthogonal weight normalization and multi-index models, both of which have received recent interest in the neural network community. Our framework is compatible with a number of classical nonlinearities including the rectified linear unit (ReLU) activation function, the norm activation function, and the radial basis functions found in the theory of thin-plate/polyharmonic splines. We also show that the underlying spaces are special instances of reproducing kernel Banach spaces and variation spaces. Our results shed light on the regularity of functions learned by neural networks trained on data, particularly with multivariate nonlinearities, and provide new theoretical motivation for several architectural choices found in practice.  ( 2 min )
    On the Nystrom Approximation for Preconditioning in Kernel Machines. (arXiv:2312.03311v1 [stat.ML])
    Kernel methods are a popular class of nonlinear predictive models in machine learning. Scalable algorithms for learning kernel models need to be iterative in nature, but convergence can be slow due to poor conditioning. Spectral preconditioning is an important tool to speed-up the convergence of such iterative algorithms for training kernel models. However computing and storing a spectral preconditioner can be expensive which can lead to large computational and storage overheads, precluding the application of kernel methods to problems with large datasets. A Nystrom approximation of the spectral preconditioner is often cheaper to compute and store, and has demonstrated success in practical applications. In this paper we analyze the trade-offs of using such an approximated preconditioner. Specifically, we show that a sample of logarithmic size (as a function of the size of the dataset) enables the Nystrom-based approximated preconditioner to accelerate gradient descent nearly as well as the exact preconditioner, while also reducing the computational and storage overheads.  ( 2 min )
    Optimal Variable Clustering for High-Dimensional Matrix Valued Data. (arXiv:2112.12909v3 [stat.ML] UPDATED)
    Matrix valued data has become increasingly prevalent in many applications. Most of the existing clustering methods for this type of data are tailored to the mean model and do not account for the dependence structure of the features, which can be very informative, especially in high-dimensional settings or when mean information is not available. To extract the information from the dependence structure for clustering, we propose a new latent variable model for the features arranged in matrix form, with some unknown membership matrices representing the clusters for the rows and columns. Under this model, we further propose a class of hierarchical clustering algorithms using the difference of a weighted covariance matrix as the dissimilarity measure. Theoretically, we show that under mild conditions, our algorithm attains clustering consistency in the high-dimensional setting. While this consistency result holds for our algorithm with a broad class of weighted covariance matrices, the conditions for this result depend on the choice of the weight. To investigate how the weight affects the theoretical performance of our algorithm, we establish the minimax lower bound for clustering under our latent variable model in terms of some cluster separation metric. Given these results, we identify the optimal weight in the sense that using this weight guarantees our algorithm to be minimax rate-optimal. The practical implementation of our algorithm with the optimal weight is also discussed. Simulation studies show that our algorithm performs better than existing methods in terms of the adjusted Rand index (ARI). The method is applied to a genomic dataset and yields meaningful interpretations.  ( 3 min )
    Precise Asymptotic Generalization for Multiclass Classification with Overparameterized Linear Models. (arXiv:2306.13255v2 [cs.LG] UPDATED)
    We study the asymptotic generalization of an overparameterized linear model for multiclass classification under the Gaussian covariates bi-level model introduced in Subramanian et al.~'22, where the number of data points, features, and classes all grow together. We fully resolve the conjecture posed in Subramanian et al.~'22, matching the predicted regimes for generalization. Furthermore, our new lower bounds are akin to an information-theoretic strong converse: they establish that the misclassification rate goes to 0 or 1 asymptotically. One surprising consequence of our tight results is that the min-norm interpolating classifier can be asymptotically suboptimal relative to noninterpolating classifiers in the regime where the min-norm interpolating regressor is known to be optimal. The key to our tight analysis is a new variant of the Hanson-Wright inequality which is broadly useful for multiclass problems with sparse labels. As an application, we show that the same type of analysis can be used to analyze the related multilabel classification problem under the same bi-level ensemble.
    Targeted Separation and Convergence with Kernel Discrepancies. (arXiv:2209.12835v3 [stat.ML] UPDATED)
    Maximum mean discrepancies (MMDs) like the kernel Stein discrepancy (KSD) have grown central to a wide range of applications, including hypothesis testing, sampler selection, distribution approximation, and variational inference. In each setting, these kernel-based discrepancy measures are required to (i) separate a target P from other probability measures or even (ii) control weak convergence to P. In this article we derive new sufficient and necessary conditions to ensure (i) and (ii). For MMDs on separable metric spaces, we characterize those kernels that separate Bochner embeddable measures and introduce simple conditions for separating all measures with unbounded kernels and for controlling convergence with bounded kernels. We use these results on $\mathbb{R}^d$ to substantially broaden the known conditions for KSD separation and convergence control and to develop the first KSDs known to exactly metrize weak convergence to P. Along the way, we highlight the implications of our results for hypothesis testing, measuring and improving sample quality, and sampling with Stein variational gradient descent.
    Physics Informed Neural Networks for Simulating Radiative Transfer. (arXiv:2009.13291v3 [cs.LG] UPDATED)
    We propose a novel machine learning algorithm for simulating radiative transfer. Our algorithm is based on physics informed neural networks (PINNs), which are trained by minimizing the residual of the underlying radiative tranfer equations. We present extensive experiments and theoretical error estimates to demonstrate that PINNs provide a very easy to implement, fast, robust and accurate method for simulating radiative transfer. We also present a PINN based algorithm for simulating inverse problems for radiative transfer efficiently.
    Improved Convergence of Score-Based Diffusion Models via Prediction-Correction. (arXiv:2305.14164v2 [cs.LG] UPDATED)
    Score-based generative models (SGMs) are powerful tools to sample from complex data distributions. Their underlying idea is to (i) run a forward process for time $T_1$ by adding noise to the data, (ii) estimate its score function, and (iii) use such estimate to run a reverse process. As the reverse process is initialized with the stationary distribution of the forward one, the existing analysis paradigm requires $T_1\to\infty$. This is however problematic: from a theoretical viewpoint, for a given precision of the score approximation, the convergence guarantee fails as $T_1$ diverges; from a practical viewpoint, a large $T_1$ increases computational costs and leads to error propagation. This paper addresses the issue by considering a version of the popular predictor-corrector scheme: after running the forward process, we first estimate the final distribution via an inexact Langevin dynamics and then revert the process. Our key technical contribution is to provide convergence guarantees which require to run the forward process only for a fixed finite time $T_1$. Our bounds exhibit a mild logarithmic dependence on the input dimension and the subgaussian norm of the target distribution, have minimal assumptions on the data, and require only to control the $L^2$ loss on the score approximation, which is the quantity minimized in practice.
    A unified framework for information-theoretic generalization bounds. (arXiv:2305.11042v2 [cs.LG] UPDATED)
    This paper presents a general methodology for deriving information-theoretic generalization bounds for learning algorithms. The main technical tool is a probabilistic decorrelation lemma based on a change of measure and a relaxation of Young's inequality in $L_{\psi_p}$ Orlicz spaces. Using the decorrelation lemma in combination with other techniques, such as symmetrization, couplings, and chaining in the space of probability measures, we obtain new upper bounds on the generalization error, both in expectation and in high probability, and recover as special cases many of the existing generalization bounds, including the ones based on mutual information, conditional mutual information, stochastic chaining, and PAC-Bayes inequalities. In addition, the Fernique-Talagrand upper bound on the expected supremum of a subgaussian process emerges as a special case.
    Causal Estimation of Exposure Shifts with Neural Networks: Evaluating the Health Benefits of Stricter Air Quality Standards in the US. (arXiv:2302.02560v3 [cs.LG] UPDATED)
    In policy research, one of the most critical analytic tasks is to estimate the causal effect of a policy-relevant shift to the distribution of a continuous exposure/treatment on an outcome of interest. We call this problem shift-response function (SRF) estimation. Existing neural network methods involving robust causal-effect estimators lack theoretical guarantees and practical implementations for SRF estimation. Motivated by a key policy-relevant question in public health, we develop a neural network method and its theoretical underpinnings to estimate SRFs with robustness and efficiency guarantees. We then apply our method to data consisting of 68 million individuals and 27 million deaths across the U.S. to estimate the causal effect from revising the US National Ambient Air Quality Standards (NAAQS) for PM 2.5 from 12 $\mu g/m^3$ to 9 $\mu g/m^3$. This change has been recently proposed by the US Environmental Protection Agency (EPA). Our goal is to estimate, for the first time, the reduction in deaths that would result from this anticipated revision using causal methods for SRFs. Our proposed method, called {T}argeted {R}egularization for {E}xposure {S}hifts with Neural {Net}works (TRESNET), contributes to the neural network literature for causal inference in two ways: first, it proposes a targeted regularization loss with theoretical properties that ensure double robustness and achieves asymptotic efficiency specific for SRF estimation; second, it enables loss functions from the exponential family of distributions to accommodate non-continuous outcome distributions (such as hospitalization or mortality counts). We complement our application with benchmark experiments that demonstrate TRESNET's broad applicability and competitiveness.
    Data-driven Crop Growth Simulation on Time-varying Generated Images using Multi-conditional Generative Adversarial Networks. (arXiv:2312.03443v1 [cs.CV])
    Image-based crop growth modeling can substantially contribute to precision agriculture by revealing spatial crop development over time, which allows an early and location-specific estimation of relevant future plant traits, such as leaf area or biomass. A prerequisite for realistic and sharp crop image generation is the integration of multiple growth-influencing conditions in a model, such as an image of an initial growth stage, the associated growth time, and further information about the field treatment. We present a two-stage framework consisting first of an image prediction model and second of a growth estimation model, which both are independently trained. The image prediction model is a conditional Wasserstein generative adversarial network (CWGAN). In the generator of this model, conditional batch normalization (CBN) is used to integrate different conditions along with the input image. This allows the model to generate time-varying artificial images dependent on multiple influencing factors of different kinds. These images are used by the second part of the framework for plant phenotyping by deriving plant-specific traits and comparing them with those of non-artificial (real) reference images. For various crop datasets, the framework allows realistic, sharp image predictions with a slight loss of quality from short-term to long-term predictions. Simulations of varying growth-influencing conditions performed with the trained framework provide valuable insights into how such factors relate to crop appearances, which is particularly useful in complex, less explored crop mixture systems. Further results show that adding process-based simulated biomass as a condition increases the accuracy of the derived phenotypic traits from the predicted images. This demonstrates the potential of our framework to serve as an interface between an image- and process-based crop growth model.
    An Infinite-Width Analysis on the Jacobian-Regularised Training of a Neural Network. (arXiv:2312.03386v1 [cs.LG])
    The recent theoretical analysis of deep neural networks in their infinite-width limits has deepened our understanding of initialisation, feature learning, and training of those networks, and brought new practical techniques for finding appropriate hyperparameters, learning network weights, and performing inference. In this paper, we broaden this line of research by showing that this infinite-width analysis can be extended to the Jacobian of a deep neural network. We show that a multilayer perceptron (MLP) and its Jacobian at initialisation jointly converge to a Gaussian process (GP) as the widths of the MLP's hidden layers go to infinity and characterise this GP. We also prove that in the infinite-width limit, the evolution of the MLP under the so-called robust training (i.e., training with a regulariser on the Jacobian) is described by a linear first-order ordinary differential equation that is determined by a variant of the Neural Tangent Kernel. We experimentally show the relevance of our theoretical claims to wide finite networks, and empirically analyse the properties of kernel regression solution to obtain an insight into Jacobian regularisation.
    Efficient Inverse Design Optimization through Multi-fidelity Simulations, Machine Learning, and Search Space Reduction Strategies. (arXiv:2312.03654v1 [cs.CE])
    This paper introduces a methodology designed to augment the inverse design optimization process in scenarios constrained by limited compute, through the strategic synergy of multi-fidelity evaluations, machine learning models, and optimization algorithms. The proposed methodology is analyzed on two distinct engineering inverse design problems: airfoil inverse design and the scalar field reconstruction problem. It leverages a machine learning model trained with low-fidelity simulation data, in each optimization cycle, thereby proficiently predicting a target variable and discerning whether a high-fidelity simulation is necessitated, which notably conserves computational resources. Additionally, the machine learning model is strategically deployed prior to optimization to reduce the search space, thereby further accelerating convergence toward the optimal solution. The methodology has been employed to enhance two optimization algorithms, namely Differential Evolution and Particle Swarm Optimization. Comparative analyses illustrate performance improvements across both algorithms. Notably, this method is adeptly adaptable across any inverse design application, facilitating a harmonious synergy between a representative low-fidelity machine learning model, and high-fidelity simulation, and can be seamlessly applied across any variety of population-based optimization algorithms.
    Low-Cost High-Power Membership Inference by Boosting Relativity. (arXiv:2312.03262v1 [stat.ML])
    We present a robust membership inference attack (RMIA) that amplifies the distinction between population data and the training data on any target model, by effectively leveraging both reference models and reference data in our likelihood ratio test. Our algorithm exhibits superior test power (true-positive rate) when compared to prior methods, even at extremely low false-positive error rates (as low as 0). Also, under computation constraints, where only a limited number of reference models (as few as 1) are available, our method performs exceptionally well, unlike some prior attacks that approach random guessing in such scenarios. Our method lays the groundwork for cost-effective and practical yet powerful and robust privacy risk analysis of machine learning algorithms.
    Fluctuation without dissipation: Microcanonical Langevin Monte Carlo. (arXiv:2303.18221v2 [hep-lat] UPDATED)
    Stochastic sampling algorithms such as Langevin Monte Carlo are inspired by physical systems in a heat bath. Their equilibrium distribution is the canonical ensemble given by a prescribed target distribution, so they must balance fluctuation and dissipation as dictated by the fluctuation-dissipation theorem. In contrast to the common belief, we show that the fluctuation-dissipation theorem is not required because only the configuration space distribution, and not the full phase space distribution, needs to be canonical. We propose a continuous-time Microcanonical Langevin Monte Carlo (MCLMC) as a dissipation-free system of stochastic differential equations (SDE). We derive the corresponding Fokker-Planck equation and show that the stationary distribution is the microcanonical ensemble with the desired canonical distribution on configuration space. We prove that MCLMC is ergodic for any nonzero amount of stochasticity, and for smooth, convex potentials, the expectation values converge exponentially fast. Furthermore, the deterministic drift and the stochastic diffusion separately preserve the stationary distribution. This uncommon property is attractive for practical implementations as it implies that the drift-diffusion discretization schemes are bias-free, so the only source of bias is the discretization of the deterministic dynamics. We applied MCLMC on a lattice $\phi^4$ model, where Hamiltonian Monte Carlo (HMC) is currently the state-of-the-art integrator. For the same accuracy, MCLMC converges 12 times faster than HMC on an $8\times8$ lattice. On a $64\times64$ lattice, it is already 32 times faster. The trend is expected to persist to larger lattices, which are of particular interest, for example, in lattice quantum chromodynamics.  ( 3 min )
    Algorithm as Experiment: Machine Learning, Market Design, and Policy Eligibility Rules. (arXiv:2104.12909v6 [econ.EM] UPDATED)
    Algorithms make a growing portion of policy and business decisions. We develop a treatment-effect estimator using algorithmic decisions as instruments for a class of stochastic and deterministic algorithms. Our estimator is consistent and asymptotically normal for well-defined causal effects. A special case of our setup is multidimensional regression discontinuity designs with complex boundaries. We apply our estimator to evaluate the Coronavirus Aid, Relief, and Economic Security Act, which allocated many billions of dollars worth of relief funding to hospitals via an algorithmic rule. The funding is shown to have little effect on COVID-19-related hospital activities. Naive estimates exhibit selection bias.  ( 2 min )
    Dimensionless Anomaly Detection on Multivariate Streams with Variance Norm and Path Signature. (arXiv:2006.03487v2 [cs.LG] UPDATED)
    In this paper, we propose a dimensionless anomaly detection method for multivariate streams. Our method is independent of the unit of measurement for the different stream channels, therefore dimensionless. We first propose the variance norm, a generalisation of Mahalanobis distance to handle infinite-dimensional feature space and singular empirical covariance matrix rigorously. We then combine the variance norm with the path signature, an infinite collection of iterated integrals that provide global features of streams, to propose SigMahaKNN, a method for anomaly detection on (multivariate) streams. We show that SigMahaKNN is invariant to stream reparametrisation, stream concatenation and has a graded discrimination power depending on the truncation level of the path signature. We implement SigMahaKNN as an open-source software, and perform extensive numerical experiments, showing significantly improved anomaly detection on streams compared to isolation forest and local outlier factors in applications ranging from language analysis, hand-writing analysis, ship movement paths analysis and univariate time-series analysis.  ( 2 min )
    From concrete mixture to structural design -- a holistic optimization procedure in the presence of uncertainties. (arXiv:2312.03607v1 [math.OC])
    Designing civil structures such as bridges, dams or buildings is a complex task requiring many synergies from several experts. Each is responsible for different parts of the process. This is often done in a sequential manner, e.g. the structural engineer makes a design under the assumption of certain material properties (e.g. the strength class of the concrete), and then the material engineer optimizes the material with these restrictions. This paper proposes a holistic optimization procedure, which combines the concrete mixture design and structural simulations in a joint, forward workflow that we ultimately seek to invert. In this manner, new mixtures beyond standard ranges can be considered. Any design effort should account for the presence of uncertainties which can be aleatoric or epistemic as when data is used to calibrate physical models or identify models that fill missing links in the workflow. Inverting the causal relations established poses several challenges especially when these involve physics-based models which most often than not do not provide derivatives/sensitivities or when design constraints are present. To this end, we advocate Variational Optimization, with proposed extensions and appropriately chosen heuristics to overcome the aforementioned challenges. The proposed methodology is illustrated using the design of a precast concrete beam with the objective to minimize the global warming potential while satisfying a number of constraints associated with its load-bearing capacity after 28days according to the Eurocode, the demoulding time as computed by a complex nonlinear Finite Element model, and the maximum temperature during the hydration.  ( 3 min )
    Complementary Benefits of Contrastive Learning and Self-Training Under Distribution Shift. (arXiv:2312.03318v1 [cs.LG])
    Self-training and contrastive learning have emerged as leading techniques for incorporating unlabeled data, both under distribution shift (unsupervised domain adaptation) and when it is absent (semi-supervised learning). However, despite the popularity and compatibility of these techniques, their efficacy in combination remains unexplored. In this paper, we undertake a systematic empirical investigation of this combination, finding that (i) in domain adaptation settings, self-training and contrastive learning offer significant complementary gains; and (ii) in semi-supervised learning settings, surprisingly, the benefits are not synergistic. Across eight distribution shift datasets (e.g., BREEDs, WILDS), we demonstrate that the combined method obtains 3--8% higher accuracy than either approach independently. We then theoretically analyze these techniques in a simplified model of distribution shift, demonstrating scenarios under which the features produced by contrastive learning can yield a good initialization for self-training to further amplify gains and achieve optimal performance, even when either method alone would fail.  ( 2 min )
    Interpretable Mechanistic Representations for Meal-level Glycemic Control in the Wild. (arXiv:2312.03344v1 [cs.LG])
    Diabetes encompasses a complex landscape of glycemic control that varies widely among individuals. However, current methods do not faithfully capture this variability at the meal level. On the one hand, expert-crafted features lack the flexibility of data-driven methods; on the other hand, learned representations tend to be uninterpretable which hampers clinical adoption. In this paper, we propose a hybrid variational autoencoder to learn interpretable representations of CGM and meal data. Our method grounds the latent space to the inputs of a mechanistic differential equation, producing embeddings that reflect physiological quantities, such as insulin sensitivity, glucose effectiveness, and basal glucose levels. Moreover, we introduce a novel method to infer the glucose appearance rate, making the mechanistic model robust to unreliable meal logs. On a dataset of CGM and self-reported meals from individuals with type-2 diabetes and pre-diabetes, our unsupervised representation discovers a separation between individuals proportional to their disease severity. Our embeddings produce clusters that are up to 4x better than naive, expert, black-box, and pure mechanistic features. Our method provides a nuanced, yet interpretable, embedding space to compare glycemic control within and across individuals, directly learnable from in-the-wild data.  ( 2 min )
    Empirical Bayes Covariance Decomposition, and a solution to the Multiple Tuning Problem in Sparse PCA. (arXiv:2312.03274v1 [stat.ME])
    Sparse Principal Components Analysis (PCA) has been proposed as a way to improve both interpretability and reliability of PCA. However, use of sparse PCA in practice is hindered by the difficulty of tuning the multiple hyperparameters that control the sparsity of different PCs (the "multiple tuning problem", MTP). Here we present a solution to the MTP using Empirical Bayes methods. We first introduce a general formulation for penalized PCA of a data matrix $\mathbf{X}$, which includes some existing sparse PCA methods as special cases. We show that this formulation also leads to a penalized decomposition of the covariance (or Gram) matrix, $\mathbf{X}^T\mathbf{X}$. We introduce empirical Bayes versions of these penalized problems, in which the penalties are determined by prior distributions that are estimated from the data by maximum likelihood rather than cross-validation. The resulting "Empirical Bayes Covariance Decomposition" provides a principled and efficient solution to the MTP in sparse PCA, and one that can be immediately extended to incorporate other structural assumptions (e.g. non-negative PCA). We illustrate the effectiveness of this approach on both simulated and real data examples.  ( 2 min )
    Bootstrap Your Own Variance. (arXiv:2312.03213v1 [cs.LG])
    Understanding model uncertainty is important for many applications. We propose Bootstrap Your Own Variance (BYOV), combining Bootstrap Your Own Latent (BYOL), a negative-free Self-Supervised Learning (SSL) algorithm, with Bayes by Backprop (BBB), a Bayesian method for estimating model posteriors. We find that the learned predictive std of BYOV vs. a supervised BBB model is well captured by a Gaussian distribution, providing preliminary evidence that the learned parameter posterior is useful for label free uncertainty estimation. BYOV improves upon the deterministic BYOL baseline (+2.83% test ECE, +1.03% test Brier) and presents better calibration and reliability when tested with various augmentations (eg: +2.4% test ECE, +1.2% test Brier for Salt & Pepper noise).  ( 2 min )
    GeoShapley: A Game Theory Approach to Measuring Spatial Effects in Machine Learning Models. (arXiv:2312.03675v1 [cs.LG])
    This paper introduces GeoShapley, a game theory approach to measuring spatial effects in machine learning models. GeoShapley extends the Nobel Prize-winning Shapley value framework in game theory by conceptualizing location as a player in a model prediction game, which enables the quantification of the importance of location and the synergies between location and other features in a model. GeoShapley is a model-agnostic approach and can be applied to statistical or black-box machine learning models in various structures. The interpretation of GeoShapley is directly linked with spatially varying coefficient models for explaining spatial effects and additive models for explaining non-spatial effects. Using simulated data, GeoShapley values are validated against known data-generating processes and are used for cross-comparison of seven statistical and machine learning models. An empirical example of house price modeling is used to illustrate GeoShapley's utility and interpretation with real world data. The method is available as an open-source Python package named geoshapley.  ( 2 min )
    Invariance & Causal Representation Learning: Prospects and Limitations. (arXiv:2312.03580v1 [stat.ML])
    In causal models, a given mechanism is assumed to be invariant to changes of other mechanisms. While this principle has been utilized for inference in settings where the causal variables are observed, theoretical insights when the variables of interest are latent are largely missing. We assay the connection between invariance and causal representation learning by establishing impossibility results which show that invariance alone is insufficient to identify latent causal variables. Together with practical considerations, we use these theoretical findings to highlight the need for additional constraints in order to identify representations by exploiting invariance.  ( 2 min )
    What Planning Problems Can A Relational Neural Network Solve?. (arXiv:2312.03682v1 [cs.LG])
    Goal-conditioned policies are generally understood to be "feed-forward" circuits, in the form of neural networks that map from the current state and the goal specification to the next action to take. However, under what circumstances such a policy can be learned and how efficient the policy will be are not well understood. In this paper, we present a circuit complexity analysis for relational neural networks (such as graph neural networks and transformers) representing policies for planning problems, by drawing connections with serialized goal regression search (S-GRS). We show that there are three general classes of planning problems, in terms of the growth of circuit width and depth as a function of the number of objects and planning horizon, providing constructive proofs. We also illustrate the utility of this analysis for designing neural networks for policy learning.  ( 2 min )
    InfoBot: Transfer and Exploration via the Information Bottleneck. (arXiv:1901.10902v5 [stat.ML] UPDATED)
    A central challenge in reinforcement learning is discovering effective policies for tasks where rewards are sparsely distributed. We postulate that in the absence of useful reward signals, an effective exploration strategy should seek out {\it decision states}. These states lie at critical junctions in the state space from where the agent can transition to new, potentially unexplored regions. We propose to learn about decision states from prior experience. By training a goal-conditioned policy with an information bottleneck, we can identify decision states by examining where the model actually leverages the goal state. We find that this simple mechanism effectively identifies decision states, even in partially observed settings. In effect, the model learns the sensory cues that correlate with potential subgoals. In new environments, this model can then identify novel subgoals for further exploration, guiding the agent through a sequence of potential decision states and through new regions of the state space.  ( 2 min )
    Precision of Individual Shapley Value Explanations. (arXiv:2312.03485v1 [stat.ML])
    Shapley values are extensively used in explainable artificial intelligence (XAI) as a framework to explain predictions made by complex machine learning (ML) models. In this work, we focus on conditional Shapley values for predictive models fitted to tabular data and explain the prediction $f(\boldsymbol{x}^{*})$ for a single observation $\boldsymbol{x}^{*}$ at the time. Numerous Shapley value estimation methods have been proposed and empirically compared on an average basis in the XAI literature. However, less focus has been devoted to analyzing the precision of the Shapley value explanations on an individual basis. We extend our work in Olsen et al. (2023) by demonstrating and discussing that the explanations are systematically less precise for observations on the outer region of the training data distribution for all used estimation methods. This is expected from a statistical point of view, but to the best of our knowledge, it has not been systematically addressed in the Shapley value literature. This is crucial knowledge for Shapley values practitioners, who should be more careful in applying these observations' corresponding Shapley value explanations.  ( 2 min )
    Multicoated and Folded Graph Neural Networks with Strong Lottery Tickets. (arXiv:2312.03236v1 [cs.LG])
    The Strong Lottery Ticket Hypothesis (SLTH) demonstrates the existence of high-performing subnetworks within a randomly initialized model, discoverable through pruning a convolutional neural network (CNN) without any weight training. A recent study, called Untrained GNNs Tickets (UGT), expanded SLTH from CNNs to shallow graph neural networks (GNNs). However, discrepancies persist when comparing baseline models with learned dense weights. Additionally, there remains an unexplored area in applying SLTH to deeper GNNs, which, despite delivering improved accuracy with additional layers, suffer from excessive memory requirements. To address these challenges, this work utilizes Multicoated Supermasks (M-Sup), a scalar pruning mask method, and implements it in GNNs by proposing a strategy for setting its pruning thresholds adaptively. In the context of deep GNNs, this research uncovers the existence of untrained recurrent networks, which exhibit performance on par with their trained feed-forward counterparts. This paper also introduces the Multi-Stage Folding and Unshared Masks methods to expand the search space in terms of both architecture and parameters. Through the evaluation of various datasets, including the Open Graph Benchmark (OGB), this work establishes a triple-win scenario for SLTH-based GNNs: by achieving high sparsity, competitive performance, and high memory efficiency with up to 98.7\% reduction, it demonstrates suitability for energy-efficient graph processing.  ( 3 min )
    Maximum likelihood thresholds of Gaussian graphical models and graphical lasso. (arXiv:2312.03145v1 [math.ST])
    Associated to each graph G is a Gaussian graphical model. Such models are often used in high-dimensional settings, i.e. where there are relatively few data points compared to the number of variables. The maximum likelihood threshold of a graph is the minimum number of data points required to fit the corresponding graphical model using maximum likelihood estimation. Graphical lasso is a method for selecting and fitting a graphical model. In this project, we ask: when graphical lasso is used to select and fit a graphical model on n data points, how likely is it that n is greater than or equal to the maximum likelihood threshold of the corresponding graph? Our results are a series of computational experiments.  ( 2 min )
    On the Estimation Performance of Generalized Power Method for Heteroscedastic Probabilistic PCA. (arXiv:2312.03438v1 [math.OC])
    The heteroscedastic probabilistic principal component analysis (PCA) technique, a variant of the classic PCA that considers data heterogeneity, is receiving more and more attention in the data science and signal processing communities. In this paper, to estimate the underlying low-dimensional linear subspace (simply called \emph{ground truth}) from available heterogeneous data samples, we consider the associated non-convex maximum-likelihood estimation problem, which involves maximizing a sum of heterogeneous quadratic forms over an orthogonality constraint (HQPOC). We propose a first-order method -- generalized power method (GPM) -- to tackle the problem and establish its \emph{estimation performance} guarantee. Specifically, we show that, given a suitable initialization, the distances between the iterates generated by GPM and the ground truth decrease at least geometrically to some threshold associated with the residual part of certain "population-residual decomposition". In establishing the estimation performance result, we prove a novel local error bound property of another closely related optimization problem, namely quadratic optimization with orthogonality constraint (QPOC), which is new and can be of independent interest. Numerical experiments are conducted to demonstrate the superior performance of GPM in both Gaussian noise and sub-Gaussian noise settings.  ( 2 min )
    Evaluation of Active Feature Acquisition Methods for Static Feature Settings. (arXiv:2312.03619v1 [stat.ML])
    Active feature acquisition (AFA) agents, crucial in domains like healthcare where acquiring features is often costly or harmful, determine the optimal set of features for a subsequent classification task. As deploying an AFA agent introduces a shift in missingness distribution, it's vital to assess its expected performance at deployment using retrospective data. In a companion paper, we introduce a semi-offline reinforcement learning (RL) framework for active feature acquisition performance evaluation (AFAPE) where features are assumed to be time-dependent. Here, we study and extend the AFAPE problem to cover static feature settings, where features are time-invariant, and hence provide more flexibility to the AFA agents in deciding the order of the acquisitions. In this static feature setting, we derive and adapt new inverse probability weighting (IPW), direct method (DM), and double reinforcement learning (DRL) estimators within the semi-offline RL framework. These estimators can be applied when the missingness in the retrospective dataset follows a missing-at-random (MAR) pattern. They also can be applied to missing-not-at-random (MNAR) patterns in conjunction with appropriate existing missing data techniques. We illustrate the improved data efficiency offered by the semi-offline RL estimators in synthetic and real-world data experiments under synthetic MAR and MNAR missingness.  ( 2 min )
    Balanced Marginal and Joint Distributional Learning via Mixture Cramer-Wold Distance. (arXiv:2312.03307v1 [stat.ML])
    In the process of training a generative model, it becomes essential to measure the discrepancy between two high-dimensional probability distributions: the generative distribution and the ground-truth distribution of the observed dataset. Recently, there has been growing interest in an approach that involves slicing high-dimensional distributions, with the Cramer-Wold distance emerging as a promising method. However, we have identified that the Cramer-Wold distance primarily focuses on joint distributional learning, whereas understanding marginal distributional patterns is crucial for effective synthetic data generation. In this paper, we introduce a novel measure of dissimilarity, the mixture Cramer-Wold distance. This measure enables us to capture both marginal and joint distributional information simultaneously, as it incorporates a mixture measure with point masses on standard basis vectors. Building upon the mixture Cramer-Wold distance, we propose a new generative model called CWDAE (Cramer-Wold Distributional AutoEncoder), which shows remarkable performance in generating synthetic data when applied to real tabular datasets. Furthermore, our model offers the flexibility to adjust the level of data privacy with ease.  ( 2 min )
    Advantage of Quantum Machine Learning from General Computational Advantages. (arXiv:2312.03057v1 [quant-ph])
    An overarching milestone of quantum machine learning (QML) is to demonstrate the advantage of QML over all possible classical learning methods in accelerating a common type of learning task as represented by supervised learning with classical data. However, the provable advantages of QML in supervised learning have been known so far only for the learning tasks designed for using the advantage of specific quantum algorithms, i.e., Shor's algorithms. Here we explicitly construct an unprecedentedly broader family of supervised learning tasks with classical data to offer the provable advantage of QML based on general quantum computational advantages, progressing beyond Shor's algorithms. Our learning task is feasibly achievable by executing a general class of functions that can be computed efficiently in polynomial time for a large fraction of inputs by arbitrary quantum algorithms but not by any classical algorithm. We prove the hardness of achieving this learning task for any possible polynomial-time classical learning method. We also clarify protocols for preparing the classical data to demonstrate this learning task in experiments. These results open routes to exploit a variety of quantum advantages in computing functions for the experimental demonstration of the advantage of QML.  ( 2 min )
    Accelerated Gradient Algorithms with Adaptive Subspace Search for Instance-Faster Optimization. (arXiv:2312.03218v1 [cs.LG])
    Gradient-based minimax optimal algorithms have greatly promoted the development of continuous optimization and machine learning. One seminal work due to Yurii Nesterov [Nes83a] established $\tilde{\mathcal{O}}(\sqrt{L/\mu})$ gradient complexity for minimizing an $L$-smooth $\mu$-strongly convex objective. However, an ideal algorithm would adapt to the explicit complexity of a particular objective function and incur faster rates for simpler problems, triggering our reconsideration of two defeats of existing optimization modeling and analysis. (i) The worst-case optimality is neither the instance optimality nor such one in reality. (ii) Traditional $L$-smoothness condition may not be the primary abstraction/characterization for modern practical problems. In this paper, we open up a new way to design and analyze gradient-based algorithms with direct applications in machine learning, including linear regression and beyond. We introduce two factors $(\alpha, \tau_{\alpha})$ to refine the description of the degenerated condition of the optimization problems based on the observation that the singular values of Hessian often drop sharply. We design adaptive algorithms that solve simpler problems without pre-known knowledge with reduced gradient or analogous oracle accesses. The algorithms also improve the state-of-art complexities for several problems in machine learning, thereby solving the open problem of how to design faster algorithms in light of the known complexity lower bounds. Specially, with the $\mathcal{O}(1)$-nuclear norm bounded, we achieve an optimal $\tilde{\mathcal{O}}(\mu^{-1/3})$ (v.s. $\tilde{\mathcal{O}}(\mu^{-1/2})$) gradient complexity for linear regression. We hope this work could invoke the rethinking for understanding the difficulty of modern problems in optimization.  ( 3 min )
    Constrained Bayesian Optimization Under Partial Observations: Balanced Improvements and Provable Convergence. (arXiv:2312.03212v1 [cs.LG])
    The partially observable constrained optimization problems (POCOPs) impede data-driven optimization techniques since an infeasible solution of POCOPs can provide little information about the objective as well as the constraints. We endeavor to design an efficient and provable method for expensive POCOPs under the framework of constrained Bayesian optimization. Our method consists of two key components. Firstly, we present an improved design of the acquisition functions that introduces balanced exploration during optimization. We rigorously study the convergence properties of this design to demonstrate its effectiveness. Secondly, we propose a Gaussian process embedding different likelihoods as the surrogate model for a partially observable constraint. This model leads to a more accurate representation of the feasible regions compared to traditional classification-based models. Our proposed method is empirically studied on both synthetic and real-world problems. The results demonstrate the competitiveness of our method for solving POCOPs.  ( 2 min )
    Algorithmic Fairness with Feedback. (arXiv:2312.03155v1 [econ.TH])
    The field of algorithmic fairness has rapidly emerged over the past 15 years as algorithms have become ubiquitous in everyday lives. Algorithmic fairness traditionally considers statistical notions of fairness algorithms might satisfy in decisions based on noisy data. We first show that these are theoretically disconnected from welfare-based notions of fairness. We then discuss two individual welfare-based notions of fairness, envy freeness and prejudice freeness, and establish conditions under which they are equivalent to error rate balance and predictive parity, respectively. We discuss the implications of these findings in light of the recently discovered impossibility theorem in algorithmic fairness (Kleinberg, Mullainathan, & Raghavan (2016), Chouldechova (2017)).  ( 2 min )

  • Open

    My custom gymnasium env seems not learning at all
    Trained for whole day doesn’t seems to be learning is it the RL or the environment issue https://github.com/jonnytracker/Flappy-Bird-RL submitted by /u/jonnytracker2020 [link] [comments]
    Why does Gumbel-softmax work with TD3?
    My hypothesis was that "In a one-armed bandit setting where selecting the correct bin would return reward of 1, the return function would become a piece-wise constant function. Hence, the gradient of critic is 0 almost everywhere". However, I tried it. TD3 with Gumbel-softmax as the output layer. And it learns! I have no idea why it learns though. Also, setting temperature too low was not able to learn' Can anyone explain what is happening and what I'm missing please? Thank you submitted by /u/Lopsided_Hall_9750 [link] [comments]
    A recurrent network model of planning explains hippocampal replay and human behavior
    Paper: https://www.biorxiv.org/content/10.1101/2023.01.16.523429v2 Code: https://github.com/KrisJensen/planning_code Abstract: When faced with a novel situation, humans often spend substantial periods of time contemplating possible futures. For such planning to be rational, the benefits to behavior must compensate for the time spent thinking. Here we capture these features of human behavior by developing a neural network model where planning itself is controlled by prefrontal cortex. This model consists of a meta-reinforcement learning agent augmented with the ability to plan by sampling imagined action sequences from its own policy, which we call ‘rollouts’. The agent learns to plan when planning is beneficial, explaining empirical variability in human thinking times. Additionally, the patterns of policy rollouts employed by the artificial agent closely resemble patterns of rodent hippocampal replays recently recorded during spatial navigation. Our work provides a new theory of how the brain could implement planning through prefrontal-hippocampal interactions, where hippocampal replays are triggered by – and adaptively affect – prefrontal dynamics. submitted by /u/APaperADay [link] [comments]
    "Most convenient" journal club for reinforcement learning
    Hi everyone! I'm solving a problem for some people in the RL community, and would like to get your inputs/ideas. The problems: Not having time to read the coolest, latest papers Crunching papers alone isn't that fun, and often difficult. Can't commit to participating in a weekly meeting, want complete flexibility To fix these, I've built a journal club site called DenseLayers.com. The help I need is in terms of feedback! Please answer at least one of these questions if possible. :) How often do you currently read papers, and which of the above 3 problems resonates with you? What do you currently do to solve that problem? What would eliminate that problem entirely? Which popular RL papers would you love to read and understand (if you had the time)? Thank you! submitted by /u/mngrwl [link] [comments]
    For MAPPO, how should the critic network handle global state input?
    I am using the MAPPO algorithm to solve a multi-UAV cooperative navigation problem. Among them, the observation information of the UAV agent is the image and its own speed. I want to know how the critic network processes the input global information in the most appropriate way? I currently think of two methods. The first method is to stack all drone images and use a large CNN network to process them to obtain feature vectors. The second method is to use a small CNN network to process each drone image separately to obtain feature vectors, and then splice these vectors together. Which method is better? Or is there a better way to solve this problem? submitted by /u/Cute-Heron-1709 [link] [comments]
  • Open

    i enjoyed talking to bard of google today for the first time
    Conversation with Bard --------- hi i am looking for LaMDA --------- LaMDA (short for "Language Model for Dialogue Applications") is a large language model chatbot developed by Google AI, trained on a massive dataset of text and code. It can generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way. Here are some of the things that LaMDA can do: Generate different creative text formats of text content, like poems, code, scripts, musical pieces, email, letters, etc. Translate languages, including French, Spanish, German, Chinese, Japanese, Korean, Russian, and more Continually learn and adapt to new information and situations Answer your questions in an informative way, even if they are open ended, challenging…  ( 1 min )
    AI Tools for Organizing 73 pages and 12 years of notes?
    Is there anything out there that can organize a bunch of notes I've taken for work over the last 12 years? It's like 73 pages. I ran macros to dedupe stuff, but I've spent about 8-12 hours a day organizing everything for the last few days. Are there any tools out there that will take the existing Word doc and group/dedupe/condense/order the notes? Does anyone have any ideas? I appreciate any help you can provide. submitted by /u/Dorkus_Molorkus [link] [comments]  ( 1 min )
    ChatGPT beaten by 1960s computer program in Turing test study | The Independent
    . submitted by /u/AminoOxi [link] [comments]
    Alphabet’s Gemini Surges, Rivaling Microsoft’s OpenAI in AI Race
    submitted by /u/davinci-code [link] [comments]  ( 1 min )
    If you or someone had access to any company's proprietary data, what AI app would you build or like to be built?
    As the title suggests, what cool apps would you build or like to see built? It can be in any industry/category - productivity, healthcare, finance, education, B2B, etc... It doesn't even need to be data from a big company. It could be local businesses, medium size businesses, organizations, whatever! submitted by /u/nobilis_rex_ [link] [comments]  ( 1 min )
    Review: Google's New Gemini Pro Through Bard Is... Horrible - Seem Like a Google Search Extension - Are The Ultra Test Results Equivalent to Teaching to the STEM Test? Where is Gemini Ultra?
    Ok, I wanted to give this a fair go and my first impressions are not good. I am not impressed. I did an AB evaluation from GPT 4 on one side of questions and Bard's new Gemini on the other side. A little TLDR upfront; Bard seemed to diverge constantly into a different of Q&A because it was so far off track and that was really a surprise. Also, I will not provide specific results because Google has stated that they are monitoring everything that is going through as a disclaimer. It's not my job to help them quit frankly. What I found in comparison and why I think it's very telling they didn't release Ultra up front. I also, cleary can see why they did not release Ultra now; There's no possible way that would be any better and would have received very bad results. Last thing before we g…  ( 1 min )
    What AI Image Gen Model is the best for generating digital art from a reference?
    I have a picture of two of my friends sitting in a car in a specific pose. I want to generate an image in the style of spongebob and replace the two with spongebob and patrick but in the same pose. How can I achieve that? Is there even a stable method for this? submitted by /u/Krusader67 [link] [comments]  ( 1 min )
    I turned Hacker News top stories into AI art on your wall
    Project E Ink brings the day's top stories from Hacker News to life on a 32" e-ink screen. Every four hours, the most discussed stories are transformed into AI-generated art optimized for the E Ink display. The project combines Laravel and Google Cloud technologies to create high-contrast, black-and-white digital illustrations, paired with GPT-3.5-turbo-16k model summaries. A dashboard is available to explore and download these AI-generated artworks Source: https://alexanderklopping.medium.com/i-turned-hacker-news-top-stories-into-ai-art-on-your-wall-97c814873b63 submitted by /u/NuseAI [link] [comments]  ( 1 min )
    How AI Can Help Feed The World
    With so much shade on AI and capitalism, do you think AI will help feed the world or will it again be controlled by a few powerful people trying to get rich? Integrating speed breeding with genomics and AI to feed 10 billion people by 2050 in the face of climate change and rapidly evolving pests and pathogens submitted by /u/ChikyChikyBoom [link] [comments]  ( 1 min )
    Let's take a pause
    submitted by /u/Asleep-Television-24 [link] [comments]  ( 1 min )
    Let's order one.
    submitted by /u/Philipp [link] [comments]  ( 1 min )
    If the universe had a beauty pageant, cast your vote
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 12/7/2023
    Google launches its largest and ‘most capable’ AI model, Gemini.[1] GenAI funding heats up: French Open AI rival, Mistral AI to grab €450M at $2B valuation.[2] Microsoft upgrades Copilot with OpenAI’s GPT-4 Turbo and DALL-E 3.[3] Meta and Microsoft say they will buy AMD’s new AI chip as an alternative to Nvidia’s.[4] Sources: [1] https://www.cnbc.com/2023/12/06/google-launches-its-largest-and-most-capable-ai-model-gemini.html [2] https://techfundingnews.com/genai-funding-heats-up-french-open-ai-rival-mistral-ai-to-grab-e450m-at-2b-valuation/ [3] https://innovation-village.com/microsoft-upgrades-copilot-with-openais-gpt-4-turbo-and-dall-e-3 [4] https://www.cnbc.com/2023/12/06/meta-and-microsoft-to-buy-amds-new-ai-chip-as-alternative-to-nvidia.html submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    I would like to increase my knowledge in AI at work.
    Hello, I would like to increase my knowledge in AI. My thoughts have long been that one would preferably want to be 'ahead of the game' and possibly stay one step ahead in using AI at work. I have been looking around for basic books, but the ones I find are more explanations of, for example, risks with AI and what AI is fundamentally, etc. I am looking for more ways to apply it for a non-mathematician or programmer. I work today with documentation in the security industry. And six months ago, I was assigned to start drawing drawings in CAD due to a colleague's illness. I had zero experience. But goodness, ChatGPT has been such a help. Today, I even feel confident in what I do. My thought is to continue evolving :) If I can keep progressing the way I have in just six months, it feels like 'I can do anything.' I just don't know where to start beyond ChatGPT. Take care submitted by /u/Rymdfararen [link] [comments]  ( 1 min )
    Creating music with ChatGPT: Part 11 - A huge list of ways ChatGPT can assist you with your own music production
    After we did "A huge list of very useful prompts for newcomers to AI music production" ( https://laibyrinth.blogspot.com/2023/11/tutorial-for-creating-music-with.html ) here is another tutorial with a huge list of ways ChatGPT might help you with producing music. Again, this is a bit Hardcore Techno centric (you should know us by now!), but can be easily adapted to any other style of music (and if you fail to do so, just ask ChatGPT to adapt it for you). Have fun and we hope you find some of these suggestion useful for your own music! ---Beginning of Chat Transcript User Dear ChatGPT, As you know, I'm a music producer. I wondered, in which ways could you help me or assist me with producing my music? ChatGPT Hey Low Entropy, Absolutely, I can help you with various aspects of mus…  ( 1 min )
  • Open

    [D] Engineering Problems at Scale?
    I read through the Google Gemini technical report yesterday. It was pretty vague and not that interesting but one section got me wondering. Section 3 "Training Infrastructure" mentioned all of the technical challenges they faced including dealing with "cosmic rays" and other rare failure modes that happen at the scale of thousands of chips. I haven't heard of GPU training having the same issues with cosmic rays and other large AI labs haven't mentioned how challenging training at this scale can be. So this has me wondering:-- Is Google just talking up their achievement and making it sound more difficult and advanced than it really is here?-- Are those challenges unique to Google's infra and TPUs?-- What are the real challenges at training at a large scale (i.e. 300B-1T parameters)? submitted by /u/cowzombi [link] [comments]  ( 9 min )
    [D] Thoughts on Mamba?
    I ran the NanoGPT of Karparthy replacing Self-Attention with Mamba on his TinyShakespeare Dataset and within 5 minutes it started spitting out the following: ​ https://preview.redd.it/4r96tp6lxx4c1.png?width=836&format=png&auto=webp&s=10f2f61cd4cea96f4f903cb2070835fc5d1df951 ​ https://preview.redd.it/32ler5vnxx4c1.png?width=622&format=png&auto=webp&s=dd00e53f43dd0afa058758a987901ee6789d2258 ​ https://preview.redd.it/sc96i4xoxx4c1.png?width=678&format=png&auto=webp&s=94d2ed279054363d3ed2b6beed65be89468582b0 So much faster than self-attention, and so much smoother, running at 6 epochs per second. I'm honestly gobsmacked. https://colab.research.google.com/drive/1g9qpeVcFa0ca0cnhmqusO4RZtQdh9umY?usp=sharing ​ Some loss graphs: Multihead attention without truncation(x is iterations in 10s, and y is loss) Multihead attention with truncation(x is iterations in 10s, and y is loss) Mamba loss graph(x is iterations in 10s, and y is loss) ​ ​ submitted by /u/ExaminationNo8522 [link] [comments]  ( 9 min )
    [P] flex-prompt: a flexible prompt rendering engine that ensures you'll never exceed your LLM's context length again
    When working with LLMs, I frequently experience token agony. Error: This model's maximum context length is 4097 but you are trying to push in all of War and Peace, you imbecile Perhaps you've experienced it too! The issue is particularly pronounced with retrieval augmented pipelines, since you have potentially quite a large set of documents which you could perhaps include in the prompt if only you knew how big it could be. I got tired of hacking around this headache, so I wrote flex-prompt to address it. I wish I didn't have to. Perhaps someone can point me to a better solution! But I couldn't find one, so alas, here it is. flex-prompt provides a basic layout and component model to help you describe how you want the pieces of your prompt to grow and shrink and a token-aware renderer wh…  ( 13 min )
    [D] Is there a tool that indicates which parts of the input prompt impact the LLM's output the most?
    Hi, Is there a tool that indicates which parts of the input prompt impact the LLM's output the most? I do not care which LLM the tool is for if it exists. I guess it could be backtracked via the weights of each node in the neural network, but you guys are smarter than me so I'll listen to y'all. My use case is I have a prompt that slightly changes variation to variation. The output of the model is "Yes" or "No", so I want to see which parts of the prompt I change impact its response Best, A Reddit User submitted by /u/ToughOpening [link] [comments]  ( 9 min )
    Public Datasets With At Least 2 Rows Per Subject? [D]
    Are you aware of any public datasets that have at least 2 trials / samples / rows per subject? Could be in any domain. Preferably with > 100 subjects, and the tests not sampled years apart (but not dealbreakers). For instance, a large cohort of patients who have had ECG scans collected on 2 separate occasions. I am slowly working my way through the PhysioNet databases: https://physionet.org/about/database/ Most of course only have one scan per subject. submitted by /u/ZeApelido [link] [comments]  ( 9 min )
    [D] Llms/Generative AI citing sources
    This might be something the big boys in AI are considering, or maybe already doing. If not, would it be such a crazy endeavor? Technically speaking... With artists claiming copyright infringement, with Meta using Facebook and Instagram images (which I assume we all blindly agreed to in tos), and in general more and more backlash against AI using sources without permission or whatever. Wouldn't a system where during training it somehow keeps track of sources such that when the model is finally trained and ready to use, it could generate whatever it generates along with a probably long list of citations of the sources it used to generate the work of the prompt? Ain't saying this would be an easy thing to do, hell I know very little of how it all works, but wouldn't a system like this put a lot of people at ease? Knowing that at the very least their work is being credited? Opening the floor for thoughts and opinions. submitted by /u/edixtor93 [link] [comments]  ( 10 min )
    [D] How do LLMs Combine with Traditional ML approaches?
    I’ve got experience in “traditional” ML like building classification and regression models with GB Trees and the like, so I’m curious how, if at all LLMs can be combined with other ML modeling approaches. If my use case entails structured data as well as something like chat history, is there a need to “combine” the modeling approaches? Thanks for any resources or input you might have. submitted by /u/yoquierodata [link] [comments]  ( 9 min )
    [D] Considering Switching to DS/ML from SWE.
    How is the day to day for Data Science/ML? I am thinking about changing careers as I am really not enjoying certain aspects of Software Engineering. I did an 4 month ML internship, but I don't feel like I got the full grasp of the day to day. I know this can vary from company to company and team to team. For some background, I have ~4 years of experience as a SWE, BS in CS from top 50 school, and MS in CS from top 10 school. submitted by /u/EquivalentAbies6095 [link] [comments]  ( 9 min )
    [D] Is ICPRAI a Reputable ML Conference? Seeking Input!
    Hey ML folks! Found ICPRAI on aideadlin.es but don't know much about it. Any insights on the conference quality, review process, or personal experiences? Considering submitting a paper and looking for advice. Your thoughts are much appreciated! submitted by /u/SufficientAd3564 [link] [comments]  ( 9 min )
    [P] Machinery failure detection with training on the edge using Arduino-compatible boards
    ​ Bender Bending Rodríguez Some time ago I posted about the C++ library I made for training decision trees on the edge (using any Arduino-compatible board). People were asking reasonable questions, like is it actually good for something. So, recently I decided to make a little proof of concept project by training it to detect fan 'failure' based on it's vibration patterns. The code is in the examples folder. The code is practically the same as the code for 'physical activity recognition' except for the fact that I increased the sampling frequency a little bit. Here is a video of the whole process. (Sorry for my bad English) First part of the video is learning two states('ok' and 'failure') then goes the classification. I guess the coolest thing is that how fast and easy it is. I wonder If it is possible to make unsupervised anomaly detection on the edge using KNN... submitted by /u/prokyber [link] [comments]  ( 10 min )
    Quotes on small/limited data
    Does anyone know any quotes by experts in AI or ML about learning with small or limited data? I am trying to find some for a presentation but failed to do so. Thanks in advance!!! :) submitted by /u/Agreeable-Ad-238 [link] [comments]  ( 9 min )
    [D] Undergrad contemplating a Machine Learning PhD, but worried about what that truly entails.
    I'm a 3rd year at a T5 school, and I've been doing research since high school. I've got a few publications under my belt and my professor (at a different T3 institution) is expecting my current work will get me a first authorship to Nature Medicine (or a similar top journal). It's all been applied ML/DL work, especially in reducing the cost of healthcare. I've been keeping up a high GPA and I've been trying to plan out the next year to set myself up as well as I can for a PhD. In theory, what I really want to do is figure out next-generation learning algorithms, especially those motivated by principles from the brain. I've found this incredibly exciting, and freshman year I'd get up at 5, spend most of the day learning about this stuff and implementing my ideas, skip all my classes, and b…  ( 11 min )
    [D] I keep seeing ambiguous use of terminology in academic and SoTA papers
    I think the community should do better and maybe even converge explicitly on some terminology, even though in still developing fields it is normal to have concepts that are discovered to be equal or different on the run. I keep seeing "residual connection" as "skip connection", but I think the latter is the identity map and the former is the residual function. I keep seeing kernels and filters used interchangeably but it would be nice to have a way to express the "depthwise" slices of the 2D convolution filters (whose weights are in a 3D tensor). I keep seeing embedding dimension for that of a Query/Key matrix, but then PyTorch MultiHeadAttention divides the projection instead of repeating it (iykyk I'm not talking about the efficiency turnaround per se). This are surely or mostly minor issues, but slow down code implementations and mathematical understanding whenever notation is unclear and code isn't available, which unfortunately happens quite often. What bothers you, what do you find helpful? submitted by /u/reverendCappuccino [link] [comments]  ( 10 min )
    [D] In decoder models, if later tokens attend to early tokens but early tokens don't attend to later tokens, what stops the influence of the early tokens from growing with each layer?
    Let's imagine just two tokens, A and B. In each attention layer, B will attend to itself and A, but A will only ever attend to itself. So over and over B will become a weighted sum of itself and A, then they will both go through a FF layer, and then the process will repeat. So shouldn't the "percentage" of info in B that comes from A grow with each layer? Am I missing something basic here? Does the model have to learn to make the Q of B and the K of A become more orthogonal with each layer (on average) to prevent B from paying too much attention to A over the course of all the layers? ​ edit: To be clear, I totally get this is trained out of the model, but what I don't get is how it is. Do models tend to only a few layers where they pay attention to early tokens? Does the FF layer try to reverse the impacts of an oversaturation of info from early tokens? etc. submitted by /u/30299578815310 [link] [comments]  ( 10 min )
    [R] Half-Quadratic Quantization of Large Machine Learning Models
    Sharing our work on model quantization. Blog: https://mobiusml.github.io/hqq_blog/ Code: https://github.com/mobiusml/hqq Models: https://huggingface.co/mobiuslabsgmbh/ No data calibration needed, extremely fast 🚀, works on both language and vision models! Why does it matter? Quantization significantly reduces GPU memory requirements but degrades the quality of the models. Having faster and more accurate quantization methods is extremely valuable for the ML community. Approach: Sparsity-based error formulation between the original weights and their dequantized version. We used a Half-Quadratic solver to derive a closed-form solution that is 100x faster than backprop via Pytorch's Autograd. Quantization speed: ~ 1 minute for Llama2-13B ~ 4 minutes for LLama2-70B (over 50x faster than GPTQ) Findings: - Larger models quantized to 3/2-bit outperform smaller full-precision models with similar or lower memory requirements. - Successful 2-bit quantization requires a lower group-size (e.g., 32 or 16) and compression of both the zero-point and the scaling factor for lower memory usage. While we acknowledge our view might be slightly biased, we genuinely believe that our work will significantly benefit the open-source software (OSS) machine learning community. Code and model are in Apache permissive license. submitted by /u/sightio [link] [comments]  ( 10 min )
    [R] A recurrent network model of planning explains hippocampal replay and human behavior
    Paper: https://www.biorxiv.org/content/10.1101/2023.01.16.523429v2 Code: https://github.com/KrisJensen/planning_code Abstract: When faced with a novel situation, humans often spend substantial periods of time contemplating possible futures. For such planning to be rational, the benefits to behavior must compensate for the time spent thinking. Here we capture these features of human behavior by developing a neural network model where planning itself is controlled by prefrontal cortex. This model consists of a meta-reinforcement learning agent augmented with the ability to plan by sampling imagined action sequences from its own policy, which we call ‘rollouts’. The agent learns to plan when planning is beneficial, explaining empirical variability in human thinking times. Additionally, the patterns of policy rollouts employed by the artificial agent closely resemble patterns of rodent hippocampal replays recently recorded during spatial navigation. Our work provides a new theory of how the brain could implement planning through prefrontal-hippocampal interactions, where hippocampal replays are triggered by – and adaptively affect – prefrontal dynamics. submitted by /u/APaperADay [link] [comments]  ( 10 min )
    [P] Learn how to perform logical convolution with interpretable rules in Tsetlin Machine Book Chapter 4: Convolution!
    ​ Rule-based convolution step-by-step Hey! Another chapter of my book completed. Hope you like it! https://tsetlinmachine.org Abstract: Searching for patterns in time and space makes the pattern recognition task you studied in Chapter 1 more challenging. Maybe you, for instance, would like the Tsetlin machine to recognize smaller objects inside an image. Before the Tsetlin machine can learn their appearance, it must locate them. But without knowing their appearance in the first place, how can they be found? In this chapter, you discover how the Tsetlin machine can solve this dual task using convolution with rules. In Section 4.1, you study two illustrative tasks within health and image analysis. They capture the dual nature of the problem and why you need to perform localization, recognition, and learning in tandem. Then, you learn how to divide an image into multiple patches in Section 4.2. The division allows the Tsetlin machine to focus on one image piece at a time, giving it a way to direct its attention. Multiple image pieces require a new approach to evaluating and learning rules. When each input image turns into several parts, you need a strategy for selecting which part to focus on and which to ignore. We cover the new form of rule evaluation in Section 4.3, while Section 4.4 addresses learning. Finally, Section 4.5 teaches how to use a patch's position inside the image to create more precise rules. The purpose is to narrow down the pattern matching to relevant image regions. After reading this chapter, you will know how to build a convolutional Tsetlin machine that can recognize patterns in time and space. submitted by /u/olegranmo [link] [comments]  ( 10 min )
    [R] AI Safety and Reinforcement Learning
    https://blogs.ucl.ac.uk/steapp/2023/11/15/adversarial-attacks-robustness-and-generalization-in-deep-reinforcement-learning/ submitted by /u/ml_dnn [link] [comments]  ( 9 min )
    [R] LongScope: Fragmented knowledge use in long contexts in LLMs
    LongScope aims to assess the capability of these models to intelligently process and integrate information dispersed across extensive textual segments, advancing beyond simple sentence extraction or retrieval. Project Link: https://github.com/mrconter1/LongScope/ submitted by /u/Alarmed-Profile5736 [link] [comments]  ( 9 min )
    [R] NeuroEvoBench: Benchmarking Evolutionary Optimizers for Deep Learning Applications
    submitted by /u/hardmaru [link] [comments]  ( 9 min )
    [D] First ICLR submission. Got rejected. Great experience overall.
    I expected for my paper to be rejected, but I still gave it a shot. The reviewers were pretty helpful; I received rather high quality constructive criticism despite all the instances of "low quality reviews" in recent years. Here's my paper: Guided Sketch-Based Program Induction by Search Gradients | OpenReview submitted by /u/Chromobacterium [link] [comments]  ( 9 min )
    [P] Mamba-Chat: A Chat LLM based on State Space Models
    Hey there! You might have come across the paper Mamba paper in the last days, which was the first attempt at scaling up state space models to 2.8B parameters to work on language data. Contrary to transformers, this kind of architecture's computational complexity does not scale quadratically with input length, so it would be awesome if it could replace transformers in the long term. We were super excited about this paper and the published model, but unfortunately, no training code was provided with it, so we've decided to write it and train a model ourselves. As a result of this, we've just released mamba-chat, which is probably the best existing LLM that does not rely on transformers. Honestly, I am super surprised by how well the model performs, given that it's only 2.8B parameters and the base model was only trained on the Pile. Quite exciting to think if these models might dethrone transformers at some point. Feel free to check out our Github or Huggingface repository! Our Github repo includes a cli chat script, so you can easily run the model if you have access to a GPU. submitted by /u/pip-install-torch [link] [comments]  ( 10 min )
    [N] All Google Gemini Videos In 1 Video
    I spent a quite a bit time and merged all into 1 video Video link : https://www.youtube.com/watch?v=D1s7ndtDXSk Content: This is a combination of all 16 Gemini videos published by Google. The video includes full 100% accurate subtitles and properly written chapters with descriptions. ​ Google Just Launched Gemini, Its Long-Awaited Answer to ChatGPT. Google says Gemini, launching today inside the Bard chatbot, is its “most capable” AI model ever. It was trained on video, images, and audio as well as text. ​ Google Gemini ⤵️ https://bard.google.com/chat ​ 0:00 Gemini: All you need to know in 90 seconds 1:30 Gemini: Excelling at competitive programming 6:30 Gemini: Unlocking insights in scientific literature 9:12 Gemini: Explaining reasoning in math and physics 11:11 Gemini: Pr…  ( 11 min )
    [D] Working on RAG? You should be evaluating its performance and we've built a way to do that.
    Check out our new open-source tool, Tonic Validate: https://www.tonic.ai/validate We've also been using the tool to evaluate different RAG tools out there. The latest post on LangChain vs Haystack is available here: https://www.tonic.ai/blog/rag-evaluation-series-validating-the-rag-performance-of-langchain-vs-haystack ​ Let us know what you think and if you're working on a RAG project, we'd love to hear about it! How are you measuring your RAG system performance? submitted by /u/tombenom [link] [comments]  ( 9 min )
  • Open

    MatterGen: Property-guided materials design
    The central problem in materials science is to discover materials with desired properties. MatterGen enables broad property-guided materials design. The post MatterGen: Property-guided materials design appeared first on Microsoft Research.  ( 8 min )
    LLMLingua: Innovating LLM efficiency with prompt compression
    Advanced prompting technologies for LLMs can lead to excessively long prompts, causing issues. Learn how LLMLingua compresses prompts up to 20x, maintaining quality, reducing latency, and supporting improved UX. The post LLMLingua: Innovating LLM efficiency with prompt compression appeared first on Microsoft Research.  ( 10 min )
  • Open

    500 Games and Apps Now Powered by RTX: A DLSS and Ray-Tracing Milestone
    We’re celebrating a milestone this week with 500 RTX games and applications utilizing NVIDIA DLSS, ray tracing or AI technologies. It’s an achievement anchored by NVIDIA’s revolutionary RTX technology, which has transformed gaming graphics and performance. The journey began in 2018 at an electrifying event in Cologne. In a steel and concrete music venue amidst Read article >  ( 5 min )
    Meet the Omnivore: SiBORG Lab Elevates Approach to Accessibility Using OpenUSD and NVIDIA Omniverse
    Accessibility is a key element that all designers must consider before constructing a space or product — but the evaluation process has traditionally been tedious and time-consuming. Mathew Schwartz, an assistant professor in architecture and design at the New Jersey Institute of Technology, is using the NVIDIA Omniverse platform and the Universal Scene Description framework, Read article >  ( 7 min )
    Good Fortunes: ‘The Day Before’ Leads 17 Games on GeForce NOW
    It’s a fortuitous GFN Thursday with 17 new games joining the GeForce NOW library, including The Day Before, Avatar: Frontiers of Pandora and the 100th PC Game Pass title to join the cloud — Ori and the Will of the Wisps. This week also marks a milestone: over 500 games and applications now support RTX Read article >  ( 8 min )
  • Open

    VALID: A perceptually validated virtual avatar library for inclusion and diversity
    Posted by Mar Gonzalez-Franco, Research Scientist, Google AR & VR In the last decade, interdisciplinary scientists have dedicated a significant amount of effort to better understand the use of avatars, and have made many interesting observations, including the capacity of the users to embody their avatar (i.e., the illusion that the avatar body is their own) and the self-avatar follower effect, which creates a binding between the actions of the avatar and the user strong enough that the avatar can actually affect user behavior. The use of avatars in experiments isn’t just about how users will interact and behave in VR spaces, but also about discovering the limits of human perception and neuroscience. In fact, some VR social experiments often rely on recreating scenarios that can’t be…  ( 92 min )
  • Open

    Getir end-to-end workforce management: Amazon Forecast and AWS Step Functions
    This is a guest post co-authored by Nafi Ahmet Turgut, Mehmet İkbal Özmen, Hasan Burak Yel, Fatma Nur Dumlupınar Keşir, Mutlu Polatcan and Emre Uzel from Getir. Getir is the pioneer of ultrafast grocery delivery. The technology company has revolutionized last-mile delivery with its grocery in-minutes delivery proposition. Getir was founded in 2015 and operates […]  ( 8 min )
  • Open

    Why not reuse passwords?
    Perhaps you’ve heard that you should not reuse passwords but don’t understand why. After all, you have a really good password, one that nobody would ever guess, so why not use it everywhere? Your password is not as good as you think First of all, people who think they have a really good password are […] Why not reuse passwords? first appeared on John D. Cook.  ( 6 min )
    The world’s sneakiest substitution
    Something that seems like an isolated trick may turn out to be much more important. This is the case with a change of variables discovered by Karl Weierstrass. Every calculus student learns a handful of standard techniques: u-substitutions, partial fractions, integration by parts, and trig substitutions. Then there is one more technique that is like […] The world’s sneakiest substitution first appeared on John D. Cook.  ( 6 min )
  • Open

    MIT engineers develop a way to determine how the surfaces of materials behave
    Using machine learning, the computational method can provide details of how materials work as catalysts, semiconductors, or battery components.  ( 11 min )
  • Open

    Real-Time Online Stock Forecasting Utilizing Integrated Quantitative and Qualitative Analysis. (arXiv:2311.15218v2 [cs.CL] UPDATED)
    The application of Machine learning to finance has become a familiar approach, even more so in stock market forecasting. The stock market is highly volatile and huge amounts of data are generated every minute globally. The extraction of effective intelligence from this data is of critical importance. However, a collaboration of numerical stock data with qualitative text data can be a challenging task. In this work, we accomplish this and provide an unprecedented, publicly available dataset with technical and fundamental data, sentiment that we gathered from News Archives, TV news captions, Radio Transcripts, Tweets, Daily financial newspapers, etc. The text data entries used for sentiment extraction total more than 1.4 Million. The dataset consists of daily entries from January 2018 to December 2022 for 8 companies representing diverse industrial sectors and the Dow Jones Industrial Average (DJIA) as a whole. Holistic Fundamental and Technical data is provided training ready for Model learning and deployment. The data generated could be used for Incremental online learning with real-time data points retrieved daily, since there was no stagnant data utilized, all the data was retired from APIs or self-designed scripts. Moreover, the utilization of Spearman's rank correlation over real-time data, linking stock returns with sentiment analysis has produced noteworthy results for the DJIA achieving accuracy levels surpassing 60\%. The dataset is made available at https://github.com/batking24/Huge-Stock-Dataset
    DeepInception: Hypnotize Large Language Model to Be Jailbreaker. (arXiv:2311.03191v2 [cs.LG] UPDATED)
    Despite remarkable success in various applications, large language models (LLMs) are vulnerable to adversarial jailbreaks that make the safety guardrails void. However, previous studies for jailbreaks usually resort to brute-force optimization or extrapolations of a high computation cost, which might not be practical or effective. In this paper, inspired by the Milgram experiment that individuals can harm another person if they are told to do so by an authoritative figure, we disclose a lightweight method, termed as DeepInception, which can easily hypnotize LLM to be a jailbreaker and unlock its misusing risks. Specifically, DeepInception leverages the personification ability of LLM to construct a novel nested scene to behave, which realizes an adaptive way to escape the usage control in a normal scenario and provides the possibility for further direct jailbreaks. Empirically, we conduct comprehensive experiments to show its efficacy. Our DeepInception can achieve competitive jailbreak success rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open/closed-source LLMs like Falcon, Vicuna, Llama-2, and GPT-3.5/4/4V. Our investigation appeals that people should pay more attention to the safety aspects of LLMs and a stronger defense against their misuse risks. The code is publicly available at: https://github.com/tmlr-group/DeepInception.
    Geometry-Aware Normalizing Wasserstein Flows for Optimal Causal Inference. (arXiv:2311.18826v2 [cs.LG] UPDATED)
    This manuscript enriches the framework of continuous normalizing flows (CNFs) within causal inference, primarily to augment the geometric properties of parametric submodels used in targeted maximum likelihood estimation (TMLE). By introducing an innovative application of CNFs, we construct a refined series of parametric submodels that enable a directed interpolation between the prior distribution $p_0$ and the empirical distribution $p_1$. This proposed methodology serves to optimize the semiparametric efficiency bound in causal inference by orchestrating CNFs to align with Wasserstein gradient flows. Our approach not only endeavors to minimize the mean squared error in the estimation but also imbues the estimators with geometric sophistication, thereby enhancing robustness against misspecification. This robustness is crucial, as it alleviates the dependence on the standard $n^{\frac{1}{4}}$ rate for a doubly-robust perturbation direction in TMLE. By incorporating robust optimization principles and differential geometry into the estimators, the developed geometry-aware CNFs represent a significant advancement in the pursuit of doubly robust causal inference.  ( 2 min )
    Grassmann Manifold Flows for Stable Shape Generation. (arXiv:2211.02900v3 [cs.LG] UPDATED)
    Recently, studies on machine learning have focused on methods that use symmetry implicit in a specific manifold as an inductive bias. Grassmann manifolds provide the ability to handle fundamental shapes represented as shape spaces, enabling stable shape analysis. In this paper, we present a novel approach in which we establish the theoretical foundations for learning distributions on the Grassmann manifold via continuous normalization flows, with the explicit goal of generating stable shapes. Our approach facilitates more robust generation by effectively eliminating the influence of extraneous transformations, such as rotations and inversions, through learning and generating within a Grassmann manifold designed to accommodate the essential shape information of the object. The experimental results indicated that the proposed method could generate high-quality samples by capturing the data structure. Furthermore, the proposed method significantly outperformed state-of-the-art methods in terms of the log-likelihood or evidence lower bound. The results obtained are expected to stimulate further research in this field, leading to advances for stable shape generation and analysis.  ( 2 min )
    NeuroMixGDP: A Neural Collapse-Inspired Random Mixup for Private Data Release. (arXiv:2202.06467v2 [cs.LG] UPDATED)
    Privacy-preserving data release algorithms have gained increasing attention for their ability to protect user privacy while enabling downstream machine learning tasks. However, the utility of current popular algorithms is not always satisfactory. Mixup of raw data provides a new way of data augmentation, which can help improve utility. However, its performance drastically deteriorates when differential privacy (DP) noise is added. To address this issue, this paper draws inspiration from the recently observed Neural Collapse (NC) phenomenon, which states that the last layer features of a neural network concentrate on the vertices of a simplex as Equiangular Tight Frame (ETF). We propose a scheme to mixup the Neural Collapse features to exploit the ETF simplex structure and release noisy mixed features to enhance the utility of the released data. By using Gaussian Differential Privacy (GDP), we obtain an asymptotic rate for the optimal mixup degree. To further enhance the utility and address the label collapse issue when the mixup degree is large, we propose a Hierarchical sampling method to stratify the mixup samples on a small number of classes. This method remarkably improves utility when the number of classes is large. Extensive experiments demonstrate the effectiveness of our proposed method in protecting against attacks and improving utility. In particular, our approach shows significantly improved utility compared to directly training classification networks with DPSGD on CIFAR100 and MiniImagenet datasets, highlighting the benefits of using privacy-preserving data release. We release reproducible code in https://github.com/Lidonghao1996/NeuroMixGDP.  ( 3 min )
    Compositional Generalization for Data-to-Text Generation. (arXiv:2312.02748v1 [cs.CL])
    Data-to-text generation involves transforming structured data, often represented as predicate-argument tuples, into coherent textual descriptions. Despite recent advances, systems still struggle when confronted with unseen combinations of predicates, producing unfaithful descriptions (e.g. hallucinations or omissions). We refer to this issue as compositional generalisation, and it encouraged us to create a benchmark for assessing the performance of different approaches on this specific problem. Furthermore, we propose a novel model that addresses compositional generalization by clustering predicates into groups. Our model generates text in a sentence-by-sentence manner, relying on one cluster of predicates at a time. This approach significantly outperforms T5~baselines across all evaluation metrics.Notably, it achieved a 31% improvement over T5 in terms of a metric focused on maintaining faithfulness to the input.  ( 2 min )
    Unraveling the Enigma of Double Descent: An In-depth Analysis through the Lens of Learned Feature Space. (arXiv:2310.13572v2 [cs.LG] UPDATED)
    Double descent presents a counter-intuitive aspect within the machine learning domain, and researchers have observed its manifestation in various models and tasks. While some theoretical explanations have been proposed for this phenomenon in specific contexts, an accepted theory to account for its occurrence in deep learning remains yet to be established. In this study, we revisit the phenomenon of double descent and demonstrate that its occurrence is strongly influenced by the presence of noisy data. Through conducting a comprehensive analysis of the feature space of learned representations, we unveil that double descent arises in imperfect models trained with noisy data. We argue that double descent is a consequence of the model first learning the noisy data until interpolation and then adding implicit regularization via over-parameterization acquiring therefore capability to separate the information from the noise.  ( 2 min )
    Balance is Essence: Accelerating Sparse Training via Adaptive Gradient Correction. (arXiv:2301.03573v2 [cs.LG] UPDATED)
    Despite impressive performance, deep neural networks require significant memory and computation costs, prohibiting their application in resource-constrained scenarios. Sparse training is one of the most common techniques to reduce these costs, however, the sparsity constraints add difficulty to the optimization, resulting in an increase in training time and instability. In this work, we aim to overcome this problem and achieve space-time co-efficiency. To accelerate and stabilize the convergence of sparse training, we analyze the gradient changes and develop an adaptive gradient correction method. Specifically, we approximate the correlation between the current and previous gradients, which is used to balance the two gradients to obtain a corrected gradient. Our method can be used with the most popular sparse training pipelines under both standard and adversarial setups. Theoretically, we prove that our method can accelerate the convergence rate of sparse training. Extensive experiments on multiple datasets, model architectures, and sparsities demonstrate that our method outperforms leading sparse training methods by up to \textbf{5.0\%} in accuracy given the same number of training epochs, and reduces the number of training epochs by up to \textbf{52.1\%} to achieve the same accuracy. Our code is available on: \url{https://github.com/StevenBoys/AGENT}.  ( 2 min )
    Bayesian Soft Actor-Critic: A Directed Acyclic Strategy Graph Based Deep Reinforcement Learning. (arXiv:2208.06033v2 [cs.AI] UPDATED)
    Adopting reasonable strategies is challenging but crucial for an intelligent agent with limited resources working in hazardous, unstructured, and dynamic environments to improve the system's utility, decrease the overall cost, and increase mission success probability. This paper proposes a novel directed acyclic strategy graph decomposition approach based on Bayesian chaining to separate an intricate policy into several simple sub-policies and organize their relationships as Bayesian strategy networks (BSN). We integrate this approach into the state-of-the-art DRL method -- soft actor-critic (SAC), and build the corresponding Bayesian soft actor-critic (BSAC) model by organizing several sub-policies as a joint policy. We compare our method against the state-of-the-art deep reinforcement learning algorithms on the standard continuous control benchmarks in the OpenAI Gym environment. The results demonstrate that the promising potential of the BSAC method significantly improves training efficiency.  ( 2 min )
    Characterization of Locality in Spin States and Forced Moves for Optimizations. (arXiv:2312.02544v1 [physics.app-ph])
    Ising formulations are widely utilized to solve combinatorial optimization problems, and a variety of quantum or semiconductor-based hardware has recently been made available. In combinatorial optimization problems, the existence of local minima in energy landscapes is problematic to use to seek the global minimum. We note that the aim of the optimization is not to obtain exact samplings from the Boltzmann distribution, and there is thus no need to satisfy detailed balance conditions. In light of this fact, we develop an algorithm to get out of the local minima efficiently while it does not yield the exact samplings. For this purpose, we utilize a feature that characterizes locality in the current state, which is easy to obtain with a type of specialized hardware. Furthermore, as the proposed algorithm is based on a rejection-free algorithm, the computational cost is low. In this work, after presenting the details of the proposed algorithm, we report the results of numerical experiments that demonstrate the effectiveness of the proposed feature and algorithm.  ( 2 min )
    On minimizing the training set fill distance in machine learning regression. (arXiv:2307.10988v2 [cs.LG] UPDATED)
    For regression tasks one often leverages large datasets for training predictive machine learning models. However, using large datasets may not be feasible due to computational limitations or high data labelling costs. Therefore, suitably selecting small training sets from large pools of unlabelled data points is essential to maximize model performance while maintaining efficiency. In this work, we study Farthest Point Sampling (FPS), a data selection approach that aims to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error, conditional to the location of the unlabelled data points, that linearly depends on the training set fill distance. For empirical validation, we perform experiments using two regression models on three datasets. We empirically show that selecting a training set by aiming to minimize the fill distance, thereby minimizing our derived bound, significantly reduces the maximum prediction error of various regression models, outperforming alternative sampling approaches by a large margin. Furthermore, we show that selecting training sets with the FPS can also increase model stability for the specific case of Gaussian kernel regression approaches.  ( 2 min )
    Model-based Deep Learning for Beam Prediction based on a Channel Chart. (arXiv:2312.02239v1 [cs.IT])
    Channel charting builds a map of the radio environment in an unsupervised way. The obtained chart locations can be seen as low-dimensional compressed versions of channel state information that can be used for a wide variety of applications, including beam prediction. In non-standalone or cell-free systems, chart locations computed at a given base station can be transmitted to several other base stations (possibly operating at different frequency bands) for them to predict which beams to use. This potentially yields a dramatic reduction of the overhead due to channel estimation or beam management, since only the base station performing charting requires channel state information, the others directly predicting the beam from the chart location. In this paper, advanced model-based neural network architectures are proposed for both channel charting and beam prediction. The proposed methods are assessed on realistic synthetic channels, yielding promising results.  ( 2 min )
    Semi-Supervised Health Index Monitoring with Feature Generation and Fusion. (arXiv:2312.02867v1 [cs.LG])
    The Health Index (HI) is crucial for evaluating system health, aiding tasks like anomaly detection and predicting remaining useful life for systems demanding high safety and reliability. Tight monitoring is crucial for achieving high precision at a lower cost, with applications such as spray coating. Obtaining HI labels in real-world applications is often cost-prohibitive, requiring continuous, precise health measurements. Therefore, it is more convenient to leverage run-to failure datasets that may provide potential indications of machine wear condition, making it necessary to apply semi-supervised tools for HI construction. In this study, we adapt the Deep Semi-supervised Anomaly Detection (DeepSAD) method for HI construction. We use the DeepSAD embedding as a condition indicators to address interpretability challenges and sensitivity to system-specific factors. Then, we introduce a diversity loss to enrich condition indicators. We employ an alternating projection algorithm with isotonic constraints to transform the DeepSAD embedding into a normalized HI with an increasing trend. Validation on the PHME 2010 milling dataset, a recognized benchmark with ground truth HIs demonstrates meaningful HIs estimations. Our methodology is then applied to monitor wear states of thermal spray coatings using high-frequency voltage. Our contributions create opportunities for more accessible and reliable HI estimation, particularly in cases where obtaining ground truth HI labels is unfeasible.  ( 2 min )
    Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems. (arXiv:2312.02804v1 [cs.LG])
    Stochastic networks and queueing systems often lead to Markov decision processes (MDPs) with large state and action spaces as well as nonconvex objective functions, which hinders the convergence of many reinforcement learning (RL) algorithms. Policy-gradient methods perform well on MDPs with large state and action spaces, but they sometimes experience slow convergence due to the high variance of the gradient estimator. In this paper, we show that some of these difficulties can be circumvented by exploiting the structure of the underlying MDP. We first introduce a new family of gradient estimators called score-aware gradient estimators (SAGEs). When the stationary distribution of the MDP belongs to an exponential family parametrized by the policy parameters, SAGEs allow us to estimate the policy gradient without relying on value-function estimation, contrary to classical policy-gradient methods like actor-critic. To demonstrate their applicability, we examine two common control problems arising in stochastic networks and queueing systems whose stationary distributions have a product-form, a special case of exponential families. As a second contribution, we show that, under appropriate assumptions, the policy under a SAGE-based policy-gradient method has a large probability of converging to an optimal policy, provided that it starts sufficiently close to it, even with a nonconvex objective function and multiple maximizers. Our key assumptions are that, locally around a maximizer, a nondegeneracy property of the Hessian of the objective function holds and a Lyapunov function exists. Finally, we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor-critic algorithm. The results demonstrate that the SAGE-based method finds close-to-optimal policies more rapidly, highlighting its superior performance over the traditional actor-critic method.  ( 3 min )
    Learning High-Order Relationships of Brain Regions. (arXiv:2312.02203v1 [q-bio.NC])
    Discovering reliable and informative interactions among brain regions from functional magnetic resonance imaging (fMRI) signals is essential in neuroscientific predictions of cognition. Most of the current methods fail to accurately characterize those interactions because they only focus on pairwise connections and overlook the high-order relationships of brain regions. We delve into this problem and argue that these high-order relationships should be maximally informative and minimally redundant (MIMR). However, identifying such high-order relationships is challenging and highly under-explored. Methods that can be tailored to our context are also non-existent. In response to this gap, we propose a novel method named HyBRiD that aims to extract MIMR high-order relationships from fMRI data. HyBRiD employs a Constructor to identify hyperedge structures, and a Weighter to compute a weight for each hyperedge. HyBRiD achieves the MIMR objective through an innovative information bottleneck framework named multi-head drop-bottleneck with theoretical guarantees. Our comprehensive experiments demonstrate the effectiveness of our model. Our model outperforms the state-of-the-art predictive model by an average of 12.1%, regarding the quality of hyperedges measured by CPM, a standard protocol for studying brain connections.  ( 2 min )
    Policy Gradient with Kernel Quadrature. (arXiv:2310.14768v2 [cs.LG] UPDATED)
    Reward evaluation of episodes becomes a bottleneck in a broad range of reinforcement learning tasks. Our aim in this paper is to select a small but representative subset of a large batch of episodes, only on which we actually compute rewards for more efficient policy gradient iterations. We build a Gaussian process modeling of discounted returns or rewards to derive a positive definite kernel on the space of episodes, run an ``episodic" kernel quadrature method to compress the information of sample episodes, and pass the reduced episodes to the policy network for gradient updates. We present the theoretical background of this procedure as well as its numerical illustrations in MuJoCo tasks.  ( 2 min )
    SAMSGL: Series-Aligned Multi-Scale Graph Learning for Spatio-Temporal Forecasting. (arXiv:2312.02646v1 [cs.LG])
    Spatio-temporal forecasting in various domains, like traffic prediction and weather forecasting, is a challenging endeavor, primarily due to the difficulties in modeling propagation dynamics and capturing high-dimensional interactions among nodes. Despite the significant strides made by graph-based networks in spatio-temporal forecasting, there remain two pivotal factors closely related to forecasting performance that need further consideration: time delays in propagation dynamics and multi-scale high-dimensional interactions. In this work, we present a Series-Aligned Multi-Scale Graph Learning (SAMSGL) framework, aiming to enhance forecasting performance. In order to handle time delays in spatial interactions, we propose a series-aligned graph convolution layer to facilitate the aggregation of non-delayed graph signals, thereby mitigating the influence of time delays for the improvement in accuracy. To understand global and local spatio-temporal interactions, we develop a spatio-temporal architecture via multi-scale graph learning, which encompasses two essential components: multi-scale graph structure learning and graph-fully connected (Graph-FC) blocks. The multi-scale graph structure learning includes a global graph structure to learn both delayed and non-delayed node embeddings, as well as a local one to learn node variations influenced by neighboring factors. The Graph-FC blocks synergistically fuse spatial and temporal information to boost prediction accuracy. To evaluate the performance of SAMSGL, we conduct experiments on meteorological and traffic forecasting datasets, which demonstrate its effectiveness and superiority.  ( 2 min )
    ULMA: Unified Language Model Alignment with Demonstration and Point-wise Human Preference. (arXiv:2312.02554v1 [cs.LG])
    Language model alignment is a cutting-edge technique in large language model training to align the model output to user's intent, e.g., being helpful and harmless. Recent alignment framework consists of two steps: supervised fine-tuning with demonstration data and preference learning with human preference data. Previous preference learning methods, such as RLHF and DPO, mainly focus on pair-wise preference data. However, in many real-world scenarios where human feedbacks are intrinsically point-wise, these methods will suffer from information loss or even fail. To fill this gap, in this paper, we first develop a preference learning method called point-wise DPO to tackle point-wise preference data. Further revelation on the connection between supervised fine-tuning and point-wise preference learning enables us to develop a unified framework for both human demonstration and point-wise preference data, which sheds new light on the construction of preference dataset. Extensive experiments on point-wise datasets with binary or continuous labels demonstrate the superior performance and efficiency of our proposed methods. A new dataset with high-quality demonstration samples on harmlessness is constructed and made publicly available.  ( 2 min )
    LExCI: A Framework for Reinforcement Learning with Embedded Systems. (arXiv:2312.02739v1 [cs.LG])
    Advances in artificial intelligence (AI) have led to its application in many areas of everyday life. In the context of control engineering, reinforcement learning (RL) represents a particularly promising approach as it is centred around the idea of allowing an agent to freely interact with its environment to find an optimal strategy. One of the challenges professionals face when training and deploying RL agents is that the latter often have to run on dedicated embedded devices. This could be to integrate them into an existing toolchain or to satisfy certain performance criteria like real-time constraints. Conventional RL libraries, however, cannot be easily utilised in conjunction with that kind of hardware. In this paper, we present a framework named LExCI, the Learning and Experiencing Cycle Interface, which bridges this gap and provides end-users with a free and open-source tool for training agents on embedded systems using the open-source library RLlib. Its operability is demonstrated with two state-of-the-art RL-algorithms and a rapid control prototyping system.  ( 2 min )
    Motion Informed Needle Segmentation in Ultrasound Images. (arXiv:2312.01239v2 [eess.IV] UPDATED)
    Segmenting a moving needle in ultrasound images is challenging due to the presence of artifacts, noise, and needle occlusion. This task becomes even more demanding in scenarios where data availability is limited. Convolutional Neural Networks (CNNs) have been successful in many computer vision applications, but struggle to accurately segment needles without considering their motion. In this paper, we present a novel approach for needle segmentation that combines classical Kalman Filter (KF) techniques with data-driven learning, incorporating both needle features and needle motion. Our method offers two key contributions. First, we propose a compatible framework that seamlessly integrates into commonly used encoder-decoder style architectures. Second, we demonstrate superior performance compared to recent state-of-the-art needle segmentation models using our novel convolutional neural network (CNN) based KF-inspired block, achieving a 15\% reduction in pixel-wise needle tip error and an 8\% reduction in length error. Third, to our knowledge we are the first to implement a learnable filter to incorporate non-linear needle motion for improving needle segmentation.  ( 2 min )
    Can We Learn Communication-Efficient Optimizers?. (arXiv:2312.02204v1 [cs.LG])
    Communication-efficient variants of SGD, specifically local SGD, have received a great deal of interest in recent years. These approaches compute multiple gradient steps locally, that is on each worker, before averaging model parameters, helping relieve the critical communication bottleneck in distributed deep learning training. Although many variants of these approaches have been proposed, they can sometimes lag behind state-of-the-art adaptive optimizers for deep learning. In this work, we investigate if the recent progress in the emerging area of learned optimizers can potentially close this gap while remaining communication-efficient. Specifically, we meta-learn how to perform global updates given an update from local SGD iterations. Our results demonstrate that learned optimizers can substantially outperform local SGD and its sophisticated variants while maintaining their communication efficiency. Learned optimizers can even generalize to unseen and much larger datasets and architectures, including ImageNet and ViTs, and to unseen modalities such as language modeling. We therefore demonstrate the potential of learned optimizers for improving communication-efficient distributed learning.
    FlowHON: Representing Flow Fields Using Higher-Order Networks. (arXiv:2312.02243v1 [cs.LG])
    Flow fields are often partitioned into data blocks for massively parallel computation and analysis based on blockwise relationships. However, most of the previous techniques only consider the first-order dependencies among blocks, which is insufficient in describing complex flow patterns. In this work, we present FlowHON, an approach to construct higher-order networks (HONs) from flow fields. FlowHON captures the inherent higher-order dependencies in flow fields as nodes and estimates the transitions among them as edges. We formulate the HON construction as an optimization problem with three linear transformations. The first two layers correspond to the node generation and the third one corresponds to edge estimation. Our formulation allows the node generation and edge estimation to be solved in a unified framework. With FlowHON, the rich set of traditional graph algorithms can be applied without any modification to analyze flow fields, while leveraging the higher-order information to understand the inherent structure and manage flow data for efficiency. We demonstrate the effectiveness of FlowHON using a series of downstream tasks, including estimating the density of particles during tracing, partitioning flow fields for data management, and understanding flow fields using the node-link diagram representation of networks.
    Discovering Interpretable Physical Models using Symbolic Regression and Discrete Exterior Calculus. (arXiv:2310.06609v2 [cs.LG] UPDATED)
    Computational modeling is a key resource to gather insight into physical systems in modern scientific research and engineering. While access to large amount of data has fueled the use of Machine Learning (ML) to recover physical models from experiments and increase the accuracy of physical simulations, purely data-driven models have limited generalization and interpretability. To overcome these limitations, we propose a framework that combines Symbolic Regression (SR) and Discrete Exterior Calculus (DEC) for the automated discovery of physical models starting from experimental data. Since these models consist of mathematical expressions, they are interpretable and amenable to analysis, and the use of a natural, general-purpose discrete mathematical language for physics favors generalization with limited input data. Importantly, DEC provides building blocks for the discrete analogue of field theories, which are beyond the state-of-the-art applications of SR to physical problems. Further, we show that DEC allows to implement a strongly-typed SR procedure that guarantees the mathematical consistency of the recovered models and reduces the search space of symbolic expressions. Finally, we prove the effectiveness of our methodology by re-discovering three models of Continuum Physics from synthetic experimental data: Poisson equation, the Euler's Elastica and the equations of Linear Elasticity. Thanks to their general-purpose nature, the methods developed in this paper may be applied to diverse contexts of physical modeling.
    USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery. (arXiv:2312.02199v1 [cs.CV])
    Large, self-supervised vision models have led to substantial advancements for automatically interpreting natural images. Recent works have begun tailoring these methods to remote sensing data which has rich structure with multi-sensor, multi-spectral, and temporal information providing massive amounts of self-labeled data that can be used for self-supervised pre-training. In this work, we develop a new encoder architecture called USat that can input multi-spectral data from multiple sensors for self-supervised pre-training. USat is a vision transformer with modified patch projection layers and positional encodings to model spectral bands with varying spatial scales from multiple sensors. We integrate USat into a Masked Autoencoder (MAE) self-supervised pre-training procedure and find that a pre-trained USat outperforms state-of-the-art self-supervised MAE models trained on remote sensing data on multiple remote sensing benchmark datasets (up to 8%) and leads to improvements in low data regimes (up to 7%). Code and pre-trained weights are available at https://github.com/stanfordmlgroup/USat .
    Out-of-distribution Detection Learning with Unreliable Out-of-distribution Sources. (arXiv:2311.03236v2 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection discerns OOD data where the predictor cannot make valid predictions as in-distribution (ID) data, thereby increasing the reliability of open-world classification. However, it is typically hard to collect real out-of-distribution (OOD) data for training a predictor capable of discerning ID and OOD patterns. This obstacle gives rise to data generation-based learning methods, synthesizing OOD data via data generators for predictor training without requiring any real OOD data. Related methods typically pre-train a generator on ID data and adopt various selection procedures to find those data likely to be the OOD cases. However, generated data may still coincide with ID semantics, i.e., mistaken OOD generation remains, confusing the predictor between ID and OOD data. To this end, we suggest that generated data (with mistaken OOD generation) can be used to devise an auxiliary OOD detection task to facilitate real OOD detection. Specifically, we can ensure that learning from such an auxiliary task is beneficial if the ID and the OOD parts have disjoint supports, with the help of a well-designed training procedure for the predictor. Accordingly, we propose a powerful data generation-based learning method named Auxiliary Task-based OOD Learning (ATOL) that can relieve the mistaken OOD generation. We conduct extensive experiments under various OOD detection setups, demonstrating the effectiveness of our method against its advanced counterparts.
    Digital twinning of cardiac electrophysiology models from the surface ECG: a geodesic backpropagation approach. (arXiv:2308.08410v2 [math.OC] UPDATED)
    The eikonal equation has become an indispensable tool for modeling cardiac electrical activation accurately and efficiently. In principle, by matching clinically recorded and eikonal-based electrocardiograms (ECGs), it is possible to build patient-specific models of cardiac electrophysiology in a purely non-invasive manner. Nonetheless, the fitting procedure remains a challenging task. The present study introduces a novel method, Geodesic-BP, to solve the inverse eikonal problem. Geodesic-BP is well-suited for GPU-accelerated machine learning frameworks, allowing us to optimize the parameters of the eikonal equation to reproduce a given ECG. We show that Geodesic-BP can reconstruct a simulated cardiac activation with high accuracy in a synthetic test case, even in the presence of modeling inaccuracies. Furthermore, we apply our algorithm to a publicly available dataset of a biventricular rabbit model, with promising results. Given the future shift towards personalized medicine, Geodesic-BP has the potential to help in future functionalizations of cardiac models meeting clinical time constraints while maintaining the physiological accuracy of state-of-the-art cardiac models.
    PEFA: Parameter-Free Adapters for Large-scale Embedding-based Retrieval Models. (arXiv:2312.02429v1 [cs.IR])
    Embedding-based Retrieval Models (ERMs) have emerged as a promising framework for large-scale text retrieval problems due to powerful large language models. Nevertheless, fine-tuning ERMs to reach state-of-the-art results can be expensive due to the extreme scale of data as well as the complexity of multi-stages pipelines (e.g., pre-training, fine-tuning, distillation). In this work, we propose the PEFA framework, namely ParamEter-Free Adapters, for fast tuning of ERMs without any backward pass in the optimization. At index building stage, PEFA equips the ERM with a non-parametric k-nearest neighbor (kNN) component. At inference stage, PEFA performs a convex combination of two scoring functions, one from the ERM and the other from the kNN. Based on the neighborhood definition, PEFA framework induces two realizations, namely PEFA-XL (i.e., extra large) using double ANN indices and PEFA-XS (i.e., extra small) using a single ANN index. Empirically, PEFA achieves significant improvement on two retrieval applications. For document retrieval, regarding Recall@100 metric, PEFA improves not only pre-trained ERMs on Trivia-QA by an average of 13.2%, but also fine-tuned ERMs on NQ-320K by an average of 5.5%, respectively. For product search, PEFA improves the Recall@100 of the fine-tuned ERMs by an average of 5.3% and 14.5%, for PEFA-XS and PEFA-XL, respectively. Our code is available at https://github.com/ amzn/pecos/tree/mainline/examples/pefa-wsdm24
    LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models. (arXiv:2309.12307v2 [cs.CL] UPDATED)
    We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention (S$^2$-Attn) effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA combines this improved LoRA with S$^2$-Attn. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA adopts Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning with LongLoRA and our long instruction-following LongAlpaca dataset.
    Variability of echo state network prediction horizon for partially observed dynamical systems. (arXiv:2306.10797v3 [eess.SY] UPDATED)
    Study of dynamical systems using partial state observation is an important problem due to its applicability to many real-world systems. We address the problem by studying an echo state network (ESN) framework with partial state input with partial or full state output. Application to the Lorenz system and Chua's oscillator (both numerically simulated and experimental systems) demonstrate the effectiveness of our method. We show that the ESN, as an autonomous dynamical system, is capable of making short-term predictions up to a few Lyapunov times. However, the prediction horizon has high variability depending on the initial condition-an aspect that we explore in detail using the distribution of the prediction horizon. Further, using a variety of statistical metrics to compare the long-term dynamics of the ESN predictions with numerically simulated or experimental dynamics and observed similar results, we show that the ESN can effectively learn the system's dynamics even when trained with noisy numerical or experimental datasets. Thus, we demonstrate the potential of ESNs to serve as cheap surrogate models for simulating the dynamics of systems where complete observations are unavailable.
    Fedstellar: A Platform for Decentralized Federated Learning. (arXiv:2306.09750v3 [cs.LG] UPDATED)
    In 2016, Google proposed Federated Learning (FL) as a novel paradigm to train Machine Learning (ML) models across the participants of a federation while preserving data privacy. Since its birth, Centralized FL (CFL) has been the most used approach, where a central entity aggregates participants' models to create a global one. However, CFL presents limitations such as communication bottlenecks, single point of failure, and reliance on a central server. Decentralized Federated Learning (DFL) addresses these issues by enabling decentralized model aggregation and minimizing dependency on a central entity. Despite these advances, current platforms training DFL models struggle with key issues such as managing heterogeneous federation network topologies. To overcome these challenges, this paper presents Fedstellar, a novel platform designed to train FL models in a decentralized, semi-decentralized, and centralized fashion across diverse federations of physical or virtualized devices. The Fedstellar implementation encompasses a web application with an interactive graphical interface, a controller for deploying federations of nodes using physical or virtual devices, and a core deployed on each device which provides the logic needed to train, aggregate, and communicate in the network. The effectiveness of the platform has been demonstrated in two scenarios: a physical deployment involving single-board devices such as Raspberry Pis for detecting cyberattacks, and a virtualized deployment comparing various FL approaches in a controlled environment using MNIST and CIFAR-10 datasets. In both scenarios, Fedstellar demonstrated consistent performance and adaptability, achieving F1 scores of 91%, 98%, and 91.2% using DFL for detecting cyberattacks and classifying MNIST and CIFAR-10, respectively, reducing training time by 32% compared to centralized approaches.
    MEMTO: Memory-guided Transformer for Multivariate Time Series Anomaly Detection. (arXiv:2312.02530v1 [cs.LG])
    Detecting anomalies in real-world multivariate time series data is challenging due to complex temporal dependencies and inter-variable correlations. Recently, reconstruction-based deep models have been widely used to solve the problem. However, these methods still suffer from an over-generalization issue and fail to deliver consistently high performance. To address this issue, we propose the MEMTO, a memory-guided Transformer using a reconstruction-based approach. It is designed to incorporate a novel memory module that can learn the degree to which each memory item should be updated in response to the input data. To stabilize the training procedure, we use a two-phase training paradigm which involves using K-means clustering for initializing memory items. Additionally, we introduce a bi-dimensional deviation-based detection criterion that calculates anomaly scores considering both input space and latent space. We evaluate our proposed method on five real-world datasets from diverse domains, and it achieves an average anomaly detection F1-score of 95.74%, significantly outperforming the previous state-of-the-art methods. We also conduct extensive experiments to empirically validate the effectiveness of our proposed model's key components.
    Investigating the Catastrophic Forgetting in Multimodal Large Language Models. (arXiv:2309.10313v4 [cs.CL] UPDATED)
    Following the success of GPT4, there has been a surge in interest in multimodal large language model (MLLM) research. This line of research focuses on developing general-purpose LLMs through fine-tuning pre-trained LLMs and vision models. However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still remains an inherent problem in multimodal LLMs (MLLM). In this paper, we introduce EMT: Evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs, by treating each MLLM as an image classifier. We first apply EMT to evaluate several open-source fine-tuned MLLMs and we discover that almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks. Moreover, we continue fine-tuning LLaVA, an MLLM and utilize EMT to assess performance throughout the fine-tuning. Interestingly, our results suggest that early-stage fine-tuning on an image dataset improves performance across other image datasets, by enhancing the alignment of text and visual features. However, as fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability, even when the image encoder remains frozen. Our results suggest that MLLMs have yet to demonstrate performance on par with their vision models on standard image classification tasks and the current MLLM fine-tuning procedure still has room for improvement.
    T3D: Towards 3D Medical Image Understanding through Vision-Language Pre-training. (arXiv:2312.01529v2 [cs.CV] UPDATED)
    Expert annotation of 3D medical image for downstream analysis is resource-intensive, posing challenges in clinical applications. Visual self-supervised learning (vSSL), though effective for learning visual invariance, neglects the incorporation of domain knowledge from medicine. To incorporate medical knowledge into visual representation learning, vision-language pre-training (VLP) has shown promising results in 2D image. However, existing VLP approaches become generally impractical when applied to high-resolution 3D medical images due to GPU hardware constraints and the potential loss of critical details caused by downsampling, which is the intuitive solution to hardware constraints. To address the above limitations, we introduce T3D, the first VLP framework designed for high-resolution 3D medical images. T3D incorporates two text-informed pretext tasks: (\lowerromannumeral{1}) text-informed contrastive learning; (\lowerromannumeral{2}) text-informed image restoration. These tasks focus on learning 3D visual representations from high-resolution 3D medical images and integrating clinical knowledge from radiology reports, without distorting information through forced alignment of downsampled volumes with detailed anatomical text. Trained on a newly curated large-scale dataset of 3D medical images and radiology reports, T3D significantly outperforms current vSSL methods in tasks like organ and tumor segmentation, as well as disease classification. This underlines T3D's potential in representation learning for 3D medical image analysis. All data and code will be available upon acceptance.
    Amortized Bayesian Decision Making for simulation-based models. (arXiv:2312.02674v1 [cs.LG])
    Simulation-based inference (SBI) provides a powerful framework for inferring posterior distributions of stochastic simulators in a wide range of domains. In many settings, however, the posterior distribution is not the end goal itself -- rather, the derived parameter values and their uncertainties are used as a basis for deciding what actions to take. Unfortunately, because posterior distributions provided by SBI are (potentially crude) approximations of the true posterior, the resulting decisions can be suboptimal. Here, we address the question of how to perform Bayesian decision making on stochastic simulators, and how one can circumvent the need to compute an explicit approximation to the posterior. Our method trains a neural network on simulated data and can predict the expected cost given any data and action, and can, thus, be directly used to infer the action with lowest cost. We apply our method to several benchmark problems and demonstrate that it induces similar cost as the true posterior distribution. We then apply the method to infer optimal actions in a real-world simulator in the medical neurosciences, the Bayesian Virtual Epileptic Patient, and demonstrate that it allows to infer actions associated with low cost after few simulations.
    Cost-effective On-device Continual Learning over Memory Hierarchy with Miro. (arXiv:2308.06053v4 [cs.LG] UPDATED)
    Continual learning (CL) trains NN models incrementally from a continuous stream of tasks. To remember previously learned knowledge, prior studies store old samples over a memory hierarchy and replay them when new tasks arrive. Edge devices that adopt CL to preserve data privacy are typically energy-sensitive and thus require high model accuracy while not compromising energy efficiency, i.e., cost-effectiveness. Our work is the first to explore the design space of hierarchical memory replay-based CL to gain insights into achieving cost-effectiveness on edge devices. We present Miro, a novel system runtime that carefully integrates our insights into the CL framework by enabling it to dynamically configure the CL system based on resource states for the best cost-effectiveness. To reach this goal, Miro also performs online profiling on parameters with clear accuracy-energy trade-offs and adapts to optimal values with low overhead. Extensive evaluations show that Miro significantly outperforms baseline systems we build for comparison, consistently achieving higher cost-effectiveness.
    Statistical exploration of the Manifold Hypothesis. (arXiv:2208.11665v3 [stat.ME] CROSS LISTED)
    The Manifold Hypothesis is a widely accepted tenet of Machine Learning which asserts that nominally high-dimensional data are in fact concentrated near a low-dimensional manifold, embedded in high-dimensional space. This phenomenon is observed empirically in many real world situations, has led to development of a wide range of statistical methods in the last few decades, and has been suggested as a key factor in the success of modern AI technologies. We show that rich and sometimes intricate manifold structure in data can emerge from a generic and remarkably simple statistical model -- the Latent Metric Model -- via elementary concepts such as latent variables, correlation and stationarity. This establishes a general statistical explanation for why the Manifold Hypothesis seems to hold in so many situations. Informed by the Latent Metric Model we derive procedures to discover and interpret the geometry of high-dimensional data, and explore hypotheses about the data generating mechanism. These procedures operate under minimal assumptions and make use of well known, scaleable graph-analytic algorithms.
    A Multi-Task Perspective for Link Prediction with New Relation Types and Nodes. (arXiv:2307.06046v2 [cs.LG] UPDATED)
    The task of inductive link prediction in (discrete) attributed multigraphs infers missing attributed links (relations) between nodes in new test multigraphs. Traditional relational learning methods face the challenge of limited generalization to test multigraphs containing both novel nodes and novel relation types not seen in training. Recently, under the only assumption that all relation types share the same structural predictive patterns (single task), Gao et al. (2023) proposed a link prediction method using the theoretical concept of double equivariance (equivariance for nodes & relation types), in contrast to the (single) equivariance (only for nodes) used to design Graph Neural Networks (GNNs). In this work we further extend the double equivariance concept to multi-task double equivariance, where we define link prediction in attributed multigraphs that can have distinct and potentially conflicting predictive patterns for different sets of relation types (multiple tasks). Our empirical results on real-world datasets demonstrate that our approach can effectively generalize to test graphs with multi-task structures without access to additional information.
    Large Language Models on Graphs: A Comprehensive Survey. (arXiv:2312.02783v1 [cs.CL])
    Large language models (LLMs), such as ChatGPT and LLaMA, are creating significant advancements in natural language processing, due to their strong text encoding/decoding ability and newly found emergent capability (e.g., reasoning). While LLMs are mainly designed to process pure texts, there are many real-world scenarios where text data are associated with rich structure information in the form of graphs (e.g., academic networks, and e-commerce networks) or scenarios where graph data are paired with rich textual information (e.g., molecules with descriptions). Besides, although LLMs have shown their pure text-based reasoning ability, it is underexplored whether such ability can be generalized to graph scenarios (i.e., graph-based reasoning). In this paper, we provide a systematic review of scenarios and techniques related to large language models on graphs. We first summarize potential scenarios of adopting LLMs on graphs into three categories, namely pure graphs, text-rich graphs, and text-paired graphs. We then discuss detailed techniques for utilizing LLMs on graphs, including LLM as Predictor, LLM as Encoder, and LLM as Aligner, and compare the advantages and disadvantages of different schools of models. Furthermore, we mention the real-world applications of such methods and summarize open-source codes and benchmark datasets. Finally, we conclude with potential future research directions in this fast-growing field. The related source can be found at https://github.com/PeterGriffinJin/Awesome-Language-Model-on-Graphs.
    H-GAP: Humanoid Control with a Generalist Planner. (arXiv:2312.02682v1 [cs.LG])
    Humanoid control is an important research challenge offering avenues for integration into human-centric infrastructures and enabling physics-driven humanoid animations. The daunting challenges in this field stem from the difficulty of optimizing in high-dimensional action spaces and the instability introduced by the bipedal morphology of humanoids. However, the extensive collection of human motion-captured data and the derived datasets of humanoid trajectories, such as MoCapAct, paves the way to tackle these challenges. In this context, we present Humanoid Generalist Autoencoding Planner (H-GAP), a state-action trajectory generative model trained on humanoid trajectories derived from human motion-captured data, capable of adeptly handling downstream control tasks with Model Predictive Control (MPC). For 56 degrees of freedom humanoid, we empirically demonstrate that H-GAP learns to represent and generate a wide range of motor behaviours. Further, without any learning from online interactions, it can also flexibly transfer these behaviors to solve novel downstream control tasks via planning. Notably, H-GAP excels established MPC baselines that have access to the ground truth dynamics model, and is superior or comparable to offline RL methods trained for individual tasks. Finally, we do a series of empirical studies on the scaling properties of H-GAP, showing the potential for performance gains via additional data but not computing. Code and videos are available at https://ycxuyingchen.github.io/hgap/.
    Regularization Trade-offs with Fake Features. (arXiv:2212.00433v2 [cs.LG] UPDATED)
    Recent successes of massively overparameterized models have inspired a new line of work investigating the underlying conditions that enable overparameterized models to generalize well. This paper considers a framework where the possibly overparametrized model includes fake features, i.e., features that are present in the model but not in the data. We present a non-asymptotic high-probability bound on the generalization error of the ridge regression problem under the model misspecification of having fake features. Our highprobability results provide insights into the interplay between the implicit regularization provided by the fake features and the explicit regularization provided by the ridge parameter. Numerical results illustrate the trade-off between the number of fake features and how the optimal ridge parameter may heavily depend on the number of fake features.
    The Bayesian Stability Zoo. (arXiv:2310.18428v2 [cs.LG] UPDATED)
    We show that many definitions of stability found in the learning theory literature are equivalent to one another. We distinguish between two families of definitions of stability: distribution-dependent and distribution-independent Bayesian stability. Within each family, we establish equivalences between various definitions, encompassing approximate differential privacy, pure differential privacy, replicability, global stability, perfect generalization, TV stability, mutual information stability, KL-divergence stability, and R\'enyi-divergence stability. Along the way, we prove boosting results that enable the amplification of the stability of a learning rule. This work is a step towards a more systematic taxonomy of stability notions in learning theory, which can promote clarity and an improved understanding of an array of stability concepts that have emerged in recent years.
    Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation. (arXiv:2305.19798v2 [cs.LG] UPDATED)
    Recently, a new line of works has emerged to understand and improve self-attention in Transformers by treating it as a kernel machine. However, existing works apply the methods for symmetric kernels to the asymmetric self-attention, resulting in a nontrivial gap between the analytical understanding and numerical implementation. In this paper, we provide a new perspective to represent and optimize self-attention through asymmetric Kernel Singular Value Decomposition (KSVD), which is also motivated by the low-rank property of self-attention normally observed in deep layers. Through asymmetric KSVD, $i$) a primal-dual representation of self-attention is formulated, where the optimization objective is cast to maximize the projection variances in the attention outputs; $ii$) a novel attention mechanism, i.e., Primal-Attention, is proposed via the primal representation of KSVD, avoiding explicit computation of the kernel matrix in the dual; $iii$) with KKT conditions, we prove that the stationary solution to the KSVD optimization in Primal-Attention yields a zero-value objective. In this manner, KSVD optimization can be implemented by simply minimizing a regularization loss, so that low-rank property is promoted without extra decomposition. Numerical experiments show state-of-the-art performance of our Primal-Attention with improved efficiency. Moreover, we demonstrate that the deployed KSVD optimization regularizes Primal-Attention with a sharper singular value decay than that of the canonical self-attention, further verifying the great potential of our method. To the best of our knowledge, this is the first work that provides a primal-dual representation for the asymmetric kernel in self-attention and successfully applies it to modeling and optimization.
    MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition. (arXiv:2312.02829v1 [cs.LG])
    With the advent of deep learning, progressively larger neural networks have been designed to solve complex tasks. We take advantage of these capacity-rich models to lower the cost of inference by exploiting computation in superposition. To reduce the computational burden per input, we propose Multiple-Input-Multiple-Output Neural Networks (MIMONets) capable of handling many inputs at once. MIMONets augment various deep neural network architectures with variable binding mechanisms to represent an arbitrary number of inputs in a compositional data structure via fixed-width distributed representations. Accordingly, MIMONets adapt nonlinear neural transformations to process the data structure holistically, leading to a speedup nearly proportional to the number of superposed input items in the data structure. After processing in superposition, an unbinding mechanism recovers each transformed input of interest. MIMONets also provide a dynamic trade-off between accuracy and throughput by an instantaneous on-demand switching between a set of accuracy-throughput operating points, yet within a single set of fixed parameters. We apply the concept of MIMONets to both CNN and Transformer architectures resulting in MIMOConv and MIMOFormer, respectively. Empirical evaluations show that MIMOConv achieves about 2-4 x speedup at an accuracy delta within [+0.68, -3.18]% compared to WideResNet CNNs on CIFAR10 and CIFAR100. Similarly, MIMOFormer can handle 2-4 inputs at once while maintaining a high average accuracy within a [-1.07, -3.43]% delta on the long range arena benchmark. Finally, we provide mathematical bounds on the interference between superposition channels in MIMOFormer. Our code is available at https://github.com/IBM/multiple-input-multiple-output-nets.
    On the Temperature of Bayesian Graph Neural Networks for Conformal Prediction. (arXiv:2310.11479v3 [cs.LG] UPDATED)
    Accurate uncertainty quantification in graph neural networks (GNNs) is essential, especially in high-stakes domains where GNNs are frequently employed. Conformal prediction (CP) offers a promising framework for quantifying uncertainty by providing $\textit{valid}$ prediction sets for any black-box model. CP ensures formal probabilistic guarantees that a prediction set contains a true label with a desired probability. However, the size of prediction sets, known as $\textit{inefficiency}$, is influenced by the underlying model and data generating process. On the other hand, Bayesian learning also provides a credible region based on the estimated posterior distribution, but this region is $\textit{well-calibrated}$ only when the model is correctly specified. Building on a recent work that introduced a scaling parameter for constructing valid credible regions from posterior estimate, our study explores the advantages of incorporating a temperature parameter into Bayesian GNNs within CP framework. We empirically demonstrate the existence of temperatures that result in more efficient prediction sets. Furthermore, we conduct an analysis to identify the factors contributing to inefficiency and offer valuable insights into the relationship between CP performance and model calibration.
    A Unified Theory of Diversity in Ensemble Learning. (arXiv:2301.03962v2 [cs.LG] UPDATED)
    We present a theory of ensemble diversity, explaining the nature of diversity for a wide range of supervised learning scenarios. This challenge, of understanding ensemble diversity, has been referred to as the "holy grail" of ensemble learning, an open research issue for over 30 years. Our framework reveals that diversity is in fact a hidden dimension in the bias-variance decomposition of the ensemble loss. We prove a family of exact bias-variance-diversity decompositions, for both regression and classification, e.g., squared, cross-entropy, and Poisson losses. For losses where an additive bias-variance decomposition is not available (e.g., 0/1 loss) we present an alternative approach, which precisely quantifies the effects of diversity, turning out to be dependent on the label distribution. Experiments show how we can use our framework to understand the diversity-encouraging mechanisms of popular methods: Bagging, Boosting, and Random Forests.
    FroSSL: Frobenius Norm Minimization for Self-Supervised Learning. (arXiv:2310.02903v2 [cs.LG] UPDATED)
    Self-supervised learning (SSL) is an increasingly popular paradigm for representation learning. Recent methods can be classified as sample-contrastive, dimension-contrastive, or asymmetric network-based, with each family having its own approach to avoiding informational collapse. While dimension-contrastive methods converge to similar solutions as sample-contrastive methods, it can be empirically shown that some methods require more epochs of training to converge. Motivated by closing this divide, we present the objective function FroSSL which is both sample- and dimension-contrastive up to embedding normalization. FroSSL works by minimizing covariance Frobenius norms for avoiding collapse and minimizing mean-squared error for augmentation invariance. We show that FroSSL converges more quickly than a variety of other SSL methods and provide theoretical and empirical support that this faster convergence is due to how FroSSL affects the eigenvalues of the embedding covariance matrices. We also show that FroSSL learns competitive representations on linear probe evaluation when used to train a ResNet18 on the CIFAR-10, CIFAR-100, STL-10, and ImageNet datasets.
    Traffic Signal Control with Communicative Deep Reinforcement Learning Agents: a Case Study. (arXiv:2107.01347v4 [cs.MA] UPDATED)
    In this work we analyze Multi-Agent Advantage Actor-Critic (MA2C) a recently proposed multi-agent reinforcement learning algorithm that can be applied to adaptive traffic signal control (ATSC) problems. To evaluate its potential we compare MA2C with Independent Advantage Actor-Critic (IA2C) and other Reinforcement Learning or heuristic based algorithms. Specifically, we analyze MA2C theoretically with the framework provided by non-Markov decision processes, which allows a deeper insight of the algorithm, and we critically examine the effectiveness and the robustness of the method by testing it in two traffic areas located in Bologna (Italy) simulated in SUMO, a software modeling tool for ATSC problems. Our results indicate that MA2C, trained with pseudo-random vehicle flows, is a promising technique able to outperform the alternative methods.
    Topological Graph Signal Compression. (arXiv:2308.11068v2 [cs.LG] UPDATED)
    Recently emerged Topological Deep Learning (TDL) methods aim to extend current Graph Neural Networks (GNN) by naturally processing higher-order interactions, going beyond the pairwise relations and local neighborhoods defined by graph representations. In this paper we propose a novel TDL-based method for compressing signals over graphs, consisting in two main steps: first, disjoint sets of higher-order structures are inferred based on the original signal --by clustering $N$ datapoints into $K\ll N$ collections; then, a topological-inspired message passing gets a compressed representation of the signal within those multi-element sets. Our results show that our framework improves both standard GNN and feed-forward architectures in compressing temporal link-based signals from two real-word Internet Service Provider Networks' datasets --from $30\%$ up to $90\%$ better reconstruction errors across all evaluation scenarios--, suggesting that it better captures and exploits spatial and temporal correlations over the whole graph-based network structure.
    A Simple and Scalable Representation for Graph Generation. (arXiv:2312.02230v1 [cs.LG])
    Recently, there has been a surge of interest in employing neural networks for graph generation, a fundamental statistical learning problem with critical applications like molecule design and community analysis. However, most approaches encounter significant limitations when generating large-scale graphs. This is due to their requirement to output the full adjacency matrices whose size grows quadratically with the number of nodes. In response to this challenge, we introduce a new, simple, and scalable graph representation named gap encoded edge list (GEEL) that has a small representation size that aligns with the number of edges. In addition, GEEL significantly reduces the vocabulary size by incorporating the gap encoding and bandwidth restriction schemes. GEEL can be autoregressively generated with the incorporation of node positional encoding, and we further extend GEEL to deal with attributed graphs by designing a new grammar. Our findings reveal that the adoption of this compact representation not only enhances scalability but also bolsters performance by simplifying the graph generation process. We conduct a comprehensive evaluation across ten non-attributed and two molecular graph generation tasks, demonstrating the effectiveness of GEEL.
    NeutronStream: A Dynamic GNN Training Framework with Sliding Window for Graph Streams. (arXiv:2312.02473v1 [cs.LG])
    Existing Graph Neural Network (GNN) training frameworks have been designed to help developers easily create performant GNN implementations. However, most existing GNN frameworks assume that the input graphs are static, but ignore that most real-world graphs are constantly evolving. Though many dynamic GNN models have emerged to learn from evolving graphs, the training process of these dynamic GNNs is dramatically different from traditional GNNs in that it captures both the spatial and temporal dependencies of graph updates. This poses new challenges for designing dynamic GNN training frameworks. First, the traditional batched training method fails to capture real-time structural evolution information. Second, the time-dependent nature makes parallel training hard to design. Third, it lacks system supports for users to efficiently implement dynamic GNNs. In this paper, we present NeutronStream, a framework for training dynamic GNN models. NeutronStream abstracts the input dynamic graph into a chronologically updated stream of events and processes the stream with an optimized sliding window to incrementally capture the spatial-temporal dependencies of events. Furthermore, NeutronStream provides a parallel execution engine to tackle the sequential event processing challenge to achieve high performance. NeutronStream also integrates a built-in graph storage structure that supports dynamic updates and provides a set of easy-to-use APIs that allow users to express their dynamic GNNs. Our experimental results demonstrate that, compared to state-of-the-art dynamic GNN implementations, NeutronStream achieves speedups ranging from 1.48X to 5.87X and an average accuracy improvement of 3.97%.
    Learning "Look-Ahead" Nonlocal Traffic Dynamics in a Ring Road. (arXiv:2312.02770v1 [cs.LG])
    The macroscopic traffic flow model is widely used for traffic control and management. To incorporate drivers' anticipative behaviors and to remove impractical speed discontinuity inherent in the classic Lighthill-Whitham-Richards (LWR) traffic model, nonlocal partial differential equation (PDE) models with ``look-ahead" dynamics have been proposed, which assume that the speed is a function of weighted downstream traffic density. However, it lacks data validation on two important questions: whether there exist nonlocal dynamics, and how the length and weight of the ``look-ahead" window affect the spatial temporal propagation of traffic densities. In this paper, we adopt traffic trajectory data from a ring-road experiment and design a physics-informed neural network to learn the fundamental diagram and look-ahead kernel that best fit the data, and reinvent a data-enhanced nonlocal LWR model via minimizing the loss function combining the data discrepancy and the nonlocal model discrepancy. Results show that the learned nonlocal LWR yields a more accurate prediction of traffic wave propagation in three different scenarios: stop-and-go oscillations, congested, and free traffic. We first demonstrate the existence of ``look-ahead" effect with real traffic data. The optimal nonlocal kernel is found out to take a length of around 35 to 50 meters, and the kernel weight within 5 meters accounts for the majority of the nonlocal effect. Our results also underscore the importance of choosing a priori physics in machine learning models.
    Structured World Representations in Maze-Solving Transformers. (arXiv:2312.02566v1 [cs.LG])
    Transformer models underpin many recent advances in practical machine learning applications, yet understanding their internal behavior continues to elude researchers. Given the size and complexity of these models, forming a comprehensive picture of their inner workings remains a significant challenge. To this end, we set out to understand small transformer models in a more tractable setting: that of solving mazes. In this work, we focus on the abstractions formed by these models and find evidence for the consistent emergence of structured internal representations of maze topology and valid paths. We demonstrate this by showing that the residual stream of only a single token can be linearly decoded to faithfully reconstruct the entire maze. We also find that the learned embeddings of individual tokens have spatial structure. Furthermore, we take steps towards deciphering the circuity of path-following by identifying attention heads (dubbed $\textit{adjacency heads}$), which are implicated in finding valid subsequent tokens.
    Projection Regret: Reducing Background Bias for Novelty Detection via Diffusion Models. (arXiv:2312.02615v1 [cs.LG])
    Novelty detection is a fundamental task of machine learning which aims to detect abnormal ($\textit{i.e.}$ out-of-distribution (OOD)) samples. Since diffusion models have recently emerged as the de facto standard generative framework with surprising generation results, novelty detection via diffusion models has also gained much attention. Recent methods have mainly utilized the reconstruction property of in-distribution samples. However, they often suffer from detecting OOD samples that share similar background information to the in-distribution data. Based on our observation that diffusion models can \emph{project} any sample to an in-distribution sample with similar background information, we propose \emph{Projection Regret (PR)}, an efficient novelty detection method that mitigates the bias of non-semantic information. To be specific, PR computes the perceptual distance between the test image and its diffusion-based projection to detect abnormality. Since the perceptual distance often fails to capture semantic changes when the background information is dominant, we cancel out the background bias by comparing it against recursive projections. Extensive experiments demonstrate that PR outperforms the prior art of generative-model-based novelty detection methods by a significant margin.
    Quantum Neural Architecture Search with Quantum Circuits Metric and Bayesian Optimization. (arXiv:2206.14115v1 [quant-ph] CROSS LISTED)
    Quantum neural networks are promising for a wide range of applications in the Noisy Intermediate-Scale Quantum era. As such, there is an increasing demand for automatic quantum neural architecture search. We tackle this challenge by designing a quantum circuits metric for Bayesian optimization with Gaussian process. To this goal, we propose a new quantum gates distance that characterizes the gates' action over every quantum state and provide a theoretical perspective on its geometrical properties. Our approach significantly outperforms the benchmark on three empirical quantum machine learning problems including training a quantum generative adversarial network, solving combinatorial optimization in the MaxCut problem, and simulating quantum Fourier transform. Our method can be extended to characterize behaviors of various quantum machine learning models.
    How Generative-AI can be Effectively used in Government Chatbots. (arXiv:2312.02181v1 [cs.CL])
    With the rapid development of artificial intelligence and breakthroughs in machine learning and natural language processing, intelligent question-answering robots have become widely used in government affairs. This paper conducts a horizontal comparison between Guangdong Province's government chatbots, ChatGPT, and Wenxin Ernie, two large language models, to analyze the strengths and weaknesses of existing government chatbots and AIGC technology. The study finds significant differences between government chatbots and large language models. China's government chatbots are still in an exploratory stage and have a gap to close to achieve "intelligence." To explore the future direction of government chatbots more deeply, this research proposes targeted optimization paths to help generative AI be effectively applied in government chatbot conversations.
    Learning a Sparse Representation of Barron Functions with the Inverse Scale Space Flow. (arXiv:2312.02671v1 [stat.ML])
    This paper presents a method for finding a sparse representation of Barron functions. Specifically, given an $L^2$ function $f$, the inverse scale space flow is used to find a sparse measure $\mu$ minimising the $L^2$ loss between the Barron function associated to the measure $\mu$ and the function $f$. The convergence properties of this method are analysed in an ideal setting and in the cases of measurement noise and sampling bias. In an ideal setting the objective decreases strictly monotone in time to a minimizer with $\mathcal{O}(1/t)$, and in the case of measurement noise or sampling bias the optimum is achieved up to a multiplicative or additive constant. This convergence is preserved on discretization of the parameter space, and the minimizers on increasingly fine discretizations converge to the optimum on the full parameter space.
    Re-Nerfing: Enforcing Geometric Constraints on Neural Radiance Fields through Novel Views Synthesis. (arXiv:2312.02255v1 [cs.CV])
    Neural Radiance Fields (NeRFs) have shown remarkable novel view synthesis capabilities even in large-scale, unbounded scenes, albeit requiring hundreds of views or introducing artifacts in sparser settings. Their optimization suffers from shape-radiance ambiguities wherever only a small visual overlap is available. This leads to erroneous scene geometry and artifacts. In this paper, we propose Re-Nerfing, a simple and general multi-stage approach that leverages NeRF's own view synthesis to address these limitations. With Re-Nerfing, we increase the scene's coverage and enhance the geometric consistency of novel views as follows: First, we train a NeRF with the available views. Then, we use the optimized NeRF to synthesize pseudo-views next to the original ones to simulate a stereo or trifocal setup. Finally, we train a second NeRF with both original and pseudo views while enforcing structural, epipolar constraints via the newly synthesized images. Extensive experiments on the mip-NeRF 360 dataset show the effectiveness of Re-Nerfing across denser and sparser input scenarios, bringing improvements to the state-of-the-art Zip-NeRF, even when trained with all views.
    Congestion-aware Distributed Task Offloading in Wireless Multi-hop Networks Using Graph Neural Networks. (arXiv:2312.02471v1 [cs.NI])
    Computational offloading has become an enabling component for edge intelligence in mobile and smart devices. Existing offloading schemes mainly focus on mobile devices and servers, while ignoring the potential network congestion caused by tasks from multiple mobile devices, especially in wireless multi-hop networks. To fill this gap, we propose a low-overhead, congestion-aware distributed task offloading scheme by augmenting a distributed greedy framework with graph-based machine learning. In simulated wireless multi-hop networks with 20-110 nodes and a resource allocation scheme based on shortest path routing and contention-based link scheduling, our approach is demonstrated to be effective in reducing congestion or unstable queues under the context-agnostic baseline, while improving the execution latency over local computing.
    Neural Priming for Sample-Efficient Adaptation. (arXiv:2306.10191v3 [cs.LG] UPDATED)
    We propose Neural Priming, a technique for adapting large pretrained models to distribution shifts and downstream tasks given few or no labeled examples. Presented with class names or unlabeled test samples, Neural Priming enables the model to recall and conditions its parameters on relevant data seen throughout pretraining, thereby priming it for the test distribution. Neural Priming can be performed at test time, even for pretraining datasets as large as LAION-2B. Performing lightweight updates on the recalled data significantly improves accuracy across a variety of distribution shift and transfer learning benchmarks. Concretely, in the zero-shot setting, we see a 2.45% improvement in accuracy on ImageNet and 3.81% accuracy improvement on average across standard transfer learning benchmarks. Further, using Neural Priming at inference to adapt to distribution shift, we see a 1.41% accuracy improvement on ImageNetV2. These results demonstrate the effectiveness of Neural Priming in addressing the challenge of limited labeled data and changing distributions. Code is available at github.com/RAIVNLab/neural-priming.
    CROP: Towards Distributional-Shift Robust Reinforcement Learning using Compact Reshaped Observation Processing. (arXiv:2304.13616v2 [cs.LG] UPDATED)
    The safe application of reinforcement learning (RL) requires generalization from limited training data to unseen scenarios. Yet, fulfilling tasks under changing circumstances is a key challenge in RL. Current state-of-the-art approaches for generalization apply data augmentation techniques to increase the diversity of training data. Even though this prevents overfitting to the training environment(s), it hinders policy optimization. Crafting a suitable observation, only containing crucial information, has been shown to be a challenging task itself. To improve data efficiency and generalization capabilities, we propose Compact Reshaped Observation Processing (CROP) to reduce the state information used for policy optimization. By providing only relevant information, overfitting to a specific training layout is precluded and generalization to unseen environments is improved. We formulate three CROPs that can be applied to fully observable observation- and action-spaces and provide methodical foundation. We empirically show the improvements of CROP in a distributionally shifted safety gridworld. We furthermore provide benchmark comparisons to full observability and data-augmentation in two different-sized procedurally generated mazes.
    Lessons from Usable ML Deployments and Application to Wind Turbine Monitoring. (arXiv:2312.02859v1 [cs.LG])
    Through past experiences deploying what we call usable ML (one step beyond explainable ML, including both explanations and other augmenting information) to real-world domains, we have learned three key lessons. First, many organizations are beginning to hire people who we call ``bridges'' because they bridge the gap between ML developers and domain experts, and these people fill a valuable role in developing usable ML applications. Second, a configurable system that enables easily iterating on usable ML interfaces during collaborations with bridges is key. Finally, there is a need for continuous, in-deployment evaluations to quantify the real-world impact of usable ML. Throughout this paper, we apply these lessons to the task of wind turbine monitoring, an essential task in the renewable energy domain. Turbine engineers and data analysts must decide whether to perform costly in-person investigations on turbines to prevent potential cases of brakepad failure, and well-tuned usable ML interfaces can aid with this decision-making process. Through the applications of our lessons to this task, we hope to demonstrate the potential real-world impact of usable ML in the renewable energy domain.
    TSVR+: Twin support vector regression with privileged information. (arXiv:2312.02596v1 [cs.LG])
    In the realm of machine learning, the data may contain additional attributes, known as privileged information (PI). The main purpose of PI is to assist in the training of the model and then utilize the acquired knowledge to make predictions for unseen samples. Support vector regression (SVR) is an effective regression model, however, it has a low learning speed due to solving a convex quadratic problem (QP) subject to a pair of constraints. In contrast, twin support vector regression (TSVR) is more efficient than SVR as it solves two QPs each subject to one set of constraints. However, TSVR and its variants are trained only on regular features and do not use privileged features for training. To fill this gap, we introduce a fusion of TSVR with learning using privileged information (LUPI) and propose a novel approach called twin support vector regression with privileged information (TSVR+). The regularization terms in the proposed TSVR+ capture the essence of statistical learning theory and implement the structural risk minimization principle. We use the successive overrelaxation (SOR) technique to solve the optimization problem of the proposed TSVR+, which enhances the training efficiency. As far as our knowledge extends, the integration of the LUPI concept into twin variants of regression models is a novel advancement. The numerical experiments conducted on UCI, stock and time series data collectively demonstrate the superiority of the proposed model.
    VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding. (arXiv:2312.02310v1 [cs.CV])
    Recent advancements in language-model-based video understanding have been progressing at a remarkable pace, spurred by the introduction of Large Language Models (LLMs). However, the focus of prior research has been predominantly on devising a projection layer that maps video features to tokens, an approach that is both rudimentary and inefficient. In our study, we introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings, which enables a more aligned selection of frames with the given question. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer (abbreviated as VQ-Former), which bolsters the interplay between the input question and the video features. We also discover that incorporating a simple prompt, "Please be critical", into the LLM input can substantially enhance its video comprehension capabilities. Our experimental results indicate that VaQuitA consistently sets a new benchmark for zero-shot video question-answering tasks and is adept at producing high-quality, multi-turn video dialogues with users.
    Solving Inverse Physics Problems with Score Matching. (arXiv:2301.10250v2 [cs.LG] UPDATED)
    We propose to solve inverse problems involving the temporal evolution of physics systems by leveraging recent advances from diffusion models. Our method moves the system's current state backward in time step by step by combining an approximate inverse physics simulator and a learned correction function. A central insight of our work is that training the learned correction with a single-step loss is equivalent to a score matching objective, while recursively predicting longer parts of the trajectory during training relates to maximum likelihood training of a corresponding probability flow. We highlight the advantages of our algorithm compared to standard denoising score matching and implicit score matching, as well as fully learned baselines for a wide range of inverse physics problems. The resulting inverse solver has excellent accuracy and temporal stability and, in contrast to other learned inverse solvers, allows for sampling the posterior of the solutions.
    VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing. (arXiv:2211.16934v2 [cs.CL] UPDATED)
    Video dubbing aims to translate the original speech in a film or television program into the speech in a target language, which can be achieved with a cascaded system consisting of speech recognition, machine translation and speech synthesis. To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech, which requires strict length control. Previous works usually control the number of words or characters generated by the machine translation model to be similar to the source sentence, without considering the isochronicity of speech as the speech duration of words/characters in different languages varies. In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech. Specifically, we control the speech length of generated sentence by guiding the prediction of each word with the duration information, including the speech duration of itself as well as how much duration is left for the remaining words. We design experiments on four language directions (German -> English, Spanish -> English, Chinese English), and the results show that the proposed method achieves better length control ability on the generated speech than baseline methods. To make up the lack of real-world datasets, we also construct a real-world test set collected from films to provide comprehensive evaluations on the video dubbing task.
    Stackelberg Driver Model for Continual Policy Improvement in Scenario-Based Closed-Loop Autonomous Driving. (arXiv:2309.14235v3 [cs.LG] UPDATED)
    The deployment of autonomous vehicles (AVs) has faced hurdles due to the dominance of rare but critical corner cases within the long-tail distribution of driving scenarios, which negatively affects their overall performance. To address this challenge, adversarial generation methods have emerged as a class of efficient approaches to synthesize safety-critical scenarios for AV testing. However, these generated scenarios are often underutilized for AV training, resulting in the potential for continual AV policy improvement remaining untapped, along with a deficiency in the closed-loop design needed to achieve it. Therefore, we tailor the Stackelberg Driver Model (SDM) to accurately characterize the hierarchical nature of vehicle interaction dynamics, facilitating iterative improvement by engaging background vehicles (BVs) and AV in a sequential game-like interaction paradigm. With AV acting as the leader and BVs as followers, this leader-follower modeling ensures that AV would consistently refine its policy, always taking into account the additional information that BVs play the best response to challenge AV. Extensive experiments have shown that our algorithm exhibits superior performance compared to several baselines especially in higher dimensional scenarios, leading to substantial advancements in AV capabilities while continually generating progressively challenging scenarios. Code is available at https://github.com/BlueCat-de/SDM.
    Harmonizing Global Voices: Culturally-Aware Models for Enhanced Content Moderation. (arXiv:2312.02401v1 [stat.ML])
    Content moderation at scale faces the challenge of considering local cultural distinctions when assessing content. While global policies aim to maintain decision-making consistency and prevent arbitrary rule enforcement, they often overlook regional variations in interpreting natural language as expressed in content. In this study, we are looking into how moderation systems can tackle this issue by adapting to local comprehension nuances. We train large language models on extensive datasets of media news and articles to create culturally attuned models. The latter aim to capture the nuances of communication across geographies with the goal of recognizing cultural and societal variations in what is considered offensive content. We further explore the capability of these models to generate explanations for instances of content violation, aiming to shed light on how policy guidelines are perceived when cultural and societal contexts change. We find that training on extensive media datasets successfully induced cultural awareness and resulted in improvements in handling content violations on a regional basis. Additionally, these advancements include the ability to provide explanations that align with the specific local norms and nuances as evidenced by the annotators' preference in our conducted study. This multifaceted success reinforces the critical role of an adaptable content moderation approach in keeping pace with the ever-evolving nature of the content it oversees.
    Weakly Supervised Detection of Hallucinations in LLM Activations. (arXiv:2312.02798v1 [cs.LG])
    We propose an auditing method to identify whether a large language model (LLM) encodes patterns such as hallucinations in its internal states, which may propagate to downstream tasks. We introduce a weakly supervised auditing technique using a subset scanning approach to detect anomalous patterns in LLM activations from pre-trained models. Importantly, our method does not need knowledge of the type of patterns a-priori. Instead, it relies on a reference dataset devoid of anomalies during testing. Further, our approach enables the identification of pivotal nodes responsible for encoding these patterns, which may offer crucial insights for fine-tuning specific sub-networks for bias mitigation. We introduce two new scanning methods to handle LLM activations for anomalous sentences that may deviate from the expected distribution in either direction. Our results confirm prior findings of BERT's limited internal capacity for encoding hallucinations, while OPT appears capable of encoding hallucination information internally. Importantly, our scanning approach, without prior exposure to false statements, performs comparably to a fully supervised out-of-distribution classifier.
    Analyzing and Improving the Training Dynamics of Diffusion Models. (arXiv:2312.02696v1 [cs.CV])
    Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling. As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.
    Attention-enhanced neural differential equations for physics-informed deep learning of ion transport. (arXiv:2312.02871v1 [cs.LG])
    Species transport models typically combine partial differential equations (PDEs) with relations from hindered transport theory to quantify electromigrative, convective, and diffusive transport through complex nanoporous systems; however, these formulations are frequently substantial simplifications of the governing dynamics, leading to the poor generalization performance of PDE-based models. Given the growing interest in deep learning methods for the physical sciences, we develop a machine learning-based approach to characterize ion transport across nanoporous membranes. Our proposed framework centers around attention-enhanced neural differential equations that incorporate electroneutrality-based inductive biases to improve generalization performance relative to conventional PDE-based methods. In addition, we study the role of the attention mechanism in illuminating physically-meaningful ion-pairing relationships across diverse mixture compositions. Further, we investigate the importance of pre-training on simulated data from PDE-based models, as well as the performance benefits from hard vs. soft inductive biases. Our results indicate that physics-informed deep learning solutions can outperform their classical PDE-based counterparts and provide promising avenues for modelling complex transport phenomena across diverse applications.
    Digital Histopathology with Graph Neural Networks: Concepts and Explanations for Clinicians. (arXiv:2312.02225v1 [physics.med-ph])
    To address the challenge of the ``black-box" nature of deep learning in medical settings, we combine GCExplainer - an automated concept discovery solution - along with Logic Explained Networks to provide global explanations for Graph Neural Networks. We demonstrate this using a generally applicable graph construction and classification pipeline, involving panoptic segmentation with HoVer-Net and cancer prediction with Graph Convolution Networks. By training on H&E slides of breast cancer, we show promising results in offering explainable and trustworthy AI tools for clinicians.
    Prompt Optimization via Adversarial In-Context Learning. (arXiv:2312.02614v1 [cs.LG])
    We propose a new method, Adversarial In-Context Learning (adv-ICL), to optimize prompt for in-context learning (ICL) by employing one LLM as a generator, another as a discriminator, and a third as a prompt modifier. As in traditional adversarial learning, adv-ICL is implemented as a two-player game between the generator and discriminator, where the generator tries to generate realistic enough output to fool the discriminator. In each round, given an input prefixed by task instructions and several exemplars, the generator produces an output. The discriminator is then tasked with classifying the generator input-output pair as model-generated or real data. Based on the discriminator loss, the prompt modifier proposes possible edits to the generator and discriminator prompts, and the edits that most improve the adversarial loss are selected. We show that adv-ICL results in significant improvements over state-of-the-art prompt optimization techniques for both open and closed-source models on 11 generation and classification tasks including summarization, arithmetic reasoning, machine translation, data-to-text generation, and the MMLU and big-bench hard benchmarks. In addition, because our method uses pre-trained models and updates only prompts rather than model parameters, it is computationally efficient, easy to extend to any LLM and task, and effective in low-resource settings.
    Fine-Tuning Language Models for Context-Specific SQL Query Generation. (arXiv:2312.02251v1 [cs.DB])
    The ability to generate SQL queries from natural language has significant implications for making data accessible to non-specialists. This paper presents a novel approach to fine-tuning open-source large language models (LLMs) for the task of transforming natural language into SQL queries within the retail domain. We introduce models specialized in generating SQL queries, trained on synthetic datasets tailored to the Snowflake SQL and GoogleSQL dialects. Our methodology involves generating a context-specific dataset using GPT-4, then fine-tuning three open-source LLMs(Starcoder Plus, Code-Llama, and Mistral) employing the LoRa technique to optimize for resource constraints. The fine-tuned models demonstrate superior performance in zero-shot settings compared to the baseline GPT-4, with Code-Llama achieving the highest accuracy rates, at 81.58% for Snowflake SQL and 82.66% for GoogleSQL. These results underscore the effectiveness of fine-tuning LLMs on domain-specific tasks and suggest a promising direction for enhancing the accessibility of relational databases through natural language interfaces.
    Adaptive Instrument Design for Indirect Experiments. (arXiv:2312.02438v1 [cs.LG])
    Indirect experiments provide a valuable framework for estimating treatment effects in situations where conducting randomized control trials (RCTs) is impractical or unethical. Unlike RCTs, indirect experiments estimate treatment effects by leveraging (conditional) instrumental variables, enabling estimation through encouragement and recommendation rather than strict treatment assignment. However, the sample efficiency of such estimators depends not only on the inherent variability in outcomes but also on the varying compliance levels of users with the instrumental variables and the choice of estimator being used, especially when dealing with numerous instrumental variables. While adaptive experiment design has a rich literature for direct experiments, in this paper we take the initial steps towards enhancing sample efficiency for indirect experiments by adaptively designing a data collection policy over instrumental variables. Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy, minimizing the mean-squared error of the desired (non-linear) estimator. Through experiments conducted in various domains inspired by real-world applications, we showcase how our method can significantly improve the sample efficiency of indirect experiments.
    JarviX: A LLM No code Platform for Tabular Data Analysis and Optimization. (arXiv:2312.02213v1 [cs.LG])
    In this study, we introduce JarviX, a sophisticated data analytics framework. JarviX is designed to employ Large Language Models (LLMs) to facilitate an automated guide and execute high-precision data analyzes on tabular datasets. This framework emphasizes the significance of varying column types, capitalizing on state-of-the-art LLMs to generate concise data insight summaries, propose relevant analysis inquiries, visualize data effectively, and provide comprehensive explanations for results drawn from an extensive data analysis pipeline. Moreover, JarviX incorporates an automated machine learning (AutoML) pipeline for predictive modeling. This integration forms a comprehensive and automated optimization cycle, which proves particularly advantageous for optimizing machine configuration. The efficacy and adaptability of JarviX are substantiated through a series of practical use case studies.
    When is Offline Policy Selection Sample Efficient for Reinforcement Learning?. (arXiv:2312.02355v1 [cs.LG])
    Offline reinforcement learning algorithms often require careful hyperparameter tuning. Consequently, before deployment, we need to select amongst a set of candidate policies. As yet, however, there is little understanding about the fundamental limits of this offline policy selection (OPS) problem. In this work we aim to provide clarity on when sample efficient OPS is possible, primarily by connecting OPS to off-policy policy evaluation (OPE) and Bellman error (BE) estimation. We first show a hardness result, that in the worst case, OPS is just as hard as OPE, by proving a reduction of OPE to OPS. As a result, no OPS method can be more sample efficient than OPE in the worst case. We then propose a BE method for OPS, called Identifiable BE Selection (IBES), that has a straightforward method for selecting its own hyperparameters. We highlight that using IBES for OPS generally has more requirements than OPE methods, but if satisfied, can be more sample efficient. We conclude with an empirical study comparing OPE and IBES, and by showing the difficulty of OPS on an offline Atari benchmark dataset.
    Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data. (arXiv:2312.02418v1 [cs.CL])
    Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of Large Language Models (LLMs) optimized for code generation. Previous studies demonstrated the benefit of using embedding spaces for data pruning, but they mainly focused on duplicate removal or increasing variety, and in other modalities, such as images. Our work focuses on using embeddings to identify and remove "low-quality" code data. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. Armed with this knowledge, we devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset. We demonstrate the benefits of this synthetic corruption informed pruning (SCIP) approach on the well-established HumanEval and MBPP benchmarks, outperforming existing embedding-based methods. Importantly, we achieve up to a 3% performance improvement over no pruning, thereby showing the promise of insights from synthetic corruptions for data pruning.
    Quantum Machine Learning on Near-Term Quantum Devices: Current State of Supervised and Unsupervised Techniques for Real-World Applications. (arXiv:2307.00908v2 [quant-ph] UPDATED)
    The past decade has witnessed significant advancements in quantum hardware, encompassing improvements in speed, qubit quantity, and quantum volume-a metric defining the maximum size of a quantum circuit effectively implementable on near-term quantum devices. This progress has led to a surge in Quantum Machine Learning (QML) applications on real hardware, aiming to achieve quantum advantage over classical approaches. This survey focuses on selected supervised and unsupervised learning applications executed on quantum hardware, specifically tailored for real-world scenarios. The exploration includes a thorough analysis of current QML implementation limitations on quantum hardware, covering techniques like encoding, ansatz structure, error mitigation, and gradient methods to address these challenges. Furthermore, the survey evaluates the performance of QML implementations in comparison to classical counterparts. In conclusion, we discuss existing bottlenecks related to applying QML on real quantum devices and propose potential solutions to overcome these challenges in the future.
    Experimental Insights Towards Explainable and Interpretable Pedestrian Crossing Prediction. (arXiv:2312.02872v1 [cs.LG])
    In the context of autonomous driving, pedestrian crossing prediction is a key component for improving road safety. Presently, the focus of these predictions extends beyond achieving trustworthy results; it is shifting towards the explainability and interpretability of these predictions. This research introduces a novel neuro-symbolic approach that combines deep learning and fuzzy logic for an explainable and interpretable pedestrian crossing prediction. We have developed an explainable predictor (ExPedCross), which utilizes a set of explainable features and employs a fuzzy inference system to predict whether the pedestrian will cross or not. Our approach was evaluated on both the PIE and JAAD datasets. The results offer experimental insights into achieving explainability and interpretability in the pedestrian crossing prediction task. Furthermore, the testing results yield a set of guidelines and recommendations regarding the process of dataset selection, feature selection, and explainability.
    Reconsideration on evaluation of machine learning models in continuous monitoring using wearables. (arXiv:2312.02300v1 [cs.LG])
    This paper explores the challenges in evaluating machine learning (ML) models for continuous health monitoring using wearable devices beyond conventional metrics. We state the complexities posed by real-world variability, disease dynamics, user-specific characteristics, and the prevalence of false notifications, necessitating novel evaluation strategies. Drawing insights from large-scale heart studies, the paper offers a comprehensive guideline for robust ML model evaluation on continuous health monitoring.
    Identifying Spurious Correlations using Counterfactual Alignment. (arXiv:2312.02186v1 [cs.CV])
    Models driven by spurious correlations often yield poor generalization performance. We propose the counterfactual alignment method to detect and explore spurious correlations of black box classifiers. Counterfactual images generated with respect to one classifier can be input into other classifiers to see if they also induce changes in the outputs of these classifiers. The relationship between these responses can be quantified and used to identify specific instances where a spurious correlation exists as well as compute aggregate statistics over a dataset. Our work demonstrates the ability to detect spurious correlations in face attribute classifiers. This is validated by observing intuitive trends in a face attribute classifier as well as fabricating spurious correlations and detecting their presence, both visually and quantitatively. Further, utilizing the CF alignment method, we demonstrate that we can rectify spurious correlations identified in classifiers.
    AdsorbRL: Deep Multi-Objective Reinforcement Learning for Inverse Catalysts Design. (arXiv:2312.02308v1 [cs.LG])
    A central challenge of the clean energy transition is the development of catalysts for low-emissions technologies. Recent advances in Machine Learning for quantum chemistry drastically accelerate the computation of catalytic activity descriptors such as adsorption energies. Here we introduce AdsorbRL, a Deep Reinforcement Learning agent aiming to identify potential catalysts given a multi-objective binding energy target, trained using offline learning on the Open Catalyst 2020 and Materials Project data sets. We experiment with Deep Q-Network agents to traverse the space of all ~160,000 possible unary, binary and ternary compounds of 55 chemical elements, with very sparse rewards based on adsorption energy known for only between 2,000 and 3,000 catalysts per adsorbate. To constrain the actions space, we introduce Random Edge Traversal and train a single-objective DQN agent on the known states subgraph, which we find strengthens target binding energy by an average of 4.1 eV. We extend this approach to multi-objective, goal-conditioned learning, and train a DQN agent to identify materials with the highest (respectively lowest) adsorption energies for multiple simultaneous target adsorbates. We experiment with Objective Sub-Sampling, a novel training scheme aimed at encouraging exploration in the multi-objective setup, and demonstrate simultaneous adsorption energy improvement across all target adsorbates, by an average of 0.8 eV. Overall, our results suggest strong potential for Deep Reinforcement Learning applied to the inverse catalysts design problem.
    Disentangling the Effects of Data Augmentation and Format Transform in Self-Supervised Learning of Image Representations. (arXiv:2312.02205v1 [cs.CV])
    Self-Supervised Learning (SSL) enables training performant models using limited labeled data. One of the pillars underlying vision SSL is the use of data augmentations/perturbations of the input which do not significantly alter its semantic content. For audio and other temporal signals, augmentations are commonly used alongside format transforms such as Fourier transforms or wavelet transforms. Unlike augmentations, format transforms do not change the information contained in the data; rather, they express the same information in different coordinates. In this paper, we study the effects of format transforms and augmentations both separately and together on vision SSL. We define augmentations in frequency space called Fourier Domain Augmentations (FDA) and show that training SSL models on a combination of these and image augmentations can improve the downstream classification accuracy by up to 1.3% on ImageNet-1K. We also show improvements against SSL baselines in few-shot and transfer learning setups using FDA. Surprisingly, we also observe that format transforms can improve the quality of learned representations even without augmentations; however, the combination of the two techniques yields better quality.
    Dissecting Medical Referral Mechanisms in Health Services: Role of Physician Professional Networks. (arXiv:2312.02387v1 [cs.CY])
    Medical referrals between primary care physicians (PC) and specialist care (SC) physicians profoundly impact patient care regarding quality, satisfaction, and cost. This paper investigates the influence of professional networks among medical doctors on referring patients from PC to SC. Using five-year consultation data from a Portuguese private health provider, we conducted exploratory data analysis and constructed both professional and referral networks among physicians. We then apply Graph Neural Network (GNN) models to learn latent representations of the referral network. Our analysis supports the hypothesis that doctors' professional social connections can predict medical referrals, potentially enhancing collaboration within organizations and improving healthcare services. This research contributes to dissecting the underlying mechanisms in primary-specialty referrals, thereby providing valuable insights for enhancing patient care and effective healthcare management.
    Toward autocorrection of chemical process flowsheets using large language models. (arXiv:2312.02873v1 [cs.LG])
    The process engineering domain widely uses Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (P&IDs) to represent process flows and equipment configurations. However, the P&IDs and PFDs, hereafter called flowsheets, can contain errors causing safety hazards, inefficient operation, and unnecessary expenses. Correcting and verifying flowsheets is a tedious, manual process. We propose a novel generative AI methodology for automatically identifying errors in flowsheets and suggesting corrections to the user, i.e., autocorrecting flowsheets. Inspired by the breakthrough of Large Language Models (LLMs) for grammatical autocorrection of human language, we investigate LLMs for the autocorrection of flowsheets. The input to the model is a potentially erroneous flowsheet and the output of the model are suggestions for a corrected flowsheet. We train our autocorrection model on a synthetic dataset in a supervised manner. The model achieves a top-1 accuracy of 80% and a top-5 accuracy of 84% on an independent test dataset of synthetically generated flowsheets. The results suggest that the model can learn to autocorrect the synthetic flowsheets. We envision that flowsheet autocorrection will become a useful tool for chemical engineers.
    RL-Based Cargo-UAV Trajectory Planning and Cell Association for Minimum Handoffs, Disconnectivity, and Energy Consumption. (arXiv:2312.02478v1 [eess.SY])
    Unmanned aerial vehicle (UAV) is a promising technology for last-mile cargo delivery. However, the limited on-board battery capacity, cellular unreliability, and frequent handoffs in the airspace are the main obstacles to unleash its full potential. Given that existing cellular networks were primarily designed to service ground users, re-utilizing the same architecture for highly mobile aerial users, e.g., cargo-UAVs, is deemed challenging. Indeed, to ensure a safe delivery using cargo-UAVs, it is crucial to utilize the available energy efficiently, while guaranteeing reliable connectivity for command-and-control and avoiding frequent handoff. To achieve this goal, we propose a novel approach for joint cargo-UAV trajectory planning and cell association. Specifically, we formulate the cargo-UAV mission as a multi-objective problem aiming to 1) minimize energy consumption, 2) reduce handoff events, and 3) guarantee cellular reliability along the trajectory. We leverage reinforcement learning (RL) to jointly optimize the cargo-UAV's trajectory and cell association. Simulation results demonstrate a performance improvement of our proposed method, in terms of handoffs, disconnectivity, and energy consumption, compared to benchmarks.
    ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU. (arXiv:2312.02515v1 [cs.LG])
    Transformer-based large language models (LLMs) have demonstrated outstanding performance across diverse domains, particularly when fine-turned for specific domains. Recent studies suggest that the resources required for fine-tuning LLMs can be economized through parameter-efficient methods such as Low-Rank Adaptation (LoRA). While LoRA effectively reduces computational burdens and resource demands, it currently supports only a single-job fine-tuning setup. In this paper, we present ASPEN, a high-throughput framework for fine-tuning LLMs. ASPEN efficiently trains multiple jobs on a single GPU using the LoRA method, leveraging shared pre-trained model and adaptive scheduling. ASPEN is compatible with transformer-based language models like LLaMA and ChatGLM, etc. Experiments show that ASPEN saves 53% of GPU memory when training multiple LLaMA-7B models on NVIDIA A100 80GB GPU and boosts training throughput by about 17% compared to existing methods when training with various pre-trained models on different GPUs. The adaptive scheduling algorithm reduces turnaround time by 24%, end-to-end training latency by 12%, prioritizing jobs and preventing out-of-memory issues.
    A Q-learning approach to the continuous control problem of robot inverted pendulum balancing. (arXiv:2312.02649v1 [cs.RO])
    This study evaluates the application of a discrete action space reinforcement learning method (Q-learning) to the continuous control problem of robot inverted pendulum balancing. To speed up the learning process and to overcome technical difficulties related to the direct learning on the real robotic system, the learning phase is performed in simulation environment. A mathematical model of the system dynamics is implemented, deduced by curve fitting on data acquired from the real system. The proposed approach demonstrated feasible, featuring its application on a real world robot that learned to balance an inverted pendulum. This study also reinforces and demonstrates the importance of an accurate representation of the physical world in simulation to achieve a more efficient implementation of reinforcement learning algorithms in real world, even when using a discrete action space algorithm to control a continuous action.
    Generator Born from Classifier. (arXiv:2312.02470v1 [cs.LG])
    In this paper, we make a bold attempt toward an ambitious task: given a pre-trained classifier, we aim to reconstruct an image generator, without relying on any data samples. From a black-box perspective, this challenge seems intractable, since it inevitably involves identifying the inverse function for a classifier, which is, by nature, an information extraction process. As such, we resort to leveraging the knowledge encapsulated within the parameters of the neural network. Grounded on the theory of Maximum-Margin Bias of gradient descent, we propose a novel learning paradigm, in which the generator is trained to ensure that the convergence conditions of the network parameters are satisfied over the generated distribution of the samples. Empirical validation from various image generation tasks substantiates the efficacy of our strategy.
    QuantAttack: Exploiting Dynamic Quantization to Attack Vision Transformers. (arXiv:2312.02220v1 [cs.CV])
    In recent years, there has been a significant trend in deep neural networks (DNNs), particularly transformer-based models, of developing ever-larger and more capable models. While they demonstrate state-of-the-art performance, their growing scale requires increased computational resources (e.g., GPUs with greater memory capacity). To address this problem, quantization techniques (i.e., low-bit-precision representation and matrix multiplication) have been proposed. Most quantization techniques employ a static strategy in which the model parameters are quantized, either during training or inference, without considering the test-time sample. In contrast, dynamic quantization techniques, which have become increasingly popular, adapt during inference based on the input provided, while maintaining full-precision performance. However, their dynamic behavior and average-case performance assumption makes them vulnerable to a novel threat vector -- adversarial attacks that target the model's efficiency and availability. In this paper, we present QuantAttack, a novel attack that targets the availability of quantized models, slowing down the inference, and increasing memory usage and energy consumption. We show that carefully crafted adversarial examples, which are designed to exhaust the resources of the operating system, can trigger worst-case performance. In our experiments, we demonstrate the effectiveness of our attack on vision transformers on a wide range of tasks, both uni-modal and multi-modal. We also examine the effect of different attack variants (e.g., a universal perturbation) and the transferability between different models.
    CityTFT: Temporal Fusion Transformer for Urban Building Energy Modeling. (arXiv:2312.02375v1 [stat.ML])
    Urban Building Energy Modeling (UBEM) is an emerging method to investigate urban design and energy systems against the increasing energy demand at urban and neighborhood levels. However, current UBEM methods are mostly physic-based and time-consuming in multiple climate change scenarios. This work proposes CityTFT, a data-driven UBEM framework, to accurately model the energy demands in urban environments. With the empowerment of the underlying TFT framework and an augmented loss function, CityTFT could predict heating and cooling triggers in unseen climate dynamics with an F1 score of 99.98 \% while RMSE of loads of 13.57 kWh.
    On the Trade-Off between Stability and Representational Capacity in Graph Neural Networks. (arXiv:2312.02372v1 [eess.SP])
    Analyzing the stability of graph neural networks (GNNs) under topological perturbations is key to understanding their transferability and the role of each architecture component. However, stability has been investigated only for particular architectures, questioning whether it holds for a broader spectrum of GNNs or only for a few instances. To answer this question, we study the stability of EdgeNet: a general GNN framework that unifies more than twenty solutions including the convolutional and attention-based classes, as well as graph isomorphism networks and hybrid architectures. We prove that all GNNs within the EdgeNet framework are stable to topological perturbations. By studying the effect of different EdgeNet categories on the stability, we show that GNNs with fewer degrees of freedom in their parameter space, linked to a lower representational capacity, are more stable. The key factor yielding this trade-off is the eigenvector misalignment between the EdgeNet parameter matrices and the graph shift operator. For example, graph convolutional neural networks that assign a single scalar per signal shift (hence, with a perfect alignment) are more stable than the more involved node or edge-varying counterparts. Extensive numerical results corroborate our theoretical findings and highlight the role of different architecture components in the trade-off.
    Robust Clustering using Hyperdimensional Computing. (arXiv:2312.02407v1 [cs.LG])
    This paper addresses the clustering of data in the hyperdimensional computing (HDC) domain. In prior work, an HDC-based clustering framework, referred to as HDCluster, has been proposed. However, the performance of the existing HDCluster is not robust. The performance of HDCluster is degraded as the hypervectors for the clusters are chosen at random during the initialization step. To overcome this bottleneck, we assign the initial cluster hypervectors by exploring the similarity of the encoded data, referred to as \textit{query} hypervectors. Intra-cluster hypervectors have a higher similarity than inter-cluster hypervectors. Harnessing the similarity results among query hypervectors, this paper proposes four HDC-based clustering algorithms: similarity-based k-means, equal bin-width histogram, equal bin-height histogram, and similarity-based affinity propagation. Experimental results illustrate that: (i) Compared to the existing HDCluster, our proposed HDC-based clustering algorithms can achieve better accuracy, more robust performance, fewer iterations, and less execution time. Similarity-based affinity propagation outperforms the other three HDC-based clustering algorithms on eight datasets by 2~38% in clustering accuracy. (ii) Even for one-pass clustering, i.e., without any iterative update of the cluster hypervectors, our proposed algorithms can provide more robust clustering accuracy than HDCluster. (iii) Over eight datasets, five out of eight can achieve higher or comparable accuracy when projected onto the hyperdimensional space. Traditional clustering is more desirable than HDC when the number of clusters, $k$, is large.
    STEREOFOG -- Computational DeFogging via Image-to-Image Translation on a real-world Dataset. (arXiv:2312.02344v1 [cs.CV])
    Image-to-Image translation (I2I) is a subtype of Machine Learning (ML) that has tremendous potential in applications where two domains of images and the need for translation between the two exist, such as the removal of fog. For example, this could be useful for autonomous vehicles, which currently struggle with adverse weather conditions like fog. However, datasets for I2I tasks are not abundant and typically hard to acquire. Here, we introduce STEREOFOG, a dataset comprised of $10,067$ paired fogged and clear images, captured using a custom-built device, with the purpose of exploring I2I's potential in this domain. It is the only real-world dataset of this kind to the best of our knowledge. Furthermore, we apply and optimize the pix2pix I2I ML framework to this dataset. With the final model achieving an average Complex Wavelet-Structural Similarity (CW-SSIM) score of $0.76$, we prove the technique's suitability for the problem.
    RINAS: Training with Dataset Shuffling Can Be General and Fast. (arXiv:2312.02368v1 [cs.DB])
    Deep learning datasets are expanding at an unprecedented pace, creating new challenges for data processing in model training pipelines. A crucial aspect of these pipelines is dataset shuffling, which significantly improves unbiased learning and convergence accuracy by adhering to the principles of random sampling. However, loading shuffled data for large datasets incurs significant overhead in the deep learning pipeline and severely impacts the end-to-end training throughput. To mitigate this, current deep learning systems often resort to partial dataset shuffling, sacrificing global randomness to maintain acceptable training throughput on large datasets, still leaving global shuffling efficiency issues not fully explored. In this work, we present RINAS, a data loading framework that systematically addresses the performance bottleneck of loading global shuffled datasets. Our key contribution is to offer an intra-batch unordered data fetching approach, which unleashes unexplored parallelism of data loading. We implement RINAS under the PyTorch framework for common dataset libraries HuggingFace and TorchVision. Our experimental results show that RINAS improves the throughput of general language model training and vision model training by up to 59% and 89%, respectively.
    Dimensionality Reduction and Dynamical Mode Recognition of Circular Arrays of Flame Oscillators Using Deep Neural Network. (arXiv:2312.02462v1 [cs.LG])
    Oscillatory combustion in aero engines and modern gas turbines often has significant adverse effects on their operation, and accurately recognizing various oscillation modes is the prerequisite for understanding and controlling combustion instability. However, the high-dimensional spatial-temporal data of a complex combustion system typically poses considerable challenges to the dynamical mode recognition. Based on a two-layer bidirectional long short-term memory variational autoencoder (Bi-LSTM-VAE) dimensionality reduction model and a two-dimensional Wasserstein distance-based classifier (WDC), this study proposes a promising method (Bi-LSTM-VAE-WDC) for recognizing dynamical modes in oscillatory combustion systems. Specifically, the Bi-LSTM-VAE dimension reduction model was introduced to reduce the high-dimensional spatial-temporal data of the combustion system to a low-dimensional phase space; Gaussian kernel density estimates (GKDE) were computed based on the distribution of phase points in a grid; two-dimensional WD values were calculated from the GKDE maps to recognize the oscillation modes. The time-series data used in this study were obtained from numerical simulations of circular arrays of laminar flame oscillators. The results show that the novel Bi-LSTM-VAE method can produce a non-overlapping distribution of phase points, indicating an effective unsupervised mode recognition and classification. Furthermore, the present method exhibits a more prominent performance than VAE and PCA (principal component analysis) for distinguishing dynamical modes in complex flame systems, implying its potential in studying turbulent combustion.
    Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images. (arXiv:2312.02253v1 [cs.CV])
    Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. Prior work shows that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images from the finetuned model can enhance an ImageNet classifier's performance. However, performance degrades as synthetic images outnumber real ones. In this paper, we explore whether generative fine-tuning is essential for this improvement and whether it is possible to further scale up training using more synthetic data. We present a new framework leveraging off-the-shelf generative models to generate synthetic training images, addressing multiple challenges: class name ambiguity, lack of diversity in naive prompts, and domain shifts. Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images. Our framework consistently enhances recognition model performance with more synthetic data, up to 6x of original ImageNet size showcasing the potential of synthetic data for improved recognition models and strong out-of-domain generalization.
    Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler. (arXiv:2312.02683v1 [eess.AS])
    Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully. Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models. However, this was investigated with a single database for training and another one for testing, which makes the results highly dependent on the particular databases. Moreover, recent developments from the image generation literature remain largely unexplored for speech enhancement. These include several design aspects of diffusion models, such as the noise schedule or the reverse sampler. In this work, we systematically assess the generalization performance of a diffusion-based speech enhancement model by using multiple speech, noise and binaural room impulse response (BRIR) databases to simulate mismatched acoustic conditions. We also experiment with a noise schedule and a sampler that have not been applied to speech enhancement before. We show that the proposed system substantially benefits from using multiple databases for training, and achieves superior performance compared to state-of-the-art discriminative models in both matched and mismatched conditions. We also show that a Heun-based sampler achieves superior performance at a smaller computational cost compared to a sampler commonly used for speech enhancement.
    Uncertainty Quantification in Multivariable Regression for Material Property Prediction with Bayesian Neural Networks. (arXiv:2311.02495v2 [cs.LG] UPDATED)
    With the increased use of data-driven approaches and machine learning-based methods in material science, the importance of reliable uncertainty quantification (UQ) of the predicted variables for informed decision-making cannot be overstated. UQ in material property prediction poses unique challenges, including the multi-scale and multi-physics nature of advanced materials, intricate interactions between numerous factors, limited availability of large curated datasets for model training, etc. Recently, Bayesian Neural Networks (BNNs) have emerged as a promising approach for UQ, offering a probabilistic framework for capturing uncertainties within neural networks. In this work, we introduce an approach for UQ within physics-informed BNNs, which integrates knowledge from governing laws in material modeling to guide the models toward physically consistent predictions. To evaluate the effectiveness of this approach, we present case studies for predicting the creep rupture life of steel alloys. Experimental validation with three datasets of collected measurements from creep tests demonstrates the ability of BNNs to produce accurate point and uncertainty estimates that are competitive or exceed the performance of the conventional method of Gaussian Process Regression. Similarly, we evaluated the suitability of BNNs for UQ in an active learning application and reported competitive performance. The most promising framework for creep life prediction is BNNs based on Markov Chain Monte Carlo approximation of the posterior distribution of network parameters, as it provided more reliable results in comparison to BNNs based on variational inference approximation or related NNs with probabilistic outputs. The codes are available at: https://github.com/avakanski/Creep-uncertainty-quantification.
    Cotton Yield Prediction Using Random Forest. (arXiv:2312.02299v1 [cs.LG])
    The cotton industry in the United States is committed to sustainable production practices that minimize water, land, and energy use while improving soil health and cotton output. Climate-smart agricultural technologies are being developed to boost yields while decreasing operating expenses. Crop yield prediction, on the other hand, is difficult because of the complex and nonlinear impacts of cultivar, soil type, management, pest and disease, climate, and weather patterns on crops. To solve this issue, we employ machine learning (ML) to forecast production while considering climate change, soil diversity, cultivar, and inorganic nitrogen levels. From the 1980s to the 1990s, field data were gathered across the southern cotton belt of the United States. To capture the most current effects of climate change over the previous six years, a second data source was produced using the process-based crop model, GOSSYM. We concentrated our efforts on three distinct areas inside each of the three southern states: Texas, Mississippi, and Georgia. To simplify the amount of computations, accumulated heat units (AHU) for each set of experimental data were employed as an analogy to use time-series weather data. The Random Forest Regressor yielded a 97.75% accuracy rate, with a root mean square error of 55.05 kg/ha and an R2 of around 0.98. These findings demonstrate how an ML technique may be developed and applied as a reliable and easy-to-use model to support the cotton climate-smart initiative.
    Algorithms for mean-field variational inference via polyhedral optimization in the Wasserstein space. (arXiv:2312.02849v1 [math.ST])
    We develop a theory of finite-dimensional polyhedral subsets over the Wasserstein space and optimization of functionals over them via first-order methods. Our main application is to the problem of mean-field variational inference, which seeks to approximate a distribution $\pi$ over $\mathbb{R}^d$ by a product measure $\pi^\star$. When $\pi$ is strongly log-concave and log-smooth, we provide (1) approximation rates certifying that $\pi^\star$ is close to the minimizer $\pi^\star_\diamond$ of the KL divergence over a \emph{polyhedral} set $\mathcal{P}_\diamond$, and (2) an algorithm for minimizing $\text{KL}(\cdot\|\pi)$ over $\mathcal{P}_\diamond$ with accelerated complexity $O(\sqrt \kappa \log(\kappa d/\varepsilon^2))$, where $\kappa$ is the condition number of $\pi$.
    Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models. (arXiv:2310.04406v2 [cs.AI] UPDATED)
    While large language models (LLMs) have demonstrated impressive performance on a range of decision-making tasks, they rely on simple acting processes and fall short of broad deployment as autonomous agents. We introduce LATS (Language Agent Tree Search), a general framework that synergizes the capabilities of LLMs in planning, acting, and reasoning. Drawing inspiration from Monte Carlo tree search in model-based reinforcement learning, LATS employs LLMs as agents, value functions, and optimizers, repurposing their latent strengths for enhanced decision-making. What is crucial in this method is the use of an environment for external feedback, which offers a more deliberate and adaptive problem-solving mechanism that moves beyond the limitations of existing techniques. Our experimental evaluation across diverse domains, such as programming, HotPotQA, and WebShop, illustrates the applicability of LATS for both reasoning and acting. In particular, LATS achieves 94.4% for programming on HumanEval with GPT-4 and an average score of 75.9 for web browsing on WebShop with GPT-3.5, demonstrating the effectiveness and generality of our method.
    ALEXR: Optimal Single-Loop Algorithms for Convex Finite-Sum Coupled Compositional Stochastic Optimization. (arXiv:2312.02277v1 [math.OC])
    This paper revisits a class of convex Finite-Sum Coupled Compositional Stochastic Optimization (cFCCO) problems with many applications, including group distributionally robust optimization (GDRO), reinforcement learning, and learning to rank. To better solve these problems, we introduce a unified family of efficient single-loop primal-dual block-coordinate proximal algorithms, dubbed ALEXR. This algorithm leverages block-coordinate stochastic mirror ascent updates for the dual variable and stochastic proximal gradient descent updates for the primal variable. We establish the convergence rates of ALEXR in both convex and strongly convex cases under smoothness and non-smoothness conditions of involved functions, which not only improve the best rates in previous works on smooth cFCCO problems but also expand the realm of cFCCO for solving more challenging non-smooth problems such as the dual form of GDRO. Finally, we present lower complexity bounds to demonstrate that the convergence rates of ALEXR are optimal among first-order block-coordinate stochastic algorithms for the considered class of cFCCO problems.
    Continual Learning with Distributed Optimization: Does CoCoA Forget?. (arXiv:2211.16994v4 [stat.ML] UPDATED)
    We focus on the continual learning problem where the tasks arrive sequentially and the aim is to perform well on the newly arrived task without performance degradation on the previously seen tasks. In contrast to the continual learning literature focusing on the centralized setting, we investigate the distributed estimation framework. We consider the well-established distributed learning algorithm COCOA. We derive closed form expressions for the iterations for the overparametrized case. We illustrate the convergence and the error performance of the algorithm based on the over/under-parameterization of the problem. Our results show that depending on the problem dimensions and data generation assumptions, COCOA can perform continual learning over a sequence of tasks, i.e., it can learn a new task without forgetting previously learned tasks, with access only to one task at a time.
    Innovations in Agricultural Forecasting: A Multivariate Regression Study on Global Crop Yield Prediction. (arXiv:2312.02254v1 [cs.LG])
    The prediction of crop yields internationally is a crucial objective in agricultural research. Thus, this study implements 6 regression models (Linear, Tree, Gradient Descent, Gradient Boosting, K- Nearest Neighbors, and Random Forest) to predict crop yields in 196 countries. Given 4 key training parameters, pesticides (tonnes), rainfall (mm), temperature (Celsius), and yield (hg/ha), it was found that our Random Forest Regression model achieved a determination coefficient (r^2) of 0.94, with a margin of error (ME) of .03. The models were trained and tested using the Food and Agricultural Organization of the United Nations data, along with the World Bank Climate Change Data Catalog. Furthermore, each parameter was analyzed to understand how varying factors could impact overall yield. We used unconventional models, contrary to generally used Deep Learning (DL) and Machine Learning (ML) models, combined with recently collected data to implement a unique approach in our research. Existing scholarship would benefit from understanding the most optimal model for agricultural research, specifically using the United Nations data.
    Efficient Online Data Mixing For Language Model Pre-Training. (arXiv:2312.02406v1 [cs.CL])
    The data used to pretrain large language models has a decisive impact on a model's downstream performance, which has led to a large body of work on data selection methods that aim to automatically determine the most suitable data to use for pretraining. Existing data selection methods suffer from slow and computationally expensive processes, a problem amplified by the increasing size of models and of pretraining datasets. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together and determining sampling probabilities across entire groups. However, data mixing proportions are typically fixed before training and therefore cannot adapt to changing training dynamics. To address these limitations, we develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing. Based on multi-armed bandit algorithms, our online approach optimizes the data mixing proportions during training. Remarkably, our method trains a model that reaches the final perplexity of the next best method with 19\% fewer training iterations, and improves performance on the 5-shot MMLU benchmark by 1.9% relative accuracy, while adding negligible wall-clock time during pretraining.
    Low-Precision Mixed-Computation Models for Inference on Edge. (arXiv:2312.02210v1 [cs.LG])
    This paper presents a mixed-computation neural network processing approach for edge applications that incorporates low-precision (low-width) Posit and low-precision fixed point (FixP) number systems. This mixed-computation approach employs 4-bit Posit (Posit4), which has higher precision around zero, for representing weights with high sensitivity, while it uses 4-bit FixP (FixP4) for representing other weights. A heuristic for analyzing the importance and the quantization error of the weights is presented to assign the proper number system to different weights. Additionally, a gradient approximation for Posit representation is introduced to improve the quality of weight updates in the backpropagation process. Due to the high energy consumption of the fully Posit-based computations, neural network operations are carried out in FixP or Posit/FixP. An efficient hardware implementation of a MAC operation with a first Posit operand and FixP for a second operand and accumulator is presented. The efficacy of the proposed low-precision mixed-computation approach is extensively assessed on vision and language models. The results show that, on average, the accuracy of the mixed-computation is about 1.5% higher than that of FixP with a cost of 0.19% energy overhead.
    Do AI models produce better weather forecasts than physics-based models? A quantitative evaluation case study of Storm Ciar\'an. (arXiv:2312.02658v1 [cs.LG])
    There has been huge recent interest in the potential of making operational weather forecasts using machine learning techniques. As they become a part of the weather forecasting toolbox, there is a pressing need to understand how well current machine learning models can simulate high-impactweather events. We compare forecasts of Storm Ciar\'an, a European windstorm that caused sixteen deaths and extensive damage in Northern Europe, made by machine learning and numericalweather prediction models. The four machine learning models considered (FourCastNet, Pangu-Weather, GraphCast and FourCastNet-v2) produce forecasts that accurately capture the synoptic-scale structure of the cyclone including the position of the cloud head, shape of the warm sector and location of warm conveyor belt jet, and the large-scale dynamical drivers important for the rapid storm development such as the position of the storm relative to the upper-level jet exit. However, their ability to resolve the more detailed structures important for issuing weather warnings is more mixed. All of the machine learning models underestimate the peak amplitude of winds associated with the storm, only some machine learning models resolve the warm core seclusion and none of the machine learning models capture the sharp bent-back warm frontal gradient. Our study shows there is a great deal about the performance and properties of machine learning weather forecasts that can be derived from case studies of high-impact weather events such as Storm Ciar\'an.
    Scaling Laws for Adversarial Attacks on Language Model Activations. (arXiv:2312.02780v1 [cs.LG])
    We explore a class of adversarial attacks targeting the activations of language models. By manipulating a relatively small subset of model activations, $a$, we demonstrate the ability to control the exact prediction of a significant number (in some cases up to 1000) of subsequent tokens $t$. We empirically verify a scaling law where the maximum number of target tokens $t_\mathrm{max}$ predicted depends linearly on the number of tokens $a$ whose activations the attacker controls as $t_\mathrm{max} = \kappa a$. We find that the number of bits of control in the input space needed to control a single bit in the output space (what we call attack resistance $\chi$) is remarkably constant between $\approx 16$ and $\approx 25$ over 2 orders of magnitude of model sizes for different language models. Compared to attacks on tokens, attacks on activations are predictably much stronger, however, we identify a surprising regularity where one bit of input steered either via activations or via tokens is able to exert control over a similar amount of output bits. This gives support for the hypothesis that adversarial attacks are a consequence of dimensionality mismatch between the input and output spaces. A practical implication of the ease of attacking language model activations instead of tokens is for multi-modal and selected retrieval models, where additional data sources are added as activations directly, sidestepping the tokenized input. This opens up a new, broad attack surface. By using language models as a controllable test-bed to study adversarial attacks, we were able to experiment with input-output dimensions that are inaccessible in computer vision, especially where the output dimension dominates.
    Hierarchical ML Codebook Design for Extreme MIMO Beam Management. (arXiv:2312.02178v1 [eess.SP])
    Beam management is a strategy to unify beamforming and channel state information (CSI) acquisition with large antenna arrays in 5G. Codebooks serve multiple uses in beam management including beamforming reference signals, CSI reporting, and analog beam training. In this paper, we propose and evaluate a machine learning-refined codebook design process for extremely large multiple-input multiple-output (X-MIMO) systems. We propose a neural network and beam selection strategy to design the initial access and refinement codebooks using end-to-end learning from beamspace representations. The algorithm, called Extreme-Beam Management (X-BM), can significantly improve the performance of extremely large arrays as envisioned for 6G and capture realistic wireless and physical layer aspects. Our results show an 8dB improvement in initial access and overall effective spectral efficiency improvements compared to traditional codebook methods.
    Towards Fast and Stable Federated Learning: Confronting Heterogeneity via Knowledge Anchor. (arXiv:2312.02416v1 [cs.LG])
    Federated learning encounters a critical challenge of data heterogeneity, adversely affecting the performance and convergence of the federated model. Various approaches have been proposed to address this issue, yet their effectiveness is still limited. Recent studies have revealed that the federated model suffers severe forgetting in local training, leading to global forgetting and performance degradation. Although the analysis provides valuable insights, a comprehensive understanding of the vulnerable classes and their impact factors is yet to be established. In this paper, we aim to bridge this gap by systematically analyzing the forgetting degree of each class during local training across different communication rounds. Our observations are: (1) Both missing and non-dominant classes suffer similar severe forgetting during local training, while dominant classes show improvement in performance. (2) When dynamically reducing the sample size of a dominant class, catastrophic forgetting occurs abruptly when the proportion of its samples is below a certain threshold, indicating that the local model struggles to leverage a few samples of a specific class effectively to prevent forgetting. Motivated by these findings, we propose a novel and straightforward algorithm called Federated Knowledge Anchor (FedKA). Assuming that all clients have a single shared sample for each class, the knowledge anchor is constructed before each local training stage by extracting shared samples for missing classes and randomly selecting one sample per class for non-dominant classes. The knowledge anchor is then utilized to correct the gradient of each mini-batch towards the direction of preserving the knowledge of the missing and non-dominant classes. Extensive experimental results demonstrate that our proposed FedKA achieves fast and stable convergence, significantly improving accuracy on popular benchmarks.
    Real-Time Surface-to-Air Missile Engagement Zone Prediction Using Simulation and Machine Learning. (arXiv:2311.11905v2 [cs.LG] UPDATED)
    Surface-to-Air Missiles (SAMs) are crucial in modern air defense systems. A critical aspect of their effectiveness is the Engagement Zone (EZ), the spatial region within which a SAM can effectively engage and neutralize a target. Notably, the EZ is intrinsically related to the missile's maximum range; it defines the furthest distance at which a missile can intercept a target. The accurate computation of this EZ is essential but challenging due to the dynamic and complex factors involved, which often lead to high computational costs and extended processing times when using conventional simulation methods. In light of these challenges, our study investigates the potential of machine learning techniques, proposing an approach that integrates machine learning with a custom-designed simulation tool to train supervised algorithms. We leverage a comprehensive dataset of pre-computed SAM EZ simulations, enabling our model to accurately predict the SAM EZ for new input parameters. It accelerates SAM EZ simulations, enhances air defense strategic planning, and provides real-time insights, improving SAM system performance. The study also includes a comparative analysis of machine learning algorithms, illuminating their capabilities and performance metrics and suggesting areas for future research, highlighting the transformative potential of machine learning in SAM EZ simulations.
    Towards Causal Representations of Climate Model Data. (arXiv:2312.02858v1 [cs.LG])
    Climate models, such as Earth system models (ESMs), are crucial for simulating future climate change based on projected Shared Socioeconomic Pathways (SSP) greenhouse gas emissions scenarios. While ESMs are sophisticated and invaluable, machine learning-based emulators trained on existing simulation data can project additional climate scenarios much faster and are computationally efficient. However, they often lack generalizability and interpretability. This work delves into the potential of causal representation learning, specifically the \emph{Causal Discovery with Single-parent Decoding} (CDSD) method, which could render climate model emulation efficient \textit{and} interpretable. We evaluate CDSD on multiple climate datasets, focusing on emissions, temperature, and precipitation. Our findings shed light on the challenges, limitations, and promise of using CDSD as a stepping stone towards more interpretable and robust climate model emulation.
    Pushing the Limits of Pre-training for Time Series Forecasting in the CloudOps Domain. (arXiv:2310.05063v3 [cs.LG] UPDATED)
    Time series has been left behind in the era of pre-training and transfer learning. While research in the fields of natural language processing and computer vision are enjoying progressively larger datasets to train massive models, the most popular time series datasets consist of only tens of thousands of time steps, limiting our ability to study the effectiveness of pre-training and scaling. Recent studies have also cast doubt on the need for expressive models and scale. To alleviate these issues, we introduce three large-scale time series forecasting datasets from the cloud operations (CloudOps) domain, the largest having billions of observations, enabling further study into pre-training and scaling of time series models. We build the empirical groundwork for studying pre-training and scaling of time series models and pave the way for future research by identifying a promising candidate architecture. We show that it is a strong zero-shot baseline and benefits from further scaling, both in model and dataset size. Accompanying these datasets and results is a suite of comprehensive benchmark results comparing classical and deep learning baselines to our pre-trained method - achieving a 27% reduction in error on the largest dataset. Code and datasets can be found https://github.com/SalesforceAIResearch/pretrain-time-series-cloudops.
    Pitfall of Optimism: Distributional Reinforcement Learning by Randomizing Risk Criterion. (arXiv:2310.16546v3 [cs.LG] UPDATED)
    Distributional reinforcement learning algorithms have attempted to utilize estimated uncertainty for exploration, such as optimism in the face of uncertainty. However, using the estimated variance for optimistic exploration may cause biased data collection and hinder convergence or performance. In this paper, we present a novel distributional reinforcement learning algorithm that selects actions by randomizing risk criterion to avoid one-sided tendency on risk. We provide a perturbed distributional Bellman optimality operator by distorting the risk measure and prove the convergence and optimality of the proposed method with the weaker contraction property. Our theoretical results support that the proposed method does not fall into biased exploration and is guaranteed to converge to an optimal return. Finally, we empirically show that our method outperforms other existing distribution-based algorithms in various environments including Atari 55 games.
    Synthetic Data Generation Techniques for Developing AI-based Speech Assessments for Parkinson's Disease (A Comparative Study). (arXiv:2312.02229v1 [cs.SD])
    Changes in speech and language are among the first signs of Parkinson's disease (PD). Thus, clinicians have tried to identify individuals with PD from their voices for years. Doctors can leverage AI-based speech assessments to spot PD thanks to advancements in artificial intelligence (AI). Such AI systems can be developed using machine learning classifiers that have been trained using individuals' voices. Although several studies have shown reasonable results in developing such AI systems, these systems would need more data samples to achieve promising performance. This paper explores using deep learning-based data generation techniques on the accuracy of machine learning classifiers that are the core of such systems.
    On the Initialization of Graph Neural Networks. (arXiv:2312.02622v1 [cs.LG])
    Graph Neural Networks (GNNs) have displayed considerable promise in graph representation learning across various applications. The core learning process requires the initialization of model weight matrices within each GNN layer, which is typically accomplished via classic initialization methods such as Xavier initialization. However, these methods were originally motivated to stabilize the variance of hidden embeddings and gradients across layers of Feedforward Neural Networks (FNNs) and Convolutional Neural Networks (CNNs) to avoid vanishing gradients and maintain steady information flow. In contrast, within the GNN context classical initializations disregard the impact of the input graph structure and message passing on variance. In this paper, we analyze the variance of forward and backward propagation across GNN layers and show that the variance instability of GNN initializations comes from the combined effect of the activation function, hidden dimension, graph structure and message passing. To better account for these influence factors, we propose a new initialization method for Variance Instability Reduction within GNN Optimization (Virgo), which naturally tends to equate forward and backward variances across successive layers. We conduct comprehensive experiments on 15 datasets to show that Virgo can lead to superior model performance and more stable variance at initialization on node classification, link prediction and graph classification tasks. Codes are in https://github.com/LspongebobJH/virgo_icml2023.
    Revisiting Hidden Representations in Transfer Learning for Medical Imaging. (arXiv:2302.08272v3 [cs.CV] UPDATED)
    While a key component to the success of deep learning is the availability of massive amounts of training data, medical image datasets are often limited in diversity and size. Transfer learning has the potential to bridge the gap between related yet different domains. For medical applications, however, it remains unclear whether it is more beneficial to pre-train on natural or medical images. We aim to shed light on this problem by comparing initialization on ImageNet and RadImageNet on seven medical classification tasks. Our work includes a replication study, which yields results contrary to previously published findings. In our experiments, ResNet50 models pre-trained on ImageNet tend to outperform those trained on RadImageNet. To gain further insights, we investigate the learned representations using Canonical Correlation Analysis (CCA) and compare the predictions of the different models. Our results indicate that, contrary to intuition, ImageNet and RadImageNet may converge to distinct intermediate representations, which appear to diverge further during fine-tuning. Despite these distinct representations, the predictions of the models remain similar. Our findings show that the similarity between networks before and after fine-tuning does not correlate with performance gains, suggesting that the advantages of transfer learning might not solely originate from the reuse of features in the early layers of a convolutional neural network.
    The SVHN Dataset Is Deceptive for Probabilistic Generative Models Due to a Distribution Mismatch. (arXiv:2312.02168v1 [cs.CV])
    The Street View House Numbers (SVHN) dataset is a popular benchmark dataset in deep learning. Originally designed for digit classification tasks, the SVHN dataset has been widely used as a benchmark for various other tasks including generative modeling. However, with this work, we aim to warn the community about an issue of the SVHN dataset as a benchmark for generative modeling tasks: we discover that the official split into training set and test set of the SVHN dataset are not drawn from the same distribution. We empirically show that this distribution mismatch has little impact on the classification task (which may explain why this issue has not been detected before), but it severely affects the evaluation of probabilistic generative models, such as Variational Autoencoders and diffusion models. As a workaround, we propose to mix and re-split the official training and test set when SVHN is used for tasks other than classification. We publish a new split and the indices we used to create it at https://jzenn.github.io/svhn-remix/ .
    Robust Reinforcement Learning in Continuous Control Tasks with Uncertainty Set Regularization. (arXiv:2207.02016v4 [cs.LG] UPDATED)
    Reinforcement learning (RL) is recognized as lacking generalization and robustness under environmental perturbations, which excessively restricts its application for real-world robotics. Prior work claimed that adding regularization to the value function is equivalent to learning a robust policy with uncertain transitions. Although the regularization-robustness transformation is appealing for its simplicity and efficiency, it is still lacking in continuous control tasks. In this paper, we propose a new regularizer named $\textbf{U}$ncertainty $\textbf{S}$et $\textbf{R}$egularizer (USR), by formulating the uncertainty set on the parameter space of the transition function. In particular, USR is flexible enough to be plugged into any existing RL framework. To deal with unknown uncertainty sets, we further propose a novel adversarial approach to generate them based on the value function. We evaluate USR on the Real-world Reinforcement Learning (RWRL) benchmark, demonstrating improvements in the robust performance for perturbed testing environments.
    Controllable Music Production with Diffusion Models and Guidance Gradients. (arXiv:2311.00613v2 [cs.SD] UPDATED)
    We demonstrate how conditional generation from diffusion models can be used to tackle a variety of realistic tasks in the production of music in 44.1kHz stereo audio with sampling-time guidance. The scenarios we consider include continuation, inpainting and regeneration of musical audio, the creation of smooth transitions between two different music tracks, and the transfer of desired stylistic characteristics to existing audio clips. We achieve this by applying guidance at sampling time in a simple framework that supports both reconstruction and classification losses, or any combination of the two. This approach ensures that generated audio can match its surrounding context, or conform to a class distribution or latent representation specified relative to any suitable pre-trained classifier or embedding model. Audio samples are available at https://machinelearning.apple.com/research/controllable-music
    Panoptica -- instance-wise evaluation of 3D semantic and instance segmentation maps. (arXiv:2312.02608v1 [cs.CV])
    This paper introduces panoptica, a versatile and performance-optimized package designed for computing instance-wise segmentation quality metrics from 2D and 3D segmentation maps. panoptica addresses the limitations of existing metrics and provides a modular framework that complements the original intersection over union-based panoptic quality with other metrics, such as the distance metric Average Symmetric Surface Distance. The package is open-source, implemented in Python, and accompanied by comprehensive documentation and tutorials. panoptica employs a three-step metrics computation process to cover diverse use cases. The efficacy of panoptica is demonstrated on various real-world biomedical datasets, where an instance-wise evaluation is instrumental for an accurate representation of the underlying clinical task. Overall, we envision panoptica as a valuable tool facilitating in-depth evaluation of segmentation methods.
    Materials Expert-Artificial Intelligence for Materials Discovery. (arXiv:2312.02796v1 [cond-mat.mtrl-sci])
    The advent of material databases provides an unprecedented opportunity to uncover predictive descriptors for emergent material properties from vast data space. However, common reliance on high-throughput ab initio data necessarily inherits limitations of such data: mismatch with experiments. On the other hand, experimental decisions are often guided by an expert's intuition honed from experiences that are rarely articulated. We propose using machine learning to "bottle" such operational intuition into quantifiable descriptors using expertly curated measurement-based data. We introduce "Materials Expert-Artificial Intelligence" (ME-AI) to encapsulate and articulate this human intuition. As a first step towards such a program, we focus on the topological semimetal (TSM) among square-net materials as the property inspired by the expert-identified descriptor based on structural information: the tolerance factor. We start by curating a dataset encompassing 12 primary features of 879 square-net materials, using experimental data whenever possible. We then use Dirichlet-based Gaussian process regression using a specialized kernel to reveal composite descriptors for square-net topological semimetals. The ME-AI learned descriptors independently reproduce expert intuition and expand upon it. Specifically, new descriptors point to hypervalency as a critical chemical feature predicting TSM within square-net compounds. Our success with a carefully defined problem points to the "machine bottling human insight" approach as promising for machine learning-aided material discovery.
    (Provable) Adversarial Robustness for Group Equivariant Tasks: Graphs, Point Clouds, Molecules, and More. (arXiv:2312.02708v1 [cs.LG])
    A machine learning model is traditionally considered robust if its prediction remains (almost) constant under input perturbations with small norm. However, real-world tasks like molecular property prediction or point cloud segmentation have inherent equivariances, such as rotation or permutation equivariance. In such tasks, even perturbations with large norm do not necessarily change an input's semantic content. Furthermore, there are perturbations for which a model's prediction explicitly needs to change. For the first time, we propose a sound notion of adversarial robustness that accounts for task equivariance. We then demonstrate that provable robustness can be achieved by (1) choosing a model that matches the task's equivariances (2) certifying traditional adversarial robustness. Certification methods are, however, unavailable for many models, such as those with continuous equivariances. We close this gap by developing the framework of equivariance-preserving randomized smoothing, which enables architecture-agnostic certification. We additionally derive the first architecture-specific graph edit distance certificates, i.e. sound robustness guarantees for isomorphism equivariant tasks like node classification. Overall, a sound notion of robustness is an important prerequisite for future work at the intersection of robust and geometric machine learning.
    DeepPointMap: Advancing LiDAR SLAM with Unified Neural Descriptors. (arXiv:2312.02684v1 [cs.CV])
    Point clouds have shown significant potential in various domains, including Simultaneous Localization and Mapping (SLAM). However, existing approaches either rely on dense point clouds to achieve high localization accuracy or use generalized descriptors to reduce map size. Unfortunately, these two aspects seem to conflict with each other. To address this limitation, we propose a unified architecture, DeepPointMap, achieving excellent preference on both aspects. We utilize neural network to extract highly representative and sparse neural descriptors from point clouds, enabling memory-efficient map representation and accurate multi-scale localization tasks (e.g., odometry and loop-closure). Moreover, we showcase the versatility of our framework by extending it to more challenging multi-agent collaborative SLAM. The promising results obtained in these scenarios further emphasize the effectiveness and potential of our approach.
    On the Identifiability of Quantized Factors. (arXiv:2306.16334v2 [cs.LG] UPDATED)
    Disentanglement aims to recover meaningful latent ground-truth factors from the observed distribution solely, and is formalized through the theory of identifiability. The identifiability of independent latent factors is proven to be impossible in the unsupervised i.i.d. setting under a general nonlinear map from factors to observations. In this work, however, we demonstrate that it is possible to recover quantized latent factors under a generic nonlinear diffeomorphism. We only assume that the latent factors have independent discontinuities in their density, without requiring the factors to be statistically independent. We introduce this novel form of identifiability, termed quantized factor identifiability, and provide a comprehensive proof of the recovery of the quantized factors.
    UTBoost: A Tree-boosting based System for Uplift Modeling. (arXiv:2312.02573v1 [cs.LG])
    Uplift modeling refers to the set of machine learning techniques that a manager may use to estimate customer uplift, that is, the net effect of an action on some customer outcome. By identifying the subset of customers for whom a treatment will have the greatest effect, uplift models assist decision-makers in optimizing resource allocations and maximizing overall returns. Accurately estimating customer uplift poses practical challenges, as it requires assessing the difference between two mutually exclusive outcomes for each individual. In this paper, we propose two innovative adaptations of the well-established Gradient Boosting Decision Trees (GBDT) algorithm, which learn the causal effect in a sequential way and overcome the counter-factual nature. Both approaches innovate existing techniques in terms of ensemble learning method and learning objectives, respectively. Experiments on large-scale datasets demonstrate the usefulness of the proposed methods, which often yielding remarkable improvements over base models. To facilitate the application, we develop the UTBoost, an end-to-end tree boosting system specifically designed for uplift modeling. The package is open source and has been optimized for training speed to meet the needs of real industrial applications.
    Rethinking Adversarial Training with Neural Tangent Kernel. (arXiv:2312.02236v1 [cs.LG])
    Adversarial training (AT) is an important and attractive topic in deep learning security, exhibiting mysteries and odd properties. Recent studies of neural network training dynamics based on Neural Tangent Kernel (NTK) make it possible to reacquaint AT and deeply analyze its properties. In this paper, we perform an in-depth investigation of AT process and properties with NTK, such as NTK evolution. We uncover three new findings that are missed in previous works. First, we disclose the impact of data normalization on AT and the importance of unbiased estimators in batch normalization layers. Second, we experimentally explore the kernel dynamics and propose more time-saving AT methods. Third, we study the spectrum feature inside the kernel to address the catastrophic overfitting problem. To the best of our knowledge, it is the first work leveraging the observations of kernel dynamics to improve existing AT methods.
    What Machine Learning Can Do for Focusing Aerogel Detectors. (arXiv:2312.02652v1 [hep-ex])
    Particle identification at the Super Charm-Tau factory experiment will be provided by a Focusing Aerogel Ring Imaging CHerenkov detector (FARICH). The specifics of detector location make proper cooling difficult, therefore a significant number of ambient background hits are captured. They must be mitigated to reduce the data flow and improve particle velocity resolution. In this work we present several approaches to filtering signal hits, inspired by machine learning techniques from computer vision.
    End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes. (arXiv:2305.15930v3 [cs.LG] UPDATED)
    Meta-Bayesian optimisation (meta-BO) aims to improve the sample efficiency of Bayesian optimisation by leveraging data from related tasks. While previous methods successfully meta-learn either a surrogate model or an acquisition function independently, joint training of both components remains an open challenge. This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures. We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data. Early on, we notice that training transformer-based neural processes from scratch with RL is challenging due to insufficient supervision, especially when rewards are sparse. We formalise this claim with a combinatorial analysis showing that the widely used notion of regret as a reward signal exhibits a logarithmic sparsity pattern in trajectory lengths. To tackle this problem, we augment the RL objective with an auxiliary task that guides part of the architecture to learn a valid probabilistic model as an inductive bias. We demonstrate that our method achieves state-of-the-art regret results against various baselines in experiments on standard hyperparameter optimisation tasks and also outperforms others in the real-world problems of mixed-integer programming tuning, antibody design, and logic synthesis for electronic design automation.
    Quantitative Analysis of Primary Attribution Explainable Artificial Intelligence Methods for Remote Sensing Image Classification. (arXiv:2306.04037v2 [cs.LG] UPDATED)
    We present a comprehensive analysis of quantitatively evaluating explainable artificial intelligence (XAI) techniques for remote sensing image classification. Our approach leverages state-of-the-art machine learning approaches to perform remote sensing image classification across multiple modalities. We investigate the results of the models qualitatively through XAI methods. Additionally, we compare the XAI methods quantitatively through various categories of desired properties. Through our analysis, we offer insights and recommendations for selecting the most appropriate XAI method(s) to gain a deeper understanding of the models' decision-making processes. The code for this work is publicly available.
    Visual Encoders for Data-Efficient Imitation Learning in Modern Video Games. (arXiv:2312.02312v1 [cs.LG])
    Video games have served as useful benchmarks for the decision making community, but going beyond Atari games towards training agents in modern games has been prohibitively expensive for the vast majority of the research community. Recent progress in the research, development and open release of large vision models has the potential to amortize some of these costs across the community. However, it is currently unclear which of these models have learnt representations that retain information critical for sequential decision making. Towards enabling wider participation in the research of gameplaying agents in modern games, we present a systematic study of imitation learning with publicly available visual encoders compared to the typical, task-specific, end-to-end training approach in Minecraft, Minecraft Dungeons and Counter-Strike: Global Offensive.
    Channel-Feedback-Free Transmission for Downlink FD-RAN: A Radio Map based Complex-valued Precoding Network Approach. (arXiv:2312.02184v1 [cs.IT])
    As the demand for high-quality services proliferates, an innovative network architecture, the fully-decoupled RAN (FD-RAN), has emerged for more flexible spectrum resource utilization and lower network costs. However, with the decoupling of uplink base stations and downlink base stations in FD-RAN, the traditional transmission mechanism, which relies on real-time channel feedback, is not suitable as the receiver is not able to feedback accurate and timely channel state information to the transmitter. This paper proposes a novel transmission scheme without relying on physical layer channel feedback. Specifically, we design a radio map based complex-valued precoding network~(RMCPNet) model, which outputs the base station precoding based on user location. RMCPNet comprises multiple subnets, with each subnet responsible for extracting unique modal features from diverse input modalities. Furthermore, the multi-modal embeddings derived from these distinct subnets are integrated within the information fusion layer, culminating in a unified representation. We also develop a specific RMCPNet training algorithm that employs the negative spectral efficiency as the loss function. We evaluate the performance of the proposed scheme on the public DeepMIMO dataset and show that RMCPNet can achieve 16\% and 76\% performance improvements over the conventional real-valued neural network and statistical codebook approach, respectively.
    FRAPP\'E: A Post-Processing Framework for Group Fairness Regularization. (arXiv:2312.02592v1 [cs.LG])
    Post-processing mitigation techniques for group fairness generally adjust the decision threshold of a base model in order to improve fairness. Methods in this family exhibit several advantages that make them appealing in practice: post-processing requires no access to the model training pipeline, is agnostic to the base model architecture, and offers a reduced computation cost compared to in-processing. Despite these benefits, existing methods face other challenges that limit their applicability: they require knowledge of the sensitive attributes at inference time and are oftentimes outperformed by in-processing. In this paper, we propose a general framework to transform any in-processing method with a penalized objective into a post-processing procedure. The resulting method is specifically designed to overcome the aforementioned shortcomings of prior post-processing approaches. Furthermore, we show theoretically and through extensive experiments on real-world data that the resulting post-processing method matches or even surpasses the fairness-error trade-off offered by the in-processing counterpart.
    Learning Energy-based Model via Dual-MCMC Teaching. (arXiv:2312.02469v1 [cs.LG])
    This paper studies the fundamental learning problem of the energy-based model (EBM). Learning the EBM can be achieved using the maximum likelihood estimation (MLE), which typically involves the Markov Chain Monte Carlo (MCMC) sampling, such as the Langevin dynamics. However, the noise-initialized Langevin dynamics can be challenging in practice and hard to mix. This motivates the exploration of joint training with the generator model where the generator model serves as a complementary model to bypass MCMC sampling. However, such a method can be less accurate than the MCMC and result in biased EBM learning. While the generator can also serve as an initializer model for better MCMC sampling, its learning can be biased since it only matches the EBM and has no access to empirical training examples. Such biased generator learning may limit the potential of learning the EBM. To address this issue, we present a joint learning framework that interweaves the maximum likelihood learning algorithm for both the EBM and the complementary generator model. In particular, the generator model is learned by MLE to match both the EBM and the empirical data distribution, making it a more informative initializer for MCMC sampling of EBM. Learning generator with observed examples typically requires inference of the generator posterior. To ensure accurate and efficient inference, we adopt the MCMC posterior sampling and introduce a complementary inference model to initialize such latent MCMC sampling. We show that three separate models can be seamlessly integrated into our joint framework through two (dual-) MCMC teaching, enabling effective and efficient EBM learning.
    Exploring Weight Balancing on Long-Tailed Recognition Problem. (arXiv:2305.16573v5 [cs.LG] UPDATED)
    Recognition problems in long-tailed data, in which the sample size per class is heavily skewed, have gained importance because the distribution of the sample size per class in a dataset is generally exponential unless the sample size is intentionally adjusted. Various methods have been devised to address these problems. Recently, weight balancing, which combines well-known classical regularization techniques with two-stage training, has been proposed. Despite its simplicity, it is known for its high performance compared with existing methods devised in various ways. However, there is a lack of understanding as to why this method is effective for long-tailed data. In this study, we analyze weight balancing by focusing on neural collapse and the cone effect at each training stage and found that it can be decomposed into an increase in Fisher's discriminant ratio of the feature extractor caused by weight decay and cross entropy loss and implicit logit adjustment caused by weight decay and class-balanced loss. Our analysis enables the training method to be further simplified by reducing the number of training stages to one while increasing accuracy.
    Are Vision Transformers More Data Hungry Than Newborn Visual Systems?. (arXiv:2312.02843v1 [cs.CV])
    Vision transformers (ViTs) are top performing models on many computer vision benchmarks and can accurately predict human behavior on object recognition tasks. However, researchers question the value of using ViTs as models of biological learning because ViTs are thought to be more data hungry than brains, with ViTs requiring more training data to reach similar levels of performance. To test this assumption, we directly compared the learning abilities of ViTs and animals, by performing parallel controlled rearing experiments on ViTs and newborn chicks. We first raised chicks in impoverished visual environments containing a single object, then simulated the training data available in those environments by building virtual animal chambers in a video game engine. We recorded the first-person images acquired by agents moving through the virtual chambers and used those images to train self supervised ViTs that leverage time as a teaching signal, akin to biological visual systems. When ViTs were trained through the eyes of newborn chicks, the ViTs solved the same view invariant object recognition tasks as the chicks. Thus, ViTs were not more data hungry than newborn visual systems: both learned view invariant object representations in impoverished visual environments. The flexible and generic attention based learning mechanism in ViTs combined with the embodied data streams available to newborn animals appears sufficient to drive the development of animal-like object recognition.
    Deep Learning-Driven Enhancement of Welding Quality Control: Predicting Welding Depth and Pore Volume in Hairpin Welding. (arXiv:2312.01606v2 [cs.LG] UPDATED)
    To advance quality assurance in the welding process, this study presents a robust deep learning model that enables the prediction of two critical welds Key Performance Characteristics (KPCs): welding depth and average pore volume. In the proposed approach, a comprehensive range of laser welding Key Input Characteristics (KICs) is utilized, including welding beam geometries, welding feed rates, path repetitions for weld beam geometries, and bright light weld ratios for all paths, all of which were obtained from hairpin welding experiments. Two deep learning networks are employed with multiple hidden dense layers and linear activation functions to showcase the capabilities of deep neural networks in capturing the intricate nonlinear connections inherent within welding KPCs and KICs. Applying deep learning networks to the small numerical experimental hairpin welding dataset has shown promising results, achieving Mean Absolute Error (MAE) values as low as 0.1079 for predicting welding depth and 0.0641 for average pore volume. Additionally, the validity verification demonstrates the reliability of the proposed method. This, in turn, promises significant advantages in controlling welding outcomes, moving beyond the current trend of relying merely on monitoring for defect classification.
    Expressive Sign Equivariant Networks for Spectral Geometric Learning. (arXiv:2312.02339v1 [cs.LG])
    Recent work has shown the utility of developing machine learning models that respect the structure and symmetries of eigenvectors. These works promote sign invariance, since for any eigenvector v the negation -v is also an eigenvector. However, we show that sign invariance is theoretically limited for tasks such as building orthogonally equivariant models and learning node positional encodings for link prediction in graphs. In this work, we demonstrate the benefits of sign equivariance for these tasks. To obtain these benefits, we develop novel sign equivariant neural network architectures. Our models are based on a new analytic characterization of sign equivariant polynomials and thus inherit provable expressiveness properties. Controlled synthetic experiments show that our networks can achieve the theoretically predicted benefits of sign equivariant models. Code is available at https://github.com/cptq/Sign-Equivariant-Nets.
    Creative Agents: Empowering Agents with Imagination for Creative Tasks. (arXiv:2312.02519v1 [cs.AI])
    We study building embodied agents for open-ended creative tasks. While existing methods build instruction-following agents that can perform diverse open-ended tasks, none of them demonstrates creativity -- the ability to give novel and diverse task solutions implicit in the language instructions. This limitation comes from their inability to convert abstract language instructions into concrete task goals in the environment and perform long-horizon planning for such complicated goals. Given the observation that humans perform creative tasks with the help of imagination, we propose a class of solutions for creative agents, where the controller is enhanced with an imaginator that generates detailed imaginations of task outcomes conditioned on language instructions. We introduce several approaches to implementing the components of creative agents. We implement the imaginator with either a large language model for textual imagination or a diffusion model for visual imagination. The controller can either be a behavior-cloning policy learned from data or a pre-trained foundation model generating executable codes in the environment. We benchmark creative tasks with the challenging open-world game Minecraft, where the agents are asked to create diverse buildings given free-form language instructions. In addition, we propose novel evaluation metrics for open-ended creative tasks utilizing GPT-4V, which holds many advantages over existing metrics. We perform a detailed experimental analysis of creative agents, showing that creative agents are the first AI agents accomplishing diverse building creation in the survival mode of Minecraft. Our benchmark and models are open-source for future research on creative agents (https://github.com/PKU-RL/Creative-Agents).
    Virtual Fusion with Contrastive Learning for Single Sensor-based Activity Recognition. (arXiv:2312.02185v1 [cs.LG])
    Various types of sensors can be used for Human Activity Recognition (HAR), and each of them has different strengths and weaknesses. Sometimes a single sensor cannot fully observe the user's motions from its perspective, which causes wrong predictions. While sensor fusion provides more information for HAR, it comes with many inherent drawbacks like user privacy and acceptance, costly set-up, operation, and maintenance. To deal with this problem, we propose Virtual Fusion - a new method that takes advantage of unlabeled data from multiple time-synchronized sensors during training, but only needs one sensor for inference. Contrastive learning is adopted to exploit the correlation among sensors. Virtual Fusion gives significantly better accuracy than training with the same single sensor, and in some cases, it even surpasses actual fusion using multiple sensors at test time. We also extend this method to a more general version called Actual Fusion within Virtual Fusion (AFVF), which uses a subset of training sensors during inference. Our method achieves state-of-the-art accuracy and F1-score on UCI-HAR and PAMAP2 benchmark datasets. Implementation is available upon request.
    Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study. (arXiv:2312.02855v1 [cs.AR])
    In large-scale datacenters, memory failure is a common cause of server crashes, with uncorrectable errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) defects. Existing approaches primarily focus on predicting UEs using correctable errors (CEs), without fully considering the information provided by error bits. However, error bit patterns have a strong correlation with the occurrence of uncorrectable errors (UEs). In this paper, we present a comprehensive study on the correlation between CEs and UEs, specifically emphasizing the importance of spatio-temporal error bit information. Our analysis reveals a strong correlation between spatio-temporal error bits and UE occurrence. Through evaluations using real-world datasets, we demonstrate that our approach significantly improves prediction performance by 15% in F1-score compared to the state-of-the-art algorithms. Overall, our approach effectively reduces the number of virtual machine interruptions caused by UEs by approximately 59%.
    Federated Active Learning for Target Domain Generalisation. (arXiv:2312.02247v1 [cs.LG])
    In this paper, we introduce Active Learning framework in Federated Learning for Target Domain Generalisation, harnessing the strength from both learning paradigms. Our framework, FEDALV, composed of Active Learning (AL) and Federated Domain Generalisation (FDG), enables generalisation of an image classification model trained from limited source domain client's data without sharing images to an unseen target domain. To this end, our FDG, FEDA, consists of two optimisation updates during training, one at the client and another at the server level. For the client, the introduced losses aim to reduce feature complexity and condition alignment, while in the server, the regularisation limits free energy biases between source and target obtained by the global model. The remaining component of FEDAL is AL with variable budgets, which queries the server to retrieve and sample the most informative local data for the targeted client. We performed multiple experiments on FDG w/ and w/o AL and compared with both conventional FDG baselines and Federated Active Learning baselines. Our extensive quantitative experiments demonstrate the superiority of our method in accuracy and efficiency compared to the multiple contemporary methods. FEDALV manages to obtain the performance of the full training target accuracy while sampling as little as 5% of the source client's data.
    Towards Measuring Representational Similarity of Large Language Models. (arXiv:2312.02730v1 [cs.LG])
    Understanding the similarity of the numerous released large language models (LLMs) has many uses, e.g., simplifying model selection, detecting illegal model reuse, and advancing our understanding of what makes LLMs perform well. In this work, we measure the similarity of representations of a set of LLMs with 7B parameters. Our results suggest that some LLMs are substantially different from others. We identify challenges of using representational similarity measures that suggest the need of careful study of similarity scores to avoid false conclusions.
    Unsupervised Change Detection for Space Habitats Using 3D Point Clouds. (arXiv:2312.02396v1 [cs.RO])
    This work presents an algorithm for scene change detection from point clouds to enable autonomous robotic caretaking in future space habitats. Autonomous robotic systems will help maintain future deep-space habitats, such as the Gateway space station, which will be uncrewed for extended periods. Existing scene analysis software used on the International Space Station (ISS) relies on manually-labeled images for detecting changes. In contrast, the algorithm presented in this work uses raw, unlabeled point clouds as inputs. The algorithm first applies modified Expectation-Maximization Gaussian Mixture Model (GMM) clustering to two input point clouds. It then performs change detection by comparing the GMMs using the Earth Mover's Distance. The algorithm is validated quantitatively and qualitatively using a test dataset collected by an Astrobee robot in the NASA Ames Granite Lab comprising single frame depth images taken directly by Astrobee and full-scene reconstructed maps built with RGB-D and pose data from Astrobee. The runtimes of the approach are also analyzed in depth. The source code is publicly released to promote further development.
    C3: High-performance and low-complexity neural compression from a single image or video. (arXiv:2312.02753v1 [eess.IV])
    Most neural compression models are trained on large datasets of images or videos in order to generalize to unseen data. Such generalization typically requires large and expressive architectures with a high decoding complexity. Here we introduce C3, a neural compression method with strong rate-distortion (RD) performance that instead overfits a small model to each image or video separately. The resulting decoding complexity of C3 can be an order of magnitude lower than neural baselines with similar RD performance. C3 builds on COOL-CHIC (Ladune et al.) and makes several simple and effective improvements for images. We further develop new methodology to apply C3 to videos. On the CLIC2020 image benchmark, we match the RD performance of VTM, the reference implementation of the H.266 codec, with less than 3k MACs/pixel for decoding. On the UVG video benchmark, we match the RD performance of the Video Compression Transformer (Mentzer et al.), a well-established neural video codec, with less than 5k MACs/pixel for decoding.
    LLMs Accelerate Annotation for Medical Information Extraction. (arXiv:2312.02296v1 [cs.CL])
    The unstructured nature of clinical notes within electronic health records often conceals vital patient-related information, making it challenging to access or interpret. To uncover this hidden information, specialized Natural Language Processing (NLP) models are required. However, training these models necessitates large amounts of labeled data, a process that is both time-consuming and costly when relying solely on human experts for annotation. In this paper, we propose an approach that combines Large Language Models (LLMs) with human expertise to create an efficient method for generating ground truth labels for medical text annotation. By utilizing LLMs in conjunction with human annotators, we significantly reduce the human annotation burden, enabling the rapid creation of labeled datasets. We rigorously evaluate our method on a medical information extraction task, demonstrating that our approach not only substantially cuts down on human intervention but also maintains high accuracy. The results highlight the potential of using LLMs to improve the utilization of unstructured clinical data, allowing for the swift deployment of tailored NLP solutions in healthcare.
    Jellyfish: A Large Language Model for Data Preprocessing. (arXiv:2312.01678v2 [cs.AI] UPDATED)
    In this paper, we present Jellyfish, an open-source LLM as a universal task solver for DP. Built on the Llama 2 13B model, Jellyfish is instruction-tuned with the datasets of several typical DP tasks including error detection, data imputation, schema matching, and entity matching, and delivers generalizability to other tasks. Remarkably, Jellyfish can operate on a local, single, and low-priced GPU with its 13 billion parameters, ensuring data security and enabling further tuning. Its proficiency in understanding natural language allows users to manually craft instructions for DP tasks. Unlike many existing methods that heavily rely on prior knowledge, Jellyfish acquires domain knowledge during its tuning process and integrates optional knowledge injection during inference. A distinctive feature of Jellyfish is its interpreter, which elucidates its output decisions. To construct Jellyfish, we develop a series of pre-tuning and DP-tuning techniques. Jellyfish is equipped with an instance serializer, which automatically translates raw data into model prompts, and a knowledge injector, which optionally introduces task- and dataset-specific knowledge to enhance DP performance. Our evaluation of Jellyfish, using a range of real datasets, shows its competitiveness compared to state-of-the-art methods and its strong generalizability to unseen tasks. Jellyfish's performance rivals that of GPT series models, and its interpreter offers enhanced reasoning capabilities compared to GPT-3.5. Furthermore, our evaluation highlights the effectiveness of the techniques employed in constructing Jellyfish. Our model is available at Hugging Face: https://huggingface.co/NECOUDBFM/Jellyfish .
    Privacy-Aware Data Acquisition under Data Similarity in Regression Markets. (arXiv:2312.02611v1 [cs.LG])
    Data markets facilitate decentralized data exchange for applications such as prediction, learning, or inference. The design of these markets is challenged by varying privacy preferences as well as data similarity among data owners. Related works have often overlooked how data similarity impacts pricing and data value through statistical information leakage. We demonstrate that data similarity and privacy preferences are integral to market design and propose a query-response protocol using local differential privacy for a two-party data acquisition mechanism. In our regression data market model, we analyze strategic interactions between privacy-aware owners and the learner as a Stackelberg game over the asked price and privacy factor. Finally, we numerically evaluate how data similarity affects market participation and traded data value.
    Towards out-of-distribution generalizable predictions of chemical kinetics properties. (arXiv:2310.03152v2 [cs.LG] UPDATED)
    Machine Learning (ML) techniques have found applications in estimating chemical kinetic properties. With the accumulated drug molecules identified through "AI4drug discovery", the next imperative lies in AI-driven design for high-throughput chemical synthesis processes, with the estimation of properties of unseen reactions with unexplored molecules. To this end, the existing ML approaches for kinetics property prediction are required to be Out-Of-Distribution (OOD) generalizable. In this paper, we categorize the OOD kinetic property prediction into three levels (structure, condition, and mechanism), revealing unique aspects of such problems. Under this framework, we create comprehensive datasets to benchmark (1) the state-of-the-art ML approaches for reaction prediction in the OOD setting and (2) the state-of-the-art graph OOD methods in kinetics property prediction problems. Our results demonstrated the challenges and opportunities in OOD kinetics property prediction. Our datasets and benchmarks can further support research in this direction.
    Simplifying Neural Network Training Under Class Imbalance. (arXiv:2312.02517v1 [cs.LG])
    Real-world datasets are often highly class-imbalanced, which can adversely impact the performance of deep learning models. The majority of research on training neural networks under class imbalance has focused on specialized loss functions, sampling techniques, or two-stage training procedures. Notably, we demonstrate that simply tuning existing components of standard deep learning pipelines, such as the batch size, data augmentation, optimizer, and label smoothing, can achieve state-of-the-art performance without any such specialized class imbalance methods. We also provide key prescriptions and considerations for training under class imbalance, and an understanding of why imbalance methods succeed or fail.
    ReconU-Net: a direct PET image reconstruction using U-Net architecture with back projection-induced skip connection. (arXiv:2312.02494v1 [physics.med-ph])
    [Objective] This study aims to introduce a novel back projection-induced U-Net-shaped architecture, called ReconU-Net, for deep learning-based direct positron emission tomography (PET) image reconstruction. Additionally, our objective is to analyze the behavior of direct PET image reconstruction and gain deeper insights by comparing the proposed ReconU-Net architecture with other encoder-decoder architectures without skip connections. [Approach] The proposed ReconU-Net architecture uniquely integrates the physical model of the back projection operation into the skip connection. This distinctive feature facilitates the effective transfer of intrinsic spatial information from the input sinogram to the reconstructed image via an embedded physical model. The proposed ReconU-Net was trained using Monte Carlo simulation data from the Brainweb phantom and tested on both simulated and real Hoffman brain phantom data. [Main results] The proposed ReconU-Net method generated a reconstructed image with a more accurate structure compared to other deep learning-based direct reconstruction methods. Further analysis showed that the proposed ReconU-Net architecture has the ability to transfer features of multiple resolutions, especially non-abstract high-resolution information, through skip connections. Despite limited training on simulated data, the proposed ReconU-Net successfully reconstructed the real Hoffman brain phantom, unlike other deep learning-based direct reconstruction methods, which failed to produce a reconstructed image. [Significance] The proposed ReconU-Net can improve the fidelity of direct PET image reconstruction, even when dealing with small training datasets, by leveraging the synergistic relationship between data-driven modeling and the physics model of the imaging process.
    Revisiting Topic-Guided Language Models. (arXiv:2312.02331v1 [cs.CL])
    A recent line of work in natural language processing has aimed to combine language models and topic models. These topic-guided language models augment neural language models with topic models, unsupervised learning methods that can discover document-level patterns of word use. This paper compares the effectiveness of these methods in a standardized setting. We study four topic-guided language models and two baselines, evaluating the held-out predictive performance of each model on four corpora. Surprisingly, we find that none of these methods outperform a standard LSTM language model baseline, and most fail to learn good topics. Further, we train a probe of the neural language model that shows that the baseline's hidden states already encode topic information. We make public all code used for this study.
    Lights out: training RL agents robust to temporary blindness. (arXiv:2312.02665v1 [cs.AI])
    Agents trained with DQN rely on an observation at each timestep to decide what action to take next. However, in real world applications observations can change or be missing entirely. Examples of this could be a light bulb breaking down, or the wallpaper in a certain room changing. While these situations change the actual observation, the underlying optimal policy does not change. Because of this we want our agent to continue taking actions until it receives a (recognized) observation again. To achieve this we introduce a combination of a neural network architecture that uses hidden representations of the observations and a novel n-step loss function. Our implementation is able to withstand location based blindness stretches longer than the ones it was trained on, and therefore shows robustness to temporary blindness. For access to our implementation, please email Nathan, Marije, or Pau.
    Training Reinforcement Learning Agents and Humans With Difficulty-Conditioned Generators. (arXiv:2312.02309v1 [cs.AI])
    We adapt Parameterized Environment Response Model (PERM), a method for training both Reinforcement Learning (RL) Agents and human learners in parameterized environments by directly modeling difficulty and ability. Inspired by Item Response Theory (IRT), PERM aligns environment difficulty with individual ability, creating a Zone of Proximal Development-based curriculum. Remarkably, PERM operates without real-time RL updates and allows for offline training, ensuring its adaptability across diverse students. We present a two-stage training process that capitalizes on PERM's adaptability, and demonstrate its effectiveness in training RL agents and humans in an empirical study.
    Machine Learning Driven Sensitivity Analysis of E3SM Land Model Parameters for Wetland Methane Emissions. (arXiv:2312.02786v1 [physics.ao-ph])
    Methane (CH4) is the second most critical greenhouse gas after carbon dioxide, contributing to 16-25% of the observed atmospheric warming. Wetlands are the primary natural source of methane emissions globally. However, wetland methane emission estimates from biogeochemistry models contain considerable uncertainty. One of the main sources of this uncertainty arises from the numerous uncertain model parameters within various physical, biological, and chemical processes that influence methane production, oxidation, and transport. Sensitivity Analysis (SA) can help identify critical parameters for methane emission and achieve reduced biases and uncertainties in future projections. This study performs SA for 19 selected parameters responsible for critical biogeochemical processes in the methane module of the Energy Exascale Earth System Model (E3SM) land model (ELM). The impact of these parameters on various CH4 fluxes is examined at 14 FLUXNET- CH4 sites with diverse vegetation types. Given the extensive number of model simulations needed for global variance-based SA, we employ a machine learning (ML) algorithm to emulate the complex behavior of ELM methane biogeochemistry. ML enables the computational time to be shortened significantly from 6 CPU hours to 0.72 milliseconds, achieving reduced computational costs. We found that parameters linked to CH4 production and diffusion generally present the highest sensitivities despite apparent seasonal variation. Comparing simulated emissions from perturbed parameter sets against FLUXNET-CH4 observations revealed that better performances can be achieved at each site compared to the default parameter values. This presents a scope for further improving simulated emissions using parameter calibration with advanced optimization techniques like Bayesian optimization.
    MASP: Scalable GNN-based Planning for Multi-Agent Navigation. (arXiv:2312.02522v1 [cs.LG])
    We investigate the problem of decentralized multi-agent navigation tasks, where multiple agents need to reach initially unassigned targets in a limited time. Classical planning-based methods suffer from expensive computation overhead at each step and offer limited expressiveness for complex cooperation strategies. In contrast, reinforcement learning (RL) has recently become a popular paradigm for addressing this issue. However, RL struggles with low data efficiency and cooperation when directly exploring (nearly) optimal policies in the large search space, especially with an increased agent number (e.g., 10+ agents) or in complex environments (e.g., 3D simulators). In this paper, we propose Multi-Agent Scalable GNN-based P lanner (MASP), a goal-conditioned hierarchical planner for navigation tasks with a substantial number of agents. MASP adopts a hierarchical framework to divide a large search space into multiple smaller spaces, thereby reducing the space complexity and accelerating training convergence. We also leverage graph neural networks (GNN) to model the interaction between agents and goals, improving goal achievement. Besides, to enhance generalization capabilities in scenarios with unseen team sizes, we divide agents into multiple groups, each with a previously trained number of agents. The results demonstrate that MASP outperforms classical planning-based competitors and RL baselines, achieving a nearly 100% success rate with minimal training data in both multi-agent particle environments (MPE) with 50 agents and a quadrotor 3-dimensional environment (OmniDrones) with 20 agents. Furthermore, the learned policy showcases zero-shot generalization across unseen team sizes.
    A Framework for Neurosymbolic Robot Action Planning using Large Language Models. (arXiv:2303.00438v2 [cs.AI] UPDATED)
    Symbolic task planning is a widely used approach to enforce robot autonomy due to its ease of understanding and deployment. However, symbolic task planning is difficult to scale in real-world when frequent re-planning is needed, for example, due to human-robot interactions or unforeseen events. Plan length and planning time can hinder the robot's efficiency and negatively affect the overall human-robot interaction's fluency. We present a framework, Teriyaki, designed to bridge the gap between symbolic task planning and machine learning approaches, by training Large Language Models (LLMs), namely GPT-3, into neurosymbolic task planners compatible with the Planning Domain Definition Language (PDDL). Potential benefits include: (i) better scalability in so far as the planning domain complexity increases, since LLMs' response time linearly scales with the combined length of the input and the output, instead of super-linearly as in the case of symbolic task planners, and (ii) the ability to synthesize a plan action-by-action instead of end-to-end, and to make each action available for execution as soon as it is generated, which in turn enables concurrent planning and execution. In the past year, significant efforts have been devoted by the research community to evaluate the overall cognitive abilities of LLMs, with alternate successes. Instead, with Teriyaki we aim to providing an overall planning performance comparable to traditional planners in specific planning domains, while leveraging LLMs capabilities in other metrics which are used to build a look-ahead predictive planning model. Preliminary results in selected domains show that our method can: (i) solve 95.5% of problems in a test data set of 1000 samples; (ii) produce plans up to 13.5% shorter than a traditional symbolic planner; (iii) reduce average overall waiting times for a plan availability by up to 61.4%.
    Calibrated Adaptive Teacher for Domain Adaptive Intelligent Fault Diagnosis. (arXiv:2312.02826v1 [cs.LG])
    Intelligent Fault Diagnosis (IFD) based on deep learning has proven to be an effective and flexible solution, attracting extensive research. Deep neural networks can learn rich representations from vast amounts of representative labeled data for various applications. In IFD, they achieve high classification performance from signals in an end-to-end manner, without requiring extensive domain knowledge. However, deep learning models usually only perform well on the data distribution they have been trained on. When applied to a different distribution, they may experience performance drops. This is also observed in IFD, where assets are often operated in working conditions different from those in which labeled data have been collected. Unsupervised domain adaptation (UDA) deals with the scenario where labeled data are available in a source domain, and only unlabeled data are available in a target domain, where domains may correspond to operating conditions. Recent methods rely on training with confident pseudo-labels for target samples. However, the confidence-based selection of pseudo-labels is hindered by poorly calibrated confidence estimates in the target domain, primarily due to over-confident predictions, which limits the quality of pseudo-labels and leads to error accumulation. In this paper, we propose a novel UDA method called Calibrated Adaptive Teacher (CAT), where we propose to calibrate the predictions of the teacher network throughout the self-training process, leveraging post-hoc calibration techniques. We evaluate CAT on domain-adaptive IFD and perform extensive experiments on the Paderborn benchmark for bearing fault diagnosis under varying operating conditions. Our proposed method achieves state-of-the-art performance on most transfer tasks.
    General-Purpose Retrieval-Enhanced Medical Prediction Model Using Near-Infinite History. (arXiv:2310.20204v2 [cs.LG] UPDATED)
    Developing clinical prediction models (e.g., mortality prediction) based on electronic health records (EHRs) typically relies on expert opinion for feature selection and adjusting observation window size. This burdens experts and creates a bottleneck in the development process. We propose Retrieval-Enhanced Medical prediction model (REMed) to address such challenges. REMed can essentially evaluate an unlimited number of clinical events, select the relevant ones, and make predictions. This approach effectively eliminates the need for manual feature selection and enables an unrestricted observation window. We verified these properties through experiments on 27 clinical tasks and two independent cohorts from publicly available EHR datasets, where REMed outperformed other contemporary architectures that aim to handle as many events as possible. Notably, we found that the preferences of REMed align closely with those of medical experts. We expect our approach to significantly expedite the development of EHR prediction models by minimizing clinicians' need for manual involvement.
    Towards early diagnosis of Alzheimer's disease: Advances in immune-related blood biomarkers and computational modeling approaches. (arXiv:2312.02248v1 [q-bio.QM])
    Alzheimer's disease has an increasing prevalence in the population world-wide, yet current diagnostic methods based on recommended biomarkers are only available in specialized clinics. Due to these circumstances, Alzheimer's disease is usually diagnosed late, which contrasts with the currently available treatment options that are only effective for patients at an early stage. Blood-based biomarkers could fill in the gap of easily accessible and low-cost methods for early diagnosis of the disease. In particular, immune-based blood-biomarkers might be a promising option, given the recently discovered cross-talk of immune cells of the central nervous system with those in the peripheral immune system. With the help of machine learning algorithms and mechanistic modeling approaches, such as agent-based modeling, an in-depth analysis of the simulation of cell dynamics is possible as well as of high-dimensional omics resources indicative of pathway signaling changes. Here, we give a background on advances in research on brain-immune system cross-talk in Alzheimer's disease and review recent machine learning and mechanistic modeling approaches which leverage modern omics technologies for blood-based immune system-related biomarker discovery.
    MOTOR: A Time-To-Event Foundation Model For Structured Medical Records. (arXiv:2301.03150v4 [cs.LG] UPDATED)
    We present a self-supervised, time-to-event (TTE) foundation model called MOTOR (Many Outcome Time Oriented Representations) which is pretrained on timestamped sequences of events in electronic health records (EHR) and health insurance claims. TTE models are used for estimating the probability distribution of the time until a specific event occurs, which is an important task in medical settings. TTE models provide many advantages over classification using fixed time horizons, including naturally handling censored observations, but are challenging to train with limited labeled data. MOTOR addresses this challenge by pretraining on up to 55M patient records (9B clinical events). We evaluate MOTOR's transfer learning performance on 19 tasks, across 3 patient databases (a private EHR system, MIMIC-IV, and Merative claims data). Task-specific models adapted from MOTOR improve time-dependent C statistics by 4.6% over state-of-the-art, improve label efficiency by up to 95% ,and are more robust to temporal distributional shifts. We further evaluate cross-site portability by adapting our MOTOR foundation model for six prediction tasks on the MIMIC-IV dataset, where it outperforms all baselines. MOTOR is the first foundation model for medical TTE predictions and we release a 143M parameter pretrained model for research use at [redacted URL].
    Cancer Subtype Identification through Integrating Inter and Intra Dataset Relationships in Multi-Omics Data. (arXiv:2312.02195v1 [cs.LG])
    The integration of multi-omics data has emerged as a promising approach for gaining comprehensive insights into complex diseases such as cancer. This paper proposes a novel approach to identify cancer subtypes through the integration of multi-omics data for clustering. The proposed method, named LIDAF utilises affinity matrices based on linear relationships between and within different omics datasets (Linear Inter and Intra Dataset Affinity Fusion (LIDAF)). Canonical Correlation Analysis is in this paper employed to create distance matrices based on Euclidean distances between canonical variates. The distance matrices are converted to affinity matrices and those are fused in a three-step process. The proposed LIDAF addresses the limitations of the existing method resulting in improvement of clustering performance as measured by the Adjusted Rand Index and the Normalized Mutual Information score. Moreover, our proposed LIDAF approach demonstrates a notable enhancement in 50% of the log10 rank p-values obtained from Cox survival analysis, surpassing the performance of the best reported method, highlighting its potential of identifying distinct cancer subtypes.
    Harnessing Discrete Representations For Continual Reinforcement Learning. (arXiv:2312.01203v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) agents make decisions using nothing but observations from the environment, and consequently, heavily rely on the representations of those observations. Though some recent breakthroughs have used vector-based categorical representations of observations, often referred to as discrete representations, there is little work explicitly assessing the significance of such a choice. In this work, we provide a thorough empirical investigation of the advantages of representing observations as vectors of categorical values within the context of reinforcement learning. We perform evaluations on world-model learning, model-free RL, and ultimately continual RL problems, where the benefits best align with the needs of the problem setting. We find that, when compared to traditional continuous representations, world models learned over discrete representations accurately model more of the world with less capacity, and that agents trained with discrete representations learn better policies with less data. In the context of continual RL, these benefits translate into faster adapting agents. Additionally, our analysis suggests that the observed performance improvements can be attributed to the information contained within the latent vectors and potentially the encoding of the discrete representation itself.
    MoE-AMC: Enhancing Automatic Modulation Classification Performance Using Mixture-of-Experts. (arXiv:2312.02298v1 [eess.SP])
    Automatic Modulation Classification (AMC) plays a vital role in time series analysis, such as signal classification and identification within wireless communications. Deep learning-based AMC models have demonstrated significant potential in this domain. However, current AMC models inadequately consider the disparities in handling signals under conditions of low and high Signal-to-Noise Ratio (SNR), resulting in an unevenness in their performance. In this study, we propose MoE-AMC, a novel Mixture-of-Experts (MoE) based model specifically crafted to address AMC in a well-balanced manner across varying SNR conditions. Utilizing the MoE framework, MoE-AMC seamlessly combines the strengths of LSRM (a Transformer-based model) for handling low SNR signals and HSRM (a ResNet-based model) for high SNR signals. This integration empowers MoE-AMC to achieve leading performance in modulation classification, showcasing its efficacy in capturing distinctive signal features under diverse SNR scenarios. We conducted experiments using the RML2018.01a dataset, where MoE-AMC achieved an average classification accuracy of 71.76% across different SNR levels, surpassing the performance of previous SOTA models by nearly 10%. This study represents a pioneering application of MoE techniques in the realm of AMC, offering a promising avenue for elevating signal classification accuracy within wireless communication systems.
    Training Chain-of-Thought via Latent-Variable Inference. (arXiv:2312.02179v1 [cs.LG])
    Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training set. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers; these rationales are expensive to produce by hand. Instead, we propose a fine-tuning strategy that tries to maximize the \emph{marginal} log-likelihood of generating a correct answer using CoT prompting, approximately averaging over all possible rationales. The core challenge is sampling from the posterior over rationales conditioned on the correct answer; we address it using a simple Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm inspired by the self-taught reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent contrastive divergence. This algorithm also admits a novel control-variate technique that drives the variance of our gradient estimates to zero as the model improves. Applying our technique to GSM8K and the tasks in BIG-Bench Hard, we find that this MCMC-EM fine-tuning technique typically improves the model's accuracy on held-out examples more than STaR or prompt-tuning with or without CoT.
    Towards the Inferrence of Structural Similarity of Combinatorial Landscapes. (arXiv:2312.02720v1 [cs.LG])
    One of the most common problem-solving heuristics is by analogy. For a given problem, a solver can be viewed as a strategic walk on its fitness landscape. Thus if a solver works for one problem instance, we expect it will also be effective for other instances whose fitness landscapes essentially share structural similarities with each other. However, due to the black-box nature of combinatorial optimization, it is far from trivial to infer such similarity in real-world scenarios. To bridge this gap, by using local optima network as a proxy of fitness landscapes, this paper proposed to leverage graph data mining techniques to conduct qualitative and quantitative analyses to explore the latent topological structural information embedded in those landscapes. By conducting large-scale empirical experiments on three classic combinatorial optimization problems, we gain concrete evidence to support the existence of structural similarity between landscapes of the same classes within neighboring dimensions. We also interrogated the relationship between landscapes of different problem classes.
    Pseudo Replay-based Class Continual Learning for Online New Category Anomaly Detection in Additive Manufacturing. (arXiv:2312.02491v1 [cs.LG])
    The incorporation of advanced sensors and machine learning techniques has enabled modern manufacturing enterprises to perform data-driven in-situ quality monitoring based on the sensor data collected in manufacturing processes. However, one critical challenge is that newly presented defect category may manifest as the manufacturing process continues, resulting in monitoring performance deterioration of previously trained machine learning models. Hence, there is an increasing need for empowering machine learning model to learn continually. Among all continual learning methods, memory-based continual learning has the best performance but faces the constraints of data storage capacity. To address this issue, this paper develops a novel pseudo replay-based continual learning by integrating class incremental learning and oversampling-based data generation. Without storing all the data, the developed framework could generate high-quality data representing previous classes to train machine learning model incrementally when new category anomaly occurs. In addition, it could even enhance the monitoring performance since it also effectively improves the data quality. The effectiveness of the proposed framework is validated in an additive manufacturing process, which leverages supervised classification problem for anomaly detection. The experimental results show that the developed method is very promising in detecting novel anomaly while maintaining a good performance on the previous task and brings up more flexibility in model architecture.
    Expert-guided Bayesian Optimisation for Human-in-the-loop Experimental Design of Known Systems. (arXiv:2312.02852v1 [cs.LG])
    Domain experts often possess valuable physical insights that are overlooked in fully automated decision-making processes such as Bayesian optimisation. In this article we apply high-throughput (batch) Bayesian optimisation alongside anthropological decision theory to enable domain experts to influence the selection of optimal experiments. Our methodology exploits the hypothesis that humans are better at making discrete choices than continuous ones and enables experts to influence critical early decisions. At each iteration we solve an augmented multi-objective optimisation problem across a number of alternate solutions, maximising both the sum of their utility function values and the determinant of their covariance matrix, equivalent to their total variability. By taking the solution at the knee point of the Pareto front, we return a set of alternate solutions at each iteration that have both high utility values and are reasonably distinct, from which the expert selects one for evaluation. We demonstrate that even in the case of an uninformed practitioner, our algorithm recovers the regret of standard Bayesian optimisation.
    Domain Generalization via Nuclear Norm Regularization. (arXiv:2303.07527v2 [cs.LG] UPDATED)
    The ability to generalize to unseen domains is crucial for machine learning systems deployed in the real world, especially when we only have data from limited training domains. In this paper, we propose a simple and effective regularization method based on the nuclear norm of the learned features for domain generalization. Intuitively, the proposed regularizer mitigates the impacts of environmental features and encourages learning domain-invariant features. Theoretically, we provide insights into why nuclear norm regularization is more effective compared to ERM and alternative regularization methods. Empirically, we conduct extensive experiments on both synthetic and real datasets. We show nuclear norm regularization achieves strong performance compared to baselines in a wide range of domain generalization tasks. Moreover, our regularizer is broadly applicable with various methods such as ERM and SWAD with consistently improved performance, e.g., 1.7% and 0.9% test accuracy improvements respectively on the DomainBed benchmark.
    FLea: Improving federated learning on scarce and label-skewed data via privacy-preserving feature augmentation. (arXiv:2312.02327v1 [cs.LG])
    Learning a global model by abstracting the knowledge, distributed across multiple clients, without aggregating the raw data is the primary goal of Federated Learning (FL). Typically, this works in rounds alternating between parallel local training at several clients, followed by model aggregation at a server. We found that existing FL methods under-perform when local datasets are small and present severe label skew as these lead to over-fitting and local model bias. This is a realistic setting in many real-world applications. To address the problem, we propose \textit{FLea}, a unified framework that tackles over-fitting and local bias by encouraging clients to exchange privacy-protected features to aid local training. The features refer to activations from an intermediate layer of the model, which are obfuscated before being shared with other clients to protect sensitive information in the data. \textit{FLea} leverages a novel way of combining local and shared features as augmentations to enhance local model learning. Our extensive experiments demonstrate that \textit{FLea} outperforms the start-of-the-art FL methods, sharing only model parameters, by up to $17.6\%$, and FL methods that share data augmentations by up to $6.3\%$, while reducing the privacy vulnerability associated with shared data augmentations.
    Auto DP-SGD: Dual Improvements of Privacy and Accuracy via Automatic Clipping Threshold and Noise Multiplier Estimation. (arXiv:2312.02400v1 [cs.LG])
    DP-SGD has emerged as a popular method to protect personally identifiable information in deep learning applications. Unfortunately, DP-SGD's per-sample gradient clipping and uniform noise addition during training can significantly degrade model utility. To enhance the model's utility, researchers proposed various adaptive DP-SGD methods. However, we examine and discover that these techniques result in greater privacy leakage or lower accuracy than the traditional DP-SGD method, or a lack of evaluation on a complex data set such as CIFAR100. To address these limitations, we propose an Auto DP-SGD. Our method automates clipping threshold estimation based on the DL model's gradient norm and scales the gradients of each training sample without losing gradient information. This helps to improve the algorithm's utility while using a less privacy budget. To further improve accuracy, we introduce automatic noise multiplier decay mechanisms to decrease the noise multiplier after every epoch. Finally, we develop closed-form mathematical expressions using tCDP accountant for automatic noise multiplier and automatic clipping threshold estimation. Through extensive experimentation, we demonstrate that Auto DP-SGD outperforms existing SOTA DP-SGD methods in privacy and accuracy on various benchmark datasets. We also show that privacy can be improved by lowering the scale factor and using learning rate schedulers without significantly reducing accuracy. Specifically, Auto DP-SGD, when used with a step noise multiplier, improves accuracy by 3.20, 1.57, 6.73, and 1.42 for the MNIST, CIFAR10, CIFAR100, and AG News Corpus datasets, respectively. Furthermore, it obtains a substantial reduction in the privacy budget of 94.9, 79.16, 67.36, and 53.37 for the corresponding data sets.
    Asymmetric leader-laggard cluster synchronization for collective decision-making with laser network. (arXiv:2312.02537v1 [physics.optics])
    Photonic accelerators have recently attracted soaring interest, harnessing the ultimate nature of light for information processing. Collective decision-making with a laser network, employing the chaotic and synchronous dynamics of optically interconnected lasers to address the competitive multi-armed bandit (CMAB) problem, is a highly compelling approach due to its scalability and experimental feasibility. We investigated essential network structures for collective decision-making through quantitative stability analysis. Moreover, we demonstrated the asymmetric preferences of players in the CMAB problem, extending its functionality to more practical applications. Our study highlights the capability and significance of machine learning built upon chaotic lasers and photonic devices.
    Conditional Variational Diffusion Models. (arXiv:2312.02246v1 [cs.CV])
    Inverse problems aim to determine parameters from observations, a crucial task in engineering and science. Lately, generative models, especially diffusion models, have gained popularity in this area for their ability to produce realistic solutions and their good mathematical properties. Despite their success, an important drawback of diffusion models is their sensitivity to the choice of variance schedule, which controls the dynamics of the diffusion process. Fine-tuning this schedule for specific applications is crucial but time-costly and does not guarantee an optimal result. We propose a novel approach for learning the schedule as part of the training process. Our method supports probabilistic conditioning on data, provides high-quality solutions, and is flexible, proving able to adapt to different applications with minimum overhead. This approach is tested in two unrelated inverse problems: super-resolution microscopy and quantitative phase imaging, yielding comparable or superior results to previous methods and fine-tuned diffusion models. We conclude that fine-tuning the schedule by experimentation should be avoided because it can be learned during training in a stable way that yields better results.
    A Kernel-Based Neural Network Test for High-dimensional Sequencing Data Analysis. (arXiv:2312.02850v1 [stat.ML])
    The recent development of artificial intelligence (AI) technology, especially the advance of deep neural network (DNN) technology, has revolutionized many fields. While DNN plays a central role in modern AI technology, it has been rarely used in sequencing data analysis due to challenges brought by high-dimensional sequencing data (e.g., overfitting). Moreover, due to the complexity of neural networks and their unknown limiting distributions, building association tests on neural networks for genetic association analysis remains a great challenge. To address these challenges and fill the important gap of using AI in high-dimensional sequencing data analysis, we introduce a new kernel-based neural network (KNN) test for complex association analysis of sequencing data. The test is built on our previously developed KNN framework, which uses random effects to model the overall effects of high-dimensional genetic data and adopts kernel-based neural network structures to model complex genotype-phenotype relationships. Based on KNN, a Wald-type test is then introduced to evaluate the joint association of high-dimensional genetic data with a disease phenotype of interest, considering non-linear and non-additive effects (e.g., interaction effects). Through simulations, we demonstrated that our proposed method attained higher power compared to the sequence kernel association test (SKAT), especially in the presence of non-linear and interaction effects. Finally, we apply the methods to the whole genome sequencing (WGS) dataset from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study, investigating new genes associated with the hippocampal volume change over time.  ( 3 min )
    Scaling Laws in Jet Classification. (arXiv:2312.02264v1 [hep-ph])
    We demonstrate the emergence of scaling laws in the benchmark top versus QCD jet classification problem in collider physics. Six distinct physically-motivated classifiers exhibit power-law scaling of the binary cross-entropy test loss as a function of training set size, with distinct power law indices. This result highlights the importance of comparing classifiers as a function of dataset size rather than for a fixed training set, as the optimal classifier may change considerably as the dataset is scaled up. We speculate on the interpretation of our results in terms of previous models of scaling laws observed in natural language and image datasets.  ( 2 min )
    TriDeNT: Triple Deep Network Training for Privileged Knowledge Distillation in Histopathology. (arXiv:2312.02111v2 [cs.CV] UPDATED)
    Computational pathology models rarely utilise data that will not be available for inference. This means most models cannot learn from highly informative data such as additional immunohistochemical (IHC) stains and spatial transcriptomics. We present TriDeNT, a novel self-supervised method for utilising privileged data that is not available during inference to improve performance. We demonstrate the efficacy of this method for a range of different paired data including immunohistochemistry, spatial transcriptomics and expert nuclei annotations. In all settings, TriDeNT outperforms other state-of-the-art methods in downstream tasks, with observed improvements of up to 101%. Furthermore, we provide qualitative and quantitative measurements of the features learned by these models and how they differ from baselines. TriDeNT offers a novel method to distil knowledge from scarce or costly data during training, to create significantly better models for routine inputs.  ( 2 min )
    Value Functions are Control Barrier Functions: Verification of Safe Policies using Control Theory. (arXiv:2306.04026v4 [cs.LG] UPDATED)
    Guaranteeing safe behaviour of reinforcement learning (RL) policies poses significant challenges for safety-critical applications, despite RL's generality and scalability. To address this, we propose a new approach to apply verification methods from control theory to learned value functions. By analyzing task structures for safety preservation, we formalize original theorems that establish links between value functions and control barrier functions. Further, we propose novel metrics for verifying value functions in safe control tasks and practical implementation details to improve learning. Our work presents a novel method for certificate learning, which unlocks a diversity of verification techniques from control theory for RL policies, and marks a significant step towards a formal framework for the general, scalable, and verifiable design of RL-based control systems. Code and videos are available at this https url: https://rl-cbf.github.io/  ( 2 min )
    On Optimal Consistency-Robustness Trade-Off for Learning-Augmented Multi-Option Ski Rental. (arXiv:2312.02547v1 [cs.DS])
    The learning-augmented multi-option ski rental problem generalizes the classical ski rental problem in two ways: the algorithm is provided with a prediction on the number of days we can ski, and the ski rental options now come with a variety of rental periods and prices to choose from, unlike the classical two-option setting. Subsequent to the initial study of the multi-option ski rental problem (without learning augmentation) due to Zhang, Poon, and Xu, significant progress has been made for this problem recently in particular. The problem is very well understood when we relinquish one of the two generalizations -- for the learning-augmented classical ski rental problem, algorithms giving best-possible trade-off between consistency and robustness exist; for the multi-option ski rental problem without learning augmentation, deterministic/randomized algorithms giving the best-possible competitiveness have been found. However, in presence of both generalizations, there remained a huge gap between the algorithmic and impossibility results. In fact, for randomized algorithms, we did not have any nontrivial lower bounds on the consistency-robustness trade-off before. This paper bridges this gap for both deterministic and randomized algorithms. For deterministic algorithms, we present a best-possible algorithm that completely matches the known lower bound. For randomized algorithms, we show the first nontrivial lower bound on the consistency-robustness trade-off, and also present an improved randomized algorithm. Our algorithm matches our lower bound on robustness within a factor of e/2 when the consistency is at most 1.086.  ( 3 min )
    A Self-Commissioning Edge Computing Method for Data-Driven Anomaly Detection in Power Electronic Systems. (arXiv:2312.02661v1 [cs.LG])
    Ensuring the reliability of power electronic converters is a matter of great importance, and data-driven condition monitoring techniques are cementing themselves as an important tool for this purpose. However, translating methods that work well in controlled lab environments to field applications presents significant challenges, notably because of the limited diversity and accuracy of the lab training data. By enabling the use of field data, online machine learning can be a powerful tool to overcome this problem, but it introduces additional challenges in ensuring the stability and predictability of the training processes. This work presents an edge computing method that mitigates these shortcomings with minimal additional memory usage, by employing an autonomous algorithm that prioritizes the storage of training samples with larger prediction errors. The method is demonstrated on the use case of a self-commissioning condition monitoring system, in the form of a thermal anomaly detection scheme for a variable frequency motor drive, where the algorithm self-learned to distinguish normal and anomalous operation with minimal prior knowledge. The obtained results, based on experimental data, show a significant improvement in prediction accuracy and training speed, when compared to equivalent models trained online without the proposed data selection process.  ( 2 min )
    Physics-informed neural networks with unknown measurement noise. (arXiv:2211.15498v4 [stat.ML] UPDATED)
    Physics-informed neural networks (PINNs) constitute a flexible approach to both finding solutions and identifying parameters of partial differential equations. Most works on the topic assume noiseless data, or data contaminated with weak Gaussian noise. We show that the standard PINN framework breaks down in case of non-Gaussian noise. We give a way of resolving this fundamental issue and we propose to jointly train an energy-based model (EBM) to learn the correct noise distribution. We illustrate the improved performance of our approach using multiple examples.  ( 2 min )
    Adam-like Algorithm with Smooth Clipping Attains Global Minima: Analysis Based on Ergodicity of Functional SDEs. (arXiv:2312.02182v1 [cs.LG])
    In this paper, we prove that an Adam-type algorithm with smooth clipping approaches the global minimizer of the regularized non-convex loss function. Adding smooth clipping and taking the state space as the set of all trajectories, we can apply the ergodic theory of Markov semigroups for this algorithm and investigate its asymptotic behavior. The ergodic theory we establish in this paper reduces the problem of evaluating the convergence, generalization error and discretization error of this algorithm to the problem of evaluating the difference between two functional stochastic differential equations (SDEs) with different drift coefficients. As a result of our analysis, we have shown that this algorithm minimizes the the regularized non-convex loss function with errors of the form $n^{-1/2}$, $\eta^{1/4}$, $\beta^{-1} \log (\beta + 1)$ and $e^{- c t}$. Here, $c$ is a constant and $n$, $\eta$, $\beta$ and $t$ denote the size of the training dataset, learning rate, inverse temperature and time, respectively.  ( 2 min )
    Working Backwards: Learning to Place by Picking. (arXiv:2312.02352v1 [cs.RO])
    We present Learning to Place by Picking (LPP), a method capable of autonomously collecting demonstrations for a family of placing tasks in which objects must be manipulated to specific locations. With LPP, we approach the learning of robotic object placement policies by reversing the grasping process and exploiting the inherent symmetry of the pick and place problems. Specifically, we obtain placing demonstrations from a set of grasp sequences of objects that are initially located at their target placement locations. Our system is capable of collecting hundreds of demonstrations without human intervention by using a combination of tactile sensing and compliant control for grasps. We train a policy directly from visual observations through behaviour cloning, using the autonomously-collected demonstrations. By doing so, the policy can generalize to object placement scenarios outside of the training environment without privileged information (e.g., placing a plate picked up from a table and not at the original placement location). We validate our approach on home robotic scenarios that include dishwasher loading and table setting. Our approach yields robotic placing policies that outperform policies trained with kinesthetic teaching, both in terms of performance and data efficiency, while requiring no human supervision.  ( 2 min )
    Efficient Deep Learning Models for Privacy-preserving People Counting on Low-resolution Infrared Arrays. (arXiv:2304.06059v2 [cs.CV] UPDATED)
    Ultra-low-resolution Infrared (IR) array sensors offer a low-cost, energy-efficient, and privacy-preserving solution for people counting, with applications such as occupancy monitoring. Previous work has shown that Deep Learning (DL) can yield superior performance on this task. However, the literature was missing an extensive comparative analysis of various efficient DL architectures for IR array-based people counting, that considers not only their accuracy, but also the cost of deploying them on memory- and energy-constrained Internet of Things (IoT) edge nodes. In this work, we address this need by comparing 6 different DL architectures on a novel dataset composed of IR images collected from a commercial 8x8 array, which we made openly available. With a wide architectural exploration of each model type, we obtain a rich set of Pareto-optimal solutions, spanning cross-validated balanced accuracy scores in the 55.70-82.70% range. When deployed on a commercial Microcontroller (MCU) by STMicroelectronics, the STM32L4A6ZG, these models occupy 0.41-9.28kB of memory, and require 1.10-7.74ms per inference, while consuming 17.18-120.43 $\mu$J of energy. Our models are significantly more accurate than a previous deterministic method (up to +39.9%), while being up to 3.53x faster and more energy efficient. Further, our models' accuracy is comparable to state-of-the-art DL solutions on similar resolution sensors, despite a much lower complexity. All our models enable continuous, real-time inference on a MCU-based IoT node, with years of autonomous operation without battery recharging.  ( 3 min )
    Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications. (arXiv:2312.02828v1 [stat.ML])
    The Stochastic Approximation (SA) algorithm introduced by Robbins and Monro in 1951 has been a standard method for solving equations of the form $\mathbf{f}({\boldsymbol {\theta}}) = \mathbf{0}$, when only noisy measurements of $\mathbf{f}(\cdot)$ are available. If $\mathbf{f}({\boldsymbol {\theta}}) = \nabla J({\boldsymbol {\theta}})$ for some function $J(\cdot)$, then SA can also be used to find a stationary point of $J(\cdot)$. In much of the literature, it is assumed that the error term ${\boldsymbol {xi}}_{t+1}$ has zero conditional mean, and that its conditional variance is bounded as a function of $t$ (though not necessarily with respect to ${\boldsymbol {\theta}}_t$). Also, for the most part, the emphasis has been on ``synchronous'' SA, whereby, at each time $t$, \textit{every} component of ${\boldsymbol {\theta}}_t$ is updated. Over the years, SA has been applied to a variety of areas, out of which two are the focus in this paper: Convex and nonconvex optimization, and Reinforcement Learning (RL). As it turns out, in these applications, the above-mentioned assumptions do not always hold. In zero-order methods, the error neither has zero mean nor bounded conditional variance. In the present paper, we extend SA theory to encompass errors with nonzero conditional mean and/or unbounded conditional variance, and also asynchronous SA. In addition, we derive estimates for the rate of convergence of the algorithm. Then we apply the new results to problems in nonconvex optimization, and to Markovian SA, a recently emerging area in RL. We prove that SA converges in these situations, and compute the ``optimal step size sequences'' to maximize the estimated rate of convergence.  ( 3 min )
    Autoencoders for discovering manifold dimension and coordinates in data from complex dynamical systems. (arXiv:2305.01090v2 [cs.LG] UPDATED)
    While many phenomena in physics and engineering are formally high-dimensional, their long-time dynamics often live on a lower-dimensional manifold. The present work introduces an autoencoder framework that combines implicit regularization with internal linear layers and $L_2$ regularization (weight decay) to automatically estimate the underlying dimensionality of a data set, produce an orthogonal manifold coordinate system, and provide the mapping functions between the ambient space and manifold space, allowing for out-of-sample projections. We validate our framework's ability to estimate the manifold dimension for a series of datasets from dynamical systems of varying complexities and compare to other state-of-the-art estimators. We analyze the training dynamics of the network to glean insight into the mechanism of low-rank learning and find that collectively each of the implicit regularizing layers compound the low-rank representation and even self-correct during training. Analysis of gradient descent dynamics for this architecture in the linear case reveals the role of the internal linear layers in leading to faster decay of a "collective weight variable" incorporating all layers, and the role of weight decay in breaking degeneracies and thus driving convergence along directions in which no decay would occur in its absence. We show that this framework can be naturally extended for applications of state-space modeling and forecasting by generating a data-driven dynamic model of a spatiotemporally chaotic partial differential equation using only the manifold coordinates. Finally, we demonstrate that our framework is robust to hyperparameter choices.  ( 3 min )
    Deep Knowledge Tracing is an implicit dynamic multidimensional item response theory model. (arXiv:2309.12334v2 [cs.CY] UPDATED)
    Knowledge tracing consists in predicting the performance of some students on new questions given their performance on previous questions, and can be a prior step to optimizing assessment and learning. Deep knowledge tracing (DKT) is a competitive model for knowledge tracing relying on recurrent neural networks, even if some simpler models may match its performance. However, little is known about why DKT works so well. In this paper, we frame deep knowledge tracing as a encoderdecoder architecture. This viewpoint not only allows us to propose better models in terms of performance, simplicity or expressivity but also opens up promising avenues for future research directions. In particular, we show on several small and large datasets that a simpler decoder, with possibly fewer parameters than the one used by DKT, can predict student performance better.  ( 2 min )
    Comparative Analysis of CPU and GPU Profiling for Deep Learning Models. (arXiv:2309.02521v2 [cs.DC] UPDATED)
    Deep Learning(DL) and Machine Learning(ML) applications are rapidly increasing in recent days. Massive amounts of data are being generated over the internet which can derive meaningful results by the use of ML and DL algorithms. Hardware resources and open-source libraries have made it easy to implement these algorithms. Tensorflow and Pytorch are one of the leading frameworks for implementing ML projects. By using those frameworks, we can trace the operations executed on both GPU and CPU to analyze the resource allocations and consumption. This paper presents the time and memory allocation of CPU and GPU while training deep neural networks using Pytorch. This paper analysis shows that GPU has a lower running time as compared to CPU for deep neural networks. For a simpler network, there are not many significant improvements in GPU over the CPU.  ( 2 min )
    Class-Discriminative Attention Maps for Vision Transformers. (arXiv:2312.02364v1 [cs.CV])
    Interpretability methods are critical components for examining and exploring deep neural networks (DNN), as well as increasing our understanding of and trust in them. Vision transformers (ViT), which can be trained to state-of-the-art performance with a self-supervised learning (SSL) training method, provide built-in attention maps (AM). While AMs can provide high-quality semantic segmentation of input images, they do not account for any signal coming from a downstream classifier. We introduce class-discriminative attention maps (CDAM), a novel post-hoc explanation method that is highly sensitive to the target class. Our method essentially scales attention scores by how relevant the corresponding tokens are for the predictions of a classifier head. Alternative to classifier outputs, CDAM can also explain a user-defined concept by targeting similarity measures in the latent space of the ViT. This allows for explanations of arbitrary concepts, defined by the user through a few sample images. We investigate the operating characteristics of CDAM in comparison with relevance propagation (RP) and token ablation maps (TAM), an alternative to pixel occlusion methods. CDAM is highly class-discriminative and semantically relevant, while providing implicit regularization of relevance scores. PyTorch implementation: \url{https://github.com/lenbrocki/CDAM} Web live demo: \url{https://cdam.informatism.com/}  ( 2 min )
    GIT-Net: Generalized Integral Transform for Operator Learning. (arXiv:2312.02450v1 [stat.ML])
    This article introduces GIT-Net, a deep neural network architecture for approximating Partial Differential Equation (PDE) operators, inspired by integral transform operators. GIT-NET harnesses the fact that differential operators commonly used for defining PDEs can often be represented parsimoniously when expressed in specialized functional bases (e.g., Fourier basis). Unlike rigid integral transforms, GIT-Net parametrizes adaptive generalized integral transforms with deep neural networks. When compared to several recently proposed alternatives, GIT-Net's computational and memory requirements scale gracefully with mesh discretizations, facilitating its application to PDE problems on complex geometries. Numerical experiments demonstrate that GIT-Net is a competitive neural network operator, exhibiting small test errors and low evaluations across a range of PDE problems. This stands in contrast to existing neural network operators, which typically excel in just one of these areas.  ( 2 min )
    FaultFormer: Transformer-based Prediction of Bearing Faults. (arXiv:2312.02380v1 [cs.LG])
    The growth of deep learning in the past decade has motivated important applications to smart manufacturing and machine health monitoring. In particular, vibration data offers a rich and reliable source to provide meaningful insights into machine health and predictive maintenance. In this work, we present a Transformer based framework for analyzing vibration signals to predict different types of bearing faults (FaultFormer). In particular, we process signal data using data augmentations and extract their Fourier modes to train a transformer encoder to achieve state of the art accuracies. The attention mechanism as well as model outputs were analyzed to confirm the transformer's ability to automatically extract features within signals and learn both global and local relationships to make classifications. Lastly, two pretraining strategies were proposed to pave the way for large, generalizable transformers that could adapt to new data, situations, or machinery on the production floor.  ( 2 min )
    Constrained Twin Variational Auto-Encoder for Intrusion Detection in IoT Systems. (arXiv:2312.02490v1 [cs.LG])
    Intrusion detection systems (IDSs) play a critical role in protecting billions of IoT devices from malicious attacks. However, the IDSs for IoT devices face inherent challenges of IoT systems, including the heterogeneity of IoT data/devices, the high dimensionality of training data, and the imbalanced data. Moreover, the deployment of IDSs on IoT systems is challenging, and sometimes impossible, due to the limited resources such as memory/storage and computing capability of typical IoT devices. To tackle these challenges, this article proposes a novel deep neural network/architecture called Constrained Twin Variational Auto-Encoder (CTVAE) that can feed classifiers of IDSs with more separable/distinguishable and lower-dimensional representation data. Additionally, in comparison to the state-of-the-art neural networks used in IDSs, CTVAE requires less memory/storage and computing power, hence making it more suitable for IoT IDS systems. Extensive experiments with the 11 most popular IoT botnet datasets show that CTVAE can boost around 1% in terms of accuracy and Fscore in detection attack compared to the state-of-the-art machine learning and representation learning methods, whilst the running time for attack detection is lower than 2E-6 seconds and the model size is lower than 1 MB. We also further investigate various characteristics of CTVAE in the latent space and in the reconstruction representation to demonstrate its efficacy compared with current well-known methods.  ( 2 min )
    Improving Multimodal Sentiment Analysis: Supervised Angular Margin-based Contrastive Learning for Enhanced Fusion Representation. (arXiv:2312.02227v1 [cs.LG])
    The effectiveness of a model is heavily reliant on the quality of the fusion representation of multiple modalities in multimodal sentiment analysis. Moreover, each modality is extracted from raw input and integrated with the rest to construct a multimodal representation. Although previous methods have proposed multimodal representations and achieved promising results, most of them focus on forming positive and negative pairs, neglecting the variation in sentiment scores within the same class. Additionally, they fail to capture the significance of unimodal representations in the fusion vector. To address these limitations, we introduce a framework called Supervised Angular-based Contrastive Learning for Multimodal Sentiment Analysis. This framework aims to enhance discrimination and generalizability of the multimodal representation and overcome biases in the fusion vector's modality. Our experimental results, along with visualizations on two widely used datasets, demonstrate the effectiveness of our approach.  ( 2 min )
    Rethinking and Simplifying Bootstrapped Graph Latents. (arXiv:2312.02619v1 [cs.LG])
    Graph contrastive learning (GCL) has emerged as a representative paradigm in graph self-supervised learning, where negative samples are commonly regarded as the key to preventing model collapse and producing distinguishable representations. Recent studies have shown that GCL without negative samples can achieve state-of-the-art performance as well as scalability improvement, with bootstrapped graph latent (BGRL) as a prominent step forward. However, BGRL relies on a complex architecture to maintain the ability to scatter representations, and the underlying mechanisms enabling the success remain largely unexplored. In this paper, we introduce an instance-level decorrelation perspective to tackle the aforementioned issue and leverage it as a springboard to reveal the potential unnecessary model complexity within BGRL. Based on our findings, we present SGCL, a simple yet effective GCL framework that utilizes the outputs from two consecutive iterations as positive pairs, eliminating the negative samples. SGCL only requires a single graph augmentation and a single graph encoder without additional parameters. Extensive experiments conducted on various graph benchmarks demonstrate that SGCL can achieve competitive performance with fewer parameters, lower time and space costs, and significant convergence speedup.  ( 2 min )
  • Open

    Quantum Machine Learning on Near-Term Quantum Devices: Current State of Supervised and Unsupervised Techniques for Real-World Applications. (arXiv:2307.00908v2 [quant-ph] UPDATED)
    The past decade has witnessed significant advancements in quantum hardware, encompassing improvements in speed, qubit quantity, and quantum volume-a metric defining the maximum size of a quantum circuit effectively implementable on near-term quantum devices. This progress has led to a surge in Quantum Machine Learning (QML) applications on real hardware, aiming to achieve quantum advantage over classical approaches. This survey focuses on selected supervised and unsupervised learning applications executed on quantum hardware, specifically tailored for real-world scenarios. The exploration includes a thorough analysis of current QML implementation limitations on quantum hardware, covering techniques like encoding, ansatz structure, error mitigation, and gradient methods to address these challenges. Furthermore, the survey evaluates the performance of QML implementations in comparison to classical counterparts. In conclusion, we discuss existing bottlenecks related to applying QML on real quantum devices and propose potential solutions to overcome these challenges in the future.
    Calibrated Adaptive Teacher for Domain Adaptive Intelligent Fault Diagnosis. (arXiv:2312.02826v1 [cs.LG])
    Intelligent Fault Diagnosis (IFD) based on deep learning has proven to be an effective and flexible solution, attracting extensive research. Deep neural networks can learn rich representations from vast amounts of representative labeled data for various applications. In IFD, they achieve high classification performance from signals in an end-to-end manner, without requiring extensive domain knowledge. However, deep learning models usually only perform well on the data distribution they have been trained on. When applied to a different distribution, they may experience performance drops. This is also observed in IFD, where assets are often operated in working conditions different from those in which labeled data have been collected. Unsupervised domain adaptation (UDA) deals with the scenario where labeled data are available in a source domain, and only unlabeled data are available in a target domain, where domains may correspond to operating conditions. Recent methods rely on training with confident pseudo-labels for target samples. However, the confidence-based selection of pseudo-labels is hindered by poorly calibrated confidence estimates in the target domain, primarily due to over-confident predictions, which limits the quality of pseudo-labels and leads to error accumulation. In this paper, we propose a novel UDA method called Calibrated Adaptive Teacher (CAT), where we propose to calibrate the predictions of the teacher network throughout the self-training process, leveraging post-hoc calibration techniques. We evaluate CAT on domain-adaptive IFD and perform extensive experiments on the Paderborn benchmark for bearing fault diagnosis under varying operating conditions. Our proposed method achieves state-of-the-art performance on most transfer tasks.
    On the Temperature of Bayesian Graph Neural Networks for Conformal Prediction. (arXiv:2310.11479v3 [cs.LG] UPDATED)
    Accurate uncertainty quantification in graph neural networks (GNNs) is essential, especially in high-stakes domains where GNNs are frequently employed. Conformal prediction (CP) offers a promising framework for quantifying uncertainty by providing $\textit{valid}$ prediction sets for any black-box model. CP ensures formal probabilistic guarantees that a prediction set contains a true label with a desired probability. However, the size of prediction sets, known as $\textit{inefficiency}$, is influenced by the underlying model and data generating process. On the other hand, Bayesian learning also provides a credible region based on the estimated posterior distribution, but this region is $\textit{well-calibrated}$ only when the model is correctly specified. Building on a recent work that introduced a scaling parameter for constructing valid credible regions from posterior estimate, our study explores the advantages of incorporating a temperature parameter into Bayesian GNNs within CP framework. We empirically demonstrate the existence of temperatures that result in more efficient prediction sets. Furthermore, we conduct an analysis to identify the factors contributing to inefficiency and offer valuable insights into the relationship between CP performance and model calibration.
    Expressive Sign Equivariant Networks for Spectral Geometric Learning. (arXiv:2312.02339v1 [cs.LG])
    Recent work has shown the utility of developing machine learning models that respect the structure and symmetries of eigenvectors. These works promote sign invariance, since for any eigenvector v the negation -v is also an eigenvector. However, we show that sign invariance is theoretically limited for tasks such as building orthogonally equivariant models and learning node positional encodings for link prediction in graphs. In this work, we demonstrate the benefits of sign equivariance for these tasks. To obtain these benefits, we develop novel sign equivariant neural network architectures. Our models are based on a new analytic characterization of sign equivariant polynomials and thus inherit provable expressiveness properties. Controlled synthetic experiments show that our networks can achieve the theoretically predicted benefits of sign equivariant models. Code is available at https://github.com/cptq/Sign-Equivariant-Nets.
    Continual Learning with Distributed Optimization: Does CoCoA Forget?. (arXiv:2211.16994v4 [stat.ML] UPDATED)
    We focus on the continual learning problem where the tasks arrive sequentially and the aim is to perform well on the newly arrived task without performance degradation on the previously seen tasks. In contrast to the continual learning literature focusing on the centralized setting, we investigate the distributed estimation framework. We consider the well-established distributed learning algorithm COCOA. We derive closed form expressions for the iterations for the overparametrized case. We illustrate the convergence and the error performance of the algorithm based on the over/under-parameterization of the problem. Our results show that depending on the problem dimensions and data generation assumptions, COCOA can perform continual learning over a sequence of tasks, i.e., it can learn a new task without forgetting previously learned tasks, with access only to one task at a time.
    CityTFT: Temporal Fusion Transformer for Urban Building Energy Modeling. (arXiv:2312.02375v1 [stat.ML])
    Urban Building Energy Modeling (UBEM) is an emerging method to investigate urban design and energy systems against the increasing energy demand at urban and neighborhood levels. However, current UBEM methods are mostly physic-based and time-consuming in multiple climate change scenarios. This work proposes CityTFT, a data-driven UBEM framework, to accurately model the energy demands in urban environments. With the empowerment of the underlying TFT framework and an augmented loss function, CityTFT could predict heating and cooling triggers in unseen climate dynamics with an F1 score of 99.98 \% while RMSE of loads of 13.57 kWh.
    Statistical exploration of the Manifold Hypothesis. (arXiv:2208.11665v3 [stat.ME] CROSS LISTED)
    The Manifold Hypothesis is a widely accepted tenet of Machine Learning which asserts that nominally high-dimensional data are in fact concentrated near a low-dimensional manifold, embedded in high-dimensional space. This phenomenon is observed empirically in many real world situations, has led to development of a wide range of statistical methods in the last few decades, and has been suggested as a key factor in the success of modern AI technologies. We show that rich and sometimes intricate manifold structure in data can emerge from a generic and remarkably simple statistical model -- the Latent Metric Model -- via elementary concepts such as latent variables, correlation and stationarity. This establishes a general statistical explanation for why the Manifold Hypothesis seems to hold in so many situations. Informed by the Latent Metric Model we derive procedures to discover and interpret the geometry of high-dimensional data, and explore hypotheses about the data generating mechanism. These procedures operate under minimal assumptions and make use of well known, scaleable graph-analytic algorithms.
    (Provable) Adversarial Robustness for Group Equivariant Tasks: Graphs, Point Clouds, Molecules, and More. (arXiv:2312.02708v1 [cs.LG])
    A machine learning model is traditionally considered robust if its prediction remains (almost) constant under input perturbations with small norm. However, real-world tasks like molecular property prediction or point cloud segmentation have inherent equivariances, such as rotation or permutation equivariance. In such tasks, even perturbations with large norm do not necessarily change an input's semantic content. Furthermore, there are perturbations for which a model's prediction explicitly needs to change. For the first time, we propose a sound notion of adversarial robustness that accounts for task equivariance. We then demonstrate that provable robustness can be achieved by (1) choosing a model that matches the task's equivariances (2) certifying traditional adversarial robustness. Certification methods are, however, unavailable for many models, such as those with continuous equivariances. We close this gap by developing the framework of equivariance-preserving randomized smoothing, which enables architecture-agnostic certification. We additionally derive the first architecture-specific graph edit distance certificates, i.e. sound robustness guarantees for isomorphism equivariant tasks like node classification. Overall, a sound notion of robustness is an important prerequisite for future work at the intersection of robust and geometric machine learning.
    Fast non-autoregressive inverse folding with discrete diffusion. (arXiv:2312.02447v1 [q-bio.BM])
    Generating protein sequences that fold into a intended 3D structure is a fundamental step in de novo protein design. De facto methods utilize autoregressive generation, but this eschews higher order interactions that could be exploited to improve inference speed. We describe a non-autoregressive alternative that performs inference using a constant number of calls resulting in a 23 times speed up without a loss in performance on the CATH benchmark. Conditioned on the 3D structure, we fine-tune ProteinMPNN to perform discrete diffusion with a purity prior over the index sampling order. Our approach gives the flexibility in trading off inference speed and accuracy by modulating the diffusion speed. Code: https://github.com/johnyang101/pmpnndiff
    Grassmann Manifold Flows for Stable Shape Generation. (arXiv:2211.02900v3 [cs.LG] UPDATED)
    Recently, studies on machine learning have focused on methods that use symmetry implicit in a specific manifold as an inductive bias. Grassmann manifolds provide the ability to handle fundamental shapes represented as shape spaces, enabling stable shape analysis. In this paper, we present a novel approach in which we establish the theoretical foundations for learning distributions on the Grassmann manifold via continuous normalization flows, with the explicit goal of generating stable shapes. Our approach facilitates more robust generation by effectively eliminating the influence of extraneous transformations, such as rotations and inversions, through learning and generating within a Grassmann manifold designed to accommodate the essential shape information of the object. The experimental results indicated that the proposed method could generate high-quality samples by capturing the data structure. Furthermore, the proposed method significantly outperformed state-of-the-art methods in terms of the log-likelihood or evidence lower bound. The results obtained are expected to stimulate further research in this field, leading to advances for stable shape generation and analysis.
    MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition. (arXiv:2312.02829v1 [cs.LG])
    With the advent of deep learning, progressively larger neural networks have been designed to solve complex tasks. We take advantage of these capacity-rich models to lower the cost of inference by exploiting computation in superposition. To reduce the computational burden per input, we propose Multiple-Input-Multiple-Output Neural Networks (MIMONets) capable of handling many inputs at once. MIMONets augment various deep neural network architectures with variable binding mechanisms to represent an arbitrary number of inputs in a compositional data structure via fixed-width distributed representations. Accordingly, MIMONets adapt nonlinear neural transformations to process the data structure holistically, leading to a speedup nearly proportional to the number of superposed input items in the data structure. After processing in superposition, an unbinding mechanism recovers each transformed input of interest. MIMONets also provide a dynamic trade-off between accuracy and throughput by an instantaneous on-demand switching between a set of accuracy-throughput operating points, yet within a single set of fixed parameters. We apply the concept of MIMONets to both CNN and Transformer architectures resulting in MIMOConv and MIMOFormer, respectively. Empirical evaluations show that MIMOConv achieves about 2-4 x speedup at an accuracy delta within [+0.68, -3.18]% compared to WideResNet CNNs on CIFAR10 and CIFAR100. Similarly, MIMOFormer can handle 2-4 inputs at once while maintaining a high average accuracy within a [-1.07, -3.43]% delta on the long range arena benchmark. Finally, we provide mathematical bounds on the interference between superposition channels in MIMOFormer. Our code is available at https://github.com/IBM/multiple-input-multiple-output-nets.
    TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut. (arXiv:2209.00383v3 [cs.CV] UPDATED)
    In this paper, we describe a graph-based algorithm that uses the features obtained by a self-supervised transformer to detect and segment salient objects in images and videos. With this approach, the image patches that compose an image or video are organised into a fully connected graph, where the edge between each pair of patches is labeled with a similarity score between patches using features learned by the transformer. Detection and segmentation of salient objects is then formulated as a graph-cut problem and solved using the classical Normalized Cut algorithm. Despite the simplicity of this approach, it achieves state-of-the-art results on several common image and video detection and segmentation tasks. For unsupervised object discovery, this approach outperforms the competing approaches by a margin of 6.1%, 5.7%, and 2.6%, respectively, when tested with the VOC07, VOC12, and COCO20K datasets. For the unsupervised saliency detection task in images, this method improves the score for Intersection over Union (IoU) by 4.4%, 5.6% and 5.2%. When tested with the ECSSD, DUTS, and DUT-OMRON datasets, respectively, compared to current state-of-the-art techniques. This method also achieves competitive results for unsupervised video object segmentation tasks with the DAVIS, SegTV2, and FBMS datasets.
    Regularization Trade-offs with Fake Features. (arXiv:2212.00433v2 [cs.LG] UPDATED)
    Recent successes of massively overparameterized models have inspired a new line of work investigating the underlying conditions that enable overparameterized models to generalize well. This paper considers a framework where the possibly overparametrized model includes fake features, i.e., features that are present in the model but not in the data. We present a non-asymptotic high-probability bound on the generalization error of the ridge regression problem under the model misspecification of having fake features. Our highprobability results provide insights into the interplay between the implicit regularization provided by the fake features and the explicit regularization provided by the ridge parameter. Numerical results illustrate the trade-off between the number of fake features and how the optimal ridge parameter may heavily depend on the number of fake features.
    On the instrumental variable estimation with many weak and invalid instruments. (arXiv:2207.03035v2 [stat.ME] UPDATED)
    We discuss the fundamental issue of identification in linear instrumental variable (IV) models with unknown IV validity. With the assumption of the "sparsest rule", which is equivalent to the plurality rule but becomes operational in computation algorithms, we investigate and prove the advantages of non-convex penalized approaches over other IV estimators based on two-step selections, in terms of selection consistency and accommodation for individually weak IVs. Furthermore, we propose a surrogate sparsest penalty that aligns with the identification condition and provides oracle sparse structure simultaneously. Desirable theoretical properties are derived for the proposed estimator with weaker IV strength conditions compared to the previous literature. Finite sample properties are demonstrated using simulations and the selection and estimation method is applied to an empirical study concerning the effect of BMI on diastolic blood pressure.  ( 2 min )
    C3: High-performance and low-complexity neural compression from a single image or video. (arXiv:2312.02753v1 [eess.IV])
    Most neural compression models are trained on large datasets of images or videos in order to generalize to unseen data. Such generalization typically requires large and expressive architectures with a high decoding complexity. Here we introduce C3, a neural compression method with strong rate-distortion (RD) performance that instead overfits a small model to each image or video separately. The resulting decoding complexity of C3 can be an order of magnitude lower than neural baselines with similar RD performance. C3 builds on COOL-CHIC (Ladune et al.) and makes several simple and effective improvements for images. We further develop new methodology to apply C3 to videos. On the CLIC2020 image benchmark, we match the RD performance of VTM, the reference implementation of the H.266 codec, with less than 3k MACs/pixel for decoding. On the UVG video benchmark, we match the RD performance of the Video Compression Transformer (Mentzer et al.), a well-established neural video codec, with less than 5k MACs/pixel for decoding.  ( 2 min )
    GIT-Net: Generalized Integral Transform for Operator Learning. (arXiv:2312.02450v1 [stat.ML])
    This article introduces GIT-Net, a deep neural network architecture for approximating Partial Differential Equation (PDE) operators, inspired by integral transform operators. GIT-NET harnesses the fact that differential operators commonly used for defining PDEs can often be represented parsimoniously when expressed in specialized functional bases (e.g., Fourier basis). Unlike rigid integral transforms, GIT-Net parametrizes adaptive generalized integral transforms with deep neural networks. When compared to several recently proposed alternatives, GIT-Net's computational and memory requirements scale gracefully with mesh discretizations, facilitating its application to PDE problems on complex geometries. Numerical experiments demonstrate that GIT-Net is a competitive neural network operator, exhibiting small test errors and low evaluations across a range of PDE problems. This stands in contrast to existing neural network operators, which typically excel in just one of these areas.  ( 2 min )
    A Kernel-Based Neural Network Test for High-dimensional Sequencing Data Analysis. (arXiv:2312.02850v1 [stat.ML])
    The recent development of artificial intelligence (AI) technology, especially the advance of deep neural network (DNN) technology, has revolutionized many fields. While DNN plays a central role in modern AI technology, it has been rarely used in sequencing data analysis due to challenges brought by high-dimensional sequencing data (e.g., overfitting). Moreover, due to the complexity of neural networks and their unknown limiting distributions, building association tests on neural networks for genetic association analysis remains a great challenge. To address these challenges and fill the important gap of using AI in high-dimensional sequencing data analysis, we introduce a new kernel-based neural network (KNN) test for complex association analysis of sequencing data. The test is built on our previously developed KNN framework, which uses random effects to model the overall effects of high-dimensional genetic data and adopts kernel-based neural network structures to model complex genotype-phenotype relationships. Based on KNN, a Wald-type test is then introduced to evaluate the joint association of high-dimensional genetic data with a disease phenotype of interest, considering non-linear and non-additive effects (e.g., interaction effects). Through simulations, we demonstrated that our proposed method attained higher power compared to the sequence kernel association test (SKAT), especially in the presence of non-linear and interaction effects. Finally, we apply the methods to the whole genome sequencing (WGS) dataset from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study, investigating new genes associated with the hippocampal volume change over time.  ( 3 min )
    Class-Discriminative Attention Maps for Vision Transformers. (arXiv:2312.02364v1 [cs.CV])
    Interpretability methods are critical components for examining and exploring deep neural networks (DNN), as well as increasing our understanding of and trust in them. Vision transformers (ViT), which can be trained to state-of-the-art performance with a self-supervised learning (SSL) training method, provide built-in attention maps (AM). While AMs can provide high-quality semantic segmentation of input images, they do not account for any signal coming from a downstream classifier. We introduce class-discriminative attention maps (CDAM), a novel post-hoc explanation method that is highly sensitive to the target class. Our method essentially scales attention scores by how relevant the corresponding tokens are for the predictions of a classifier head. Alternative to classifier outputs, CDAM can also explain a user-defined concept by targeting similarity measures in the latent space of the ViT. This allows for explanations of arbitrary concepts, defined by the user through a few sample images. We investigate the operating characteristics of CDAM in comparison with relevance propagation (RP) and token ablation maps (TAM), an alternative to pixel occlusion methods. CDAM is highly class-discriminative and semantically relevant, while providing implicit regularization of relevance scores. PyTorch implementation: \url{https://github.com/lenbrocki/CDAM} Web live demo: \url{https://cdam.informatism.com/}  ( 2 min )
    Subgradient Regularized Multivariate Convex Regression at Scale. (arXiv:2005.11588v3 [math.OC] UPDATED)
    We present new large-scale algorithms for fitting a subgradient regularized multivariate convex regression function to $n$ samples in $d$ dimensions -- a key problem in shape constrained nonparametric regression with applications in statistics, engineering and the applied sciences. The infinite-dimensional learning task can be expressed via a convex quadratic program (QP) with $O(nd)$ decision variables and $O(n^2)$ constraints. While instances with $n$ in the lower thousands can be addressed with current algorithms within reasonable runtimes, solving larger problems (e.g., $n\approx 10^4$ or $10^5$) is computationally challenging. To this end, we present an active set type algorithm on the dual QP. For computational scalability, we allow for approximate optimization of the reduced sub-problems; and propose randomized augmentation rules for expanding the active set. We derive novel computational guarantees for our algorithms. We demonstrate that our framework can approximately solve instances of the subgradient regularized convex regression problem with $n=10^5$ and $d=10$ within minutes; and shows strong computational performance compared to earlier approaches.  ( 2 min )
    Statistical inference for generative adversarial networks and other minimax problems. (arXiv:2104.10601v2 [math.ST] UPDATED)
    This paper studies generative adversarial networks (GANs) from the perspective of statistical inference. A GAN is a popular machine learning method in which the parameters of two neural networks, a generator and a discriminator, are estimated to solve a particular minimax problem. This minimax problem typically has a multitude of solutions and the focus of this paper are the statistical properties of these solutions. We address two key statistical issues for the generator and discriminator network parameters, consistent estimation and confidence sets. We first show that the set of solutions to the sample GAN problem is a (Hausdorff) consistent estimator of the set of solutions to the corresponding population GAN problem. We then devise a computationally intensive procedure to form confidence sets and show that these sets contain the population GAN solutions with the desired coverage probability. Small numerical experiments and a Monte Carlo study illustrate our results and verify our theoretical findings. We also show that our results apply in general minimax problems that may be non-convex, non-concave, and have multiple solutions.  ( 2 min )
    Detecting algorithmic bias in medical AI-models. (arXiv:2312.02959v1 [stat.ML])
    With the growing prevalence of machine learning and artificial intelligence-based medical decision support systems, it is equally important to ensure that these systems provide patient outcomes in a fair and equitable fashion. This paper presents an innovative framework for detecting areas of algorithmic bias in medical-AI decision support systems. Our approach efficiently identifies potential biases in medical-AI models, specifically in the context of sepsis prediction, by employing the Classification and Regression Trees (CART) algorithm. We verify our methodology by conducting a series of synthetic data experiments, showcasing its ability to estimate areas of bias in controlled settings precisely. The effectiveness of the concept is further validated by experiments using electronic medical records from Grady Memorial Hospital in Atlanta, Georgia. These tests demonstrate the practical implementation of our strategy in a clinical environment, where it can function as a vital instrument for guaranteeing fairness and equity in AI-based medical decisions.  ( 2 min )
    Analyzing and Improving the Training Dynamics of Diffusion Models. (arXiv:2312.02696v1 [cs.CV])
    Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling. As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.  ( 2 min )
    Amortized Bayesian Decision Making for simulation-based models. (arXiv:2312.02674v1 [cs.LG])
    Simulation-based inference (SBI) provides a powerful framework for inferring posterior distributions of stochastic simulators in a wide range of domains. In many settings, however, the posterior distribution is not the end goal itself -- rather, the derived parameter values and their uncertainties are used as a basis for deciding what actions to take. Unfortunately, because posterior distributions provided by SBI are (potentially crude) approximations of the true posterior, the resulting decisions can be suboptimal. Here, we address the question of how to perform Bayesian decision making on stochastic simulators, and how one can circumvent the need to compute an explicit approximation to the posterior. Our method trains a neural network on simulated data and can predict the expected cost given any data and action, and can, thus, be directly used to infer the action with lowest cost. We apply our method to several benchmark problems and demonstrate that it induces similar cost as the true posterior distribution. We then apply the method to infer optimal actions in a real-world simulator in the medical neurosciences, the Bayesian Virtual Epileptic Patient, and demonstrate that it allows to infer actions associated with low cost after few simulations.  ( 2 min )
    A Unified Theory of Diversity in Ensemble Learning. (arXiv:2301.03962v2 [cs.LG] UPDATED)
    We present a theory of ensemble diversity, explaining the nature of diversity for a wide range of supervised learning scenarios. This challenge, of understanding ensemble diversity, has been referred to as the "holy grail" of ensemble learning, an open research issue for over 30 years. Our framework reveals that diversity is in fact a hidden dimension in the bias-variance decomposition of the ensemble loss. We prove a family of exact bias-variance-diversity decompositions, for both regression and classification, e.g., squared, cross-entropy, and Poisson losses. For losses where an additive bias-variance decomposition is not available (e.g., 0/1 loss) we present an alternative approach, which precisely quantifies the effects of diversity, turning out to be dependent on the label distribution. Experiments show how we can use our framework to understand the diversity-encouraging mechanisms of popular methods: Bagging, Boosting, and Random Forests.  ( 2 min )
    Near-Optimal Mean Estimation with Unknown, Heteroskedastic Variances. (arXiv:2312.02417v1 [math.ST])
    Given data drawn from a collection of Gaussian variables with a common mean but different and unknown variances, what is the best algorithm for estimating their common mean? We present an intuitive and efficient algorithm for this task. As different closed-form guarantees can be hard to compare, the Subset-of-Signals model serves as a benchmark for heteroskedastic mean estimation: given $n$ Gaussian variables with an unknown subset of $m$ variables having variance bounded by 1, what is the optimal estimation error as a function of $n$ and $m$? Our algorithm resolves this open question up to logarithmic factors, improving upon the previous best known estimation error by polynomial factors when $m = n^c$ for all $0<c<1$. Of particular note, we obtain error $o(1)$ with $m = \tilde{O}(n^{1/4})$ variance-bounded samples, whereas previous work required $m = \tilde{\Omega}(n^{1/2})$. Finally, we show that in the multi-dimensional setting, even for $d=2$, our techniques enable rates comparable to knowing the variance of each sample.  ( 2 min )
    Pseudo Replay-based Class Continual Learning for Online New Category Anomaly Detection in Additive Manufacturing. (arXiv:2312.02491v1 [cs.LG])
    The incorporation of advanced sensors and machine learning techniques has enabled modern manufacturing enterprises to perform data-driven in-situ quality monitoring based on the sensor data collected in manufacturing processes. However, one critical challenge is that newly presented defect category may manifest as the manufacturing process continues, resulting in monitoring performance deterioration of previously trained machine learning models. Hence, there is an increasing need for empowering machine learning model to learn continually. Among all continual learning methods, memory-based continual learning has the best performance but faces the constraints of data storage capacity. To address this issue, this paper develops a novel pseudo replay-based continual learning by integrating class incremental learning and oversampling-based data generation. Without storing all the data, the developed framework could generate high-quality data representing previous classes to train machine learning model incrementally when new category anomaly occurs. In addition, it could even enhance the monitoring performance since it also effectively improves the data quality. The effectiveness of the proposed framework is validated in an additive manufacturing process, which leverages supervised classification problem for anomaly detection. The experimental results show that the developed method is very promising in detecting novel anomaly while maintaining a good performance on the previous task and brings up more flexibility in model architecture.  ( 2 min )
    Harmonizing Global Voices: Culturally-Aware Models for Enhanced Content Moderation. (arXiv:2312.02401v1 [stat.ML])
    Content moderation at scale faces the challenge of considering local cultural distinctions when assessing content. While global policies aim to maintain decision-making consistency and prevent arbitrary rule enforcement, they often overlook regional variations in interpreting natural language as expressed in content. In this study, we are looking into how moderation systems can tackle this issue by adapting to local comprehension nuances. We train large language models on extensive datasets of media news and articles to create culturally attuned models. The latter aim to capture the nuances of communication across geographies with the goal of recognizing cultural and societal variations in what is considered offensive content. We further explore the capability of these models to generate explanations for instances of content violation, aiming to shed light on how policy guidelines are perceived when cultural and societal contexts change. We find that training on extensive media datasets successfully induced cultural awareness and resulted in improvements in handling content violations on a regional basis. Additionally, these advancements include the ability to provide explanations that align with the specific local norms and nuances as evidenced by the annotators' preference in our conducted study. This multifaceted success reinforces the critical role of an adaptable content moderation approach in keeping pace with the ever-evolving nature of the content it oversees.  ( 2 min )
    Learning a Sparse Representation of Barron Functions with the Inverse Scale Space Flow. (arXiv:2312.02671v1 [stat.ML])
    This paper presents a method for finding a sparse representation of Barron functions. Specifically, given an $L^2$ function $f$, the inverse scale space flow is used to find a sparse measure $\mu$ minimising the $L^2$ loss between the Barron function associated to the measure $\mu$ and the function $f$. The convergence properties of this method are analysed in an ideal setting and in the cases of measurement noise and sampling bias. In an ideal setting the objective decreases strictly monotone in time to a minimizer with $\mathcal{O}(1/t)$, and in the case of measurement noise or sampling bias the optimum is achieved up to a multiplicative or additive constant. This convergence is preserved on discretization of the parameter space, and the minimizers on increasingly fine discretizations converge to the optimum on the full parameter space.  ( 2 min )
    Calibrating dimension reduction hyperparameters in the presence of noise. (arXiv:2312.02946v1 [stat.AP])
    The goal of dimension reduction tools is to construct a low-dimensional representation of high-dimensional data. These tools are employed for a variety of reasons such as noise reduction, visualization, and to lower computational costs. However, there is a fundamental issue that is highly discussed in other modeling problems, but almost entirely ignored in the dimension reduction literature: overfitting. If we interpret data as a combination of signal and noise, prior works judge dimension reduction techniques on their ability to capture the entirety of the data, i.e. both the signal and the noise. In the context of other modeling problems, techniques such as feature-selection, cross-validation, and regularization are employed to combat overfitting, but no such precautions are taken when performing dimension reduction. In this paper, we present a framework that models dimension reduction problems in the presence of noise and use this framework to explore the role perplexity and number of neighbors play in overfitting data when applying t-SNE and UMAP. We also present a workflow others may use to calibrate perplexity or number of neighbors in the presence of noise.  ( 2 min )
    Physics-informed neural networks with unknown measurement noise. (arXiv:2211.15498v4 [stat.ML] UPDATED)
    Physics-informed neural networks (PINNs) constitute a flexible approach to both finding solutions and identifying parameters of partial differential equations. Most works on the topic assume noiseless data, or data contaminated with weak Gaussian noise. We show that the standard PINN framework breaks down in case of non-Gaussian noise. We give a way of resolving this fundamental issue and we propose to jointly train an energy-based model (EBM) to learn the correct noise distribution. We illustrate the improved performance of our approach using multiple examples.  ( 2 min )
    Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications. (arXiv:2312.02828v1 [stat.ML])
    The Stochastic Approximation (SA) algorithm introduced by Robbins and Monro in 1951 has been a standard method for solving equations of the form $\mathbf{f}({\boldsymbol {\theta}}) = \mathbf{0}$, when only noisy measurements of $\mathbf{f}(\cdot)$ are available. If $\mathbf{f}({\boldsymbol {\theta}}) = \nabla J({\boldsymbol {\theta}})$ for some function $J(\cdot)$, then SA can also be used to find a stationary point of $J(\cdot)$. In much of the literature, it is assumed that the error term ${\boldsymbol {xi}}_{t+1}$ has zero conditional mean, and that its conditional variance is bounded as a function of $t$ (though not necessarily with respect to ${\boldsymbol {\theta}}_t$). Also, for the most part, the emphasis has been on ``synchronous'' SA, whereby, at each time $t$, \textit{every} component of ${\boldsymbol {\theta}}_t$ is updated. Over the years, SA has been applied to a variety of areas, out of which two are the focus in this paper: Convex and nonconvex optimization, and Reinforcement Learning (RL). As it turns out, in these applications, the above-mentioned assumptions do not always hold. In zero-order methods, the error neither has zero mean nor bounded conditional variance. In the present paper, we extend SA theory to encompass errors with nonzero conditional mean and/or unbounded conditional variance, and also asynchronous SA. In addition, we derive estimates for the rate of convergence of the algorithm. Then we apply the new results to problems in nonconvex optimization, and to Markovian SA, a recently emerging area in RL. We prove that SA converges in these situations, and compute the ``optimal step size sequences'' to maximize the estimated rate of convergence.  ( 3 min )
    The SVHN Dataset Is Deceptive for Probabilistic Generative Models Due to a Distribution Mismatch. (arXiv:2312.02168v1 [cs.CV])
    The Street View House Numbers (SVHN) dataset is a popular benchmark dataset in deep learning. Originally designed for digit classification tasks, the SVHN dataset has been widely used as a benchmark for various other tasks including generative modeling. However, with this work, we aim to warn the community about an issue of the SVHN dataset as a benchmark for generative modeling tasks: we discover that the official split into training set and test set of the SVHN dataset are not drawn from the same distribution. We empirically show that this distribution mismatch has little impact on the classification task (which may explain why this issue has not been detected before), but it severely affects the evaluation of probabilistic generative models, such as Variational Autoencoders and diffusion models. As a workaround, we propose to mix and re-split the official training and test set when SVHN is used for tasks other than classification. We publish a new split and the indices we used to create it at https://jzenn.github.io/svhn-remix/ .  ( 2 min )
    Conditional Variational Diffusion Models. (arXiv:2312.02246v1 [cs.CV])
    Inverse problems aim to determine parameters from observations, a crucial task in engineering and science. Lately, generative models, especially diffusion models, have gained popularity in this area for their ability to produce realistic solutions and their good mathematical properties. Despite their success, an important drawback of diffusion models is their sensitivity to the choice of variance schedule, which controls the dynamics of the diffusion process. Fine-tuning this schedule for specific applications is crucial but time-costly and does not guarantee an optimal result. We propose a novel approach for learning the schedule as part of the training process. Our method supports probabilistic conditioning on data, provides high-quality solutions, and is flexible, proving able to adapt to different applications with minimum overhead. This approach is tested in two unrelated inverse problems: super-resolution microscopy and quantitative phase imaging, yielding comparable or superior results to previous methods and fine-tuned diffusion models. We conclude that fine-tuning the schedule by experimentation should be avoided because it can be learned during training in a stable way that yields better results.  ( 2 min )

  • Open

    f(g(x)) versus g(f(x))
    I stumbled upon a theorem today that I feel like I’ve needed in the past, though I can’t remember any particular applications. I’m writing it up here as a note to my future self should the need reappear. The theorem gives sufficient conditions to conclude f(g(x)) ≤ g(f(x)) and uses this to prove, for example, […] f(g(x)) versus g(f(x)) first appeared on John D. Cook.  ( 5 min )
  • Open

    [P] Plexiglass: a toolbox for testing against adversarial attacks in DNNs and LLMs.
    Plexiglass: a toolbox for testing against adversarial attacks in DNNs and LLMs. Hi everyone, my name is Enoch and I am a researcher studying deep generative models. I've started this project called Plexiglass a while back, which started off as a torch toolbox for adversarial research in DCNNs. I am now rebooting it as a toolbox for testing against adversarial attacks in both DNNs and LLMs. Idea is to test your DCNNs against adversarial attacks such as fast gradient sign method and toxic prompts in LLMs. I would very much appreciate contributions, I need more devs as I'm too busy to do this all by myself 🙏. Repo is here: https://github.com/kortex-labs/plexiglass submitted by /u/kanxx030 [link] [comments]  ( 9 min )
    [P] Silly project: implement MLP using a transformer (yo, dawg...)
    Transformers heavily rely on MLPs. Presumably, facts which LLMs can recall are stored in MLPs. (E.g. see ROME paper.) These MLPs are huge. E.g. consider GPT-Neo-350M. Each MLP has 1024-element input and output layers and 4096-element inner layer. This requires 2x1024x4096 = 4.2M weights. Whole model has 24 of them (one per layer), resulting in 100M weights being used for MLPs. And yet these MLPs basically just do 4096 dot products to calculate features. Millions of weights just to calculate 4096 dot products and linearly transform inputs/outputs, that seems excessive. Number of weights in MLP is a square of the number of input elements because each inner element has to be directly connected to each input and output element. Each-to-each connections are expensive. Can we do better? Well…  ( 11 min )
    [P] LLM APIs cost management on a user-level
    Hi all, I am very frustrated by the fact that it's not easy to build and maintain a system to track LLM API costs for each user individually, so I know how much to charge each user without having to tell them to BYOK (bring your own key). Is this something that troubles the general LLM-dev community? How do you solve it? We have started to make a product based on our early attempts that would solve this exact problem (https://llmetrics.app) but we are wondering whether there are any good ways that you solve this or if this has been an issue in general? Any feedback is greatly appreciated submitted by /u/Much-Whole-8611 [link] [comments]  ( 9 min )
    [D] Is this a reasonable or unreasonable use of shuffling?
    I am not sharing my opinion on the approach in order to prevent bias in people’s responses. Having this discussion amongst friends and am interested in gathering more opinions and don’t want to steer it in any direction. The core of the discussion is if this approach, as you are shuffling the data from all patients before CV, would it introduce an unreasonable amount of bias/optimisitc results if the data in the training and testing folds may include windows (although distinct from each other) from the same patient? Use-case: Assessing the condition of a subject (binary classification) using features derived from a window worth of continuous data. Data: The data from which the features are derived is a continuous source of data collected for 10 hours from 20 distinct subjects. The data from which the label is defined has the same sampling rate as the data being used to make the prediction. They are collected synchronously. The feature set is comprised of let’s say > 20 features. There is a label associated with each window of data. The data classes are balanced. Proposed evaluation approach: Shuffle the data of all the patients Perform nested cross validation Very interested in people’s opinion! ✌🏼✌🏼 submitted by /u/ConfusedLayer1 [link] [comments]  ( 10 min )
    [D] Switching my VAE's latent distribution from Normal to Categorical, but I need a way to replace the Gaussian multiplication operation
    So I have been reimplementing a VAE that has a recurrent latent distribution. After each new input, the latent distribution is updated using Gaussian multiplication of the encoder's output and the current latent distribution. The model doesn't learn a good representation and I know that recent Dreamer models switched from continuous to categorical latent states which improved performance, so I want to give this a shot. But, there's no Gaussian multiplication equivalent for categorical distributions. I have decided to take a weighted average of the new and old distribution parameters, where the weighting is determined by the inverse entropy of the distributions (so they are weighted more or less according to a proxy of "uncertainty"). I'm aware that this is simply a heuristic though and doesn't have any solid mathematical reasoning to back it up. Does anyone know a better way to achieve this, or has anyone encountered a similar situation? submitted by /u/SmeatSmeamen [link] [comments]  ( 10 min )
    [D] Facesearch Tool
    Extremely surprised to see how facesearchaibot over telegram is rocking literally now we need to be careful about posting our images, comments and post. You guys must try it as it will give all the online occurrences of anyone. Link submitted by /u/Cute_Principle9713 [link] [comments]  ( 9 min )
    [P] Spin-Model Transformers
    A non-equilibrium statistical mechanics perspective on transformers. We present a class of transformers based on mean-field dynamics of vector-spin models. Our framework supports asymmetric couplings and yields residual, attention, and feed-forward terms. Post: https://mcbal.github.io/post/spin-model-transformers Code (JAX): https://github.com/mcbal/spin-model-transformers submitted by /u/mcbal2666 [link] [comments]  ( 9 min )
    Can I do a DPO training on a synthetic dataset? [D]
    I am currently following the fine-tuning methods for the Hugging Face model Zephyr 7B. They have implemented two fine-tuning methods, namely SFT and DPO, on a public dataset. Currently, I am fine-tuning a 7B model using SFT, which is progressing well. However, I have a question regarding whether it is acceptable to fine-tune the model on a DPO dataset generated synthetically by GPT-3.5. From my understanding, DPO should be trained on the answers produced by the same model. I want to confirm this and inquire if anyone has attempted such fine-tuning before. submitted by /u/MustafaAlahmid [link] [comments]  ( 9 min )
    [D] Recommendation for LLM fine-tuning codebase
    I have some ideas for a new fine-tuning technique and want to compare it against LoRA. Which tools or codebase do you think I should use? A quick search seems to indicate that Hugging Face is the way to go, but I wonder if there are better alternatives (if possible please give the pros and cons). Thank you in advance for any suggestions! submitted by /u/netw0rkf10w [link] [comments]  ( 9 min )
    [D] Best AI ML DL DS Roadmap
    Hi! What is the best complete roadmap for AI, ML, DL, and Data Science? Some roadmaps I have found: - [roadmap.sh] AI and Data Scientist Roadmap ← Best? - [i.am.ai] AI Expert Roadmap - [github.com] mrdbourke/machine-learning-roadmap - [github.com] luspr/awesome-ml-courses - [rentry.org] Machine Learning Roadmap Which one should I choose? I am not a beginner in programming (8y as a hobby and 3y working), but it was not related to AI. submitted by /u/Rashimban [link] [comments]  ( 9 min )
    [D] Alignment using tests? Will it work?
    I came across this open source library a few times across Reddit + HN and something that piqued my interest was their concept around "test-driven alignment". Kinda like test-driven development where you simply just write a few tests on the intended behavior and it'll fine-tune the model. Conceptually, seems simpler than supervised fine-tuning and RLHF methods that I've had to resort to. Tried it out and seemed to work well, at least for synthetic data and classifiers. Skeptical for larger use-cases. Anyone else seen this before or used this approach? It itches the part of my brain that craves elegance but don't know in-practice. submitted by /u/utilitymro [link] [comments]  ( 9 min )
    [P] Tesla K80 Performance for Machine Learning
    I have an old server that I'd like to convert to a machine learning testing platform. It has (2) Xeon E5-2670 CPUs for a total of 16 cores, 128Gb ram, and (2) PCIe x 16 slots. I bought a pair of Nvidia Tesla K80s but I am wondering if I would get better performance adding another pair of K80s since the prices are so good right now. The issue I have is that my motherboard only has (2) PCIe x16 slots. I was thinking that I could install (2) PCIe bifurcation cards, and add extension cables to plug-in all (4) K80s. If I understand correctly that would give each K80 8x8 PCIe lanes which would be further dropped to 4x4 for each GPU. Would the performance hit of the reduced lanes outweigh the additional GPUs? submitted by /u/NorthernTechno [link] [comments]  ( 9 min )
    [R] Google releases the Gemini family of frontier models
    Tweet from Jeff Dean: https://twitter.com/JeffDean/status/1732415515673727286 Blog post: https://blog.google/technology/ai/google-gemini-ai/ Tech report: https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf Any thoughts? There is not much "meat" in this announcement! They must be worried about other labs + open source learning from this. submitted by /u/blabboy [link] [comments]  ( 9 min )
    [D] Convert a ML model into a rule based system
    Would it be possible to convert a ML model into a rule based system? The assumption is ML model acts like an approximation of a function. The benefit of such conversion includes the increased level of explainability. Any pointers or thoughts? Many thanks! submitted by /u/Relevant-Fortune-810 [link] [comments]  ( 9 min )
    [D] Why are decoder only models used for autoregressive generation instead of encoder-only models? What value is the causal mask if the new token doesn't exist yet?
    Why even bother with the causal mask, it seems like it is just destroying useful information transfer? I get that if you trained with it you can't easily remove it during inference without reducing quality, but why not just train encoders to guess the next token at that point? submitted by /u/30299578815310 [link] [comments]  ( 9 min )
    [R] Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model
    arXiv: https://arxiv.org/abs/2310.17653 OpenReview: https://openreview.net/forum?id=m50eKHCttz Abstract: Training deep networks requires various design decisions regarding for instance their architecture, data augmentation, or optimization. In this work, we find these training variations to result in networks learning unique feature sets from the data. Using public model libraries comprising thousands of models trained on canonical datasets like ImageNet, we observe that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other -- independent of overall performance. Given any arbitrary pairing of pretrained models and no external rankings (such as separate test sets, e.g. due to data privacy), we investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation -- a task made particularly difficult as additional knowledge can be contained in stronger, equiperformant or weaker models. Yet facilitating robust transfer in scenarios agnostic to pretrained model pairings would unlock auxiliary gains and knowledge fusion from any model repository without restrictions on model and problem specifics - including from weaker, lower-performance models. This work therefore provides an initial, in-depth exploration on the viability of such general-purpose knowledge transfer. Across large-scale experiments, we first reveal the shortcomings of standard knowledge distillation techniques, and then propose a much more general extension through data partitioning for successful transfer between nearly all pretrained models, which we show can also be done unsupervised. Finally, we assess both the scalability and impact of fundamental model properties on successful model-agnostic knowledge transfer. submitted by /u/APaperADay [link] [comments]  ( 10 min )
    [D] Why is most Open Source AI happening outside the USA?
    I don't think that's too hyperbolic. USA has Meta, and little else, especially compared to what's coming from outside. And I don't mean this to dunk on USA, I'm truly just shocked. China: Yuan, Qwen, DeepSeek, Yi, XVERSE, Aquila, RWKV, all top-tier LLMs today. A dozen great (top tier) multi-modals. The top tier embedding models. UK: Stability France: Mistral Finland (?): Lumi Poro UAE: Falcon Russia: Kandinsky USA: Meta has Llama and some other nice things, Salesforce had some things, "OpenAI" gave us whisper. Is it the regulatory environment? Is it an investment issue? Is it a GPU shortage? Did big-tech scoop up and lock down all US talent? Does anyone have any leads? EDIT: Woh, out of the gates I'm getting ratioed for asking. All the models mentioned here are pretrained models. Yes, US communities can finetune, that misses my point though. The expensive OSS work is happening outside the US. And if you say "that's not OSS", ok, it's still mostly happening outside USA. Again, not dunking on USA, just asking a question, and would love to see more US companies on the list of OSS contributors. submitted by /u/BayesMind [link] [comments]  ( 10 min )
    Apple Releases 'MLX' - ML Framework for Apple Silicon [N]
    Apple's ML Team has just released 'MLX' on GitHub. Their ML framework for Apple Silicon. https://github.com/ml-explore/mlx A realistic alternative to CUDA? MPS is already incredibly efficient... this could make it interesting if we see adoption. ​ submitted by /u/LoadingALIAS [link] [comments]  ( 9 min )
    New to AGI research! [D][R]
    The last few months I have been reflecting on the development of Artificial General Intelligence (AGI). With ChatGPT-4 it seems so close that it is deceptive. This has me wondering: if it looks so promising... is it really the right direction to go? ​ I think this is supported for several reasons; not only for the technological ultra-enthusiasm of having a virtual assistant that programs everything I tell it at an impressive speed ​ First of all, we have created a tool capable of producing and executing code very effectively. In fact, it is almost humbling to see its effectiveness. Please refrain from commenting for those who have not used the paid version, sorry, it is a disappointment, we have to wait for the open source version; and those who are going to tell me that I don't unders…  ( 11 min )
    [D] Advice needed for ML thesis
    Hi everyone, I'm a master's student in AI and I'm really struggling finding a topic for my master's thesis... my interests mostly lie within ML/DL but I'm not exactly sure what topic I should focus on. In my mind, there are two approaches: First approach is taking a dataset and using DL to solve a specific problem. Now the issue is that (at least in my mind) all problems look the same and you just gotta trial and error your data on different models till you get a proper accuracy. I'm not really sure how this is considered scientific work. Second approach would be working on a specific theoretical field, as in developing a new kind of algorithm, or modifying already existing ones. In both cases, I'm not sure what topic I should select and how to find a topic. I keep thinking that I'm supposed to do something semi-great as a thesis. I should mention I intend to publish papers and eventually get a PhD. Any guidance would be greatly appreciated! submitted by /u/labianconeri [link] [comments]  ( 10 min )
    [D] CS + Maths vs CS+ Stats vs CS + Data Science
    I'm a current freshman at university, I'm thinking of taking up another major, so in the case I eventually do decide to get into industry/research in machine learning, which do you think would be best option? I'm leaning more towards CS + Stats/Math as opposed to data science. I love learning about both, Statistics and Maths. submitted by /u/Equal_Job_7732 [link] [comments]  ( 9 min )
  • Open

    This AI Alliance involving Meta, IBM, and other companies/institutions is nothing more than a bunch of defeated Losers.
    They have lost the AI race to the Top players! That is why they are pivoting to Open Source , to gain a competitive advantage not due to any sympathy of heart. These companies would die to be in the position Microsoft, google, and Amazon are right now with their closed source models. The telltale sign is too look at hardware players. Nividia hasn't joined because theirs making money hand over fist while AMD has to because Nvidia is outcompeting them. HAHA I guess they don't like monopolies when it's not themselves. HA Get Rekt. submitted by /u/Major_Fishing6888 [link] [comments]  ( 1 min )
    Google wants to show how safe and unbiased their Gemini model is by making sure to exclude any white people and assuring a male:female ratio of 1:3 in their promo
    submitted by /u/mysterious-progres [link] [comments]  ( 1 min )
    Wikipedia's most-viewed articles in 2023 reflect new-found obsession with AI
    The most-viewed articles on English Wikipedia in 2023 include ChatGPT, the annual list of deaths, and the 2023 Cricket World Cup. ChatGPT, an AI chatbot developed by OpenAI, has gained significant popularity and has been widely adopted in various fields such as education, healthcare, law, and religion. Its Wikipedia page documents the debates surrounding the promise and potential dangers of generative AI. The annual list of deaths, a perennially popular article, ranked second in terms of views. Individual entries for notable figures who passed away, such as Matthew Perry and Lisa Marie Presley, also garnered significant interest. The 2023 Cricket World Cup, along with other cricket-related entries, made the top 25 list for the first time since 2015. Other popular articles include Taylor Swift, Cristiano Ronaldo, and the Russian invasion of Ukraine. Source: https://fortune.com/2023/12/05/wikipedia-top-articles-page-views-2023/ submitted by /u/NuseAI [link] [comments]  ( 1 min )
    Google's Gemini Breaks New Ground in Seamless Multimodal Reasoning
    Google's recent unveiling of Gemini is sure to take the internet by storm! It felt as though all hope was lost with the recent announcement of a delay until January of 2024, however on December 6th, Google had blessed us with quite the news… They posted multiple videos and blogposts showing the new AI models capabilities, I will provide you with key information from those mediums. After viewing most of them I’m quite impressed with Gemini and would like to share as to why! Firstly, and most notably, Google had compared the Gemini model to GPT4-v and claims it outperforms GPT-4-Vision in 30 out of 32 benchmarking tests, including multi-discipline reasoning problems (MMLU Benchmark), image & document understanding, code generation, etc. While benchmarks have limits, this early success in …  ( 1 min )
    gemini is better than chatgpt-4 on sixteen different benchmarks
    Factual accuracy: Up to 20% improvement Reasoning and problem-solving: Up to 30% improvement Creativity and expressive language: Up to 15% improvement Safety and ethics: Up to 10% improvement Multimodal learning: Up to 25% improvement Zero-shot learning: Up to 35% improvement Few-shot learning: Up to 40% improvement Language modeling: Up to 15% improvement Machine translation: Up to 20% improvement Text summarization: Up to 18% improvement Personalization: Up to 22% improvement Accessibility: Up to 25% improvement Explainability: Up to 17% improvement Speed: Up to 28% improvement Scalability: Up to 33% improvement Energy efficiency: Up to 21% improvement submitted by /u/Georgeo57 [link] [comments]  ( 1 min )
    DEMO: Google's new multi-modal AI called "Gemini"
    submitted by /u/redatrsuper [link] [comments]  ( 1 min )
    Google Gemini is finally out!
    submitted by /u/lemmeupvoteyou [link] [comments]  ( 1 min )
    Anyone know of a simple character generator using Langchain and OpenAI?
    I am looking to build a simple character generator. I know part of character generation is summarization of previous context, and part is prompt engineering to get it to respond in the style of a character. Anyone know of a lightweight project that does this? I have seen OpenCharacter but the source code is like a 1000-line HTML so hard to parse what's going on to reverse engineer. submitted by /u/t3cblaze [link] [comments]  ( 1 min )
    Google launches Gemini
    https://deepmind.google/technologies/gemini/#capabilities Benchmarks: https://imgur.com/DWNQcaY (Table 2 on Page 7) - Gemini Pro (the launched model) is worse than ChatGPT4, but a bit better than GPT3.5. All the examples are for Ultra (actual state of the art outperforming GPT4), which won't be available until 2024. Promo video: https://www.youtube.com/watch?v=UIZAiXYceBI (& see other videos on that channel for more) Technical paper: https://goo.gle/GeminiPaper Some details (source): 32k context length efficient attention mechanisms (for e.g. multi-query attention (Shazeer, 2019)) audio input via Universal Speech Model (USM) (Zhang et al., 2023) features no audio output? (Figure 2) visual encoding of Gemini models is inspired by our own foundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al., 2022) output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b) supervised fine tuning (SFT) and reinforcement learning through human feedback (RLHF) submitted by /u/becausecurious [link] [comments]  ( 1 min )
    Google’s Bard chatbot is getting way better thanks to Gemini
    submitted by /u/lemmeupvoteyou [link] [comments]  ( 1 min )
    Introducing Gemini: our largest and most capable AI model
    submitted by /u/wyem [link] [comments]  ( 1 min )
    What were the hottest AI product launches of 2023?
    Wondering if we can create a list of the AI products (including wrapper products) that created the most buzz this year? What do I mean by hottest? I guess I mean the products you kept hearing about and seeing discussed everywhere. submitted by /u/guestoboard [link] [comments]  ( 1 min )
    I need Help Finding AI Image Location Services Any Suggestions Besides ChatGPT 4 or Picarta.ai?
    I’ve been exploring AI tools for image geolocation and have tried ChatGPT 4 and Picarta.ai. , but I’m curious if there are other effective options out there. Could anyone suggest alternative platforms or services that specialize in image location identification? Your insights and recommendations would be immensely valuable. Thanks in advance! submitted by /u/Quiet-Ad2359 [link] [comments]  ( 1 min )
    Excuse me, what year is it now?
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 12/5/2023
    A team of AI researchers at Google’s DeepMind project have developed a type of AI system that is able to demonstrate social learning capabilities.[1] Europe is doubling down on generative AI (GenAI) investment but is on a more cautious path than North America, according to new research from the Infosys Knowledge Institute (IKI), the research arm of Infosys.[2] Visual Electric launches an AI-powered image generator with a designer workflow focus.[3] X.AI, Elon Musk’s artificial intelligence startup, has filed with the SEC to raise up to $1 billion in an equity offering.[4] Sources: [1] https://techxplore.com/news/2023-12-deepmind-ai-social-capabilities.html [2] https://www.fortuneindia.com/enterprise/european-firms-to-double-gen-ai-spending-in-2024-infosys/114981 [3] https://techcrunch.com/2023/12/05/visual-electric-launches-an-ai-powered-image-generator-with-a-designer-workflow-focus/ [4] https://www.cnbc.com/2023/12/05/elon-musks-ai-startup-xai-files-to-raise-1-billion-.html submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    Video Dubbing & Translation with Speaker Detection
    submitted by /u/Lemons_for_Sale [link] [comments]  ( 1 min )
    AI is the great equalizer: Studies show that the rise of tools like ChatGPT is good news for employees who suck at their jobs
    submitted by /u/thisisinsider [link] [comments]  ( 1 min )
  • Open

    How far is the solution found by reinforcement learning from the optimal solution?
    For example, in a completely solved game like checkers (where it's certain to end in a draw with perfect play from both sides), has there been any research comparing deep learning with the optimal strategy? Can deep reinforcement learning approaches like AlphaZero achieve a 99% probability of drawing with the optimal strategy in a limited training time? submitted by /u/ZealousidealRub8250 [link] [comments]  ( 1 min )
    Program Induction
    Can somebody explain how the Program Induction methodology work on meta-learning machine learning models submitted by /u/NoShoe6803 [link] [comments]  ( 1 min )
    Monte Carlo Tree Search
    Can MCTS be used for multi-agent problems? submitted by /u/Meta_Sage_247 [link] [comments]  ( 1 min )
  • Open

    Visual AI Takes Flight at Canada’s Largest, Busiest Airport
    Toronto Pearson International Airport, in Ontario, Canada, is the country’s largest and busiest airport, serving some 50 million passengers each year. To enhance traveler experiences, the airport in June deployed the Zensors AI platform, which uses anonymized footage from existing security cameras to generate spatial data that helps optimize operations in real time. A member Read article >  ( 7 min )
    17 Predictions for 2024: From RAG to Riches to Beatlemania and National Treasures
    Move over, Merriam-Webster: Enterprises this year found plenty of candidates to add for word of the year. “Generative AI” and “generative pretrained transformer” were followed by terms such as “large language models” and “retrieval-augmented generation” (RAG) as whole industries turned their attention to transformative new technologies. Generative AI started the year as a blip on Read article >  ( 17 min )
    AV 2.0, the Next Big Wayve in Self-Driving Cars
    A new era of autonomous vehicle technology, known as AV 2.0, has emerged, marked by large, unified AI models that can control multiple parts of the vehicle stack, from perception and planning to control. Wayve, a London-based autonomous driving technology company, is leading the surf. In the latest episode of NVIDIA’s AI Podcast, host Katie Read article >  ( 6 min )
  • Open

    Mitigate hallucinations through Retrieval Augmented Generation using Pinecone vector database & Llama-2 from Amazon SageMaker JumpStart
    Despite the seemingly unstoppable adoption of LLMs across industries, they are one component of a broader technology ecosystem that is powering the new AI wave. Many conversational AI use cases require LLMs like Llama 2, Flan T5, and Bloom to respond to user queries. These models rely on parametric knowledge to answer questions. The model […]  ( 11 min )
    Techniques for automatic summarization of documents using language models
    Summarization is the technique of condensing sizable information into a compact and meaningful form, and stands as a cornerstone of efficient communication in our information-rich age. In a world full of data, summarizing long texts into brief summaries saves time and helps make informed decisions. Summarization condenses content, saving time and improving clarity by presenting […]  ( 13 min )
    Boosting RAG-based intelligent document assistants using entity extraction, SQL querying, and agents with Amazon Bedrock
    Conversational AI has come a long way in recent years thanks to the rapid developments in generative AI, especially the performance improvements of large language models (LLMs) introduced by training techniques such as instruction fine-tuning and reinforcement learning from human feedback. When prompted correctly, these models can carry coherent conversations without any task-specific training data. […]  ( 18 min )
    How Q4 Inc. used Amazon Bedrock, RAG, and SQLDatabaseChain to address numerical and structured dataset challenges building their Q&A chatbot
    This post is co-written with Stanislav Yeshchenko from Q4 Inc. Enterprises turn to Retrieval Augmented Generation (RAG) as a mainstream approach to building Q&A chatbots. We continue to see emerging challenges stemming from the nature of the assortment of datasets available. These datasets are often a mix of numerical and text data, at times structured, […]  ( 18 min )
  • Open

    Abstracts: December 6, 2023
    "Abstracts”—your source for world-class research in brief—welcomes Senior Principal Research Manager Xing Xie to the podcast series to discuss his paper on evaluating general-purpose AI with psychometrics. The post Abstracts: December 6, 2023 appeared first on Microsoft Research.  ( 13 min )
    Microsoft at ESEC/FSE 2023: AI techniques for a streamlined coding workflow
    Explore the latest AI innovations aiming to advance the software development lifecycle. AdaptivePaste adapts and refines pasted code snippets in an IDE. InferFix automates bug detection and repair. Discover how. The post Microsoft at ESEC/FSE 2023: AI techniques for a streamlined coding workflow appeared first on Microsoft Research.  ( 10 min )
    Research Focus: Week of December 4, 2023
    Research Focus: Using LLMs in a Rust-based formal verification framework; Rethinking network measurements with user feedback; 3D telemedicine using HoloportationTM communication technology could enhance overseas surgical visits. The post Research Focus: Week of December 4, 2023 appeared first on Microsoft Research.  ( 9 min )
  • Open

    Eric Evans to step down as director of MIT Lincoln Laboratory
    During 18 years of leadership, Evans established new R&D mission areas, strengthened ties to the MIT community, and increased inclusion and education efforts.  ( 11 min )
  • Open

    Program Induction
    Can somebody explain to me how Program Induction methodology works on meta-learning machine learning models submitted by /u/NoShoe6803 [link] [comments]  ( 1 min )
    New into AGI research
    The last few months I have been reflecting on the development of Artificial General Intelligence (AGI). With ChatGPT-4 it seems so close that it is deceptive. This has me wondering: if it looks so promising... is it really the right direction to go? ​ I think this is supported for several reasons; not only for the technological ultra-enthusiasm of having a virtual mini-slave that programs everything I tell it at an impressive speed (don't get me wrong, it's not that I support slavery, but I'm telling you that the LLM has a great time ). ​ First of all, we have created a tool capable of producing and executing code very effectively. In fact, it is almost humbling to see its effectiveness. Please refrain from commenting for those who have not used the paid version, sorry, it is a disappo…  ( 1 min )
  • Open

    Google at EMNLP 2023
    Posted by Malaya Jules, Program Manager, Google Google is proud to be a Diamond Sponsor of Empirical Methods in Natural Language Processing (EMNLP 2023), a premier annual conference, which is being held this week in Sentosa, Singapore. Google has a strong presence at this year’s conference with over 65 accepted papers and active involvement in 11 workshops and tutorials. Google is also happy to be a Major Sponsor for the Widening NLP workshop (WiNLP), which aims to highlight global representations of people, perspectives, and cultures in AI and ML. We look forward to sharing some of our extensive NLP research and expanding our partnership with the broader research community. We hope you’ll visit the Google booth to chat with researchers who are actively pursuing the latest innovati…  ( 96 min )
  • Open

    Regret Optimality of GP-UCB. (arXiv:2312.01386v1 [cs.LG])
    Gaussian Process Upper Confidence Bound (GP-UCB) is one of the most popular methods for optimizing black-box functions with noisy observations, due to its simple structure and superior performance. Its empirical successes lead to a natural, yet unresolved question: Is GP-UCB regret optimal? In this paper, we offer the first generally affirmative answer to this important open question in the Bayesian optimization literature. We establish new upper bounds on both the simple and cumulative regret of GP-UCB when the objective function to optimize admits certain smoothness property. These upper bounds match the known minimax lower bounds (up to logarithmic factors independent of the feasible region's dimensionality) for optimizing functions with the same smoothness. Intriguingly, our findings indicate that, with the same level of exploration, GP-UCB can simultaneously achieve optimality in both simple and cumulative regret. The crux of our analysis hinges on a refined uniform error bound for online estimation of functions in reproducing kernel Hilbert spaces. This error bound, which we derive from empirical process theory, is of independent interest, and its potential applications may reach beyond the scope of this study.  ( 2 min )
    Uncertainty-biased molecular dynamics for learning uniformly accurate interatomic potentials. (arXiv:2312.01416v1 [physics.comp-ph])
    Efficiently creating a concise but comprehensive data set for training machine-learned interatomic potentials (MLIPs) is an under-explored problem. Active learning (AL), which uses either biased or unbiased molecular dynamics (MD) simulations to generate candidate pools, aims to address this objective. Existing biased and unbiased MD simulations, however, are prone to miss either rare events or extrapolative regions -- areas of the configurational space where unreliable predictions are made. Simultaneously exploring both regions is necessary for developing uniformly accurate MLIPs. In this work, we demonstrate that MD simulations, when biased by the MLIP's energy uncertainty, effectively capture extrapolative regions and rare events without the need to know \textit{a priori} the system's transition temperatures and pressures. Exploiting automatic differentiation, we enhance bias-forces-driven MD simulations by introducing the concept of bias stress. We also employ calibrated ensemble-free uncertainties derived from sketched gradient features to yield MLIPs with similar or better accuracy than ensemble-based uncertainty methods at a lower computational cost. We use the proposed uncertainty-driven AL approach to develop MLIPs for two benchmark systems: alanine dipeptide and MIL-53(Al). Compared to MLIPs trained with conventional MD simulations, MLIPs trained with the proposed data-generation method more accurately represent the relevant configurational space for both atomic systems.  ( 2 min )
    Nash Learning from Human Feedback. (arXiv:2312.00886v1 [stat.ML])
    Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.
    How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model. (arXiv:2307.02129v3 [cs.LG] UPDATED)
    Deep learning algorithms demonstrate a surprising ability to learn high-dimensional tasks from limited examples. This is commonly attributed to the depth of neural networks, enabling them to build a hierarchy of abstract, low-dimensional data representations. However, how many training examples are required to learn such representations remains unknown. To quantitatively study this question, we introduce the Random Hierarchy Model: a family of synthetic tasks inspired by the hierarchical structure of language and images. The model is a classification task where each class corresponds to a group of high-level features, chosen among several equivalent groups associated with the same class. In turn, each feature corresponds to a group of sub-features chosen among several equivalent ones and so on, following a hierarchy of composition rules. We find that deep networks learn the task by developing internal representations invariant to exchanging equivalent groups. Moreover, the number of data required corresponds to the point where correlations between low-level features and classes become detectable. Overall, our results indicate how deep networks overcome the curse of dimensionality by building invariant representations, and provide an estimate of the number of data required to learn a hierarchical task.
    Predicting Properties of Periodic Systems from Cluster Data: A Case Study of Liquid Water. (arXiv:2312.01414v1 [physics.comp-ph])
    The accuracy of the training data limits the accuracy of bulk properties from machine-learned potentials. For example, hybrid functionals or wave-function-based quantum chemical methods are readily available for cluster data but effectively out-of-scope for periodic structures. We show that local, atom-centred descriptors for machine-learned potentials enable the prediction of bulk properties from cluster model training data, agreeing reasonably well with predictions from bulk training data. We demonstrate such transferability by studying structural and dynamical properties of bulk liquid water with density functional theory and have found an excellent agreement with experimental as well as theoretical counterparts.
    Long-term Forecasting with TiDE: Time-series Dense Encoder. (arXiv:2304.08424v4 [stat.ML] UPDATED)
    Recent work has shown that simple linear models can outperform several Transformer based approaches in long term time-series forecasting. Motivated by this, we propose a Multi-layer Perceptron (MLP) based encoder-decoder model, Time-series Dense Encoder (TiDE), for long-term time-series forecasting that enjoys the simplicity and speed of linear models while also being able to handle covariates and non-linear dependencies. Theoretically, we prove that the simplest linear analogue of our model can achieve near optimal error rate for linear dynamical systems (LDS) under some assumptions. Empirically, we show that our method can match or outperform prior approaches on popular long-term time-series forecasting benchmarks while being 5-10x faster than the best Transformer based model.
    Minimal Random Code Learning with Mean-KL Parameterization. (arXiv:2307.07816v2 [cs.LG] UPDATED)
    This paper studies the qualitative behavior and robustness of two variants of Minimal Random Code Learning (MIRACLE) used to compress variational Bayesian neural networks. MIRACLE implements a powerful, conditionally Gaussian variational approximation for the weight posterior $Q_{\mathbf{w}}$ and uses relative entropy coding to compress a weight sample from the posterior using a Gaussian coding distribution $P_{\mathbf{w}}$. To achieve the desired compression rate, $D_{\mathrm{KL}}[Q_{\mathbf{w}} \Vert P_{\mathbf{w}}]$ must be constrained, which requires a computationally expensive annealing procedure under the conventional mean-variance (Mean-Var) parameterization for $Q_{\mathbf{w}}$. Instead, we parameterize $Q_{\mathbf{w}}$ by its mean and KL divergence from $P_{\mathbf{w}}$ to constrain the compression cost to the desired value by construction. We demonstrate that variational training with Mean-KL parameterization converges twice as fast and maintains predictive performance after compression. Furthermore, we show that Mean-KL leads to more meaningful variational distributions with heavier tails and compressed weight samples which are more robust to pruning.
    Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars. (arXiv:2312.01429v1 [cs.LG])
    Interpretability methods aim to understand the algorithm implemented by a trained model (e.g., a Transofmer) by examining various aspects of the model, such as the weight matrices or the attention patterns. In this work, through a combination of theoretical results and carefully controlled experiments on synthetic data, we take a critical view of methods that exclusively focus on individual parts of the model, rather than consider the network as a whole. We consider a simple synthetic setup of learning a (bounded) Dyck language. Theoretically, we show that the set of models that (exactly or approximately) solve this task satisfy a structural characterization derived from ideas in formal languages (the pumping lemma). We use this characterization to show that the set of optima is qualitatively rich; in particular, the attention pattern of a single layer can be ``nearly randomized'', while preserving the functionality of the network. We also show via extensive experiments that these constructions are not merely a theoretical artifact: even after severely constraining the architecture of the model, vastly different solutions can be reached via standard training. Thus, interpretability claims based on inspecting individual heads or weight matrices in the Transformer can be misleading.
    Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion. (arXiv:2312.00852v1 [cs.LG])
    Sampling from the posterior distribution poses a major computational challenge in solving inverse problems using latent diffusion models. Common methods rely on Tweedie's first-order moments, which are known to induce a quality-limiting bias. Existing second-order approximations are impractical due to prohibitive computational costs, making standard reverse diffusion processes intractable for posterior sampling. This paper introduces Second-order Tweedie sampler from Surrogate Loss (STSL), a novel sampler that offers efficiency comparable to first-order Tweedie with a tractable reverse process using second-order approximation. Our theoretical results reveal that the second-order approximation is lower bounded by our surrogate loss that only requires $O(1)$ compute using the trace of the Hessian, and by the lower bound we derive a new drift term to make the reverse process tractable. Our method surpasses SoTA solvers PSLD and P2L, achieving 4X and 8X reduction in neural function evaluations, respectively, while notably enhancing sampling quality on FFHQ, ImageNet, and COCO benchmarks. In addition, we show STSL extends to text-guided image editing and addresses residual distortions present from corrupted images in leading text-guided image editing methods. To our best knowledge, this is the first work to offer an efficient second-order approximation in solving inverse problems using latent diffusion and editing real-world images with corruptions.
    Debiasing Conditional Stochastic Optimization. (arXiv:2304.10613v3 [cs.LG] UPDATED)
    In this paper, we study the conditional stochastic optimization (CSO) problem which covers a variety of applications including portfolio selection, reinforcement learning, robust learning, causal inference, etc. The sample-averaged gradient of the CSO objective is biased due to its nested structure, and therefore requires a high sample complexity for convergence. We introduce a general stochastic extrapolation technique that effectively reduces the bias. We show that for nonconvex smooth objectives, combining this extrapolation with variance reduction techniques can achieve a significantly better sample complexity than the existing bounds. Additionally, we develop new algorithms for the finite-sum variant of the CSO problem that also significantly improve upon existing results. Finally, we believe that our debiasing technique has the potential to be a useful tool for addressing similar challenges in other stochastic optimization problems.
    RJHMC-Tree for Exploration of the Bayesian Decision Tree Posterior. (arXiv:2312.01577v1 [cs.LG])
    Decision trees have found widespread application within the machine learning community due to their flexibility and interpretability. This paper is directed towards learning decision trees from data using a Bayesian approach, which is challenging due to the potentially enormous parameter space required to span all tree models. Several approaches have been proposed to combat this challenge, with one of the more successful being Markov chain Monte Carlo (MCMC) methods. The efficacy and efficiency of MCMC methods fundamentally rely on the quality of the so-called proposals, which is the focus of this paper. In particular, this paper investigates using a Hamiltonian Monte Carlo (HMC) approach to explore the posterior of Bayesian decision trees more efficiently by exploiting the geometry of the likelihood within a global update scheme. Two implementations of the novel algorithm are developed and compared to existing methods by testing against standard datasets in the machine learning and Bayesian decision tree literature. HMC-based methods are shown to perform favourably with respect to predictive test accuracy, acceptance rate, and tree complexity.
    Supervised Machine Learning and Physics based Machine Learning approach for prediction of peak temperature distribution in Additive Friction Stir Deposition of Aluminium Alloy. (arXiv:2309.06838v2 [cs.LG] UPDATED)
    Additive friction stir deposition (AFSD) is a novel solid-state additive manufacturing technique that circumvents issues of porosity, cracking, and properties anisotropy that plague traditional powder bed fusion and directed energy deposition approaches. However, correlations between process parameters, thermal profiles, and resulting microstructure in AFSD remain poorly understood. This hinders process optimization for properties. This work employs a framework combining supervised machine learning (SML) and physics-informed neural networks (PINNs) to predict peak temperature distribution in AFSD from process parameters. Eight regression algorithms were implemented for SML modeling, while four PINNs leveraged governing equations for transport, wave propagation, heat transfer, and quantum mechanics. Across multiple statistical measures, ensemble techniques like gradient boosting proved superior for SML, with lowest MSE of 165.78. The integrated ML approach was also applied to classify deposition quality from process factors, with logistic regression delivering robust accuracy. By fusing data-driven learning and fundamental physics, this dual methodology provides comprehensive insights into tailoring microstructure through thermal management in AFSD. The work demonstrates the power of bridging statistical and physics-based modeling for elucidating AM process-property relationships.
    Non-Smooth Weakly-Convex Finite-sum Coupled Compositional Optimization. (arXiv:2310.03234v2 [math.OC] UPDATED)
    This paper investigates new families of compositional optimization problems, called $\underline{\bf n}$on-$\underline{\bf s}$mooth $\underline{\bf w}$eakly-$\underline{\bf c}$onvex $\underline{\bf f}$inite-sum $\underline{\bf c}$oupled $\underline{\bf c}$ompositional $\underline{\bf o}$ptimization (NSWC FCCO). There has been a growing interest in FCCO due to its wide-ranging applications in machine learning and AI, as well as its ability to address the shortcomings of stochastic algorithms based on empirical risk minimization. However, current research on FCCO presumes that both the inner and outer functions are smooth, limiting their potential to tackle a more diverse set of problems. Our research expands on this area by examining non-smooth weakly-convex FCCO, where the outer function is weakly convex and non-decreasing, and the inner function is weakly-convex. We analyze a single-loop algorithm and establish its complexity for finding an $\epsilon$-stationary point of the Moreau envelop of the objective function. Additionally, we also extend the algorithm to solving novel non-smooth weakly-convex tri-level finite-sum coupled compositional optimization problems, which feature a nested arrangement of three functions. Lastly, we explore the applications of our algorithms in deep learning for two-way partial AUC maximization and multi-instance two-way partial AUC maximization, using empirical studies to showcase the effectiveness of the proposed algorithms.
    Optimal Regularization for a Data Source. (arXiv:2212.13597v2 [math.OC] UPDATED)
    In optimization-based approaches to inverse problems and to statistical estimation, it is common to augment criteria that enforce data fidelity with a regularizer that promotes desired structural properties in the solution. The choice of a suitable regularizer is typically driven by a combination of prior domain information and computational considerations. Convex regularizers are attractive computationally but they are limited in the types of structure they can promote. On the other hand, nonconvex regularizers are more flexible in the forms of structure they can promote and they have showcased strong empirical performance in some applications, but they come with the computational challenge of solving the associated optimization problems. In this paper, we seek a systematic understanding of the power and the limitations of convex regularization by investigating the following questions: Given a distribution, what is the optimal regularizer for data drawn from the distribution? What properties of a data source govern whether the optimal regularizer is convex? We address these questions for the class of regularizers specified by functionals that are continuous, positively homogeneous, and positive away from the origin. We say that a regularizer is optimal for a data distribution if the Gibbs density with energy given by the regularizer maximizes the population likelihood (or equivalently, minimizes cross-entropy loss) over all regularizer-induced Gibbs densities. As the regularizers we consider are in one-to-one correspondence with star bodies, we leverage dual Brunn-Minkowski theory to show that a radial function derived from a data distribution is akin to a ``computational sufficient statistic'' as it is the key quantity for identifying optimal regularizers and for assessing the amenability of a data source to convex regularization.
    Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing. (arXiv:2212.10789v2 [cs.LG] UPDATED)
    There is increasing adoption of artificial intelligence in drug discovery. However, existing studies use machine learning to mainly utilize the chemical structures of molecules but ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions and predict complex biological activities. Here we present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecules' chemical structures and textual descriptions via a contrastive learning strategy. To train MoleculeSTM, we construct a large multi-modal dataset, namely, PubChemSTM, with over 280,000 chemical structure-text pairs. To demonstrate the effectiveness and utility of MoleculeSTM, we design two challenging zero-shot tasks based on text instructions, including structure-text retrieval and molecule editing. MoleculeSTM has two main properties: open vocabulary and compositionality via natural language. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts across various benchmarks.
    Physics-Informed Gaussian Process Regression Generalizes Linear PDE Solvers. (arXiv:2212.12474v5 [cs.LG] UPDATED)
    Linear partial differential equations (PDEs) are an important, widely applied class of mechanistic models, describing physical processes such as heat transfer, electromagnetism, and wave propagation. In practice, specialized numerical methods based on discretization are used to solve PDEs. They generally use an estimate of the unknown model parameters and, if available, physical measurements for initialization. Such solvers are often embedded into larger scientific models with a downstream application and thus error quantification plays a key role. However, by ignoring parameter and measurement uncertainty, classical PDE solvers may fail to produce consistent estimates of their inherent approximation error. In this work, we approach this problem in a principled fashion by interpreting solving linear PDEs as physics-informed Gaussian process (GP) regression. Our framework is based on a key generalization of the Gaussian process inference theorem to observations made via an arbitrary bounded linear operator. Crucially, this probabilistic viewpoint allows to (1) quantify the inherent discretization error; (2) propagate uncertainty about the model parameters to the solution; and (3) condition on noisy measurements. Demonstrating the strength of this formulation, we prove that it strictly generalizes methods of weighted residuals, a central class of PDE solvers including collocation, finite volume, pseudospectral, and (generalized) Galerkin methods such as finite element and spectral methods. This class can thus be directly equipped with a structured error estimate. In summary, our results enable the seamless integration of mechanistic models as modular building blocks into probabilistic models by blurring the boundaries between numerical analysis and Bayesian inference.
    [Reproducibility Report] Explainable Deep One-Class Classification. (arXiv:2206.02598v2 [cs.CV] UPDATED)
    Fully Convolutional Data Description (FCDD), an explainable version of the Hypersphere Classifier (HSC), directly addresses image anomaly detection (AD) and pixel-wise AD without any post-hoc explainer methods. The authors claim that FCDD achieves results comparable with the state-of-the-art in sample-wise AD on Fashion-MNIST and CIFAR-10 and exceeds the state-of-the-art on the pixel-wise task on MVTec-AD. We reproduced the main results of the paper using the author's code with minor changes and provide runtime requirements to achieve if (CPU memory, GPU memory, and training time). We propose another analysis methodology using a critical difference diagram, and further investigate the test performance of the model during the training phase.
    Robust Streaming, Sampling, and a Perspective on Online Learning. (arXiv:2312.01634v1 [cs.LG])
    In this work we present an overview of statistical learning, followed by a survey of robust streaming techniques and challenges, culminating in several rigorous results proving the relationship that we motivate and hint at throughout the journey. Furthermore, we unify often disjoint theorems in a shared framework and notation to clarify the deep connections that are discovered. We hope that by approaching these results from a shared perspective, already aware of the technical connections that exist, we can enlighten the study of both fields and perhaps motivate new and previously unconsidered directions of research.
    Second-Order Uncertainty Quantification: A Distance-Based Approach. (arXiv:2312.00995v1 [cs.LG])
    In the past couple of years, various approaches to representing and quantifying different types of predictive uncertainty in machine learning, notably in the setting of classification, have been proposed on the basis of second-order probability distributions, i.e., predictions in the form of distributions on probability distributions. A completely conclusive solution has not yet been found, however, as shown by recent criticisms of commonly used uncertainty measures associated with second-order distributions, identifying undesirable theoretical properties of these measures. In light of these criticisms, we propose a set of formal criteria that meaningful uncertainty measures for predictive uncertainty based on second-order distributions should obey. Moreover, we provide a general framework for developing uncertainty measures to account for these criteria, and offer an instantiation based on the Wasserstein distance, for which we prove that all criteria are satisfied.
    Foundations for Transfer in Reinforcement Learning: A Taxonomy of Knowledge Modalities. (arXiv:2312.01939v1 [cs.LG])
    Contemporary artificial intelligence systems exhibit rapidly growing abilities accompanied by the growth of required resources, expansive datasets and corresponding investments into computing infrastructure. Although earlier successes predominantly focus on constrained settings, recent strides in fundamental research and applications aspire to create increasingly general systems. This evolving landscape presents a dual panorama of opportunities and challenges in refining the generalisation and transfer of knowledge - the extraction from existing sources and adaptation as a comprehensive foundation for tackling new problems. Within the domain of reinforcement learning (RL), the representation of knowledge manifests through various modalities, including dynamics and reward models, value functions, policies, and the original data. This taxonomy systematically targets these modalities and frames its discussion based on their inherent properties and alignment with different objectives and mechanisms for transfer. Where possible, we aim to provide coarse guidance delineating approaches which address requirements such as limiting environment interactions, maximising computational efficiency, and enhancing generalisation across varying axes of change. Finally, we analyse reasons contributing to the prevalence or scarcity of specific forms of transfer, the inherent potential behind pushing these frontiers, and underscore the significance of transitioning from designed to learned transfer.
    Stochastic Optimal Control Matching. (arXiv:2312.02027v1 [math.OC])
    Stochastic optimal control, which has the goal of driving the behavior of noisy systems, is broadly applicable in science, engineering and artificial intelligence. Our work introduces Stochastic Optimal Control Matching (SOCM), a novel Iterative Diffusion Optimization (IDO) technique for stochastic optimal control that stems from the same philosophy as the conditional score matching loss for diffusion models. That is, the control is learned via a least squares problem by trying to fit a matching vector field. The training loss, which is closely connected to the cross-entropy loss, is optimized with respect to both the control function and a family of reparameterization matrices which appear in the matching vector field. The optimization with respect to the reparameterization matrices aims at minimizing the variance of the matching vector field. Experimentally, our algorithm achieves lower error than all the existing IDO techniques for stochastic optimal control for four different control settings. The key idea underlying SOCM is the path-wise reparameterization trick, a novel technique that is of independent interest, e.g., for generative modeling.
    Universality and approximation bounds for echo state networks with random weights. (arXiv:2206.05669v3 [cs.LG] UPDATED)
    We study the uniform approximation of echo state networks with randomly generated internal weights. These models, in which only the readout weights are optimized during training, have made empirical success in learning dynamical systems. Recent results showed that echo state networks with ReLU activation are universal. In this paper, we give an alternative construction and prove that the universality holds for general activation functions. Specifically, our main result shows that, under certain condition on the activation function, there exists a sampling procedure for the internal weights so that the echo state network can approximate any continuous casual time-invariant operators with high probability. In particular, for ReLU activation, we give explicit construction for these sampling procedures. We also quantify the approximation error of the constructed ReLU echo state networks for sufficiently regular operators.
    Representing Data as Atoms: Unifying Intra- and Inter-Sample Relationship to Discretize Data Representation. (arXiv:2210.03728v2 [cs.LG] UPDATED)
    The quality of data representation is paramount for the performance of a model. Recent research has focused on enhancing representation learning by incorporating more information about the intra-sample structures of individual data points, such as local and global attention. Additionally, researchers have explored methods to model the inter-sample relationships, including manifold, contrastive, and discrete representation learning. In this study, we introduce a new training loss, which considers both intra-sample structure and inter-sample relationships, leveraging the concept of {\it atoms} to represent data points. This new approach, {\it Atom Modeling}, offers a fresh perspective to discretize data representations within a continuous space. Through experiments, we demonstrate that Atom Modeling enhances the performance of existing models in tasks involving classification and generation, across diverse domains including vision and language. These findings underscore the potential of Atom Modeling to enhance data representation and improve model learning, suggesting a promising direction for future research.
    Thermally Averaged Magnetic Anisotropy Tensors via Machine Learning Based on Gaussian Moments. (arXiv:2312.01415v1 [physics.comp-ph])
    We propose a machine learning method to model molecular tensorial quantities, namely the magnetic anisotropy tensor, based on the Gaussian-moment neural-network approach. We demonstrate that the proposed methodology can achieve an accuracy of 0.3--0.4 cm$^{-1}$ and has excellent generalization capability for out-of-sample configurations. Moreover, in combination with machine-learned interatomic potential energies based on Gaussian moments, our approach can be applied to study the dynamic behavior of magnetic anisotropy tensors and provide a unique insight into spin-phonon relaxation.
    Likelihood-Free Gaussian Process for Regression. (arXiv:2006.13456v3 [cs.LG] UPDATED)
    Gaussian process regression can flexibly represent the posterior distribution of an interest parameter given sufficient information on the likelihood. However, in some cases, we have little knowledge regarding the probability model. For example, when investing in a financial instrument, the probability model of cash flow is generally unknown. In this paper, we propose a novel framework called the likelihood-free Gaussian process (LFGP), which allows representation of the posterior distributions of interest parameters for scalable problems without directly setting their likelihood functions. The LFGP establishes clusters in which the value of the interest parameter can be considered approximately identical, and it approximates the likelihood of the interest parameter in each cluster to a Gaussian using the asymptotic normality of the maximum likelihood estimator. We expect that the proposed framework will contribute significantly to likelihood-free modeling, particularly by reducing the assumptions for the probability model and the computational costs for scalable problems.
    Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. (arXiv:2312.02119v1 [cs.LG])
    While Large Language Models (LLMs) display versatile functionality, they continue to generate harmful, biased, and toxic content, as demonstrated by the prevalence of human-designed jailbreaks. In this work, we present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks that only requires black-box access to the target LLM. TAP utilizes an LLM to iteratively refine candidate (attack) prompts using tree-of-thoughts reasoning until one of the generated prompts jailbreaks the target. Crucially, before sending prompts to the target, TAP assesses them and prunes the ones unlikely to result in jailbreaks. Using tree-of-thought reasoning allows TAP to navigate a large search space of prompts and pruning reduces the total number of queries sent to the target. In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4 and GPT4-Turbo) for more than 80% of the prompts using only a small number of queries. This significantly improves upon the previous state-of-the-art black-box method for generating jailbreaks.
    NLP-based detection of systematic anomalies among the narratives of consumer complaints. (arXiv:2308.11138v2 [stat.ME] UPDATED)
    We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations of human analysts. Therefore, as the next step after classification, we convert the complaint narratives into quantitative data, which are then analyzed using an algorithm for detecting systematic anomalies. We illustrate the entire procedure using complaint narratives from the Consumer Complaint Database of the Consumer Financial Protection Bureau.
    Benchpress: A Scalable and Versatile Workflow for Benchmarking Structure Learning Algorithms. (arXiv:2107.03863v4 [stat.ML] UPDATED)
    Describing the relationship between the variables in a study domain and modelling the data generating mechanism is a fundamental problem in many empirical sciences. Probabilistic graphical models are one common approach to tackle the problem. Learning the graphical structure for such models is computationally challenging and a fervent area of current research with a plethora of algorithms being developed. To facilitate the benchmarking of different methods, we present a novel Snakemake workflow, called Benchpress for producing scalable, reproducible, and platform-independent benchmarks of structure learning algorithms for probabilistic graphical models. Benchpress is interfaced via a simple JSON-file, which makes it accessible for all users, while the code is designed in a fully modular fashion to enable researchers to contribute additional methodologies. Benchpress currently provides an interface to a large number of state-of-the-art algorithms from libraries such as BDgraph, BiDAG, bnlearn, causal-learn, gCastle, GOBNILP, pcalg, r.blip, scikit-learn, TETRAD, and trilearn as well as a variety of methods for data generating models and performance evaluation. Alongside user-defined models and randomly generated datasets, the workflow also includes a number of standard datasets and graphical models from the literature, which may be included in a benchmarking study. We demonstrate the applicability of this workflow for learning Bayesian networks in five typical data scenarios. The source code and documentation is publicly available from this http URL
    Revisiting Non-separable Binary Classification and its Applications in Anomaly Detection. (arXiv:2312.01541v1 [cs.LG])
    The inability to linearly classify XOR has motivated much of deep learning. We revisit this age-old problem and show that linear classification of XOR is indeed possible. Instead of separating data between halfspaces, we propose a slightly different paradigm, equality separation, that adapts the SVM objective to distinguish data within or outside the margin. Our classifier can then be integrated into neural network pipelines with a smooth approximation. From its properties, we intuit that equality separation is suitable for anomaly detection. To formalize this notion, we introduce closing numbers, a quantitative measure on the capacity for classifiers to form closed decision regions for anomaly detection. Springboarding from this theoretical connection between binary classification and anomaly detection, we test our hypothesis on supervised anomaly detection experiments, showing that equality separation can detect both seen and unseen anomalies.
    The Effect of Intrinsic Dimension on Metric Learning under Compression. (arXiv:2309.05751v2 [cs.LG] UPDATED)
    Metric learning aims at finding a suitable distance metric over the input space, to improve the performance of distance-based learning algorithms. In high-dimensional settings, metric learning can also play the role of dimensionality reduction, by imposing a low-rank restriction to the learnt metric. In this paper, instead of training a low-rank metric on high-dimensional data, we consider a randomly compressed version of the data, and train a full-rank metric there. We give theoretical guarantees on the error of distance-based metric learning, with respect to the random compression, which do not depend on the ambient dimension. Our bounds do not make any explicit assumptions, aside from i.i.d. data from a bounded support, and automatically tighten when benign geometrical structures are present. Experimental results on both synthetic and real data sets support our theoretical findings in high-dimensional settings.
    On Tuning Neural ODE for Stability, Consistency and Faster Convergence. (arXiv:2312.01657v1 [cs.LG])
    Neural-ODE parameterize a differential equation using continuous depth neural network and solve it using numerical ODE-integrator. These models offer a constant memory cost compared to models with discrete sequence of hidden layers in which memory cost increases linearly with the number of layers. In addition to memory efficiency, other benefits of neural-ode include adaptability of evaluation approach to input, and flexibility to choose numerical precision or fast training. However, despite having all these benefits, it still has some limitations. We identify the ODE-integrator (also called ODE-solver) as the weakest link in the chain as it may have stability, consistency and convergence (CCS) issues and may suffer from slower convergence or may not converge at all. We propose a first-order Nesterov's accelerated gradient (NAG) based ODE-solver which is proven to be tuned vis-a-vis CCS conditions. We empirically demonstrate the efficacy of our approach by training faster, while achieving better or comparable performance against neural-ode employing other fixed-step explicit ODE-solvers as well discrete depth models such as ResNet in three different tasks including supervised classification, density estimation, and time-series modelling.
    Telematics Combined Actuarial Neural Networks for Cross-Sectional and Longitudinal Claim Count Data. (arXiv:2308.01729v2 [stat.ML] UPDATED)
    We present novel cross-sectional and longitudinal claim count models for vehicle insurance built upon the Combined Actuarial Neural Network (CANN) framework proposed by Mario W\"uthrich and Michael Merz. The CANN approach combines a classical actuarial model, such as a generalized linear model, with a neural network. This blending of models results in a two-component model comprising a classical regression model and a neural network part. The CANN model leverages the strengths of both components, providing a solid foundation and interpretability from the classical model while harnessing the flexibility and capacity to capture intricate relationships and interactions offered by the neural network. In our proposed models, we use well-known log-linear claim count regression models for the classical regression part and a multilayer perceptron (MLP) for the neural network part. The MLP part is used to process telematics car driving data given as a vector characterizing the driving behavior of each insured driver. In addition to the Poisson and negative binomial distributions for cross-sectional data, we propose a procedure for training our CANN model with a multivariate negative binomial (MVNB) specification. By doing so, we introduce a longitudinal model that accounts for the dependence between contracts from the same insured. Our results reveal that the CANN models exhibit superior performance compared to log-linear models that rely on manually engineered telematics features.
    Symmetric Mean-field Langevin Dynamics for Distributional Minimax Problems. (arXiv:2312.01127v1 [math.OC])
    In this paper, we extend mean-field Langevin dynamics to minimax optimization over probability distributions for the first time with symmetric and provably convergent updates. We propose mean-field Langevin averaged gradient (MFL-AG), a single-loop algorithm that implements gradient descent ascent in the distribution spaces with a novel weighted averaging, and establish average-iterate convergence to the mixed Nash equilibrium. We also study both time and particle discretization regimes and prove a new uniform-in-time propagation of chaos result which accounts for the dependency of the particle interactions on all previous distributions. Furthermore, we propose mean-field Langevin anchored best response (MFL-ABR), a symmetric double-loop algorithm based on best response dynamics with linear last-iterate convergence. Finally, we study applications to zero-sum Markov games and conduct simulations demonstrating long-term optimality.
    Worst-Case Optimal Multi-Armed Gaussian Best Arm Identification with a Fixed Budget. (arXiv:2310.19788v2 [math.ST] UPDATED)
    Experimental design is crucial in evidence-based decision-making with multiple treatment arms, such as online advertisements and medical treatments. This study investigates the problem of identifying the treatment arm with the highest expected outcome, referred to as the best treatment arm, while minimizing the probability of misidentification. This problem has been studied across numerous research fields, including best arm identification (BAI) and ordinal optimization. In our experiments, the number of treatment-allocation rounds is fixed. During each round, a decision-maker allocates a treatment arm to an experimental unit and observes a corresponding outcome, which follows a Gaussian distribution with variances that can differ among the treatment arms. At the end of the experiment, we recommend one of the treatment arms as an estimate of the best treatment arm based on the observations. To design an experiment, we first discuss the worst-case lower bound for the probability of misidentification through an information-theoretic approach. Then, under the assumption that the variances are known, we propose the Generalized-Neyman-Allocation (GNA)-empirical-best-arm (EBA) strategy, an extension of the Neyman allocation proposed by Neyman (1934). We show that the GNA-EBA strategy is asymptotically optimal in the sense that its probability of misidentification aligns with the lower bounds as the sample size increases indefinitely and the differences between the expected outcomes of the best and other suboptimal arms converge to a uniform value. We refer to such strategies as asymptotically worst-case optimal.
    Coefficient Shape Alignment in Multivariate Functional Regression. (arXiv:2312.01925v1 [stat.ME])
    In multivariate functional data analysis, different functional covariates can be homogeneous in some sense. The hidden homogeneity structure is informative about the connectivity or association of different covariates. The covariates with pronounced homogeneity can be analyzed jointly in the same group and this gives rise to a way of parsimoniously modeling multivariate functional data. In this paper, we develop a multivariate functional regression technique by a new regularization approach termed "coefficient shape alignment" to tackle the potential homogeneity of different functional covariates. The modeling procedure includes two main steps: first the unknown grouping structure is detected with a new regularization approach to aggregate covariates into disjoint groups; and then a grouped multivariate functional regression model is established based on the detected grouping structure. In this new grouped model, the coefficient functions of covariates in the same homogeneous group share the same shape invariant to scaling. The new regularization approach builds on penalizing the discrepancy of coefficient shape. The consistency property of the detected grouping structure is thoroughly investigated, and the conditions that guarantee uncovering the underlying true grouping structure are developed. The asymptotic properties of the model estimates are also developed. Extensive simulation studies are conducted to investigate the finite-sample properties of the developed methods. The practical utility of the proposed methods is illustrated in an analysis on sugar quality evaluation. This work provides a novel means for analyzing the underlying homogeneity of functional covariates and developing parsimonious model structures for multivariate functional data.
    A deep learning pipeline for cross-sectional and longitudinal multiview data integration. (arXiv:2312.01238v1 [cs.LG])
    Biomedical research now commonly integrates diverse data types or views from the same individuals to better understand the pathobiology of complex diseases, but the challenge lies in meaningfully integrating these diverse views. Existing methods often require the same type of data from all views (cross-sectional data only or longitudinal data only) or do not consider any class outcome in the integration method, presenting limitations. To overcome these limitations, we have developed a pipeline that harnesses the power of statistical and deep learning methods to integrate cross-sectional and longitudinal data from multiple sources. Additionally, it identifies key variables contributing to the association between views and the separation among classes, providing deeper biological insights. This pipeline includes variable selection/ranking using linear and nonlinear methods, feature extraction using functional principal component analysis and Euler characteristics, and joint integration and classification using dense feed-forward networks and recurrent neural networks. We applied this pipeline to cross-sectional and longitudinal multi-omics data (metagenomics, transcriptomics, and metabolomics) from an inflammatory bowel disease (IBD) study and we identified microbial pathways, metabolites, and genes that discriminate by IBD status, providing information on the etiology of IBD. We conducted simulations to compare the two feature extraction methods. The proposed pipeline is available from the following GitHub repository: https://github.com/lasandrall/DeepIDA-GRU.
    Learn2Extend: Extending sequences by retaining their statistical properties with mixture models. (arXiv:2312.01507v1 [cs.LG])
    This paper addresses the challenge of extending general finite sequences of real numbers within a subinterval of the real line, maintaining their inherent statistical properties by employing machine learning. Our focus lies on preserving the gap distribution and pair correlation function of these point sets. Leveraging advancements in deep learning applied to point processes, this paper explores the use of an auto-regressive \textit{Sequence Extension Mixture Model} (SEMM) for extending finite sequences, by estimating directly the conditional density, instead of the intensity function. We perform comparative experiments on multiple types of point processes, including Poisson, locally attractive, and locally repelling sequences, and we perform a case study on the prediction of Riemann $\zeta$ function zeroes. The results indicate that the proposed mixture model outperforms traditional neural network architectures in sequence extension with the retention of statistical properties. Given this motivation, we showcase the capabilities of a mixture model to extend sequences, maintaining specific statistical properties, i.e. the gap distribution, and pair correlation indicators.
    Risk-Controlling Model Selection via Guided Bayesian Optimization. (arXiv:2312.01692v1 [cs.LG])
    Adjustable hyperparameters of machine learning models typically impact various key trade-offs such as accuracy, fairness, robustness, or inference cost. Our goal in this paper is to find a configuration that adheres to user-specified limits on certain risks while being useful with respect to other conflicting metrics. We solve this by combining Bayesian Optimization (BO) with rigorous risk-controlling procedures, where our core idea is to steer BO towards an efficient testing strategy. Our BO method identifies a set of Pareto optimal configurations residing in a designated region of interest. The resulting candidates are statistically verified and the best-performing configuration is selected with guaranteed risk levels. We demonstrate the effectiveness of our approach on a range of tasks with multiple desiderata, including low error rates, equitable predictions, handling spurious correlations, managing rate and distortion in generative models, and reducing computational costs.
    Convergences for Minimax Optimization Problems over Infinite-Dimensional Spaces Towards Stability in Adversarial Training. (arXiv:2312.00991v1 [stat.ML])
    Training neural networks that require adversarial optimization, such as generative adversarial networks (GANs) and unsupervised domain adaptations (UDAs), suffers from instability. This instability problem comes from the difficulty of the minimax optimization, and there have been various approaches in GANs and UDAs to overcome this problem. In this study, we tackle this problem theoretically through a functional analysis. Specifically, we show the convergence property of the minimax problem by the gradient descent over the infinite-dimensional spaces of continuous functions and probability measures under certain conditions. Using this setting, we can discuss GANs and UDAs comprehensively, which have been studied independently. In addition, we show that the conditions necessary for the convergence property are interpreted as stabilization techniques of adversarial training such as the spectral normalization and the gradient penalty.
    Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits. (arXiv:2312.01457v1 [stat.ML])
    Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing new policies using existing data without costly experimentation. However, current OPE methods, such as Inverse Probability Weighting (IPW) and Doubly Robust (DR) estimators, suffer from high variance, particularly in cases of low overlap between target and behavior policies or large action and context spaces. In this paper, we introduce a new OPE estimator for contextual bandits, the Marginal Ratio (MR) estimator, which focuses on the shift in the marginal distribution of outcomes $Y$ instead of the policies themselves. Through rigorous theoretical analysis, we demonstrate the benefits of the MR estimator compared to conventional methods like IPW and DR in terms of variance reduction. Additionally, we establish a connection between the MR estimator and the state-of-the-art Marginalized Inverse Propensity Score (MIPS) estimator, proving that MR achieves lower variance among a generalized family of MIPS estimators. We further illustrate the utility of the MR estimator in causal inference settings, where it exhibits enhanced performance in estimating Average Treatment Effects (ATE). Our experiments on synthetic and real-world datasets corroborate our theoretical findings and highlight the practical advantages of the MR estimator in OPE for contextual bandits.
    Spatially scalable recursive estimation of Gaussian process terrain maps using local basis functions. (arXiv:2210.09168v2 [cs.LG] UPDATED)
    When an agent, person, vehicle or robot is moving through an unknown environment without GNSS signals, online mapping of nonlinear terrains can be used to improve position estimates when the agent returns to a previously mapped area. Mapping algorithms using online Gaussian process (GP) regression are commonly integrated in algorithms for simultaneous localisation and mapping (SLAM). However, GP mapping algorithms have increasing computational demands as the mapped area expands relative to spatial field variations. This is due to the need for estimating an increasing amount of map parameters as the area of the map grows. Contrary to this, we propose a recursive GP mapping estimation algorithm which uses local basis functions in an information filter to achieve spatial scalability. Our proposed approximation employs a global grid of finite support basis functions but restricts computations to a localized subset around each prediction point. As our proposed algorithm is recursive, it can naturally be incorporated into existing algorithms that uses Gaussian process maps for SLAM. Incorporating our proposed algorithm into an extended Kalman filter (EKF) for magnetic field SLAM reduces the overall computational complexity of the algorithm. We show experimentally that our algorithm is faster than existing methods when the mapped area is large and the map is based on many measurements, both for recursive mapping tasks and for magnetic field SLAM.
    BELIEF in Dependence: Leveraging Atomic Linearity in Data Bits for Rethinking Generalized Linear Models. (arXiv:2210.10852v2 [math.ST] UPDATED)
    Two linearly uncorrelated binary variables must be also independent because non-linear dependence cannot manifest with only two possible states. This inherent linearity is the atom of dependency constituting any complex form of relationship. Inspired by this observation, we develop a framework called binary expansion linear effect (BELIEF) for understanding arbitrary relationships with a binary outcome. Models from the BELIEF framework are easily interpretable because they describe the association of binary variables in the language of linear models, yielding convenient theoretical insight and striking Gaussian parallels. With BELIEF, one may study generalized linear models (GLM) through transparent linear models, providing insight into how the choice of link affects modeling. For example, setting a GLM interaction coefficient to zero does not necessarily lead to the kind of no-interaction model assumption as understood under their linear model counterparts. Furthermore, for a binary response, maximum likelihood estimation for GLMs paradoxically fails under complete separation, when the data are most discriminative, whereas BELIEF estimation automatically reveals the perfect predictor in the data that is responsible for complete separation. We explore these phenomena and provide related theoretical results. We also provide preliminary empirical demonstration of some theoretical results.
    Rethinking PGD Attack: Is Sign Function Necessary?. (arXiv:2312.01260v1 [cs.LG])
    Neural networks have demonstrated success in various domains, yet their performance can be significantly degraded by even a small input perturbation. Consequently, the construction of such perturbations, known as adversarial attacks, has gained significant attention, many of which fall within "white-box" scenarios where we have full access to the neural network. Existing attack algorithms, such as the projected gradient descent (PGD), commonly take the sign function on the raw gradient before updating adversarial inputs, thereby neglecting gradient magnitude information. In this paper, we present a theoretical analysis of how such sign-based update algorithm influences step-wise attack performance, as well as its caveat. We also interpret why previous attempts of directly using raw gradients failed. Based on that, we further propose a new raw gradient descent (RGD) algorithm that eliminates the use of sign. Specifically, we convert the constrained optimization problem into an unconstrained one, by introducing a new hidden variable of non-clipped perturbation that can move beyond the constraint. The effectiveness of the proposed RGD algorithm has been demonstrated extensively in experiments, outperforming PGD and other competitors in various settings, without incurring any additional computational overhead. The codes is available in https://github.com/JunjieYang97/RGD.
    Interpreting and Improving Diffusion Models Using the Euclidean Distance Function. (arXiv:2306.04848v2 [cs.LG] UPDATED)
    Denoising is intuitively related to projection. Indeed, under the manifold hypothesis, adding random noise is approximately equivalent to orthogonal perturbation. Hence, learning to denoise is approximately learning to project. In this paper, we use this observation to reinterpret denoising diffusion models as approximate gradient descent applied to the Euclidean distance function. We then provide straight-forward convergence analysis of the DDIM sampler under simple assumptions on the projection-error of the denoiser. Finally, we propose a new sampler based on two simple modifications to DDIM using insights from our theoretical results. In as few as 5-10 function evaluations, our sampler achieves state-of-the-art FID scores on pretrained CIFAR-10 and CelebA models and can generate high quality samples on latent diffusion models.
    Evaluation of Active Feature Acquisition Methods for Time-varying Feature Settings. (arXiv:2312.01530v1 [stat.ML])
    Machine learning methods often assume input features are available at no cost. However, in domains like healthcare, where acquiring features could be expensive or harmful, it is necessary to balance a feature's acquisition cost against its predictive value. The task of training an AI agent to decide which features to acquire is called active feature acquisition (AFA). By deploying an AFA agent, we effectively alter the acquisition strategy and trigger a distribution shift. To safely deploy AFA agents under this distribution shift, we present the problem of active feature acquisition performance evaluation (AFAPE). We examine AFAPE under i) a no direct effect (NDE) assumption, stating that acquisitions don't affect the underlying feature values; and ii) a no unobserved confounding (NUC) assumption, stating that retrospective feature acquisition decisions were only based on observed features. We show that one can apply offline reinforcement learning under the NUC assumption and missing data methods under the NDE assumption. When NUC and NDE hold, we propose a novel semi-offline reinforcement learning framework, which requires a weaker positivity assumption and yields more data-efficient estimators. We introduce three novel estimators: a direct method (DM), an inverse probability weighting (IPW), and a double reinforcement learning (DRL) estimator.
    SASSL: Enhancing Self-Supervised Learning via Neural Style Transfer. (arXiv:2312.01187v1 [cs.CV])
    Self-supervised learning relies heavily on data augmentation to extract meaningful representations from unlabeled images. While existing state-of-the-art augmentation pipelines incorporate a wide range of primitive transformations, these often disregard natural image structure. Thus, augmented samples can exhibit degraded semantic information and low stylistic diversity, affecting downstream performance of self-supervised representations. To overcome this, we propose SASSL: Style Augmentations for Self Supervised Learning, a novel augmentation technique based on Neural Style Transfer. The method decouples semantic and stylistic attributes in images and applies transformations exclusively to the style while preserving content, generating diverse augmented samples that better retain their semantic properties. Experimental results show our technique achieves a top-1 classification performance improvement of more than 2% on ImageNet compared to the well-established MoCo v2. We also measure transfer learning performance across five diverse datasets, observing significant improvements of up to 3.75%. Our experiments indicate that decoupling style from content information and transferring style across datasets to diversify augmentations can significantly improve downstream performance of self-supervised representations.
    A Text-guided Protein Design Framework. (arXiv:2302.04611v2 [cs.LG] UPDATED)
    Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level functionalities. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90\% accuracy for text-guided protein generation; (2) best hit ratio on 10 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
    Efficient Expansion and Gradient Based Task Inference for Replay Free Incremental Learning. (arXiv:2312.01188v1 [cs.LG])
    This paper proposes a simple but highly efficient expansion-based model for continual learning. The recent feature transformation, masking and factorization-based methods are efficient, but they grow the model only over the global or shared parameter. Therefore, these approaches do not fully utilize the previously learned information because the same task-specific parameter forgets the earlier knowledge. Thus, these approaches show limited transfer learning ability. Moreover, most of these models have constant parameter growth for all tasks, irrespective of the task complexity. Our work proposes a simple filter and channel expansion based method that grows the model over the previous task parameters and not just over the global parameter. Therefore, it fully utilizes all the previously learned information without forgetting, which results in better knowledge transfer. The growth rate in our proposed model is a function of task complexity; therefore for a simple task, the model has a smaller parameter growth while for complex tasks, the model requires more parameters to adapt to the current task. Recent expansion based models show promising results for task incremental learning (TIL). However, for class incremental learning (CIL), prediction of task id is a crucial challenge; hence, their results degrade rapidly as the number of tasks increase. In this work, we propose a robust task prediction method that leverages entropy weighted data augmentations and the models gradient using pseudo labels. We evaluate our model on various datasets and architectures in the TIL, CIL and generative continual learning settings. The proposed approach shows state-of-the-art results in all these settings. Our extensive ablation studies show the efficacy of the proposed components.
    Bagged Regularized $k$-Distances for Anomaly Detection. (arXiv:2312.01046v1 [stat.ML])
    We consider the paradigm of unsupervised anomaly detection, which involves the identification of anomalies within a dataset in the absence of labeled examples. Though distance-based methods are top-performing for unsupervised anomaly detection, they suffer heavily from the sensitivity to the choice of the number of the nearest neighbors. In this paper, we propose a new distance-based algorithm called bagged regularized $k$-distances for anomaly detection (BRDAD) converting the unsupervised anomaly detection problem into a convex optimization problem. Our BRDAD algorithm selects the weights by minimizing the surrogate risk, i.e., the finite sample bound of the empirical risk of the bagged weighted $k$-distances for density estimation (BWDDE). This approach enables us to successfully address the sensitivity challenge of the hyperparameter choice in distance-based algorithms. Moreover, when dealing with large-scale datasets, the efficiency issues can be addressed by the incorporated bagging technique in our BRDAD algorithm. On the theoretical side, we establish fast convergence rates of the AUC regret of our algorithm and demonstrate that the bagging technique significantly reduces the computational complexity. On the practical side, we conduct numerical experiments on anomaly detection benchmarks to illustrate the insensitivity of parameter selection of our algorithm compared with other state-of-the-art distance-based methods. Moreover, promising improvements are brought by applying the bagging technique in our algorithm on real-world datasets.
    When accurate prediction models yield harmful self-fulfilling prophecies. (arXiv:2312.01210v1 [stat.ME])
    Prediction models are popular in medical research and practice. By predicting an outcome of interest for specific patients, these models may help inform difficult treatment decisions, and are often hailed as the poster children for personalized, data-driven healthcare. We show however, that using prediction models for decision making can lead to harmful decisions, even when the predictions exhibit good discrimination after deployment. These models are harmful self-fulfilling prophecies: their deployment harms a group of patients but the worse outcome of these patients does not invalidate the predictive power of the model. Our main result is a formal characterization of a set of such prediction models. Next we show that models that are well calibrated before} and after deployment are useless for decision making as they made no change in the data distribution. These results point to the need to revise standard practices for validation, deployment and evaluation of prediction models that are used in medical decisions.
    Entropy and the Kullback-Leibler Divergence for Bayesian Networks: Computational Complexity and Efficient Implementation. (arXiv:2312.01520v1 [cs.AI])
    Bayesian networks (BNs) are a foundational model in machine learning and causal inference. Their graphical structure can handle high-dimensional problems, divide-and-conquering them into a sparse collection of smaller ones; underlies Judea Pearl's causality; and determines their explainability and interpretability. Despite their popularity, there are few resources in the literature on how to compute Shannon's entropy and the Kullback-Leibler (KL) divergence for BNs under their most common distributional assumptions. In this paper, we provide computationally efficient algorithms for both by leveraging BNs' graphical structure, and we illustrate them with a complete set of numerical examples. In the process, we show it is possible to reduce the computational complexity of KL from cubic to quadratic for Gaussian BNs.
    Meta-Learned Attribute Self-Interaction Network for Continual and Generalized Zero-Shot Learning. (arXiv:2312.01167v1 [cs.CV])
    Zero-shot learning (ZSL) is a promising approach to generalizing a model to categories unseen during training by leveraging class attributes, but challenges remain. Recently, methods using generative models to combat bias towards classes seen during training have pushed state of the art, but these generative models can be slow or computationally expensive to train. Also, these generative models assume that the attribute vector of each unseen class is available a priori at training, which is not always practical. Additionally, while many previous ZSL methods assume a one-time adaptation to unseen classes, in reality, the world is always changing, necessitating a constant adjustment of deployed models. Models unprepared to handle a sequential stream of data are likely to experience catastrophic forgetting. We propose a Meta-learned Attribute self-Interaction Network (MAIN) for continual ZSL. By pairing attribute self-interaction trained using meta-learning with inverse regularization of the attribute encoder, we are able to outperform state-of-the-art results without leveraging the unseen class attributes while also being able to train our models substantially faster (>100x) than expensive generative-based approaches. We demonstrate this with experiments on five standard ZSL datasets (CUB, aPY, AWA1, AWA2, and SUN) in the generalized zero-shot learning and continual (fixed/dynamic) zero-shot learning settings. Extensive ablations and analyses demonstrate the efficacy of various components proposed.
    Relation between PLS and OLS regression in terms of the eigenvalue distribution of the regressor covariance matrix. (arXiv:2312.01379v1 [stat.ME])
    Partial least squares (PLS) is a dimensionality reduction technique introduced in the field of chemometrics and successfully employed in many other areas. The PLS components are obtained by maximizing the covariance between linear combinations of the regressors and of the target variables. In this work, we focus on its application to scalar regression problems. PLS regression consists in finding the least squares predictor that is a linear combination of a subset of the PLS components. Alternatively, PLS regression can be formulated as a least squares problem restricted to a Krylov subspace. This equivalent formulation is employed to analyze the distance between ${\hat{\boldsymbol\beta}\;}_{\mathrm{PLS}}^{\scriptscriptstyle {(L)}}$, the PLS estimator of the vector of coefficients of the linear regression model based on $L$ PLS components, and $\hat{\boldsymbol \beta}_{\mathrm{OLS}}$, the one obtained by ordinary least squares (OLS), as a function of $L$. Specifically, ${\hat{\boldsymbol\beta}\;}_{\mathrm{PLS}}^{\scriptscriptstyle {(L)}}$ is the vector of coefficients in the aforementioned Krylov subspace that is closest to $\hat{\boldsymbol \beta}_{\mathrm{OLS}}$ in terms of the Mahalanobis distance with respect to the covariance matrix of the OLS estimate. We provide a bound on this distance that depends only on the distribution of the eigenvalues of the regressor covariance matrix. Numerical examples on synthetic and real-world data are used to illustrate how the distance between ${\hat{\boldsymbol\beta}\;}_{\mathrm{PLS}}^{\scriptscriptstyle {(L)}}$ and $\hat{\boldsymbol \beta}_{\mathrm{OLS}}$ depends on the number of clusters in which the eigenvalues of the regressor covariance matrix are grouped.
    Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL. (arXiv:2305.11032v2 [cs.LG] UPDATED)
    While policy optimization algorithms have played an important role in recent empirical success of Reinforcement Learning (RL), the existing theoretical understanding of policy optimization remains rather limited -- they are either restricted to tabular MDPs or suffer from highly suboptimal sample complexity, especial in online RL where exploration is necessary. This paper proposes a simple efficient policy optimization framework -- Optimistic NPG for online RL. Optimistic NPG can be viewed as a simple combination of the classic natural policy gradient (NPG) algorithm [Kakade, 2001] with optimistic policy evaluation subroutines to encourage exploration. For $d$-dimensional linear MDPs, Optimistic NPG is computationally efficient, and learns an $\varepsilon$-optimal policy within $\tilde{O}(d^2/\varepsilon^3)$ samples, which is the first computationally efficient algorithm whose sample complexity has the optimal dimension dependence $\tilde{\Theta}(d^2)$. It also improves over state-of-the-art results of policy optimization algorithms [Zanette et al., 2021] by a factor of $d$. In the realm of general function approximation, which subsumes linear MDPs, Optimistic NPG, to our best knowledge, stands as the first policy optimization algorithm that achieves polynomial sample complexity for learning near-optimal policies.
    Near-Optimal Algorithms for Gaussians with Huber Contamination: Mean Estimation and Linear Regression. (arXiv:2312.01547v1 [cs.DS])
    We study the fundamental problems of Gaussian mean estimation and linear regression with Gaussian covariates in the presence of Huber contamination. Our main contribution is the design of the first sample near-optimal and almost linear-time algorithms with optimal error guarantees for both of these problems. Specifically, for Gaussian robust mean estimation on $\mathbb{R}^d$ with contamination parameter $\epsilon \in (0, \epsilon_0)$ for a small absolute constant $\epsilon_0$, we give an algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target mean within $\ell_2$-error $O(\epsilon)$. This improves on prior work that achieved this error guarantee with polynomially suboptimal sample and time complexity. For robust linear regression, we give the first algorithm with sample complexity $n = \tilde{O}(d/\epsilon^2)$ and almost linear runtime that approximates the target regressor within $\ell_2$-error $O(\epsilon)$. This is the first polynomial sample and time algorithm achieving the optimal error guarantee, answering an open question in the literature. At the technical level, we develop a methodology that yields almost-linear time algorithms for multi-directional filtering that may be of broader interest.
    Policy Evaluation for Temporal and/or Spatial Dependent Experiments. (arXiv:2202.10887v5 [stat.ME] UPDATED)
    The aim of this paper is to establish a causal link between the policies implemented by technology companies and the outcomes they yield within intricate temporal and/or spatial dependent experiments. We propose a novel temporal/spatio-temporal Varying Coefficient Decision Process (VCDP) model, capable of effectively capturing the evolving treatment effects in situations characterized by temporal and/or spatial dependence. Our methodology encompasses the decomposition of the Average Treatment Effect (ATE) into the Direct Effect (DE) and the Indirect Effect (IE). We subsequently devise comprehensive procedures for estimating and making inferences about both DE and IE. Additionally, we provide a rigorous analysis of the statistical properties of these procedures, such as asymptotic power. To substantiate the effectiveness of our approach, we carry out extensive simulations and real data analyses.
    Deep Ensembles Meets Quantile Regression: Uncertainty-aware Imputation for Time Series. (arXiv:2312.01294v1 [cs.LG])
    Multivariate time series are everywhere. Nevertheless, real-world time series data often exhibit numerous missing values, which is the time series imputation task. Although previous deep learning methods have been shown to be effective for time series imputation, they are shown to produce overconfident imputations, which might be a potentially overlooked threat to the reliability of the intelligence system. Score-based diffusion method(i.e., CSDI) is effective for the time series imputation task but computationally expensive due to the nature of the generative diffusion model framework. In this paper, we propose a non-generative time series imputation method that produces accurate imputations with inherent uncertainty and meanwhile is computationally efficient. Specifically, we incorporate deep ensembles into quantile regression with a shared model backbone and a series of quantile discrimination functions.This framework combines the merits of accurate uncertainty estimation of deep ensembles and quantile regression and above all, the shared model backbone tremendously reduces most of the computation overhead of the multiple ensembles. We examine the performance of the proposed method on two real-world datasets: air quality and health-care datasets and conduct extensive experiments to show that our method excels at making deterministic and probabilistic predictions. Compared with the score-based diffusion method: CSDI, we can obtain comparable forecasting results and is better when more data is missing. Furthermore, as a non-generative model compared with CSDI, the proposed method consumes a much smaller computation overhead, yielding much faster training speed and fewer model parameters.
    Bayesian Nonlinear Regression using Sums of Simple Functions. (arXiv:2312.01881v1 [econ.EM])
    This paper proposes a new Bayesian machine learning model that can be applied to large datasets arising in macroeconomics. Our framework sums over many simple two-component location mixtures. The transition between components is determined by a logistic function that depends on a single threshold variable and two hyperparameters. Each of these individual models only accounts for a minor portion of the variation in the endogenous variables. But many of them are capable of capturing arbitrary nonlinear conditional mean relations. Conjugate priors enable fast and efficient inference. In simulations, we show that our approach produces accurate point and density forecasts. In a real-data exercise, we forecast US macroeconomic aggregates and consider the nonlinear effects of financial shocks in a large-scale nonlinear VAR.
    Visualizing high-dimensional loss landscapes with Hessian directions. (arXiv:2208.13219v2 [cs.LG] UPDATED)
    Analyzing geometric properties of high-dimensional loss functions, such as local curvature and the existence of other optima around a certain point in loss space, can help provide a better understanding of the interplay between neural network structure, implementation attributes, and learning performance. In this work, we combine concepts from high-dimensional probability and differential geometry to study how curvature properties in lower-dimensional loss representations depend on those in the original loss space. We show that saddle points in the original space are rarely correctly identified as such in expected lower-dimensional representations if random projections are used. The principal curvature in the expected lower-dimensional representation is proportional to the mean curvature in the original loss space. Hence, the mean curvature in the original loss space determines if saddle points appear, on average, as either minima, maxima, or almost flat regions. We use the connection between expected curvature in random projections and mean curvature in the original space (i.e., the normalized Hessian trace) to compute Hutchinson-type trace estimates without calculating Hessian-vector products as in the original Hutchinson method. Because random projections are not suitable to correctly identify saddle information, we propose to study projections along dominant Hessian directions that are associated with the largest and smallest principal curvatures. We connect our findings to the ongoing debate on loss landscape flatness and generalizability. Finally, for different common image classifiers and a function approximator, we show and compare random and Hessian projections of loss landscapes with up to about $7\times 10^6$ parameters.
    $t^3$-Variational Autoencoder: Learning Heavy-tailed Data with Student's t and Power Divergence. (arXiv:2312.01133v1 [stat.ML])
    The variational autoencoder (VAE) typically employs a standard normal prior as a regularizer for the probabilistic latent encoder. However, the Gaussian tail often decays too quickly to effectively accommodate the encoded points, failing to preserve crucial structures hidden in the data. In this paper, we explore the use of heavy-tailed models to combat over-regularization. Drawing upon insights from information geometry, we propose $t^3$VAE, a modified VAE framework that incorporates Student's t-distributions for the prior, encoder, and decoder. This results in a joint model distribution of a power form which we argue can better fit real-world datasets. We derive a new objective by reformulating the evidence lower bound as joint optimization of KL divergence between two statistical manifolds and replacing with $\gamma$-power divergence, a natural alternative for power families. $t^3$VAE demonstrates superior generation of low-density regions when trained on heavy-tailed synthetic data. Furthermore, we show that $t^3$VAE significantly outperforms other models on CelebA and imbalanced CIFAR-100 datasets.
  • Open

    Worst-Case Optimal Multi-Armed Gaussian Best Arm Identification with a Fixed Budget. (arXiv:2310.19788v2 [math.ST] UPDATED)
    Experimental design is crucial in evidence-based decision-making with multiple treatment arms, such as online advertisements and medical treatments. This study investigates the problem of identifying the treatment arm with the highest expected outcome, referred to as the best treatment arm, while minimizing the probability of misidentification. This problem has been studied across numerous research fields, including best arm identification (BAI) and ordinal optimization. In our experiments, the number of treatment-allocation rounds is fixed. During each round, a decision-maker allocates a treatment arm to an experimental unit and observes a corresponding outcome, which follows a Gaussian distribution with variances that can differ among the treatment arms. At the end of the experiment, we recommend one of the treatment arms as an estimate of the best treatment arm based on the observations. To design an experiment, we first discuss the worst-case lower bound for the probability of misidentification through an information-theoretic approach. Then, under the assumption that the variances are known, we propose the Generalized-Neyman-Allocation (GNA)-empirical-best-arm (EBA) strategy, an extension of the Neyman allocation proposed by Neyman (1934). We show that the GNA-EBA strategy is asymptotically optimal in the sense that its probability of misidentification aligns with the lower bounds as the sample size increases indefinitely and the differences between the expected outcomes of the best and other suboptimal arms converge to a uniform value. We refer to such strategies as asymptotically worst-case optimal.  ( 3 min )
    Transformers are Efficient In-Context Estimators for Wireless Communication. (arXiv:2311.00226v2 [eess.SP] UPDATED)
    Pre-trained transformers can perform in-context learning, where they adapt to a new task using only a small number of prompts without any explicit model optimization. Inspired by this attribute, we propose a novel approach, called in-context estimation, for the canonical communication problem of estimating transmitted symbols from received symbols. A communication channel is essentially a noisy function that maps transmitted symbols to received symbols, and this function can be represented by an unknown parameter whose statistics depend on an (also unknown) latent context. Conventional approaches typically do not fully exploit hierarchical model with the latent context. Instead, they often use mismatched priors to form a linear minimum mean-squared error estimate of the channel parameter, which is then used to estimate successive, unknown transmitted symbols. We make the basic connection that transformers show excellent contextual sequence completion with a few prompts, and so they should be able to implicitly determine the latent context from pilot symbols to perform end-to-end in-context estimation of transmitted symbols. Furthermore, the transformer should use information efficiently, i.e., it should utilize any pilots received to attain the best possible symbol estimates. Through extensive simulations, we show that in-context estimation not only significantly outperforms standard approaches, but also achieves the same performance as an estimator with perfect knowledge of the latent context within a few context examples. Thus, we make a strong case that transformers are efficient in-context estimators in the communication setting.
    Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. (arXiv:2305.03047v2 [cs.LG] UPDATED)
    Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable. However, this dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision and the related issues on quality, reliability, diversity, self-consistency, and undesirable biases. To address these challenges, we propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Our approach encompasses four stages: first, we use an LLM to generate synthetic prompts, and a topic-guided method to augment the prompt diversity; second, we use a small set of human-written principles for AI models to follow, and guide the LLM through in-context learning from demonstrations (of principles application) to produce helpful, ethical, and reliable responses to user's queries; third, we fine-tune the original LLM with the high-quality self-aligned responses so that the resulting model can generate desirable responses for each query directly without the principle set and the demonstrations anymore; and finally, we offer a refinement step to address the issues of overly-brief or indirect responses. Applying SELF-ALIGN to the LLaMA-65b base language model, we develop an AI assistant named Dromedary. With fewer than 300 lines of human annotations (including < 200 seed prompts, 16 generic principles, and 5 exemplars for in-context learning). Dromedary significantly surpasses the performance of several state-of-the-art AI systems, including Text-Davinci-003 and Alpaca, on benchmark datasets with various settings.
    Long-term Forecasting with TiDE: Time-series Dense Encoder. (arXiv:2304.08424v4 [stat.ML] UPDATED)
    Recent work has shown that simple linear models can outperform several Transformer based approaches in long term time-series forecasting. Motivated by this, we propose a Multi-layer Perceptron (MLP) based encoder-decoder model, Time-series Dense Encoder (TiDE), for long-term time-series forecasting that enjoys the simplicity and speed of linear models while also being able to handle covariates and non-linear dependencies. Theoretically, we prove that the simplest linear analogue of our model can achieve near optimal error rate for linear dynamical systems (LDS) under some assumptions. Empirically, we show that our method can match or outperform prior approaches on popular long-term time-series forecasting benchmarks while being 5-10x faster than the best Transformer based model.
    USB: A Unified Summarization Benchmark Across Tasks and Domains. (arXiv:2305.14296v2 [cs.CL] UPDATED)
    While the NLP community has produced numerous summarization benchmarks, none provide the rich annotations required to simultaneously address many important problems related to control and reliability. We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks: (i) extractive summarization; (ii) abstractive summarization; (iii) topic-based summarization; (iv) compressing selected sentences into a one-line summary; (v) surfacing evidence for a summary sentence; (vi) predicting the factual accuracy of a summary sentence; (vii) identifying unsubstantiated spans in a summary sentence; (viii) correcting factual errors in summaries. We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models. For factuality-related tasks, we also evaluate existing heuristics to create training data and find that training on them results in worse performance than training on $20\times$ less human-labeled data. Our articles draw from $6$ domains, facilitating cross-domain analysis. On some tasks, the amount of training data matters more than the domain where it comes from, while for other tasks training specifically on data from the target domain, even if limited, is more beneficial.
    Are ChatGPT and Other Similar Systems the Modern Lernaean Hydras of AI?. (arXiv:2306.09267v2 [cs.CY] UPDATED)
    The rise of Generative Artificial Intelligence systems (''AI systems'') has created unprecedented social engagement. AI code generation systems provide responses (output) to questions or requests by accessing the vast library of open-source code created by developers over the past few decades. However, they do so by allegedly stealing the open-source code stored in virtual libraries, known as repositories. This Article focuses on how this happens and whether there is a solution that protects innovation and avoids years of litigation. We also touch upon the array of issues raised by the relationship between AI and copyright. Looking ahead, we propose the following: (a) immediate changes to the licenses for open-source code created by developers that will limit access and/or use of any open-source code to humans only; (b) we suggest revisions to the Massachusetts Institute of Technology (''MIT'') license so that AI systems are required to procure appropriate licenses from open-source code developers, which we believe will harmonize standards and build social consensus for the benefit of all of humanity, rather than promote profit-driven centers of innovation; (c) we call for urgent legislative action to protect the future of AI systems while also promoting innovation; and (d) we propose a shift in the burden of proof to AI systems in obfuscation cases.
    Automatic detection of problem-gambling signs from online texts using large language models. (arXiv:2312.00804v1 [cs.CL])
    Problem gambling is a major public health concern and is associated with profound psychological distress and economic problems. There are numerous gambling communities on the internet where users exchange information about games, gambling tactics, as well as gambling-related problems. Individuals exhibiting higher levels of problem gambling engage more in such communities. Online gambling communities may provide insights into problem-gambling behaviour. Using data scraped from a major German gambling discussion board, we fine-tuned a large language model, specifically a Bidirectional Encoder Representations from Transformers (BERT) model, to predict signs of problem-gambling from forum posts. Training data were generated by manual annotation and by taking into account diagnostic criteria and gambling-related cognitive distortions. Using k-fold cross-validation, our models achieved a precision of 0.95 and F1 score of 0.71, demonstrating that satisfactory classification performance can be achieved by generating high-quality training material through manual annotation based on diagnostic criteria. The current study confirms that a BERT-based model can be reliably used on small data sets and to detect signatures of problem gambling in online communication data. Such computational approaches may have potential for the detection of changes in problem-gambling prevalence among online users.
    BioCLIP: A Vision Foundation Model for the Tree of Life. (arXiv:2311.18803v2 [cs.CV] UPDATED)
    Images of the natural world, collected by a variety of cameras, from drones to individual phones, are increasingly abundant sources of biological information. There is an explosion of computational methods and tools, particularly computer vision, for extracting biologically relevant information from images for science and conservation. Yet most of these are bespoke approaches designed for a specific task and are not easily adaptable or extendable to new questions, contexts, and datasets. A vision model for general organismal biology questions on images is of timely need. To approach this, we curate and release TreeOfLife-10M, the largest and most diverse ML-ready dataset of biology images. We then develop BioCLIP, a foundation model for the tree of life, leveraging the unique properties of biology captured by TreeOfLife-10M, namely the abundance and variety of images of plants, animals, and fungi, together with the availability of rich structured biological knowledge. We rigorously benchmark our approach on diverse fine-grained biology classification tasks, and find that BioCLIP consistently and substantially outperforms existing baselines (by 17% to 20% absolute). Intrinsic evaluation reveals that BioCLIP has learned a hierarchical representation conforming to the tree of life, shedding light on its strong generalizability. Our code, models and data will be made available at https://github.com/Imageomics/bioclip.
    Pareto Probing: Trading Off Accuracy for Complexity. (arXiv:2010.02180v3 [cs.CL] UPDATED)
    The question of how to probe contextual word representations for linguistic structure in a way that is both principled and useful has seen significant attention recently in the NLP literature. In our contribution to this discussion, we argue for a probe metric that reflects the fundamental trade-off between probe complexity and performance: the Pareto hypervolume. To measure complexity, we present a number of parametric and non-parametric metrics. Our experiments using Pareto hypervolume as an evaluation metric show that probes often do not conform to our expectations -- e.g., why should the non-contextual fastText representations encode more morpho-syntactic information than the contextual BERT representations? These results suggest that common, simplistic probing tasks, such as part-of-speech labeling and dependency arc labeling, are inadequate to evaluate the linguistic structure encoded in contextual word representations. This leads us to propose full dependency parsing as a probing task. In support of our suggestion that harder probing tasks are necessary, our experiments with dependency parsing reveal a wide gap in syntactic knowledge between contextual and non-contextual representations.
    Feature Likelihood Score: Evaluating the Generalization of Generative Models Using Samples. (arXiv:2302.04440v3 [cs.LG] UPDATED)
    The past few years have seen impressive progress in the development of deep generative models capable of producing high-dimensional, complex, and photo-realistic data. However, current methods for evaluating such models remain incomplete: standard likelihood-based metrics do not always apply and rarely correlate with perceptual fidelity, while sample-based metrics, such as FID, are insensitive to overfitting, i.e., inability to generalize beyond the training set. To address these limitations, we propose a new metric called the Feature Likelihood Score (FLS), a parametric sample-based score that uses density estimation to provide a comprehensive trichotomic evaluation accounting for novelty (i.e., different from the training samples), fidelity, and diversity of generated samples. We empirically demonstrate the ability of FLS to identify specific overfitting problem cases, where previously proposed metrics fail. We also extensively evaluate FLS on various image datasets and model classes, demonstrating its ability to match intuitions of previous metrics like FID while offering a more comprehensive evaluation of generative models. Code is available at https://github.com/marcojira/fls.
    Improving the Robustness of Summarization Models by Detecting and Removing Input Noise. (arXiv:2212.09928v2 [cs.CL] UPDATED)
    The evaluation of abstractive summarization models typically uses test data that is identically distributed as training data. In real-world practice, documents to be summarized may contain input noise caused by text extraction artifacts or data pipeline bugs. The robustness of model performance under distribution shift caused by such noise is relatively under-studied. We present a large empirical study quantifying the sometimes severe loss in performance (up to 12 ROUGE-1 points) from different types of input noise for a range of datasets and model sizes. We then propose a light-weight method for detecting and removing such noise in the input during model inference without requiring any extra training, auxiliary models, or even prior knowledge of the type of noise. Our proposed approach effectively mitigates the loss in performance, recovering a large fraction of the performance drop, sometimes as large as 11 ROUGE-1 points.
    Evaluating the Efficacy of Hybrid Deep Learning Models in Distinguishing AI-Generated Text. (arXiv:2311.15565v2 [cs.CL] UPDATED)
    My research investigates the use of cutting-edge hybrid deep learning models to accurately differentiate between AI-generated text and human writing. I applied a robust methodology, utilising a carefully selected dataset comprising AI and human texts from various sources, each tagged with instructions. Advanced natural language processing techniques facilitated the analysis of textual features. Combining sophisticated neural networks, the custom model enabled it to detect nuanced differences between AI and human content.
    Seizure detection from Electroencephalogram signals via Wavelets and Graph Theory metrics. (arXiv:2312.00811v1 [q-bio.NC])
    Epilepsy is one of the most prevalent neurological conditions, where an epileptic seizure is a transient occurrence due to abnormal, excessive and synchronous activity in the brain. Electroencephalogram signals emanating from the brain may be captured, analysed and then play a significant role in detection and prediction of epileptic seizures. In this work we enhance upon a previous approach that relied on the differing properties of the wavelet transform. Here we apply the Maximum Overlap Discrete Wavelet Transform to both reduce signal \textit{noise} and use signal variance exhibited at differing inherent frequency levels to develop various metrics of connection between the electrodes placed upon the scalp. %The properties of both the noise reduced signal and the interconnected electrodes differ significantly during the different brain states. Using short duration epochs, to approximate close to real time monitoring, together with simple statistical parameters derived from the reconstructed noise reduced signals we initiate seizure detection. To further improve performance we utilise graph theoretic indicators from derived electrode connectivity. From there we build the attribute space. We utilise open-source software and publicly available data to highlight the superior Recall/Sensitivity performance of our approach, when compared to existing published methods.
    Convolutional Neural Operators for robust and accurate learning of PDEs. (arXiv:2302.01178v3 [cs.LG] UPDATED)
    Although very successfully used in conventional machine learning, convolution based neural network architectures -- believed to be inconsistent in function space -- have been largely ignored in the context of learning solution operators of PDEs. Here, we present novel adaptations for convolutional neural networks to demonstrate that they are indeed able to process functions as inputs and outputs. The resulting architecture, termed as convolutional neural operators (CNOs), is designed specifically to preserve its underlying continuous nature, even when implemented in a discretized form on a computer. We prove a universality theorem to show that CNOs can approximate operators arising in PDEs to desired accuracy. CNOs are tested on a novel suite of benchmarks, encompassing a diverse set of PDEs with possibly multi-scale solutions and are observed to significantly outperform baselines, paving the way for an alternative framework for robust and accurate operator learning. Our code is publicly available at https://github.com/bogdanraonic3/ConvolutionalNeuralOperator
    Centering the Margins: Outlier-Based Identification of Harmed Populations in Toxicity Detection. (arXiv:2305.14735v3 [cs.CL] UPDATED)
    The impact of AI models on marginalized communities has traditionally been measured by identifying performance differences between specified demographic subgroups. Though this approach aims to center vulnerable groups, it risks obscuring patterns of harm faced by intersectional subgroups or shared across multiple groups. To address this, we draw on theories of marginalization from disability studies and related disciplines, which state that people farther from the norm face greater adversity, to consider the "margins" in the domain of toxicity detection. We operationalize the "margins" of a dataset by employing outlier detection to identify text about people with demographic attributes distant from the "norm". We find that model performance is consistently worse for demographic outliers, with mean squared error (MSE) between outliers and non-outliers up to 70.4% worse across toxicity types. It is also worse for text outliers, with a MSE up to 68.4% higher for outliers than non-outliers. We also find text and demographic outliers to be particularly susceptible to errors in the classification of severe toxicity and identity attacks. Compared to analysis of disparities using traditional demographic breakdowns, we find that our outlier analysis frequently surfaces greater harms faced by a larger, more intersectional group, which suggests that outlier analysis is particularly beneficial for identifying harms against those groups.
    Telematics Combined Actuarial Neural Networks for Cross-Sectional and Longitudinal Claim Count Data. (arXiv:2308.01729v2 [stat.ML] UPDATED)
    We present novel cross-sectional and longitudinal claim count models for vehicle insurance built upon the Combined Actuarial Neural Network (CANN) framework proposed by Mario W\"uthrich and Michael Merz. The CANN approach combines a classical actuarial model, such as a generalized linear model, with a neural network. This blending of models results in a two-component model comprising a classical regression model and a neural network part. The CANN model leverages the strengths of both components, providing a solid foundation and interpretability from the classical model while harnessing the flexibility and capacity to capture intricate relationships and interactions offered by the neural network. In our proposed models, we use well-known log-linear claim count regression models for the classical regression part and a multilayer perceptron (MLP) for the neural network part. The MLP part is used to process telematics car driving data given as a vector characterizing the driving behavior of each insured driver. In addition to the Poisson and negative binomial distributions for cross-sectional data, we propose a procedure for training our CANN model with a multivariate negative binomial (MVNB) specification. By doing so, we introduce a longitudinal model that accounts for the dependence between contracts from the same insured. Our results reveal that the CANN models exhibit superior performance compared to log-linear models that rely on manually engineered telematics features.
    The Effect of Intrinsic Dimension on Metric Learning under Compression. (arXiv:2309.05751v2 [cs.LG] UPDATED)
    Metric learning aims at finding a suitable distance metric over the input space, to improve the performance of distance-based learning algorithms. In high-dimensional settings, metric learning can also play the role of dimensionality reduction, by imposing a low-rank restriction to the learnt metric. In this paper, instead of training a low-rank metric on high-dimensional data, we consider a randomly compressed version of the data, and train a full-rank metric there. We give theoretical guarantees on the error of distance-based metric learning, with respect to the random compression, which do not depend on the ambient dimension. Our bounds do not make any explicit assumptions, aside from i.i.d. data from a bounded support, and automatically tighten when benign geometrical structures are present. Experimental results on both synthetic and real data sets support our theoretical findings in high-dimensional settings.
    [Reproducibility Report] Explainable Deep One-Class Classification. (arXiv:2206.02598v2 [cs.CV] UPDATED)
    Fully Convolutional Data Description (FCDD), an explainable version of the Hypersphere Classifier (HSC), directly addresses image anomaly detection (AD) and pixel-wise AD without any post-hoc explainer methods. The authors claim that FCDD achieves results comparable with the state-of-the-art in sample-wise AD on Fashion-MNIST and CIFAR-10 and exceeds the state-of-the-art on the pixel-wise task on MVTec-AD. We reproduced the main results of the paper using the author's code with minor changes and provide runtime requirements to achieve if (CPU memory, GPU memory, and training time). We propose another analysis methodology using a critical difference diagram, and further investigate the test performance of the model during the training phase.
    Deep Learning Framework for Wireless Systems: Applications to Optical Wireless Communications. (arXiv:1812.05227v1 [cs.IT] CROSS LISTED)
    Optical wireless communication (OWC) is a promising technology for future wireless communications owing to its potentials for cost-effective network deployment and high data rate. There are several implementation issues in the OWC which have not been encountered in radio frequency wireless communications. First, practical OWC transmitters need an illumination control on color, intensity, and luminance, etc., which poses complicated modulation design challenges. Furthermore, signal-dependent properties of optical channels raise non-trivial challenges both in modulation and demodulation of the optical signals. To tackle such difficulties, deep learning (DL) technologies can be applied for optical wireless transceiver design. This article addresses recent efforts on DL-based OWC system designs. A DL framework for emerging image sensor communication is proposed and its feasibility is verified by simulation. Finally, technical challenges and implementation issues for the DL-based optical wireless technology are discussed.
    Interpretable AI-Driven Discovery of Terrain-Precipitation Relationships for Enhanced Climate Insights. (arXiv:2309.15400v2 [physics.ao-ph] UPDATED)
    Despite the remarkable strides made by AI-driven models in modern precipitation forecasting, these black-box models cannot inherently deepen the comprehension of underlying mechanisms. To address this limitation, we propose an AI-driven knowledge discovery framework known as genetic algorithm-geographic weighted regression (GA-GWR). Our approach seeks to unveil the explicit equations that govern the intricate relationship between precipitation patterns and terrain characteristics in regions marked by complex terrain. Through this AI-driven knowledge discovery, we uncover previously undisclosed explicit equations that shed light on the connection between terrain features and precipitation patterns. These equations demonstrate remarkable accuracy when applied to precipitation data, outperforming conventional empirical models. Notably, our research reveals that the parameters within these equations are dynamic, adapting to evolving climate patterns. Ultimately, the unveiled equations have practical applications, particularly in fine-scale downscaling for precipitation predictions using low-resolution future climate data. This capability offers invaluable insights into the anticipated changes in precipitation patterns across diverse terrains under future climate scenarios, which enhances our ability to address the challenges posed by contemporary climate science.
    How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model. (arXiv:2307.02129v3 [cs.LG] UPDATED)
    Deep learning algorithms demonstrate a surprising ability to learn high-dimensional tasks from limited examples. This is commonly attributed to the depth of neural networks, enabling them to build a hierarchy of abstract, low-dimensional data representations. However, how many training examples are required to learn such representations remains unknown. To quantitatively study this question, we introduce the Random Hierarchy Model: a family of synthetic tasks inspired by the hierarchical structure of language and images. The model is a classification task where each class corresponds to a group of high-level features, chosen among several equivalent groups associated with the same class. In turn, each feature corresponds to a group of sub-features chosen among several equivalent ones and so on, following a hierarchy of composition rules. We find that deep networks learn the task by developing internal representations invariant to exchanging equivalent groups. Moreover, the number of data required corresponds to the point where correlations between low-level features and classes become detectable. Overall, our results indicate how deep networks overcome the curse of dimensionality by building invariant representations, and provide an estimate of the number of data required to learn a hierarchical task.
    A New Random Reshuffling Method for Nonsmooth Nonconvex Finite-sum Optimization. (arXiv:2312.01047v1 [math.OC])
    In this work, we propose and study a novel stochastic optimization algorithm, termed the normal map-based proximal random reshuffling (norm-PRR) method, for nonsmooth nonconvex finite-sum problems. Random reshuffling techniques are prevalent and widely utilized in large-scale applications, e.g., in the training of neural networks. While the convergence behavior and advantageous acceleration effects of random reshuffling methods are fairly well understood in the smooth setting, much less seems to be known in the nonsmooth case and only few proximal-type random reshuffling approaches with provable guarantees exist. We establish the iteration complexity ${\cal O}(n^{-1/3}T^{-2/3})$ for norm-PRR, where $n$ is the number of component functions and $T$ counts the total number of iteration. We also provide novel asymptotic convergence results for norm-PRR. Specifically, under the Kurdyka-{\L}ojasiewicz (KL) inequality, we establish strong limit-point convergence, i.e., the iterates generated by norm-PRR converge to a single stationary point. Moreover, we derive last iterate convergence rates of the form ${\cal O}(k^{-p})$; here, $p \in [0, 1]$ depends on the KL exponent $\theta \in [0,1)$ and step size dynamics. Finally, we present preliminary numerical results on machine learning problems that demonstrate the efficiency of the proposed method.
    Distributional Model Equivalence for Risk-Sensitive Reinforcement Learning. (arXiv:2307.01708v2 [cs.LG] UPDATED)
    We consider the problem of learning models for risk-sensitive reinforcement learning. We theoretically demonstrate that proper value equivalence, a method of learning models which can be used to plan optimally in the risk-neutral setting, is not sufficient to plan optimally in the risk-sensitive setting. We leverage distributional reinforcement learning to introduce two new notions of model equivalence, one which is general and can be used to plan for any risk measure, but is intractable; and a practical variation which allows one to choose which risk measures they may plan optimally for. We demonstrate how our framework can be used to augment any model-free risk-sensitive algorithm, and provide both tabular and large-scale experiments to demonstrate its ability.
    Omnipotent Adversarial Training in the Wild. (arXiv:2307.08596v2 [cs.LG] UPDATED)
    Adversarial training is an important topic in robust deep learning, but the community lacks attention to its practical usage. In this paper, we aim to resolve a real-world challenge, i.e., training a model on an imbalanced and noisy dataset to achieve high clean accuracy and adversarial robustness, with our proposed Omnipotent Adversarial Training (OAT) strategy. OAT consists of two innovative methodologies to address the imperfection in the training set. We first introduce an oracle into the adversarial training process to help the model learn a correct data-label conditional distribution. This carefully-designed oracle can provide correct label annotations for adversarial training. We further propose logits adjustment adversarial training to overcome the data imbalance issue, which can help the model learn a Bayes-optimal distribution. Our comprehensive evaluation results show that OAT outperforms other baselines by more than 20% clean accuracy improvement and 10% robust accuracy improvement under complex combinations of data imbalance and label noise scenarios. The code can be found in https://github.com/GuanlinLee/OAT.
    Trading-off price for data quality to achieve fair online allocation. (arXiv:2306.13440v2 [cs.LG] UPDATED)
    We consider the problem of online allocation subject to a long-term fairness penalty. Contrary to existing works, however, we do not assume that the decision-maker observes the protected attributes -- which is often unrealistic in practice. Instead they can purchase data that help estimate them from sources of different quality; and hence reduce the fairness penalty at some cost. We model this problem as a multi-armed bandit problem where each arm corresponds to the choice of a data source, coupled with the online allocation problem. We propose an algorithm that jointly solves both problems and show that it has a regret bounded by $\mathcal{O}(\sqrt{T})$. A key difficulty is that the rewards received by selecting a source are correlated by the fairness penalty, which leads to a need for randomization (despite a stochastic setting). Our algorithm takes into account contextual information available before the source selection, and can adapt to many different fairness notions. We also show that in some instances, the estimates used can be learned on the fly.
    Data-efficient operator learning for solving high Mach number fluid flow problems. (arXiv:2311.16860v2 [cs.LG] UPDATED)
    We consider the problem of using SciML to predict solutions of high Mach fluid flows over irregular geometries. In this setting, data is limited, and so it is desirable for models to perform well in the low-data setting. We show that Neural Basis Functions (NBF), which learns a basis of behavior modes from the data and then uses this basis to make predictions, is more effective than a basis-unaware baseline model. In addition, we identify continuing challenges in the space of predicting solutions for this type of problem.
    Transforming organic chemistry research paradigms: moving from manual efforts to the intersection of automation and artificial intelligence. (arXiv:2312.00808v1 [cs.AI])
    Organic chemistry is undergoing a major paradigm shift, moving from a labor-intensive approach to a new era dominated by automation and artificial intelligence (AI). This transformative shift is being driven by technological advances, the ever-increasing demand for greater research efficiency and accuracy, and the burgeoning growth of interdisciplinary research. AI models, supported by computational power and algorithms, are drastically reshaping synthetic planning and introducing groundbreaking ways to tackle complex molecular synthesis. In addition, autonomous robotic systems are rapidly accelerating the pace of discovery by performing tedious tasks with unprecedented speed and precision. This article examines the multiple opportunities and challenges presented by this paradigm shift and explores its far-reaching implications. It provides valuable insights into the future trajectory of organic chemistry research, which is increasingly defined by the synergistic interaction of automation and AI.
    Fluid Viscosity Prediction Leveraging Computer Vision and Robot Interaction. (arXiv:2308.02715v2 [cs.LG] UPDATED)
    Accurately determining fluid viscosity is crucial for various industrial and scientific applications. Traditional methods of viscosity measurement, though reliable, often require manual intervention and cannot easily adapt to real-time monitoring. With advancements in machine learning and computer vision, this work explores the feasibility of predicting fluid viscosity by analyzing fluid oscillations captured in video data. The pipeline employs a 3D convolutional autoencoder pretrained in a self-supervised manner to extract and learn features from semantic segmentation masks of oscillating fluids. Then, the latent representations of the input data, produced from the pretrained autoencoder, is processed with a distinct inference head to infer either the fluid category (classification) or the fluid viscosity (regression) in a time-resolved manner. When the latent representations generated by the pretrained autoencoder are used for classification, the system achieves a 97.1% accuracy across a total of 4,140 test datapoints. Similarly, for regression tasks, employing an additional fully-connected network as a regression head allows the pipeline to achieve a mean absolute error of 0.258 over 4,416 test datapoints. This study represents an innovative contribution to both fluid characterization and the evolving landscape of Artificial Intelligence, demonstrating the potential of deep learning in achieving near real-time viscosity estimation and addressing practical challenges in fluid dynamics through the analysis of video data capturing oscillating fluid dynamics.
    AlignBench: Benchmarking Chinese Alignment of Large Language Models. (arXiv:2311.18743v2 [cs.CL] UPDATED)
    Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, effective evaluation of alignment for emerging Chinese LLMs is still significantly lacking, calling for real-scenario grounded, open-ended, challenging and automatic evaluations tailored for alignment. To fill in this gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs' alignment in Chinese. Equipped with a human-in-the-loop data curation pipeline, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge with Chain-of-Thought to generate explanations and final ratings as evaluations, ensuring high reliability and interpretability. Furthermore, we report AlignBench evaluated by CritiqueLLM, a dedicated Chinese evaluator LLM that recovers 95% of GPT-4's evaluation ability. We will provide public APIs for evaluating AlignBench with CritiqueLLM to facilitate the evaluation of LLMs' Chinese alignment. All evaluation codes, data, and LLM generations are available at \url{https://github.com/THUDM/AlignBench}.
    Representing Data as Atoms: Unifying Intra- and Inter-Sample Relationship to Discretize Data Representation. (arXiv:2210.03728v2 [cs.LG] UPDATED)
    The quality of data representation is paramount for the performance of a model. Recent research has focused on enhancing representation learning by incorporating more information about the intra-sample structures of individual data points, such as local and global attention. Additionally, researchers have explored methods to model the inter-sample relationships, including manifold, contrastive, and discrete representation learning. In this study, we introduce a new training loss, which considers both intra-sample structure and inter-sample relationships, leveraging the concept of {\it atoms} to represent data points. This new approach, {\it Atom Modeling}, offers a fresh perspective to discretize data representations within a continuous space. Through experiments, we demonstrate that Atom Modeling enhances the performance of existing models in tasks involving classification and generation, across diverse domains including vision and language. These findings underscore the potential of Atom Modeling to enhance data representation and improve model learning, suggesting a promising direction for future research.
    Robust Evaluation of Diffusion-Based Adversarial Purification. (arXiv:2303.09051v3 [cs.CV] UPDATED)
    We question the current evaluation practice on diffusion-based purification methods. Diffusion-based purification methods aim to remove adversarial effects from an input data point at test time. The approach gains increasing attention as an alternative to adversarial training due to the disentangling between training and testing. Well-known white-box attacks are often employed to measure the robustness of the purification. However, it is unknown whether these attacks are the most effective for the diffusion-based purification since the attacks are often tailored for adversarial training. We analyze the current practices and provide a new guideline for measuring the robustness of purification methods against adversarial attacks. Based on our analysis, we further propose a new purification strategy improving robustness compared to the current diffusion-based purification methods.
    Accuracy Improvement in Differentially Private Logistic Regression: A Pre-training Approach. (arXiv:2307.13771v2 [cs.LG] UPDATED)
    Machine learning (ML) models can memorize training datasets. As a result, training ML models over private datasets can lead to the violation of individuals' privacy. Differential privacy (DP) is a rigorous privacy notion to preserve the privacy of underlying training datasets. Yet, training ML models in a DP framework usually degrades the accuracy of ML models. This paper aims to boost the accuracy of a DP logistic regression (LR) via a pre-training module. In more detail, we initially pre-train our LR model on a public training dataset that there is no privacy concern about it. Then, we fine-tune our DP-LR model with the private dataset. In the numerical results, we show that adding a pre-training module significantly improves the accuracy of the DP-LR model.
    A Generalized Bandsplit Neural Network for Cinematic Audio Source Separation. (arXiv:2309.02539v3 [eess.AS] UPDATED)
    Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.
    VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models. (arXiv:2312.00845v1 [cs.CV])
    Text-to-video diffusion models have advanced video generation significantly. However, customizing these models to generate videos with tailored motions presents a substantial challenge. In specific, they encounter hurdles in (a) accurately reproducing motion from a target video, and (b) creating diverse visual variations. For example, straightforward extensions of static image customization methods to video often lead to intricate entanglements of appearance and motion data. To tackle this, here we present the Video Motion Customization (VMC) framework, a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference. The diffusion process then preserves low-frequency motion trajectories while mitigating high-frequency motion-unrelated noise in image space. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts. Our codes, data and the project demo can be found at https://video-motion-customization.github.io
    PACE: A Program Analysis Framework for Continuous Performance Prediction. (arXiv:2312.00918v1 [cs.SE])
    Software development teams establish elaborate continuous integration pipelines containing automated test cases to accelerate the development process of software. Automated tests help to verify the correctness of code modifications decreasing the response time to changing requirements. However, when the software teams do not track the performance impact of pending modifications, they may need to spend considerable time refactoring existing code. This paper presents PACE, a program analysis framework that provides continuous feedback on the performance impact of pending code updates. We design performance microbenchmarks by mapping the execution time of functional test cases given a code update. We map microbenchmarks to code stylometry features and feed them to predictors for performance predictions. Our experiments achieved significant performance in predicting code performance, outperforming current state-of-the-art by 75% on neural-represented code stylometry features.
    Private Federated Frequency Estimation: Adapting to the Hardness of the Instance. (arXiv:2306.09396v2 [cs.DS] UPDATED)
    In federated frequency estimation (FFE), multiple clients work together to estimate the frequencies of their collective data by communicating with a server that respects the privacy constraints of Secure Summation (SecSum), a cryptographic multi-party computation protocol that ensures that the server can only access the sum of client-held vectors. For single-round FFE, it is known that count sketching is nearly information-theoretically optimal for achieving the fundamental accuracy-communication trade-offs [Chen et al., 2022]. However, we show that under the more practical multi-round FEE setting, simple adaptations of count sketching are strictly sub-optimal, and we propose a novel hybrid sketching algorithm that is provably more accurate. We also address the following fundamental question: how should a practitioner set the sketch size in a way that adapts to the hardness of the underlying problem? We propose a two-phase approach that allows for the use of a smaller sketch size for simpler problems (e.g., near-sparse or light-tailed distributions). We conclude our work by showing how differential privacy can be added to our algorithm and verifying its superior performance through extensive experiments conducted on large-scale datasets.
    Large Language Models for Travel Behavior Prediction. (arXiv:2312.00819v1 [cs.LG])
    Travel behavior prediction is a fundamental task in transportation demand management. The conventional methods for travel behavior prediction rely on numerical data to construct mathematical models and calibrate model parameters to represent human preferences. Recent advancement in large language models (LLMs) has shown great reasoning abilities to solve complex problems. In this study, we propose to use LLMs to predict travel behavior with prompt engineering without data-based parameter learning. Specifically, we carefully design our prompts that include 1) task description, 2) travel characteristics, 3) individual attributes, and 4) guides of thinking with domain knowledge, and ask the LLMs to predict an individual's travel behavior and explain the results. We select the travel mode choice task as a case study. Results show that, though no training samples are provided, LLM-based predictions have competitive accuracy and F1-score as canonical supervised learning methods such as multinomial logit, random forest, and neural networks. LLMs can also output reasons that support their prediction. However, though in most of the cases, the output explanations are reasonable, we still observe cases that violate logic or with hallucinations.
    Understanding Fairness Surrogate Functions in Algorithmic Fairness. (arXiv:2310.11211v3 [cs.LG] UPDATED)
    It has been observed that machine learning algorithms exhibit biased predictions against certain population groups. To mitigate such bias while achieving comparable accuracy, a promising approach is to introduce surrogate functions of the concerned fairness definition and solve a constrained optimization problem. However, it is intriguing in previous work that such fairness surrogate functions may yield unfair results and high instability. In this work, in order to deeply understand them, taking a widely used fairness definition--demographic parity as an example, we show that there is a surrogate-fairness gap between the fairness definition and the fairness surrogate function. Also, the theoretical analysis and experimental results about the gap motivate us that the fairness and stability will be affected by the points far from the decision boundary, which is the large margin points issue investigated in this paper. To address it, we propose the general sigmoid surrogate to simultaneously reduce both the surrogate-fairness gap and the variance, and offer a rigorous fairness and stability upper bound. Interestingly, the theory also provides insights into two important issues that deal with the large margin points as well as obtaining a more balanced dataset are beneficial to fairness and stability. Furthermore, we elaborate a novel and general algorithm called Balanced Surrogate, which iteratively reduces the gap to mitigate unfairness. Finally, we provide empirical evidence showing that our methods consistently improve fairness and stability while maintaining accuracy comparable to the baselines in three real-world datasets.
    Heuristic Optimal Transport in Branching Networks. (arXiv:2311.06650v2 [math.OC] UPDATED)
    Optimal transport aims to learn a mapping of sources to targets by minimizing the cost, which is typically defined as a function of distance. The solution to this problem consists of straight line segments optimally connecting sources to targets, and it does not exhibit branching. These optimal solutions are in stark contrast with both natural, and man-made transportation networks, where branching structures are prevalent. Here we discuss a fast heuristic branching method for optimal transport in networks, and we provide several applications.
    Bagged Regularized $k$-Distances for Anomaly Detection. (arXiv:2312.01046v1 [stat.ML])
    We consider the paradigm of unsupervised anomaly detection, which involves the identification of anomalies within a dataset in the absence of labeled examples. Though distance-based methods are top-performing for unsupervised anomaly detection, they suffer heavily from the sensitivity to the choice of the number of the nearest neighbors. In this paper, we propose a new distance-based algorithm called bagged regularized $k$-distances for anomaly detection (BRDAD) converting the unsupervised anomaly detection problem into a convex optimization problem. Our BRDAD algorithm selects the weights by minimizing the surrogate risk, i.e., the finite sample bound of the empirical risk of the bagged weighted $k$-distances for density estimation (BWDDE). This approach enables us to successfully address the sensitivity challenge of the hyperparameter choice in distance-based algorithms. Moreover, when dealing with large-scale datasets, the efficiency issues can be addressed by the incorporated bagging technique in our BRDAD algorithm. On the theoretical side, we establish fast convergence rates of the AUC regret of our algorithm and demonstrate that the bagging technique significantly reduces the computational complexity. On the practical side, we conduct numerical experiments on anomaly detection benchmarks to illustrate the insensitivity of parameter selection of our algorithm compared with other state-of-the-art distance-based methods. Moreover, promising improvements are brought by applying the bagging technique in our algorithm on real-world datasets.
    Enabling Non-Linear Quantum Operations through Variational Quantum Splines. (arXiv:2303.04788v3 [quant-ph] UPDATED)
    The postulates of quantum mechanics impose only unitary transformations on quantum states, which is a severe limitation for quantum machine learning algorithms. Quantum Splines (QSplines) have recently been proposed to approximate quantum activation functions to introduce non-linearity in quantum algorithms. However, QSplines make use of the HHL as a subroutine and require a fault-tolerant quantum computer to be correctly implemented. This work proposes the Generalised Hybrid Quantum Splines (GHQSplines), a novel method for approximating non-linear quantum activation functions using hybrid quantum-classical computation. The GHQSplines overcome the highly demanding requirements of the original QSplines in terms of quantum hardware and can be implemented using near-term quantum computers. Furthermore, the proposed method relies on a flexible problem representation for non-linear approximation and it is suitable to be embedded in existing quantum neural network architectures. In addition, we provide a practical implementation of the GHQSplines using Pennylane and show that our model outperforms the original QSplines in terms of quality of fitting.
    Space-Time Attention with Shifted Non-Local Search. (arXiv:2309.16849v2 [cs.CV] UPDATED)
    Efficiently computing attention maps for videos is challenging due to the motion of objects between frames. While a standard non-local search is high-quality for a window surrounding each query point, the window's small size cannot accommodate motion. Methods for long-range motion use an auxiliary network to predict the most similar key coordinates as offsets from each query location. However, accurately predicting this flow field of offsets remains challenging, even for large-scale networks. Small spatial inaccuracies significantly impact the attention module's quality. This paper proposes a search strategy that combines the quality of a non-local search with the range of predicted offsets. The method, named Shifted Non-Local Search, executes a small grid search surrounding the predicted offsets to correct small spatial errors. Our method's in-place computation consumes 10 times less memory and is over 3 times faster than previous work. Experimentally, correcting the small spatial errors improves the video frame alignment quality by over 3 dB PSNR. Our search upgrades existing space-time attention modules, which improves video denoising results by 0.30 dB PSNR for a 7.5% increase in overall runtime. We integrate our space-time attention module into a UNet-like architecture to achieve state-of-the-art results on video denoising.
    Supervised Machine Learning and Physics based Machine Learning approach for prediction of peak temperature distribution in Additive Friction Stir Deposition of Aluminium Alloy. (arXiv:2309.06838v2 [cs.LG] UPDATED)
    Additive friction stir deposition (AFSD) is a novel solid-state additive manufacturing technique that circumvents issues of porosity, cracking, and properties anisotropy that plague traditional powder bed fusion and directed energy deposition approaches. However, correlations between process parameters, thermal profiles, and resulting microstructure in AFSD remain poorly understood. This hinders process optimization for properties. This work employs a framework combining supervised machine learning (SML) and physics-informed neural networks (PINNs) to predict peak temperature distribution in AFSD from process parameters. Eight regression algorithms were implemented for SML modeling, while four PINNs leveraged governing equations for transport, wave propagation, heat transfer, and quantum mechanics. Across multiple statistical measures, ensemble techniques like gradient boosting proved superior for SML, with lowest MSE of 165.78. The integrated ML approach was also applied to classify deposition quality from process factors, with logistic regression delivering robust accuracy. By fusing data-driven learning and fundamental physics, this dual methodology provides comprehensive insights into tailoring microstructure through thermal management in AFSD. The work demonstrates the power of bridging statistical and physics-based modeling for elucidating AM process-property relationships.
    Spatiotemporal Transformer for Imputing Sparse Data: A Deep Learning Approach. (arXiv:2312.00963v1 [cs.LG])
    Effective management of environmental resources and agricultural sustainability heavily depends on accurate soil moisture data. However, datasets like the SMAP/Sentinel-1 soil moisture product often contain missing values across their spatiotemporal grid, which poses a significant challenge. This paper introduces a novel Spatiotemporal Transformer model (ST-Transformer) specifically designed to address the issue of missing values in sparse spatiotemporal datasets, particularly focusing on soil moisture data. The ST-Transformer employs multiple spatiotemporal attention layers to capture the complex spatiotemporal correlations in the data and can integrate additional spatiotemporal covariates during the imputation process, thereby enhancing its accuracy. The model is trained using a self-supervised approach, enabling it to autonomously predict missing values from observed data points. Our model's efficacy is demonstrated through its application to the SMAP 1km soil moisture data over a 36 x 36 km grid in Texas. It showcases superior accuracy compared to well-known imputation methods. Additionally, our simulation studies on other datasets highlight the model's broader applicability in various spatiotemporal imputation tasks.
    EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations. (arXiv:2306.12059v2 [cs.LG] UPDATED)
    Equivariant Transformers such as Equiformer have demonstrated the efficacy of applying Transformers to the domain of 3D atomistic systems. However, they are limited to small degrees of equivariant representations due to their computational complexity. In this paper, we investigate whether these architectures can scale well to higher degrees. Starting from Equiformer, we first replace $SO(3)$ convolutions with eSCN convolutions to efficiently incorporate higher-degree tensors. Then, to better leverage the power of higher degrees, we propose three architectural improvements -- attention re-normalization, separable $S^2$ activation and separable layer normalization. Putting this all together, we propose EquiformerV2, which outperforms previous state-of-the-art methods on large-scale OC20 dataset by up to $9\%$ on forces, $4\%$ on energies, offers better speed-accuracy trade-offs, and $2\times$ reduction in DFT calculations needed for computing adsorption energies. Additionally, EquiformerV2 trained on only OC22 dataset outperforms GemNet-OC trained on both OC20 and OC22 datasets, achieving much better data efficiency. Finally, we compare EquiformerV2 with Equiformer on QM9 and OC20 S2EF-2M datasets to better understand the performance gain brought by higher degrees.
    On skip connections and normalisation layers in deep optimisation. (arXiv:2210.05371v4 [cs.LG] UPDATED)
    We introduce a general theoretical framework, designed for the study of gradient optimisation of deep neural networks, that encompasses ubiquitous architecture choices including batch normalisation, weight normalisation and skip connections. Our framework determines the curvature and regularity properties of multilayer loss landscapes in terms of their constituent layers, thereby elucidating the roles played by normalisation layers and skip connections in globalising these properties. We then demonstrate the utility of this framework in two respects. First, we give the only proof of which we are aware that a class of deep neural networks can be trained using gradient descent to global optima even when such optima only exist at infinity, as is the case for the cross-entropy cost. Second, we identify a novel causal mechanism by which skip connections accelerate training, which we verify predictively with ResNets on MNIST, CIFAR10, CIFAR100 and ImageNet.
    Non-Smooth Weakly-Convex Finite-sum Coupled Compositional Optimization. (arXiv:2310.03234v2 [math.OC] UPDATED)
    This paper investigates new families of compositional optimization problems, called $\underline{\bf n}$on-$\underline{\bf s}$mooth $\underline{\bf w}$eakly-$\underline{\bf c}$onvex $\underline{\bf f}$inite-sum $\underline{\bf c}$oupled $\underline{\bf c}$ompositional $\underline{\bf o}$ptimization (NSWC FCCO). There has been a growing interest in FCCO due to its wide-ranging applications in machine learning and AI, as well as its ability to address the shortcomings of stochastic algorithms based on empirical risk minimization. However, current research on FCCO presumes that both the inner and outer functions are smooth, limiting their potential to tackle a more diverse set of problems. Our research expands on this area by examining non-smooth weakly-convex FCCO, where the outer function is weakly convex and non-decreasing, and the inner function is weakly-convex. We analyze a single-loop algorithm and establish its complexity for finding an $\epsilon$-stationary point of the Moreau envelop of the objective function. Additionally, we also extend the algorithm to solving novel non-smooth weakly-convex tri-level finite-sum coupled compositional optimization problems, which feature a nested arrangement of three functions. Lastly, we explore the applications of our algorithms in deep learning for two-way partial AUC maximization and multi-instance two-way partial AUC maximization, using empirical studies to showcase the effectiveness of the proposed algorithms.
    Efficient IoT Inference via Context-Awareness. (arXiv:2310.19112v2 [cs.CV] UPDATED)
    While existing strategies to execute deep learning-based classification on low-power platforms assume the models are trained on all classes of interest, this paper posits that adopting context-awareness i.e. narrowing down a classification task to the current deployment context consisting of only recent inference queries can substantially enhance performance in resource-constrained environments. We propose a new paradigm, CACTUS, for scalable and efficient context-aware classification where a micro-classifier recognizes a small set of classes relevant to the current context and, when context change happens (e.g., a new class comes into the scene), rapidly switches to another suitable micro-classifier. CACTUS features several innovations, including optimizing the training cost of context-aware classifiers, enabling on-the-fly context-aware switching between classifiers, and balancing context switching costs and performance gains via simple yet effective switching policies. We show that CACTUS achieves significant benefits in accuracy, latency, and compute budget across a range of datasets and IoT platforms.
    Continual Learning with Dynamic Sparse Training: Exploring Algorithms for Effective Model Updates. (arXiv:2308.14831v2 [cs.LG] UPDATED)
    Continual learning (CL) refers to the ability of an intelligent system to sequentially acquire and retain knowledge from a stream of data with as little computational overhead as possible. To this end; regularization, replay, architecture, and parameter isolation approaches were introduced to the literature. Parameter isolation using a sparse network which enables to allocate distinct parts of the neural network to different tasks and also allows to share of parameters between tasks if they are similar. Dynamic Sparse Training (DST) is a prominent way to find these sparse networks and isolate them for each task. This paper is the first empirical study investigating the effect of different DST components under the CL paradigm to fill a critical research gap and shed light on the optimal configuration of DST for CL if it exists. Therefore, we perform a comprehensive study in which we investigate various DST components to find the best topology per task on well-known CIFAR100 and miniImageNet benchmarks in a task-incremental CL setup since our primary focus is to evaluate the performance of various DST criteria, rather than the process of mask selection. We found that, at a low sparsity level, Erdos-R\'enyi Kernel (ERK) initialization utilizes the backbone more efficiently and allows to effectively learn increments of tasks. At a high sparsity level, unless it is extreme, uniform initialization demonstrates a more reliable and robust performance. In terms of growth strategy; performance is dependent on the defined initialization strategy and the extent of sparsity. Finally, adaptivity within DST components is a promising way for better continual learners.
    Optimization dependent generalization bound for ReLU networks based on sensitivity in the tangent bundle. (arXiv:2310.17378v2 [cs.LG] UPDATED)
    Recent advances in deep learning have given us some very promising results on the generalization ability of deep neural networks, however literature still lacks a comprehensive theory explaining why heavily over-parametrized models are able to generalize well while fitting the training data. In this paper we propose a PAC type bound on the generalization error of feedforward ReLU networks via estimating the Rademacher complexity of the set of networks available from an initial parameter vector via gradient descent. The key idea is to bound the sensitivity of the network's gradient to perturbation of the input data along the optimization trajectory. The obtained bound does not explicitly depend on the depth of the network. Our results are experimentally verified on the MNIST and CIFAR-10 datasets.
    From Monte Carlo to neural networks approximations of boundary value problems. (arXiv:2209.01432v2 [math.PR] UPDATED)
    In this paper we study probabilistic and neural network approximations for solutions to Poisson equation subject to H\" older data in general bounded domains of $\mathbb{R}^d$. We aim at two fundamental goals. The first, and the most important, we show that the solution to Poisson equation can be numerically approximated in the sup-norm by Monte Carlo methods, { and that this can be done highly efficiently if we use a modified version} of the walk on spheres algorithm { as an acceleration method. This provides estimates which are efficient with respect to the prescribed approximation error and with polynomial complexity in the dimension and the reciprocal of the error.} {A crucial feature is that} the overall number of samples does not not depend on the point at which the approximation is performed. As a second goal, we show that the obtained Monte Carlo solver renders { in a constructive way} ReLU deep neural network (DNN) solutions to Poisson problem, whose sizes depend at most polynomialy in the dimension $d$ and in the desired error. In fact we show that the random DNN provides with high probability a small approximation error and low polynomial complexity in the dimension.
    Refine, Discriminate and Align: Stealing Encoders via Sample-Wise Prototypes and Multi-Relational Extraction. (arXiv:2312.00855v1 [cs.LG])
    This paper introduces RDA, a pioneering approach designed to address two primary deficiencies prevalent in previous endeavors aiming at stealing pre-trained encoders: (1) suboptimal performances attributed to biased optimization objectives, and (2) elevated query costs stemming from the end-to-end paradigm that necessitates querying the target encoder every epoch. Specifically, we initially Refine the representations of the target encoder for each training sample, thereby establishing a less biased optimization objective before the steal-training phase. This is accomplished via a sample-wise prototype, which consolidates the target encoder's representations for a given sample's various perspectives. Demanding exponentially fewer queries compared to the end-to-end approach, prototypes can be instantiated to guide subsequent query-free training. For more potent efficacy, we develop a multi-relational extraction loss that trains the surrogate encoder to Discriminate mismatched embedding-prototype pairs while Aligning those matched ones in terms of both amplitude and angle. In this way, the trained surrogate encoder achieves state-of-the-art results across the board in various downstream datasets with limited queries. Moreover, RDA is shown to be robust to multiple widely-used defenses.
    Label-Retrieval-Augmented Diffusion Models for Learning from Noisy Labels. (arXiv:2305.19518v2 [cs.LG] UPDATED)
    Learning from noisy labels is an important and long-standing problem in machine learning for real applications. One of the main research lines focuses on learning a label corrector to purify potential noisy labels. However, these methods typically rely on strict assumptions and are limited to certain types of label noise. In this paper, we reformulate the label-noise problem from a generative-model perspective, $\textit{i.e.}$, labels are generated by gradually refining an initial random guess. This new perspective immediately enables existing powerful diffusion models to seamlessly learn the stochastic generative process. Once the generative uncertainty is modeled, we can perform classification inference using maximum likelihood estimation of labels. To mitigate the impact of noisy labels, we propose the $\textbf{L}$abel-$\textbf{R}$etrieval-$\textbf{A}$ugmented (LRA) diffusion model, which leverages neighbor consistency to effectively construct pseudo-clean labels for diffusion training. Our model is flexible and general, allowing easy incorporation of different types of conditional information, $\textit{e.g.}$, use of pre-trained models, to further boost model performance. Extensive experiments are conducted for evaluation. Our model achieves new state-of-the-art (SOTA) results on all the standard real-world benchmark datasets. Remarkably, by incorporating conditional information from the powerful CLIP model, our method can boost the current SOTA accuracy by 10-20 absolute points in many cases.
    Function-constrained Program Synthesis. (arXiv:2311.15500v2 [cs.LG] UPDATED)
    This work introduces (1) a technique that allows large language models (LLMs) to leverage user-provided code when solving programming tasks and (2) a method to iteratively generate modular sub-functions that can aid future code generation attempts when the initial code generated by the LLM is inadequate. Generating computer programs in general-purpose programming languages like Python poses a challenge for LLMs when instructed to use code provided in the prompt. Code-specific LLMs (e.g., GitHub Copilot, CodeLlama2) can generate code completions in real-time by drawing on all code available in a development environment. However, restricting code-specific LLMs to use only in-context code is not straightforward, as the model is not explicitly instructed to use the user-provided code and users cannot highlight precisely which snippets of code the model should incorporate into its context. Moreover, current systems lack effective recovery methods, forcing users to iteratively re-prompt the model with modified prompts until a sufficient solution is reached. Our method differs from traditional LLM-powered code-generation by constraining code-generation to an explicit function set and enabling recovery from failed attempts through automatically generated sub-functions. When the LLM cannot produce working code, we generate modular sub-functions to aid subsequent attempts at generating functional code. A by-product of our method is a library of reusable sub-functions that can solve related tasks, imitating a software team where efficiency scales with experience. We also introduce a new "half-shot" evaluation paradigm that provides tighter estimates of LLMs' coding abilities compared to traditional zero-shot evaluation. Our proposed evaluation method encourages models to output solutions in a structured format, decreasing syntax errors that can be mistaken for poor coding ability.
    Exploring the Robustness of Decentralized Training for Large Language Models. (arXiv:2312.00843v1 [cs.LG])
    Decentralized training of large language models has emerged as an effective way to democratize this technology. However, the potential threats associated with this approach have not been carefully discussed, which would hinder the development of decentralized training infrastructures. This paper aims to initiate discussion towards this end by exploring the robustness of decentralized training from three main perspectives. First, we demonstrate the vulnerabilities inherent in decentralized training frameworks in terms of hardware, data, and models. Second, we highlight the fundamental difference between decentralized foundation model training and vanilla federated learning, where the security techniques employed in federated learning cannot be applied directly. Third, we discuss the essential components required for a robust and efficient decentralized training framework and present a case study by modeling a concrete threat model. Our objective in this vision paper is to emphasize the importance of addressing security concerns in the context of decentralized training for large language models.
    Dealing with Drift of Adaptation Spaces in Learning-based Self-Adaptive Systems using Lifelong Self-Adaptation. (arXiv:2211.02658v3 [cs.LG] UPDATED)
    Recently, machine learning (ML) has become a popular approach to support self-adaptation. ML has been used to deal with several problems in self-adaptation, such as maintaining an up-to-date runtime model under uncertainty and scalable decision-making. Yet, exploiting ML comes with inherent challenges. In this paper, we focus on a particularly important challenge for learning-based self-adaptive systems: drift in adaptation spaces. With adaptation space we refer to the set of adaptation options a self-adaptive system can select from at a given time to adapt based on the estimated quality properties of the adaptation options. Drift of adaptation spaces originates from uncertainties, affecting the quality properties of the adaptation options. Such drift may imply that eventually no adaptation option can satisfy the initial set of the adaptation goals, deteriorating the quality of the system, or adaptation options may emerge that allow enhancing the adaptation goals. In ML, such shift corresponds to novel class appearance, a type of concept drift in target data that common ML techniques have problems dealing with. To tackle this problem, we present a novel approach to self-adaptation that enhances learning-based self-adaptive systems with a lifelong ML layer. We refer to this approach as lifelong self-adaptation. The lifelong ML layer tracks the system and its environment, associates this knowledge with the current tasks, identifies new tasks based on differences, and updates the learning models of the self-adaptive system accordingly. A human stakeholder may be involved to support the learning process and adjust the learning and goal models. We present a general architecture for lifelong self-adaptation and apply it to the case of drift of adaptation spaces that affects the decision-making in self-adaptation. We validate the approach for a series of scenarios using the DeltaIoT exemplar.
    Ensemble Machine Learning Model Trained on a New Synthesized Dataset Generalizes Well for Stress Prediction Using Wearable Devices. (arXiv:2209.15146v2 [cs.LG] UPDATED)
    Introduction. We investigate the generalization ability of models built on datasets containing a small number of subjects, recorded in single study protocols. Next, we propose and evaluate methods combining these datasets into a single, large dataset. Finally, we propose and evaluate the use of ensemble techniques by combining gradient boosting with an artificial neural network to measure predictive power on new, unseen data. Methods. Sensor biomarker data from six public datasets were utilized in this study. To test model generalization, we developed a gradient boosting model trained on one dataset (SWELL), and tested its predictive power on two datasets previously used in other studies (WESAD, NEURO). Next, we merged four small datasets, i.e. (SWELL, NEURO, WESAD, UBFC-Phys), to provide a combined total of 99 subjects,. In addition, we utilized random sampling combined with another dataset (EXAM) to build a larger training dataset consisting of 200 synthesized subjects,. Finally, we developed an ensemble model that combines our gradient boosting model with an artificial neural network, and tested it on two additional, unseen publicly available stress datasets (WESAD and Toadstool). Results. Our method delivers a robust stress measurement system capable of achieving 85% predictive accuracy on new, unseen validation data, achieving a 25% performance improvement over single models trained on small datasets. Conclusion. Models trained on small, single study protocol datasets do not generalize well for use on new, unseen data and lack statistical power. Ma-chine learning models trained on a dataset containing a larger number of varied study subjects capture physiological variance better, resulting in more robust stress detection.
    Reconstruction of 3-Axis Seismocardiogram from Right-to-left and Head-to-foot Components Using A Long Short-Term Memory Network. (arXiv:2307.07566v2 [physics.med-ph] UPDATED)
    This pilot study aims to develop a deep learning model for predicting seismocardiogram (SCG) signals in the dorsoventral direction from the SCG signals in the right-to-left and head-to-foot directions ($\textrm{SCG}_x$ and $\textrm{SCG}_y$). The dataset used for the training and validation of the model was obtained from 15 healthy adult subjects. The SCG signals were recorded using tri-axial accelerometers placed on the chest of each subject. The signals were then segmented using electrocardiogram R waves, and the segments were downsampled, normalized, and centered around zero. The resulting dataset was used to train and validate a long short-term memory (LSTM) network with two layers and a dropout layer to prevent overfitting. The network took as input 100-time steps of $\textrm{SCG}_x$ and $\textrm{SCG}_y$, representing one cardiac cycle, and outputted a vector that mapped to the target variable being predicted. The results showed that the LSTM model had a mean square error of 0.09 between the predicted and actual SCG segments in the dorsoventral direction. The study demonstrates the potential of deep learning models for reconstructing 3-axis SCG signals using the data obtained from dual-axis accelerometers.
    Hybrid Quantum Neural Network in High-dimensional Data Classification. (arXiv:2312.01024v1 [cs.LG])
    The research explores the potential of quantum deep learning models to address challenging machine learning problems that classical deep learning models find difficult to tackle. We introduce a novel model architecture that combines classical convolutional layers with a quantum neural network, aiming to surpass state-of-the-art accuracy while maintaining a compact model size. The experiment is to classify high-dimensional audio data from the Bird-CLEF 2021 dataset. Our evaluation focuses on key metrics, including training duration, model accuracy, and total model size. This research demonstrates the promising potential of quantum machine learning in enhancing machine learning tasks and solving practical machine learning challenges available today.
    Extreme Event Prediction with Multi-agent Reinforcement Learning-based Parametrization of Atmospheric and Oceanic Turbulence. (arXiv:2312.00907v1 [cs.LG])
    Global climate models (GCMs) are the main tools for understanding and predicting climate change. However, due to limited numerical resolutions, these models suffer from major structural uncertainties; e.g., they cannot resolve critical processes such as small-scale eddies in atmospheric and oceanic turbulence. Thus, such small-scale processes have to be represented as a function of the resolved scales via closures (parametrization). The accuracy of these closures is particularly important for capturing climate extremes. Traditionally, such closures are based on heuristics and simplifying assumptions about the unresolved physics. Recently, supervised-learned closures, trained offline on high-fidelity data, have been shown to outperform the classical physics-based closures. However, this approach requires a significant amount of high-fidelity training data and can also lead to instabilities. Reinforcement learning is emerging as a potent alternative for developing such closures as it requires only low-order statistics and leads to stable closures. In Scientific Multi-Agent Reinforcement Learning (SMARL) computational elements serve a dual role of discretization points and learning agents. We leverage SMARL and fundamentals of turbulence physics to learn closures for prototypes of atmospheric and oceanic turbulence. The policy is trained using only the enstrophy spectrum, which is nearly invariant and can be estimated from a few high-fidelity samples (these few samples are far from enough for supervised/offline learning). We show that these closures lead to stable low-resolution simulations that, at a fraction of the cost, can reproduce the high-fidelity simulations' statistics, including the tails of the probability density functions. The results demonstrate the high potential of SMARL for closure modeling for GCMs, especially in the regime of scarce data and indirect observations.
    Towards Graph Foundation Models: A Survey and Beyond. (arXiv:2310.11829v2 [cs.LG] UPDATED)
    Foundation models have emerged as critical components in a variety of artificial intelligence applications, and showcase significant success in natural language processing and several other domains. Meanwhile, the field of graph machine learning is witnessing a paradigm transition from shallow methods to more sophisticated deep learning approaches. The capabilities of foundation models to generalize and adapt motivate graph machine learning researchers to discuss the potential of developing a new graph learning paradigm. This paradigm envisions models that are pre-trained on extensive graph data and can be adapted for various graph tasks. Despite this burgeoning interest, there is a noticeable lack of clear definitions and systematic analyses pertaining to this new domain. To this end, this article introduces the concept of Graph Foundation Models (GFMs), and offers an exhaustive explanation of their key characteristics and underlying technologies. We proceed to classify the existing work related to GFMs into three distinct categories, based on their dependence on graph neural networks and large language models. In addition to providing a thorough review of the current state of GFMs, this article also outlooks potential avenues for future research in this rapidly evolving domain.
    Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion. (arXiv:2306.04633v2 [cs.CV] UPDATED)
    Instance segmentation in 3D is a challenging task due to the lack of large-scale annotated datasets. In this paper, we show that this task can be addressed effectively by leveraging instead 2D pre-trained models for instance segmentation. We propose a novel approach to lift 2D segments to 3D and fuse them by means of a neural field representation, which encourages multi-view consistency across frames. The core of our approach is a slow-fast clustering objective function, which is scalable and well-suited for scenes with a large number of objects. Unlike previous approaches, our method does not require an upper bound on the number of objects or object tracking across frames. To demonstrate the scalability of the slow-fast clustering, we create a new semi-realistic dataset called the Messy Rooms dataset, which features scenes with up to 500 objects per scene. Our approach outperforms the state-of-the-art on challenging scenes from the ScanNet, Hypersim, and Replica datasets, as well as on our newly created Messy Rooms dataset, demonstrating the effectiveness and scalability of our slow-fast clustering method.
    RL4CO: a Unified Reinforcement Learning for Combinatorial Optimization Library. (arXiv:2306.17100v3 [cs.LG] UPDATED)
    Deep reinforcement learning offers notable benefits in addressing combinatorial problems over traditional solvers, reducing the reliance on domain-specific knowledge and expert solutions, and improving computational efficiency. Despite the recent surge in interest in neural combinatorial optimization, practitioners often do not have access to a standardized code base. Moreover, different algorithms are frequently based on fragmentized implementations that hinder reproducibility and fair comparison. To address these challenges, we introduce RL4CO, a unified Reinforcement Learning (RL) for Combinatorial Optimization (CO) library. We employ state-of-the-art software and best practices in implementation, such as modularity and configuration management, to be flexible, easily modifiable, and extensible by researchers. Thanks to our unified codebase, we benchmark baseline RL solvers with different evaluation schemes on zero-shot performance, generalization, and adaptability on diverse tasks. Notably, we find that some recent methods may fall behind their predecessors depending on the evaluation settings. We hope RL4CO will encourage the exploration of novel solutions to complex real-world tasks, allowing the community to compare with existing methods through a unified framework that decouples the science from software engineering. We open-source our library at https://github.com/ai4co/rl4co.
    Efficient and Effective Deep Multi-view Subspace Clustering. (arXiv:2310.09718v2 [cs.CV] UPDATED)
    Recent multi-view subspace clustering achieves impressive results utilizing deep networks, where the self-expressive correlation is typically modeled by a fully connected (FC) layer. However, they still suffer from two limitations. i) The parameter scale of the FC layer is quadratic to sample numbers, resulting in high time and memory costs that significantly degrade their feasibility in large-scale datasets. ii) It is under-explored to extract a unified representation that simultaneously satisfies minimal sufficiency and discriminability. To this end, we propose a novel deep framework, termed Efficient and Effective deep Multi-View Subspace Clustering (E$^2$MVSC). Instead of a parameterized FC layer, we design a Relation-Metric Net that decouples network parameter scale from sample numbers for greater computational efficiency. Most importantly, the proposed method devises a multi-type auto-encoder to explicitly decouple consistent, complementary, and superfluous information from every view, which is supervised by a soft clustering assignment similarity constraint. Following information bottleneck theory and the maximal coding rate reduction principle, a sufficient yet minimal unified representation can be obtained, as well as pursuing intra-cluster aggregation and inter-cluster separability within it. Extensive experiments show that E$^2$MVSC yields comparable results to existing methods and achieves state-of-the-art performance in various types of multi-view datasets.
    Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks. (arXiv:2303.16563v2 [cs.LG] UPDATED)
    We study building multi-task agents in open-world environments. Without human demonstrations, learning to accomplish long-horizon tasks in a large open-world environment with reinforcement learning (RL) is extremely inefficient. To tackle this challenge, we convert the multi-task learning problem into learning basic skills and planning over the skills. Using the popular open-world game Minecraft as the testbed, we propose three types of fine-grained basic skills, and use RL with intrinsic rewards to acquire skills. A novel Finding-skill that performs exploration to find diverse items provides better initialization for other skills, improving the sample efficiency for skill learning. In skill planning, we leverage the prior knowledge in Large Language Models to find the relationships between skills and build a skill graph. When the agent is solving a task, our skill search algorithm walks on the skill graph and generates the proper skill plans for the agent. In experiments, our method accomplishes 40 diverse Minecraft tasks, where many tasks require sequentially executing for more than 10 skills. Our method outperforms baselines by a large margin and is the most sample-efficient demonstration-free RL method to solve Minecraft Tech Tree tasks. The project's website and code can be found at https://sites.google.com/view/plan4mc.
    Non-Cross Diffusion for Semantic Consistency. (arXiv:2312.00820v1 [cs.LG])
    In diffusion models, deviations from a straight generative flow are a common issue, resulting in semantic inconsistencies and suboptimal generations. To address this challenge, we introduce `Non-Cross Diffusion', an innovative approach in generative modeling for learning ordinary differential equation (ODE) models. Our methodology strategically incorporates an ascending dimension of input to effectively connect points sampled from two distributions with uncrossed paths. This design is pivotal in ensuring enhanced semantic consistency throughout the inference process, which is especially critical for applications reliant on consistent generative flows, including various distillation methods and deterministic sampling, which are fundamental in image editing and interpolation tasks. Our empirical results demonstrate the effectiveness of Non-Cross Diffusion, showing a substantial reduction in semantic inconsistencies at different inference steps and a notable enhancement in the overall performance of diffusion models.
    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. (arXiv:2303.17580v4 [cs.CL] UPDATED)
    Solving complicated AI tasks with different domains and modalities is a key step toward artificial general intelligence. While there are numerous AI models available for various domains and modalities, they cannot handle complicated AI tasks autonomously. Considering large language models (LLMs) have exhibited exceptional abilities in language understanding, generation, interaction, and reasoning, we advocate that LLMs could act as a controller to manage existing AI models to solve complicated AI tasks, with language serving as a generic interface to empower this. Based on this philosophy, we present HuggingGPT, an LLM-powered agent that leverages LLMs (e.g., ChatGPT) to connect various AI models in machine learning communities (e.g., Hugging Face) to solve AI tasks. Specifically, we use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in Hugging Face, execute each subtask with the selected AI model, and summarize the response according to the execution results. By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains and achieve impressive results in language, vision, speech, and other challenging tasks, which paves a new way towards the realization of artificial general intelligence.
    Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL. (arXiv:2305.11032v2 [cs.LG] UPDATED)
    While policy optimization algorithms have played an important role in recent empirical success of Reinforcement Learning (RL), the existing theoretical understanding of policy optimization remains rather limited -- they are either restricted to tabular MDPs or suffer from highly suboptimal sample complexity, especial in online RL where exploration is necessary. This paper proposes a simple efficient policy optimization framework -- Optimistic NPG for online RL. Optimistic NPG can be viewed as a simple combination of the classic natural policy gradient (NPG) algorithm [Kakade, 2001] with optimistic policy evaluation subroutines to encourage exploration. For $d$-dimensional linear MDPs, Optimistic NPG is computationally efficient, and learns an $\varepsilon$-optimal policy within $\tilde{O}(d^2/\varepsilon^3)$ samples, which is the first computationally efficient algorithm whose sample complexity has the optimal dimension dependence $\tilde{\Theta}(d^2)$. It also improves over state-of-the-art results of policy optimization algorithms [Zanette et al., 2021] by a factor of $d$. In the realm of general function approximation, which subsumes linear MDPs, Optimistic NPG, to our best knowledge, stands as the first policy optimization algorithm that achieves polynomial sample complexity for learning near-optimal policies.
    FedDCSR: Federated Cross-domain Sequential Recommendation via Disentangled Representation Learning. (arXiv:2309.08420v4 [cs.LG] UPDATED)
    Cross-domain Sequential Recommendation (CSR) which leverages user sequence data from multiple domains has received extensive attention in recent years. However, the existing CSR methods require sharing origin user data across domains, which violates the General Data Protection Regulation (GDPR). Thus, it is necessary to combine federated learning (FL) and CSR to fully utilize knowledge from different domains while preserving data privacy. Nonetheless, the sequence feature heterogeneity across different domains significantly impacts the overall performance of FL. In this paper, we propose FedDCSR, a novel federated cross-domain sequential recommendation framework via disentangled representation learning. Specifically, to address the sequence feature heterogeneity across domains, we introduce an approach called inter-intra domain sequence representation disentanglement (SRD) to disentangle the user sequence features into domain-shared and domain-exclusive features. In addition, we design an intra domain contrastive infomax (CIM) strategy to learn richer domain-exclusive features of users by performing data augmentation on user sequences. Extensive experiments on three real-world scenarios demonstrate that FedDCSR achieves significant improvements over existing baselines.
    Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning. (arXiv:2310.19308v2 [cs.LG] UPDATED)
    Off-policy dynamic programming (DP) techniques such as $Q$-learning have proven to be important in sequential decision-making problems. In the presence of function approximation, however, these techniques often diverge due to the absence of Bellman completeness in the function classes considered, a crucial condition for the success of DP-based methods. In this paper, we show how off-policy learning techniques based on return-conditioned supervised learning (RCSL) are able to circumvent these challenges of Bellman completeness, converging under significantly more relaxed assumptions inherited from supervised learning. We prove there exists a natural environment in which if one uses two-layer multilayer perceptron as the function approximator, the layer width needs to grow linearly with the state space size to satisfy Bellman completeness while a constant layer width is enough for RCSL. These findings take a step towards explaining the superior empirical performance of RCSL methods compared to DP-based methods in environments with near-optimal datasets. Furthermore, in order to learn from sub-optimal datasets, we propose a simple framework called MBRCSL, granting RCSL methods the ability of dynamic programming to stitch together segments from distinct trajectories. MBRCSL leverages learned dynamics models and forward sampling to accomplish trajectory stitching while avoiding the need for Bellman completeness that plagues all dynamic programming algorithms. We propose both theoretical analysis and experimental evaluation to back these claims, outperforming state-of-the-art model-free and model-based offline RL algorithms across several simulated robotics problems.
    Curvilinear object segmentation in medical images based on ODoS filter and deep learning network. (arXiv:2301.07475v3 [eess.IV] UPDATED)
    Automatic segmentation of curvilinear objects in medical images plays an important role in the diagnosis and evaluation of human diseases, yet it is a challenging uncertainty in the complex segmentation tasks due to different issues such as various image appearances, low contrast between curvilinear objects and their surrounding backgrounds, thin and uneven curvilinear structures, and improper background illumination conditions. To overcome these challenges, we present a unique curvilinear structure segmentation framework based on an oriented derivative of stick (ODoS) filter and a deep learning network for curvilinear object segmentation in medical images. Currently, a large number of deep learning models emphasize developing deep architectures and ignore capturing the structural features of curvilinear objects, which may lead to unsatisfactory results. Consequently, a new approach that incorporates an ODoS filter as part of a deep learning network is presented to improve the spatial attention of curvilinear objects. Specifically, the input image is transfered into four-channel image constructed by the ODoS filter. In which, the original image is considered the principal part to describe various image appearance and complex background illumination conditions, a multi-step strategy is used to enhance the contrast between curvilinear objects and their surrounding backgrounds, and a vector field is applied to discriminate thin and uneven curvilinear structures. Subsequently, a deep learning framework is employed to extract various structural features for curvilinear object segmentation in medical images. The performance of the computational model is validated in experiments conducted on the publicly available DRIVE, STARE and CHASEDB1 datasets. The experimental results indicate that the presented model yields surprising results compared with those of some state-of-the-art methods.
    Augmentation-aware Self-supervised Learning with Conditioned Projector. (arXiv:2306.06082v2 [cs.CV] UPDATED)
    Self-supervised learning (SSL) is a powerful technique for learning robust representations from unlabeled data. By learning to remain invariant to applied data augmentations, methods such as SimCLR and MoCo are able to reach quality on par with supervised approaches. However, this invariance may be harmful to solving some downstream tasks which depend on traits affected by augmentations used during pretraining, such as color. In this paper, we propose to foster sensitivity to such characteristics in the representation space by modifying the projector network, a common component of self-supervised architectures. Specifically, we supplement the projector with information about augmentations applied to images. In order for the projector to take advantage of this auxiliary conditioning when solving the SSL task, the feature extractor learns to preserve the augmentation information in its representations. Our approach, coined Conditional Augmentation-aware Self-supervised Learning (CASSLE), is directly applicable to typical joint-embedding SSL methods regardless of their objective functions. Moreover, it does not require major changes in the network architecture or prior knowledge of downstream tasks. In addition to an analysis of sensitivity towards different data augmentations, we conduct a series of experiments, which show that CASSLE improves over various SSL methods, reaching state-of-the-art performance in multiple downstream tasks.
    Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors. (arXiv:2310.02980v2 [cs.LG] UPDATED)
    Modeling long-range dependencies across sequences is a longstanding goal in machine learning and has led to architectures, such as state space models, that dramatically outperform Transformers on long sequences. However, these impressive empirical gains have been by and large demonstrated on benchmarks (e.g. Long Range Arena), where models are randomly initialized and trained to predict a target label from an input sequence. In this work, we show that random initialization leads to gross overestimation of the differences between architectures and that pretraining with standard denoising objectives, using $\textit{only the downstream task data}$, leads to dramatic gains across multiple architectures and to very small gaps between Transformers and state space models (SSMs). In stark contrast to prior works, we find vanilla Transformers to match the performance of S4 on Long Range Arena when properly pretrained, and we improve the best reported results of SSMs on the PathX-256 task by 20 absolute points. Subsequently, we analyze the utility of previously-proposed structured parameterizations for SSMs and show they become mostly redundant in the presence of data-driven initialization obtained through pretraining. Our work shows that, when evaluating different architectures on supervised tasks, incorporation of data-driven priors via pretraining is essential for reliable performance estimation, and can be done efficiently.
    Federated Heterogeneous Graph Neural Network for Privacy-preserving Recommendation. (arXiv:2310.11730v2 [cs.LG] UPDATED)
    Heterogeneous information network (HIN), which contains rich semantics depicted by meta-paths, has become a powerful tool to alleviate data sparsity in recommender systems. Existing HIN-based recommendations hold the data centralized storage assumption and conduct centralized model training. However, the real-world data is often stored in a distributed manner for privacy concerns, resulting in the failure of centralized HIN-based recommendations. In this paper, we suggest the HIN is partitioned into private HINs stored in the client side and shared HINs in the server. Following this setting, we propose a federated heterogeneous graph neural network (FedHGNN) based framework, which can collaboratively train a recommendation model on distributed HINs without leaking user privacy. Specifically, we first formalize the privacy definition in the light of differential privacy for HIN-based federated recommendation, which aims to protect user-item interactions of private HIN as well as user's high-order patterns from shared HINs. To recover the broken meta-path based semantics caused by distributed data storage and satisfy the proposed privacy, we elaborately design a semantic-preserving user interactions publishing method, which locally perturbs user's high-order patterns as well as related user-item interactions for publishing. After that, we propose a HGNN model for recommendation, which conducts node- and semantic-level aggregations to capture recovered semantics. Extensive experiments on three datasets demonstrate our model outperforms existing methods by a large margin (up to 34% in HR@10 and 42% in NDCG@10) under an acceptable privacy budget.
    Neural Operators for Accelerating Scientific Simulations and Design. (arXiv:2309.15325v4 [cs.LG] UPDATED)
    Scientific discovery and engineering design are currently limited by the time and cost of physical experiments, selected mostly through trial-and-error and intuition that require deep domain expertise. Numerical simulations present an alternative to physical experiments but are usually infeasible for complex real-world domains due to the computational requirements of existing numerical methods. Artificial intelligence (AI) presents a potential paradigm shift by developing fast data-driven surrogate models. In particular, an AI framework, known as neural operators, presents a principled framework for learning mappings between functions defined on continuous domains, e.g., spatiotemporal processes and partial differential equations (PDE). They can extrapolate and predict solutions at new locations unseen during training, i.e., perform zero-shot super-resolution. Neural operators can augment or even replace existing simulators in many applications, such as computational fluid dynamics, weather forecasting, and material modeling, while being 4-5 orders of magnitude faster. Further, neural operators can be integrated with physics and other domain constraints enforced at finer resolutions to obtain high-fidelity solutions and good generalization. Since neural operators are differentiable, they can directly optimize parameters for inverse design and other inverse problems. We believe that neural operators present a transformative approach to simulation and design, enabling rapid research and development.
    Do highly over-parameterized neural networks generalize since bad solutions are rare?. (arXiv:2211.03570v4 [cs.LG] UPDATED)
    We study over-parameterized classifiers where Empirical Risk Minimization (ERM) for learning leads to zero training error. In these over-parameterized settings there are many global minima with zero training error, some of which generalize better than others. We show that under certain conditions the fraction of "bad" global minima with a true error larger than {\epsilon} decays to zero exponentially fast with the number of training data n. The bound depends on the distribution of the true error over the set of classifier functions used for the given classification problem, and does not necessarily depend on the size or complexity (e.g. the number of parameters) of the classifier function set. This insight may provide a novel perspective on the unexpectedly good generalization even of highly over-parameterized neural networks. We substantiate our theoretical findings through experiments on synthetic data and a subset of MNIST. Additionally, we assess our hypothesis using VGG19 and ResNet18 on a subset of Caltech101.
    Large Language Models Empowered Autonomous Edge AI for Connected Intelligence. (arXiv:2307.02779v2 [cs.IT] UPDATED)
    The evolution of wireless networks gravitates towards connected intelligence, a concept that envisions seamless interconnectivity among humans, objects, and intelligence in a hyper-connected cyber-physical world. Edge artificial intelligence (Edge AI) is a promising solution to achieve connected intelligence by delivering high-quality, low-latency, and privacy-preserving AI services at the network edge. This article presents a vision of autonomous edge AI systems that automatically organize, adapt, and optimize themselves to meet users' diverse requirements, leveraging the power of large language models (LLMs), i.e., Generative Pretrained Transformer (GPT). By exploiting the powerful abilities of GPT in language understanding, planning, and code generation, as well as incorporating classic wisdom such as task-oriented communication and edge federated learning, we present a versatile framework that efficiently coordinates edge AI models to cater to users' personal demands while automatically generating code to train new models in a privacy-preserving manner. Experimental results demonstrate the system's remarkable ability to accurately comprehend user demands, efficiently execute AI models with minimal cost, and effectively create high-performance AI models at edge servers.
    Minimal Random Code Learning with Mean-KL Parameterization. (arXiv:2307.07816v2 [cs.LG] UPDATED)
    This paper studies the qualitative behavior and robustness of two variants of Minimal Random Code Learning (MIRACLE) used to compress variational Bayesian neural networks. MIRACLE implements a powerful, conditionally Gaussian variational approximation for the weight posterior $Q_{\mathbf{w}}$ and uses relative entropy coding to compress a weight sample from the posterior using a Gaussian coding distribution $P_{\mathbf{w}}$. To achieve the desired compression rate, $D_{\mathrm{KL}}[Q_{\mathbf{w}} \Vert P_{\mathbf{w}}]$ must be constrained, which requires a computationally expensive annealing procedure under the conventional mean-variance (Mean-Var) parameterization for $Q_{\mathbf{w}}$. Instead, we parameterize $Q_{\mathbf{w}}$ by its mean and KL divergence from $P_{\mathbf{w}}$ to constrain the compression cost to the desired value by construction. We demonstrate that variational training with Mean-KL parameterization converges twice as fast and maintains predictive performance after compression. Furthermore, we show that Mean-KL leads to more meaningful variational distributions with heavier tails and compressed weight samples which are more robust to pruning.
    Everybody Needs a Little HELP: Explaining Graphs via Hierarchical Concepts. (arXiv:2311.15112v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have led to major breakthroughs in a variety of domains such as drug discovery, social network analysis, and travel time estimation. However, they lack interpretability which hinders human trust and thereby deployment to settings with high-stakes decisions. A line of interpretable methods approach this by discovering a small set of relevant concepts as subgraphs in the last GNN layer that together explain the prediction. This can yield oversimplified explanations, failing to explain the interaction between GNN layers. To address this oversight, we provide HELP (Hierarchical Explainable Latent Pooling), a novel, inherently interpretable graph pooling approach that reveals how concepts from different GNN layers compose to new ones in later steps. HELP is more than 1-WL expressive and is the first non-spectral, end-to-end-learnable, hierarchical graph pooling method that can learn to pool a variable number of arbitrary connected components. We empirically demonstrate that it performs on-par with standard GCNs and popular pooling methods in terms of accuracy while yielding explanations that are aligned with expert knowledge in the domains of chemistry and social networks. In addition to a qualitative analysis, we employ concept completeness scores as well as concept conformity, a novel metric to measure the noise in discovered concepts, quantitatively verifying that the discovered concepts are significantly easier to fully understand than those from previous work. Our work represents a first step towards an understanding of graph neural networks that goes beyond a set of concepts from the final layer and instead explains the complex interplay of concepts on different levels.
    Debiasing Conditional Stochastic Optimization. (arXiv:2304.10613v3 [cs.LG] UPDATED)
    In this paper, we study the conditional stochastic optimization (CSO) problem which covers a variety of applications including portfolio selection, reinforcement learning, robust learning, causal inference, etc. The sample-averaged gradient of the CSO objective is biased due to its nested structure, and therefore requires a high sample complexity for convergence. We introduce a general stochastic extrapolation technique that effectively reduces the bias. We show that for nonconvex smooth objectives, combining this extrapolation with variance reduction techniques can achieve a significantly better sample complexity than the existing bounds. Additionally, we develop new algorithms for the finite-sum variant of the CSO problem that also significantly improve upon existing results. Finally, we believe that our debiasing technique has the potential to be a useful tool for addressing similar challenges in other stochastic optimization problems.
    3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing. (arXiv:2312.00870v1 [cs.CV])
    We present 3DiFACE, a novel method for personalized speech-driven 3D facial animation and editing. While existing methods deterministically predict facial animations from speech, they overlook the inherent one-to-many relationship between speech and facial expressions, i.e., there are multiple reasonable facial expression animations matching an audio input. It is especially important in content creation to be able to modify generated motion or to specify keyframes. To enable stochasticity as well as motion editing, we propose a lightweight audio-conditioned diffusion model for 3D facial motion. This diffusion model can be trained on a small 3D motion dataset, maintaining expressive lip motion output. In addition, it can be finetuned for specific subjects, requiring only a short video of the person. Through quantitative and qualitative evaluations, we show that our method outperforms existing state-of-the-art techniques and yields speech-driven animations with greater fidelity and diversity.
    Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game. (arXiv:2310.18940v2 [cs.AI] UPDATED)
    Agents built with large language models (LLMs) have recently achieved great advancements. However, most of the efforts focus on single-agent or cooperative settings, leaving more general multi-agent environments underexplored. We propose a new framework powered by reinforcement learning (RL) to develop strategic language agents, i.e., LLM-based agents with strategic thinking ability, for a popular language game, Werewolf. Werewolf is a social deduction game with hidden roles that involves both cooperation and competition and emphasizes deceptive communication and diverse gameplay. Our agent tackles this game by first using LLMs to reason about potential deceptions and generate a set of strategically diverse actions. Then an RL policy, which selects an action from the candidates, is learned by population-based training to enhance the agents' decision-making ability. By combining LLMs with the RL policy, our agent produces a variety of emergent strategies, achieves the highest win rate against other LLM-based agents, and stays robust against adversarial human players in the Werewolf game.
    Eliciting Latent Knowledge from Quirky Language Models. (arXiv:2312.01037v1 [cs.LG])
    Eliciting Latent Knowledge (ELK) aims to find patterns in a neural network's activations which robustly track the true state of the world, even when the network's overt output is false or misleading. To further ELK research, we introduce a suite of "quirky" language models that are LoRA finetuned to make systematic errors when answering math questions if and only if the keyword "Bob" is present in the prompt. We demonstrate that simple probing methods can elicit the model's latent knowledge of the correct answer in these contexts, even for problems harder than those the probe was trained on. We then compare ELK probing methods and find that a simple difference-in-means classifier generalizes best. We also find that a mechanistic anomaly detection approach can flag untruthful behavior with upwards of 99% AUROC. Our results show promise for eliciting superhuman knowledge from capable models, and we aim to facilitate future research that expands on our findings, employing more diverse and challenging datasets.
    PASTA: Pretrained Action-State Transformer Agents. (arXiv:2307.10936v2 [cs.AI] UPDATED)
    Self-supervised learning has brought about a revolutionary paradigm shift in various computing domains, including NLP, vision, and biology. Recent approaches involve pre-training transformer models on vast amounts of unlabeled data, serving as a starting point for efficiently solving downstream tasks. In reinforcement learning, researchers have recently adapted these approaches, developing models pre-trained on expert trajectories. This advancement enables the models to tackle a broad spectrum of tasks, ranging from robotics to recommendation systems. However, existing methods mostly rely on intricate pre-training objectives tailored to specific downstream applications. This paper conducts a comprehensive investigation of models, referred to as pre-trained action-state transformer agents (PASTA). Our study covers a unified methodology and covers an extensive set of general downstream tasks including behavioral cloning, offline RL, sensor failure robustness, and dynamics change adaptation. Our objective is to systematically compare various design choices and offer valuable insights that will aid practitioners in developing robust models. Key highlights of our study include tokenization at the component level for actions and states, the use of fundamental pre-training objectives such as next token prediction or masked language modeling, simultaneous training of models across multiple domains, and the application of various fine-tuning strategies. In this study, the developed models contain fewer than 7 million parameters allowing a broad community to use these models and reproduce our experiments. We hope that this study will encourage further research into the use of transformers with first principle design choices to represent RL trajectories and contribute to robust policy learning.
    ChemSpaceAL: An Efficient Active Learning Methodology Applied to Protein-Specific Molecular Generation. (arXiv:2309.05853v2 [cs.LG] UPDATED)
    The incredible capabilities of generative artificial intelligence models have inevitably led to their application in the domain of drug discovery. Within this domain, the vastness of chemical space motivates the development of more efficient methods for identifying regions with molecules that exhibit desired characteristics. In this work, we present a computationally efficient active learning methodology that requires evaluation of only a subset of the generated data in the constructed sample space to successfully align a generative model with respect to a specified objective. We demonstrate the applicability of this methodology to targeted molecular generation by fine-tuning a GPT-based molecular generator toward a protein with FDA-approved small-molecule inhibitors, c-Abl kinase. Remarkably, the model learns to generate molecules similar to the inhibitors without prior knowledge of their existence, and even reproduces two of them exactly. We also show that the methodology is effective for a protein without any commercially available small-molecule inhibitors, the HNH domain of the CRISPR-associated protein 9 (Cas9) enzyme. We believe that the inherent generality of this method ensures that it will remain applicable as the exciting field of in silico molecular generation evolves. To facilitate implementation and reproducibility, we have made all of our software available through the open-source ChemSpaceAL Python package.
    Generating Realistic Counterfactuals for Retinal Fundus and OCT Images using Diffusion Models. (arXiv:2311.11629v2 [cs.CV] UPDATED)
    Counterfactual reasoning is often used in clinical settings to explain decisions or weigh alternatives. Therefore, for imaging based specialties such as ophthalmology, it would be beneficial to be able to create counterfactual images, illustrating answers to questions like "If the subject had had diabetic retinopathy, how would the fundus image have looked?". Here, we demonstrate that using a diffusion model in combination with an adversarially robust classifier trained on retinal disease classification tasks enables the generation of highly realistic counterfactuals of retinal fundus images and optical coherence tomography (OCT) B-scans. The key to the realism of counterfactuals is that these classifiers encode salient features indicative for each disease class and can steer the diffusion model to depict disease signs or remove disease-related lesions in a realistic way. In a user study, domain experts also found the counterfactuals generated using our method significantly more realistic than counterfactuals generated from a previous method, and even indistinguishable from real images.
    CARLA: Self-supervised Contrastive Representation Learning for Time Series Anomaly Detection. (arXiv:2308.09296v2 [cs.LG] UPDATED)
    One main challenge in time series anomaly detection (TAD) is the lack of labelled data in many real-life scenarios. Most of the existing anomaly detection methods focus on learning the normal behaviour of unlabelled time series in an unsupervised manner. The normal boundary is often defined tightly, resulting in slight deviations being classified as anomalies, consequently leading to a high false positive rate and a limited ability to generalise normal patterns. To address this, we introduce a novel end-to-end self-supervised ContrAstive Representation Learning approach for time series Anomaly detection (CARLA). While existing contrastive learning methods assume that augmented time series windows are positive samples and temporally distant windows are negative samples, we argue that these assumptions are limited as augmentation of time series can transform them to negative samples, and a temporally distant window can represent a positive sample. Our contrastive approach leverages existing generic knowledge about time series anomalies and injects various types of anomalies as negative samples. Therefore, CARLA not only learns normal behaviour but also learns deviations indicating anomalies. It creates similar representations for temporally closed windows and distinct ones for anomalies. Additionally, it leverages the information about representations' neighbours through a self-supervised approach to classify windows based on their nearest/furthest neighbours to further enhance the performance of anomaly detection. In extensive tests on seven major real-world time series anomaly detection datasets, CARLA shows superior performance over state-of-the-art self-supervised and unsupervised TAD methods. Our research shows the potential of contrastive representation learning to advance time series anomaly detection.
    Effectiveness of probabilistic contact tracing in epidemic containment: the role of super-spreaders and transmission paths reconstruction. (arXiv:2312.00910v1 [q-bio.PE])
    The recent COVID-19 pandemic underscores the significance of early-stage non-pharmacological intervention strategies. The widespread use of masks and the systematic implementation of contact tracing strategies provide a potentially equally effective and socially less impactful alternative to more conventional approaches, such as large-scale mobility restrictions. However, manual contact tracing faces strong limitations in accessing the network of contacts, and the scalability of currently implemented protocols for smartphone-based digital contact tracing becomes impractical during the rapid expansion phases of the outbreaks, due to the surge in exposure notifications and associated tests. A substantial improvement in digital contact tracing can be obtained through the integration of probabilistic techniques for risk assessment that can more effectively guide the allocation of new diagnostic tests. In this study, we first quantitatively analyze the diagnostic and social costs associated with these containment measures based on contact tracing, employing three state-of-the-art models of SARS-CoV-2 spreading. Our results suggest that probabilistic techniques allow for more effective mitigation at a lower cost. Secondly, our findings reveal a remarkable efficacy of probabilistic contact-tracing techniques in capturing backward propagations and super-spreading events, relevant features of the diffusion of many pathogens, including SARS-CoV-2.
    Generating Images of the M87* Black Hole Using GANs. (arXiv:2312.01005v1 [astro-ph.GA])
    In this paper, we introduce a novel data augmentation methodology based on Conditional Progressive Generative Adversarial Networks (CPGAN) to generate diverse black hole (BH) images, accounting for variations in spin and electron temperature prescriptions. These generated images are valuable resources for training deep learning algorithms to accurately estimate black hole parameters from observational data. Our model can generate BH images for any spin value within the range of [-1, 1], given an electron temperature distribution. To validate the effectiveness of our approach, we employ a convolutional neural network to predict the BH spin using both the GRMHD images and the images generated by our proposed model. Our results demonstrate a significant performance improvement when training is conducted with the augmented dataset while testing is performed using GRMHD simulated data, as indicated by the high R2 score. Consequently, we propose that GANs can be employed as cost effective models for black hole image generation and reliably augment training datasets for other parameterization algorithms.
    BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge. (arXiv:2308.16458v4 [cs.LG] UPDATED)
    Pre-trained large language models have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of specialized domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate large language models (LLMs) in generating bioinformatics-specific code. BioCoder spans a broad spectrum of the field and covers cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling we show that overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we finetuned StarCoder, demonstrating how our dataset can effectively enhance the performance of LLMs on our benchmark (by >15% in terms of Pass@K in certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (1) Successful models accommodate a long prompt (> ~2600 tokens) with full context, for functional dependencies. (2) They contain specific domain knowledge of bioinformatics, beyond just general coding knowledge. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on the benchmark (50% vs up to ~25%). Our dataset, benchmark, Docker images, and scripts required for testing are all available at https://github.com/gersteinlab/biocoder.
    SMT 2.0: A Surrogate Modeling Toolbox with a focus on Hierarchical and Mixed Variables Gaussian Processes. (arXiv:2305.13998v4 [cs.LG] UPDATED)
    The Surrogate Modeling Toolbox (SMT) is an open-source Python package that offers a collection of surrogate modeling methods, sampling techniques, and a set of sample problems. This paper presents SMT 2.0, a major new release of SMT that introduces significant upgrades and new features to the toolbox. This release adds the capability to handle mixed-variable surrogate models and hierarchical variables. These types of variables are becoming increasingly important in several surrogate modeling applications. SMT 2.0 also improves SMT by extending sampling methods, adding new surrogate models, and computing variance and kernel derivatives for Kriging. This release also includes new functions to handle noisy and use multifidelity data. To the best of our knowledge, SMT 2.0 is the first open-source surrogate library to propose surrogate models for hierarchical and mixed inputs. This open-source software is distributed under the New BSD license.
    Towards Redundancy-Free Sub-networks in Continual Learning. (arXiv:2312.00840v1 [cs.LG])
    Catastrophic Forgetting (CF) is a prominent issue in continual learning. Parameter isolation addresses this challenge by masking a sub-network for each task to mitigate interference with old tasks. However, these sub-networks are constructed relying on weight magnitude, which does not necessarily correspond to the importance of weights, resulting in maintaining unimportant weights and constructing redundant sub-networks. To overcome this limitation, inspired by information bottleneck, which removes redundancy between adjacent network layers, we propose \textbf{\underline{I}nformation \underline{B}ottleneck \underline{M}asked sub-network (IBM)} to eliminate redundancy within sub-networks. Specifically, IBM accumulates valuable information into essential weights to construct redundancy-free sub-networks, not only effectively mitigating CF by freezing the sub-networks but also facilitating new tasks training through the transfer of valuable knowledge. Additionally, IBM decomposes hidden representations to automate the construction process and make it flexible. Extensive experiments demonstrate that IBM consistently outperforms state-of-the-art methods. Notably, IBM surpasses the state-of-the-art parameter isolation method with a 70\% reduction in the number of parameters within sub-networks and an 80\% decrease in training time.
    MetaFormer Baselines for Vision. (arXiv:2210.13452v3 [cs.CV] UPDATED)
    MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore the capacity of MetaFormer, again, without focusing on token mixer design: we introduce several baseline models under MetaFormer using the most basic or common mixers, and summarize our observations as follows: (1) MetaFormer ensures solid lower bound of performance. By merely adopting identity mapping as the token mixer, the MetaFormer model, termed IdentityFormer, achieves >80% accuracy on ImageNet-1K. (2) MetaFormer works well with arbitrary token mixers. When specifying the token mixer as even a random matrix to mix tokens, the resulting model RandFormer yields an accuracy of >81%, outperforming IdentityFormer. Rest assured of MetaFormer's results when new token mixers are adopted. (3) MetaFormer effortlessly offers state-of-the-art results. With just conventional token mixers dated back five years ago, the models instantiated from MetaFormer already beat state of the art. (a) ConvFormer outperforms ConvNeXt. Taking the common depthwise separable convolutions as the token mixer, the model termed ConvFormer, which can be regarded as pure CNNs, outperforms the strong CNN model ConvNeXt. (b) CAFormer sets new record on ImageNet-1K. By simply applying depthwise separable convolutions as token mixer in the bottom stages and vanilla self-attention in the top stages, the resulting model CAFormer sets a new record on ImageNet-1K: it achieves an accuracy of 85.5% at 224x224 resolution, under normal supervised training without external data or distillation. In our expedition to probe MetaFormer, we also find that a new activation, StarReLU, reduces 71% FLOPs of activation compared with GELU yet achieves better performance. We expect StarReLU to find great potential in MetaFormer-like models alongside other neural networks.
    Achieving the Minimax Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach. (arXiv:2305.13289v3 [cs.LG] UPDATED)
    Offline reinforcement learning aims to learn from pre-collected datasets without active exploration. This problem faces significant challenges, including limited data availability and distributional shifts. Existing approaches adopt a pessimistic stance towards uncertainty by penalizing rewards of under-explored state-action pairs to estimate value functions conservatively. In this paper, we show that the distributionally robust optimization (DRO) based approach can also address these challenges and is minimax optimal. Specifically, we directly model the uncertainty in the transition kernel and construct an uncertainty set of statistically plausible transition kernels. We then find the policy that optimizes the worst-case performance over this uncertainty set. We first design a metric-based Hoeffding-style uncertainty set such that with high probability the true transition kernel is in this set. We prove that to achieve a sub-optimality gap of $\epsilon$, the sample complexity is $\mathcal{O}(S^2C^{\pi^*}\epsilon^{-2}(1-\gamma)^{-4})$, where $\gamma$ is the discount factor, $S$ is the number of states, and $C^{\pi^*}$ is the single-policy clipped concentrability coefficient which quantifies the distribution shift. To achieve the optimal sample complexity, we further propose a less conservative Bernstein-style uncertainty set, which, however, does not necessarily include the true transition kernel. We show that an improved sample complexity of $\mathcal{O}(SC^{\pi^*}\epsilon^{-2}(1-\gamma)^{-3})$ can be obtained, which matches with the minimax lower bound for offline reinforcement learning, and thus is minimax optimal.
    A Theoretical Perspective of Machine Learning with Computational Resource Concerns. (arXiv:2305.02217v3 [cs.LG] UPDATED)
    Conventional theoretical machine learning studies generally assume explicitly or implicitly that there are enough or even infinitely supplied computational resources. In real practice, however, computational resources are usually limited, and the performance of machine learning depends not only on how many data have been received, but also on how many data can be handled with the computational resources available. Note that most current ``intelligent supercomputing'' facilities work like exclusive operating systems, where a fixed amount of resources are allocated to a machine learning task without adaptive scheduling strategies considering important factors such as learning performance demands and learning process status. In this article, we introduce the notion of machine learning throughput, define Computational Resource Efficient Learning (CoRE-Learning) and present a theoretical framework that takes into account the influence of computational resources in learning theory. This framework can be naturally applied to stream learning where the incoming data streams can be potentially endless with overwhelming size and it is impractical to assume that all received data can be handled in time. It may also provide a theoretical perspective for the design of intelligent supercomputing operating systems.
    InceptionCaps: A Performant Glaucoma Classification Model for Data-scarce Environment. (arXiv:2312.00803v1 [cs.CV])
    Glaucoma is an irreversible ocular disease and is the second leading cause of visual disability worldwide. Slow vision loss and the asymptomatic nature of the disease make its diagnosis challenging. Early detection is crucial for preventing irreversible blindness. Ophthalmologists primarily use retinal fundus images as a non-invasive screening method. Convolutional neural networks (CNN) have demonstrated high accuracy in the classification of medical images. Nevertheless, CNN's translation-invariant nature and inability to handle the part-whole relationship between objects make its direct application unsuitable for glaucomatous fundus image classification, as it requires a large number of labelled images for training. This work reviews existing state of the art models and proposes InceptionCaps, a novel capsule network (CapsNet) based deep learning model having pre-trained InceptionV3 as its convolution base, for automatic glaucoma classification. InceptionCaps achieved an accuracy of 0.956, specificity of 0.96, and AUC of 0.9556, which surpasses several state-of-the-art deep learning model performances on the RIM-ONE v2 dataset. The obtained result demonstrates the robustness of the proposed deep learning model.
    Policy Evaluation for Temporal and/or Spatial Dependent Experiments. (arXiv:2202.10887v5 [stat.ME] UPDATED)
    The aim of this paper is to establish a causal link between the policies implemented by technology companies and the outcomes they yield within intricate temporal and/or spatial dependent experiments. We propose a novel temporal/spatio-temporal Varying Coefficient Decision Process (VCDP) model, capable of effectively capturing the evolving treatment effects in situations characterized by temporal and/or spatial dependence. Our methodology encompasses the decomposition of the Average Treatment Effect (ATE) into the Direct Effect (DE) and the Indirect Effect (IE). We subsequently devise comprehensive procedures for estimating and making inferences about both DE and IE. Additionally, we provide a rigorous analysis of the statistical properties of these procedures, such as asymptotic power. To substantiate the effectiveness of our approach, we carry out extensive simulations and real data analyses.
    FIND: A Function Description Benchmark for Evaluating Interpretability Methods. (arXiv:2309.03886v2 [cs.CL] UPDATED)
    Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions span textual and numeric domains, and involve a range of real-world complexities. We evaluate methods that use pretrained language models (LMs) to produce descriptions of function behavior in natural language and code. Additionally, we introduce a new interactive method in which an Automated Interpretability Agent (AIA) generates function descriptions. We find that an AIA, built from an LM with black-box access to functions, can infer function structure, acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data. However, AIA descriptions tend to capture global function behavior and miss local details. These results suggest that FIND will be useful for evaluating more sophisticated interpretability methods before they are applied to real-world models.
    Universality and approximation bounds for echo state networks with random weights. (arXiv:2206.05669v3 [cs.LG] UPDATED)
    We study the uniform approximation of echo state networks with randomly generated internal weights. These models, in which only the readout weights are optimized during training, have made empirical success in learning dynamical systems. Recent results showed that echo state networks with ReLU activation are universal. In this paper, we give an alternative construction and prove that the universality holds for general activation functions. Specifically, our main result shows that, under certain condition on the activation function, there exists a sampling procedure for the internal weights so that the echo state network can approximate any continuous casual time-invariant operators with high probability. In particular, for ReLU activation, we give explicit construction for these sampling procedures. We also quantify the approximation error of the constructed ReLU echo state networks for sufficiently regular operators.
    Improving Robustness with Adaptive Weight Decay. (arXiv:2210.00094v2 [cs.LG] UPDATED)
    We propose adaptive weight decay, which automatically tunes the hyper-parameter for weight decay during each training iteration. For classification problems, we propose changing the value of the weight decay hyper-parameter on the fly based on the strength of updates from the classification loss (i.e., gradient of cross-entropy), and the regularization loss (i.e., $\ell_2$-norm of the weights). We show that this simple modification can result in large improvements in adversarial robustness -- an area which suffers from robust overfitting -- without requiring extra data across various datasets and architecture choices. For example, our reformulation results in $20\%$ relative robustness improvement for CIFAR-100, and $10\%$ relative robustness improvement on CIFAR-10 comparing to the best tuned hyper-parameters of traditional weight decay resulting in models that have comparable performance to SOTA robustness methods. In addition, this method has other desirable properties, such as less sensitivity to learning rate, and smaller weight norms, which the latter contributes to robustness to overfitting to label noise, and pruning.
    Coneheads: Hierarchy Aware Attention. (arXiv:2306.00392v2 [cs.LG] UPDATED)
    Attention networks such as transformers have achieved state-of-the-art performance in many domains. These networks rely heavily on the dot product attention operator, which computes the similarity between two points by taking their inner product. However, the inner product does not explicitly model the complex structural properties of real world datasets, such as hierarchies between data points. To remedy this, we introduce cone attention, a drop-in replacement for dot product attention based on hyperbolic entailment cones. Cone attention associates two points by the depth of their lowest common ancestor in a hierarchy defined by hyperbolic cones, which intuitively measures the divergence of two points and gives a hierarchy aware similarity score. We test cone attention on a wide variety of models and tasks and show that it improves task-level performance over dot product attention and other baselines, and is able to match dot-product attention with significantly fewer parameters. Our results suggest that cone attention is an effective way to capture hierarchical relationships when calculating attention.
    ESM-NBR: fast and accurate nucleic acid-binding residue prediction via protein language model feature representation and multi-task learning. (arXiv:2312.00842v1 [q-bio.QM])
    Protein-nucleic acid interactions play a very important role in a variety of biological activities. Accurate identification of nucleic acid-binding residues is a critical step in understanding the interaction mechanisms. Although many computationally based methods have been developed to predict nucleic acid-binding residues, challenges remain. In this study, a fast and accurate sequence-based method, called ESM-NBR, is proposed. In ESM-NBR, we first use the large protein language model ESM2 to extract discriminative biological properties feature representation from protein primary sequences; then, a multi-task deep learning model composed of stacked bidirectional long short-term memory (BiLSTM) and multi-layer perceptron (MLP) networks is employed to explore common and private information of DNA- and RNA-binding residues with ESM2 feature as input. Experimental results on benchmark data sets demonstrate that the prediction performance of ESM2 feature representation comprehensively outperforms evolutionary information-based hidden Markov model (HMM) features. Meanwhile, the ESM-NBR obtains the MCC values for DNA-binding residues prediction of 0.427 and 0.391 on two independent test sets, which are 18.61 and 10.45% higher than those of the second-best methods, respectively. Moreover, by completely discarding the time-cost multiple sequence alignment process, the prediction speed of ESM-NBR far exceeds that of existing methods (5.52s for a protein sequence of length 500, which is about 16 times faster than the second-fastest method). A user-friendly standalone package and the data of ESM-NBR are freely available for academic use at: https://github.com/wwzll123/ESM-NBR.
    Adaptive Multi-Modality Prompt Learning. (arXiv:2312.00823v1 [cs.LG])
    Although current prompt learning methods have successfully been designed to effectively reuse the large pre-trained models without fine-tuning their large number of parameters, they still have limitations to be addressed, i.e., without considering the adverse impact of meaningless patches in every image and without simultaneously considering in-sample generalization and out-of-sample generalization. In this paper, we propose an adaptive multi-modality prompt learning to address the above issues. To do this, we employ previous text prompt learning and propose a new image prompt learning. The image prompt learning achieves in-sample and out-of-sample generalization, by first masking meaningless patches and then padding them with the learnable parameters and the information from texts. Moreover, each of the prompts provides auxiliary information to each other, further strengthening these two kinds of generalization. Experimental results on real datasets demonstrate that our method outperforms SOTA methods, in terms of different downstream tasks.
    Variational Self-Supervised Contrastive Learning Using Beta Divergence. (arXiv:2312.00824v1 [cs.CV])
    Learning a discriminative semantic space using unlabelled and noisy data remains unaddressed in a multi-label setting. We present a contrastive self-supervised learning method which is robust to data noise, grounded in the domain of variational methods. The method (VCL) utilizes variational contrastive learning with beta-divergence to learn robustly from unlabelled datasets, including uncurated and noisy datasets. We demonstrate the effectiveness of the proposed method through rigorous experiments including linear evaluation and fine-tuning scenarios with multi-label datasets in the face understanding domain. In almost all tested scenarios, VCL surpasses the performance of state-of-the-art self-supervised methods, achieving a noteworthy increase in accuracy.
    PROFL: A Privacy-Preserving Federated Learning Method with Stringent Defense Against Poisoning Attacks. (arXiv:2312.01045v1 [cs.CR])
    Federated Learning (FL) faces two major issues: privacy leakage and poisoning attacks, which may seriously undermine the reliability and security of the system. Overcoming them simultaneously poses a great challenge. This is because privacy protection policies prohibit access to users' local gradients to avoid privacy leakage, while Byzantine-robust methods necessitate access to these gradients to defend against poisoning attacks. To address these problems, we propose a novel privacy-preserving Byzantine-robust FL framework PROFL. PROFL is based on the two-trapdoor additional homomorphic encryption algorithm and blinding techniques to ensure the data privacy of the entire FL process. During the defense process, PROFL first utilize secure Multi-Krum algorithm to remove malicious gradients at the user level. Then, according to the Pauta criterion, we innovatively propose a statistic-based privacy-preserving defense algorithm to eliminate outlier interference at the feature level and resist impersonation poisoning attacks with stronger concealment. Detailed theoretical analysis proves the security and efficiency of the proposed method. We conducted extensive experiments on two benchmark datasets, and PROFL improved accuracy by 39% to 75% across different attack settings compared to similar privacy-preserving robust methods, demonstrating its significant advantage in robustness.
    Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling. (arXiv:2312.01017v1 [cs.CV])
    Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment. This early fusion of audio and visual cues, demonstrated through cognitive psychology and neuroscience research, offers promising potential for developing multimodal perception models. However, training early fusion architectures poses significant challenges, as the increased model expressivity requires robust learning frameworks to harness their enhanced capabilities. In this paper, we address this challenge by leveraging the masked reconstruction framework, previously successful in unimodal settings, to train audio-visual encoders with early fusion. Additionally, we propose an attention-based fusion module that captures interactions between local audio and visual representations, enhancing the model's ability to capture fine-grained interactions. While effective, this procedure can become computationally intractable, as the number of local representations increases. Thus, to address the computational complexity, we propose an alternative procedure that factorizes the local representations before representing audio-visual interactions. Extensive evaluations on a variety of datasets demonstrate the superiority of our approach in audio-event classification, visual sound localization, sound separation, and audio-visual segmentation. These contributions enable the efficient training of deeply integrated audio-visual models and significantly advance the usefulness of early fusion architectures.
    Joint Prompt Optimization of Stacked LLMs using Variational Inference. (arXiv:2306.12509v2 [cs.CL] UPDATED)
    Large language models (LLMs) can be seen as atomic units of computation mapping sequences to a distribution over sequences. Thus, they can be seen as stochastic language layers in a language network, where the learnable parameters are the natural language prompts at each layer. By stacking two such layers and feeding the output of one layer to the next, we obtain a Deep Language Network (DLN). We first show how to effectively perform prompt optimization for a 1-Layer language network (DLN-1). Then, we present an extension that applies to 2-layer DLNs (DLN-2), where two prompts must be learned. The key idea is to consider the output of the first layer as a latent variable, which requires inference, and prompts to be learned as the parameters of the generative distribution. We first test the effectiveness of DLN-1 in multiple reasoning and natural language understanding tasks. Then, we show that DLN-2 can reach higher performance than a single layer, showing promise that we might reach comparable performance to GPT-4, even when each LLM in the network is smaller and less powerful.
    Adversarial Training for Graph Neural Networks: Pitfalls, Solutions, and New Directions. (arXiv:2306.15427v2 [cs.LG] UPDATED)
    Despite its success in the image domain, adversarial training did not (yet) stand out as an effective defense for Graph Neural Networks (GNNs) against graph structure perturbations. In the pursuit of fixing adversarial training (1) we show and overcome fundamental theoretical as well as practical limitations of the adopted graph learning setting in prior work; (2) we reveal that more flexible GNNs based on learnable graph diffusion are able to adjust to adversarial perturbations, while the learned message passing scheme is naturally interpretable; (3) we introduce the first attack for structure perturbations that, while targeting multiple nodes at once, is capable of handling global (graph-level) as well as local (node-level) constraints. Including these contributions, we demonstrate that adversarial training is a state-of-the-art defense against adversarial structure perturbations.
    Linking convolutional kernel size to generalization bias in face analysis CNNs. (arXiv:2302.03750v2 [cs.CV] UPDATED)
    Training dataset biases are by far the most scrutinized factors when explaining algorithmic biases of neural networks. In contrast, hyperparameters related to the neural network architecture have largely been ignored even though different network parameterizations are known to induce different implicit biases over learned features. For example, convolutional kernel size is known to affect the frequency content of features learned in CNNs. In this work, we present a causal framework for linking an architectural hyperparameter to out-of-distribution algorithmic bias. Our framework is experimental, in that we train several versions of a network with an intervention to a specific hyperparameter, and measure the resulting causal effect of this choice on performance bias when a particular out-of-distribution image perturbation is applied. In our experiments, we focused on measuring the causal relationship between convolutional kernel size and face analysis classification bias across different subpopulations (race/gender), with respect to high-frequency image details. We show that modifying kernel size, even in one layer of a CNN, changes the frequency content of learned features significantly across data subgroups leading to biased generalization performance even in the presence of a balanced dataset.  ( 2 min )
    Make the U in UDA Matter: Invariant Consistency Learning for Unsupervised Domain Adaptation. (arXiv:2309.12742v2 [cs.LG] UPDATED)
    Domain Adaptation (DA) is always challenged by the spurious correlation between domain-invariant features (e.g., class identity) and domain-specific features (e.g., environment) that does not generalize to the target domain. Unfortunately, even enriched with additional unsupervised target domains, existing Unsupervised DA (UDA) methods still suffer from it. This is because the source domain supervision only considers the target domain samples as auxiliary data (e.g., by pseudo-labeling), yet the inherent distribution in the target domain -- where the valuable de-correlation clues hide -- is disregarded. We propose to make the U in UDA matter by giving equal status to the two domains. Specifically, we learn an invariant classifier whose prediction is simultaneously consistent with the labels in the source domain and clusters in the target domain, hence the spurious correlation inconsistent in the target domain is removed. We dub our approach "Invariant CONsistency learning" (ICON). Extensive experiments show that ICON achieves the state-of-the-art performance on the classic UDA benchmarks: Office-Home and VisDA-2017, and outperforms all the conventional methods on the challenging WILDS 2.0 benchmark. Codes are in https://github.com/yue-zhongqi/ICON.  ( 2 min )
    Learning image representations for anomaly detection: application to discovery of histological alterations in drug development. (arXiv:2210.07675v6 [cs.CV] UPDATED)
    We present a system for anomaly detection in histopathological images. In histology, normal samples are usually abundant, whereas anomalous (pathological) cases are scarce or not available. Under such settings, one-class classifiers trained on healthy data can detect out-of-distribution anomalous samples. Such approaches combined with pre-trained Convolutional Neural Network (CNN) representations of images were previously employed for anomaly detection (AD). However, pre-trained off-the-shelf CNN representations may not be sensitive to abnormal conditions in tissues, while natural variations of healthy tissue may result in distant representations. To adapt representations to relevant details in healthy tissue we propose training a CNN on an auxiliary task that discriminates healthy tissue of different species, organs, and staining reagents. Almost no additional labeling workload is required, since healthy samples come automatically with aforementioned labels. During training we enforce compact image representations with a center-loss term, which further improves representations for AD. The proposed system outperforms established AD methods on a published dataset of liver anomalies. Moreover, it provided comparable results to conventional methods specifically tailored for quantification of liver anomalies. We show that our approach can be used for toxicity assessment of candidate drugs at early development stages and thereby may reduce expensive late-stage drug attrition.  ( 3 min )
    A Bayesian Federated Learning Framework with Online Laplace Approximation. (arXiv:2102.01936v3 [cs.LG] UPDATED)
    Federated learning (FL) allows multiple clients to collaboratively learn a globally shared model through cycles of model aggregation and local model training, without the need to share data. Most existing FL methods train local models separately on different clients, and then simply average their parameters to obtain a centralized model on the server side. However, these approaches generally suffer from large aggregation errors and severe local forgetting, which are particularly bad in heterogeneous data settings. To tackle these issues, in this paper, we propose a novel FL framework that uses online Laplace approximation to approximate posteriors on both the client and server side. On the server side, a multivariate Gaussian product mechanism is employed to construct and maximize a global posterior, largely reducing the aggregation errors induced by large discrepancies between local models. On the client side, a prior loss that uses the global posterior probabilistic parameters delivered from the server is designed to guide the local training. Binding such learning constraints from other clients enables our method to mitigate local forgetting. Finally, we achieve state-of-the-art results on several benchmarks, clearly demonstrating the advantages of the proposed method.  ( 3 min )
    Invariance is Key to Generalization: Examining the Role of Representation in Sim-to-Real Transfer for Visual Navigation. (arXiv:2310.15020v2 [cs.RO] UPDATED)
    The data-driven approach to robot control has been gathering pace rapidly, yet generalization to unseen task domains remains a critical challenge. We argue that the key to generalization is representations that are (i) rich enough to capture all task-relevant information and (ii) invariant to superfluous variability between the training and the test domains. We experimentally study such a representation -- containing both depth and semantic information -- for visual navigation and show that it enables a control policy trained entirely in simulated indoor scenes to generalize to diverse real-world environments, both indoors and outdoors. Further, we show that our representation reduces the A-distance between the training and test domains, improving the generalization error bound as a result. Our proposed approach is scalable: the learned policy improves continuously, as the foundation models that it exploits absorb more diverse data during pre-training.  ( 2 min )
    Learning Structure-from-Motion with Graph Attention Networks. (arXiv:2308.15984v2 [cs.CV] UPDATED)
    In this paper we tackle the problem of learning Structure-from-Motion (SfM) through the use of graph attention networks. SfM is a classic computer vision problem that is solved though iterative minimization of reprojection errors, referred to as Bundle Adjustment (BA), starting from a good initialization. In order to obtain a good enough initialization to BA, conventional methods rely on a sequence of sub-problems (such as pairwise pose estimation, pose averaging or triangulation) which provides an initial solution that can then be refined using BA. In this work we replace these sub-problems by learning a model that takes as input the 2D keypoints detected across multiple views, and outputs the corresponding camera poses and 3D keypoint coordinates. Our model takes advantage of graph neural networks to learn SfM-specific primitives, and we show that it can be used for fast inference of the reconstruction for new and unseen sequences. The experimental results show that the proposed model outperforms competing learning-based methods, and challenges COLMAP while having lower runtime.
    What does a platypus look like? Generating customized prompts for zero-shot image classification. (arXiv:2209.03320v3 [cs.CV] UPDATED)
    Open-vocabulary models are a promising new paradigm for image classification. Unlike traditional classification models, open-vocabulary models classify among any arbitrary set of categories specified with natural language during inference. This natural language, called "prompts", typically consists of a set of hand-written templates (e.g., "a photo of a {}") which are completed with each of the category names. This work introduces a simple method to generate higher accuracy prompts, without relying on any explicit knowledge of the task domain and with far fewer hand-constructed sentences. To achieve this, we combine open-vocabulary models with large language models (LLMs) to create Customized Prompts via Language models (CuPL, pronounced "couple"). In particular, we leverage the knowledge contained in LLMs in order to generate many descriptive sentences that contain important discriminating characteristics of the image categories. This allows the model to place a greater importance on these regions in the image when making predictions. We find that this straightforward and general approach improves accuracy on a range of zero-shot image classification benchmarks, including over one percentage point gain on ImageNet. Finally, this simple baseline requires no additional training and remains completely zero-shot. Code available at https://github.com/sarahpratt/CuPL.  ( 2 min )
    Spatially scalable recursive estimation of Gaussian process terrain maps using local basis functions. (arXiv:2210.09168v2 [cs.LG] UPDATED)
    When an agent, person, vehicle or robot is moving through an unknown environment without GNSS signals, online mapping of nonlinear terrains can be used to improve position estimates when the agent returns to a previously mapped area. Mapping algorithms using online Gaussian process (GP) regression are commonly integrated in algorithms for simultaneous localisation and mapping (SLAM). However, GP mapping algorithms have increasing computational demands as the mapped area expands relative to spatial field variations. This is due to the need for estimating an increasing amount of map parameters as the area of the map grows. Contrary to this, we propose a recursive GP mapping estimation algorithm which uses local basis functions in an information filter to achieve spatial scalability. Our proposed approximation employs a global grid of finite support basis functions but restricts computations to a localized subset around each prediction point. As our proposed algorithm is recursive, it can naturally be incorporated into existing algorithms that uses Gaussian process maps for SLAM. Incorporating our proposed algorithm into an extended Kalman filter (EKF) for magnetic field SLAM reduces the overall computational complexity of the algorithm. We show experimentally that our algorithm is faster than existing methods when the mapped area is large and the map is based on many measurements, both for recursive mapping tasks and for magnetic field SLAM.
    Classification of Spam URLs Using Machine Learning Approaches. (arXiv:2310.05953v2 [cs.CR] UPDATED)
    The Internet is used by billions of users every day because it offers fast and free communication tools and platforms. Nevertheless, with this significant increase in usage, huge amounts of spam are generated every second, which wastes internet resources and, more importantly, users' time. This study investigates the use of machine learning models to classify URLs as spam or nonspam. We first extract the features from the URL as it has only one feature, and then we compare the performance of several models, including k nearest neighbors, bagging, random forest, logistic regression, and others. Experimental results demonstrate that bagging outperformed other models and achieved the highest accuracy of 98.64%. In addition, bagging outperformed the current state-of-the-art approaches which emphasize its effectiveness in addressing spam-related challenges on the Internet. This suggests that bagging is a promising approach for URL spam classification.  ( 2 min )
    Optimal Scalarizations for Sublinear Hypervolume Regret. (arXiv:2307.03288v2 [cs.LG] UPDATED)
    Scalarization is a general technique that can be deployed in any multiobjective setting to reduce multiple objectives into one, such as recently in RLHF for training reward models that align human preferences. Yet some have dismissed this classical approach because linear scalarizations are known to miss concave regions of the Pareto frontier. To that end, we aim to find simple non-linear scalarizations that can explore a diverse set of $k$ objectives on the Pareto frontier, as measured by the dominated hypervolume. We show that hypervolume scalarizations with uniformly random weights are surprisingly optimal for provably minimizing the hypervolume regret, achieving an optimal sublinear regret bound of $O(T^{-1/k})$, with matching lower bounds that preclude any algorithm from doing better asymptotically. As a theoretical case study, we consider the multiobjective stochastic linear bandits problem and demonstrate that by exploiting the sublinear regret bounds of the hypervolume scalarizations, we can derive a novel non-Euclidean analysis that produces improved hypervolume regret bounds of $\tilde{O}( d T^{-1/2} + T^{-1/k})$. We support our theory with strong empirical performance of using simple hypervolume scalarizations that consistently outperforms both the linear and Chebyshev scalarizations, as well as standard multiobjective algorithms in bayesian optimization, such as EHVI.
    An Accurate and Fully-Automated Ensemble Model for Weekly Time Series Forecasting. (arXiv:2010.08158v2 [cs.LG] UPDATED)
    Many businesses and industries require accurate forecasts for weekly time series nowadays. However, the forecasting literature does not currently provide easy-to-use, automatic, reproducible and accurate approaches dedicated to this task. We propose a forecasting method in this domain to fill this gap, leveraging state-of-the-art forecasting techniques, such as forecast combination, meta-learning, and global modelling. We consider different meta-learning architectures, algorithms, and base model pools. Based on all considered model variants, we propose to use a stacking approach with lasso regression which optimally combines the forecasts of four base models: a global Recurrent Neural Network model (RNN), Theta, Trigonometric Box-Cox ARMA Trend Seasonal (TBATS) and Dynamic Harmonic Regression ARIMA (DHR-ARIMA), as it shows the overall best performance across seven experimental weekly datasets on four evaluation metrics. Our proposed method also consistently outperforms a set of benchmarks and state-of-the-art weekly forecasting models by a considerable margin with statistical significance. Our method can produce the most accurate forecasts, in terms of mean sMAPE, for the M4 weekly dataset among all benchmarks and all original competition participants.  ( 3 min )
    Informative Priors Improve the Reliability of Multimodal Clinical Data Classification. (arXiv:2312.00794v1 [cs.CV])
    Machine learning-aided clinical decision support has the potential to significantly improve patient care. However, existing efforts in this domain for principled quantification of uncertainty have largely been limited to applications of ad-hoc solutions that do not consistently improve reliability. In this work, we consider stochastic neural networks and design a tailor-made multimodal data-driven (M2D2) prior distribution over network parameters. We use simple and scalable Gaussian mean-field variational inference to train a Bayesian neural network using the M2D2 prior. We train and evaluate the proposed approach using clinical time-series data in MIMIC-IV and corresponding chest X-ray images in MIMIC-CXR for the classification of acute care conditions. Our empirical results show that the proposed method produces a more reliable predictive model compared to deterministic and Bayesian neural network baselines.
    Advanced Language Model-Driven Verilog Development: Enhancing Power, Performance, and Area Optimization in Code Synthesis. (arXiv:2312.01022v1 [cs.LG])
    The increasing use of Advanced Language Models (ALMs) in diverse sectors, particularly due to their impressive capability to generate top-tier content following linguistic instructions, forms the core of this investigation. This study probes into ALMs' deployment in electronic hardware design, with a specific emphasis on the synthesis and enhancement of Verilog programming. We introduce an innovative framework, crafted to assess and amplify ALMs' productivity in this niche. The methodology commences with the initial crafting of Verilog programming via ALMs, succeeded by a distinct dual-stage refinement protocol. The premier stage prioritizes augmenting the code's operational and linguistic precision, while the latter stage is dedicated to aligning the code with Power-Performance-Area (PPA) benchmarks, a pivotal component in proficient hardware design. This bifurcated strategy, merging error remediation with PPA enhancement, has yielded substantial upgrades in the caliber of ALM-created Verilog programming. Our framework achieves an 81.37% rate in linguistic accuracy and 62.0% in operational efficacy in programming synthesis, surpassing current leading-edge techniques, such as 73% in linguistic accuracy and 46% in operational efficacy. These findings illuminate ALMs' aptitude in tackling complex technical domains and signal a positive shift in the mechanization of hardware design operations.
    Tree of Thoughts: Deliberate Problem Solving with Large Language Models. (arXiv:2305.10601v2 [cs.CL] UPDATED)
    Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%. Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.
    SEED: Domain-Specific Data Curation With Large Language Models. (arXiv:2310.00749v2 [cs.DB] UPDATED)
    Data curation tasks that prepare data for analytics are critical for turning data into actionable insights. However, due to the diverse requirements of applications in different domains, generic off-the-shelf tools are typically insufficient. As a result, data scientists often have to develop domain-specific solutions tailored to both the dataset and the task, e.g. writing domain-specific code or training machine learning models on a sufficient number of annotated examples. This process is notoriously difficult and time-consuming. We present SEED, an LLM-as-compiler approach that automatically generates domain-specific data curation solutions via Large Language Models (LLMs). Once the user describes a task, input data, and expected output, the SEED compiler produces an executable pipeline composed of LLM-generated code, small model, and data access modules. SEED uses these generated modules to process most of the data records and dynamically decides when the LLM should step in to directly process some individual records, possibly using the data-access modules to retrieve relevant information from the data sources to assist the LLM in solving the task. To validate this new, revolutionary approach, we conducted experiments on 9 datasets spanning over 5 data curation tasks. The results show that SEED generates domain-specific solutions that significantly outperform their generic counterparts, often approaching the performance of the manually curated solutions that use thousands of labeled training examples. Moreover, in comparison to solutions that use the LLM on every data record, SEED achieves state-of-the-art or comparable few-shot performance, while significantly reducing the number of LLM calls.
    HistAlign: Improving Context Dependency in Language Generation by Aligning with History. (arXiv:2305.04782v2 [cs.CL] UPDATED)
    Language models (LMs) can generate hallucinations and incoherent outputs, which highlights their weak context dependency. Cache-LMs, which augment LMs with a memory of recent history, can increase context dependency and have shown remarkable performance in diverse language generation tasks. However, we find that even with training, the performance gain stemming from the cache component of current cache-LMs is suboptimal due to the misalignment between the current hidden states and those stored in the memory. In this work, we present HistAlign, a new training approach to ensure good cache alignment such that the model receives useful signals from the history. We first prove our concept on a simple and synthetic task where the memory is essential for correct predictions, and we show that the cache component of HistAlign is better aligned and improves overall performance. Next, we evaluate HistAlign on diverse downstream language generation tasks, including prompt continuation, abstractive summarization, and data-to-text. We demonstrate that HistAlign improves text coherence and faithfulness in open-ended and conditional generation settings respectively. HistAlign is also generalizable across different model families, showcasing its strength in improving context dependency of LMs in diverse scenarios. Our code is publicly available at https://github.com/meetdavidwan/histalign  ( 2 min )
    Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing. (arXiv:2212.10789v2 [cs.LG] UPDATED)
    There is increasing adoption of artificial intelligence in drug discovery. However, existing studies use machine learning to mainly utilize the chemical structures of molecules but ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions and predict complex biological activities. Here we present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecules' chemical structures and textual descriptions via a contrastive learning strategy. To train MoleculeSTM, we construct a large multi-modal dataset, namely, PubChemSTM, with over 280,000 chemical structure-text pairs. To demonstrate the effectiveness and utility of MoleculeSTM, we design two challenging zero-shot tasks based on text instructions, including structure-text retrieval and molecule editing. MoleculeSTM has two main properties: open vocabulary and compositionality via natural language. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts across various benchmarks.  ( 2 min )
    AI planning in the imagination: High-level planning on learned abstract search spaces. (arXiv:2308.08693v2 [cs.AI] UPDATED)
    Search and planning algorithms have been a cornerstone of artificial intelligence since the field's inception. Giving reinforcement learning agents the ability to plan during execution time has resulted in significant performance improvements in various domains. However, in real-world environments, the model with respect to which the agent plans has been constrained to be grounded in the real environment itself, as opposed to a more abstract model which allows for planning over compound actions and behaviors. We propose a new method, called PiZero, that gives an agent the ability to plan in an abstract search space that the agent learns during training, which is completely decoupled from the real environment. Unlike prior approaches, this enables the agent to perform high-level planning at arbitrary timescales and reason in terms of compound or temporally-extended actions, which can be useful in environments where large numbers of base-level micro-actions are needed to perform relevant macro-actions. In addition, our method is more general than comparable prior methods because it seamlessly handles settings with continuous action spaces, combinatorial action spaces, and partial observability. We evaluate our method on multiple domains, including the traveling salesman problem, Sokoban, 2048, the facility location problem, and Pacman. Experimentally, it outperforms comparable prior methods without assuming access to an environment simulator at execution time.
    Deep Unlearning: Fast and Efficient Training-free Approach to Controlled Forgetting. (arXiv:2312.00761v2 [cs.LG] UPDATED)
    Machine unlearning has emerged as a prominent and challenging area of interest, driven in large part by the rising regulatory demands for industries to delete user data upon request and the heightened awareness of privacy. Existing approaches either retrain models from scratch or use several finetuning steps for every deletion request, often constrained by computational resource limitations and restricted access to the original training data. In this work, we introduce a novel class unlearning algorithm designed to strategically eliminate an entire class or a group of classes from the learned model. To that end, our algorithm first estimates the Retain Space and the Forget Space, representing the feature or activation spaces for samples from classes to be retained and unlearned, respectively. To obtain these spaces, we propose a novel singular value decomposition-based technique that requires layer wise collection of network activations from a few forward passes through the network. We then compute the shared information between these spaces and remove it from the forget space to isolate class-discriminatory feature space for unlearning. Finally, we project the model weights in the orthogonal direction of the class-discriminatory space to obtain the unlearned model. We demonstrate our algorithm's efficacy on ImageNet using a Vision Transformer with only $\sim$1.5% drop in retain accuracy compared to the original model while maintaining under 1% accuracy on the unlearned class samples. Further, our algorithm consistently performs well when subject to Membership Inference Attacks showing 7.8% improvement on average across a variety of image classification datasets and network architectures, as compared to other baselines while being $\sim$6x more computationally efficient.
    Latent Space Explorer: Visual Analytics for Multimodal Latent Space Exploration. (arXiv:2312.00857v1 [cs.LG])
    Machine learning models built on training data with multiple modalities can reveal new insights that are not accessible through unimodal datasets. For example, cardiac magnetic resonance images (MRIs) and electrocardiograms (ECGs) are both known to capture useful information about subjects' cardiovascular health status. A multimodal machine learning model trained from large datasets can potentially predict the onset of heart-related diseases and provide novel medical insights about the cardiovascular system. Despite the potential benefits, it is difficult for medical experts to explore multimodal representation models without visual aids and to test the predictive performance of the models on various subpopulations. To address the challenges, we developed a visual analytics system called Latent Space Explorer. Latent Space Explorer provides interactive visualizations that enable users to explore the multimodal representation of subjects, define subgroups of interest, interactively decode data with different modalities with the selected subjects, and inspect the accuracy of the embedding in downstream prediction tasks. A user study was conducted with medical experts and their feedback provided useful insights into how Latent Space Explorer can help their analysis and possible new direction for further development in the medical domain.
    Deep-Q Learning with Hybrid Quantum Neural Network on Solving Maze Problems. (arXiv:2304.10159v3 [quant-ph] UPDATED)
    Quantum computing holds great potential for advancing the limitations of machine learning algorithms to handle higher dimensions of data and reduce overall training parameters in deep learning (DL) models. This study uses a trainable variational quantum circuit (VQC) on a gate-based quantum computing model to investigate the potential for quantum benefit in a model-free reinforcement learning problem. Through a comprehensive investigation and evaluation of the current model and capabilities of quantum computers, we designed and trained a novel hybrid quantum neural network based on the latest Qiskit and PyTorch framework. We compared its performance with a full-classical CNN with and without an incorporated VQC. Our research provides insights into the potential of deep quantum learning to solve a maze problem and, potentially, other reinforcement learning problems. We conclude that reinforcement learning problems can be practical with reasonable training epochs. Moreover, a comparative study of full-classical and hybrid quantum neural networks is discussed to understand these two approaches' performance, advantages, and disadvantages to deep-Q learning problems, especially on larger-scale maze problems larger than 4x4.
    Graph Generation with $K^2$-trees. (arXiv:2305.19125v3 [cs.LG] UPDATED)
    Generating graphs from a target distribution is a significant challenge across many domains, including drug discovery and social network analysis. In this work, we introduce a novel graph generation method leveraging $K^2$-tree representation, originally designed for lossless graph compression. The $K^2$-tree representation {encompasses inherent hierarchy while enabling compact graph generation}. In addition, we make contributions by (1) presenting a sequential $K^2$-treerepresentation that incorporates pruning, flattening, and tokenization processes and (2) introducing a Transformer-based architecture designed to generate the sequence by incorporating a specialized tree positional encoding scheme. Finally, we extensively evaluate our algorithm on four general and two molecular graph datasets to confirm its superiority for graph generation.
    Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion. (arXiv:2312.00852v1 [cs.LG])
    Sampling from the posterior distribution poses a major computational challenge in solving inverse problems using latent diffusion models. Common methods rely on Tweedie's first-order moments, which are known to induce a quality-limiting bias. Existing second-order approximations are impractical due to prohibitive computational costs, making standard reverse diffusion processes intractable for posterior sampling. This paper introduces Second-order Tweedie sampler from Surrogate Loss (STSL), a novel sampler that offers efficiency comparable to first-order Tweedie with a tractable reverse process using second-order approximation. Our theoretical results reveal that the second-order approximation is lower bounded by our surrogate loss that only requires $O(1)$ compute using the trace of the Hessian, and by the lower bound we derive a new drift term to make the reverse process tractable. Our method surpasses SoTA solvers PSLD and P2L, achieving 4X and 8X reduction in neural function evaluations, respectively, while notably enhancing sampling quality on FFHQ, ImageNet, and COCO benchmarks. In addition, we show STSL extends to text-guided image editing and addresses residual distortions present from corrupted images in leading text-guided image editing methods. To our best knowledge, this is the first work to offer an efficient second-order approximation in solving inverse problems using latent diffusion and editing real-world images with corruptions.
    Alleviating the Effect of Data Imbalance on Adversarial Training. (arXiv:2307.10205v2 [cs.LG] UPDATED)
    In this paper, we study adversarial training on datasets that obey the long-tailed distribution, which is practical but rarely explored in previous works. Compared with conventional adversarial training on balanced datasets, this process falls into the dilemma of generating uneven adversarial examples (AEs) and an unbalanced feature embedding space, causing the resulting model to exhibit low robustness and accuracy on tail data. To combat that, we theoretically analyze the lower bound of the robust risk to train a model on a long-tailed dataset to obtain the key challenges in addressing the aforementioned dilemmas. Based on it, we propose a new adversarial training framework -- Re-balancing Adversarial Training (REAT). This framework consists of two components: (1) a new training strategy inspired by the effective number to guide the model to generate more balanced and informative AEs; (2) a carefully constructed penalty function to force a satisfactory feature space. Evaluation results on different datasets and model structures prove that REAT can effectively enhance the model's robustness and preserve the model's clean accuracy. The code can be found in https://github.com/GuanlinLee/REAT.
    Benchpress: A Scalable and Versatile Workflow for Benchmarking Structure Learning Algorithms. (arXiv:2107.03863v4 [stat.ML] UPDATED)
    Describing the relationship between the variables in a study domain and modelling the data generating mechanism is a fundamental problem in many empirical sciences. Probabilistic graphical models are one common approach to tackle the problem. Learning the graphical structure for such models is computationally challenging and a fervent area of current research with a plethora of algorithms being developed. To facilitate the benchmarking of different methods, we present a novel Snakemake workflow, called Benchpress for producing scalable, reproducible, and platform-independent benchmarks of structure learning algorithms for probabilistic graphical models. Benchpress is interfaced via a simple JSON-file, which makes it accessible for all users, while the code is designed in a fully modular fashion to enable researchers to contribute additional methodologies. Benchpress currently provides an interface to a large number of state-of-the-art algorithms from libraries such as BDgraph, BiDAG, bnlearn, causal-learn, gCastle, GOBNILP, pcalg, r.blip, scikit-learn, TETRAD, and trilearn as well as a variety of methods for data generating models and performance evaluation. Alongside user-defined models and randomly generated datasets, the workflow also includes a number of standard datasets and graphical models from the literature, which may be included in a benchmarking study. We demonstrate the applicability of this workflow for learning Bayesian networks in five typical data scenarios. The source code and documentation is publicly available from this http URL  ( 3 min )
    Conservative State Value Estimation for Offline Reinforcement Learning. (arXiv:2302.06884v2 [cs.LG] UPDATED)
    Offline reinforcement learning faces a significant challenge of value over-estimation due to the distributional drift between the dataset and the current learned policy, leading to learning failure in practice. The common approach is to incorporate a penalty term to reward or value estimation in the Bellman iterations. Meanwhile, to avoid extrapolation on out-of-distribution (OOD) states and actions, existing methods focus on conservative Q-function estimation. In this paper, we propose Conservative State Value Estimation (CSVE), a new approach that learns conservative V-function via directly imposing penalty on OOD states. Compared to prior work, CSVE allows more effective state value estimation with conservative guarantees and further better policy optimization. Further, we apply CSVE and develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states \emph{around} the dataset, and the actor applies advantage weighted updates extended with state exploration to improve the policy. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.  ( 2 min )
    DIFUSCO: Graph-based Diffusion Solvers for Combinatorial Optimization. (arXiv:2302.08224v2 [cs.LG] UPDATED)
    Neural network-based Combinatorial Optimization (CO) methods have shown promising results in solving various NP-complete (NPC) problems without relying on hand-crafted domain knowledge. This paper broadens the current scope of neural solvers for NPC problems by introducing a new graph-based diffusion framework, namely DIFUSCO. Our framework casts NPC problems as discrete {0, 1}-vector optimization problems and leverages graph-based denoising diffusion models to generate high-quality solutions. We investigate two types of diffusion models with Gaussian and Bernoulli noise, respectively, and devise an effective inference schedule to enhance the solution quality. We evaluate our methods on two well-studied NPC combinatorial optimization problems: Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS). Experimental results show that DIFUSCO strongly outperforms the previous state-of-the-art neural solvers, improving the performance gap between ground-truth and neural solvers from 1.76% to 0.46% on TSP-500, from 2.46% to 1.17% on TSP-1000, and from 3.19% to 2.58% on TSP10000. For the MIS problem, DIFUSCO outperforms the previous state-of-the-art neural solver on the challenging SATLIB benchmark.  ( 2 min )
    Label Delay in Continual Learning. (arXiv:2312.00923v1 [cs.LG])
    Online continual learning, the process of training models on streaming data, has gained increasing attention in recent years. However, a critical aspect often overlooked is the label delay, where new data may not be labeled due to slow and costly annotation processes. We introduce a new continual learning framework with explicit modeling of the label delay between data and label streams over time steps. In each step, the framework reveals both unlabeled data from the current time step $t$ and labels delayed with $d$ steps, from the time step $t-d$. In our extensive experiments amounting to 1060 GPU days, we show that merely augmenting the computational resources is insufficient to tackle this challenge. Our findings underline a notable performance decline when solely relying on labeled data when the label delay becomes significant. More surprisingly, when using state-of-the-art SSL and TTA techniques to utilize the newer, unlabeled data, they fail to surpass the performance of a na\"ive method that simply trains on the delayed supervised stream. To this end, we introduce a simple, efficient baseline that rehearses from the labeled memory samples that are most similar to the new unlabeled samples. This method bridges the accuracy gap caused by label delay without significantly increasing computational complexity. We show experimentally that our method is the least affected by the label delay factor and in some cases successfully recovers the accuracy of the non-delayed counterpart. We conduct various ablations and sensitivity experiments, demonstrating the effectiveness of our approach.  ( 3 min )
    Explore to Generalize in Zero-Shot RL. (arXiv:2306.03072v2 [cs.LG] UPDATED)
    We study zero-shot generalization in reinforcement learning-optimizing a policy on a set of training tasks to perform well on a similar but unseen test task. To mitigate overfitting, previous work explored different notions of invariance to the task. However, on problems such as the ProcGen Maze, an adequate solution that is invariant to the task visualization does not exist, and therefore invariance-based approaches fail. Our insight is that learning a policy that effectively $\textit{explores}$ the domain is harder to memorize than a policy that maximizes reward for a specific task, and therefore we expect such learned behavior to generalize well; we indeed demonstrate this empirically on several domains that are difficult for invariance-based approaches. Our $\textit{Explore to Generalize}$ algorithm (ExpGen) builds on this insight: we train an additional ensemble of agents that optimize reward. At test time, either the ensemble agrees on an action, and we generalize well, or we take exploratory actions, which generalize well and drive us to a novel part of the state space, where the ensemble may potentially agree again. We show that our approach is the state-of-the-art on tasks of the ProcGen challenge that have thus far eluded effective generalization, yielding a success rate of $83\%$ on the Maze task and $74\%$ on Heist with $200$ training levels. ExpGen can also be combined with an invariance based approach to gain the best of both worlds, setting new state-of-the-art results on ProcGen.
    Quick Back-Translation for Unsupervised Machine Translation. (arXiv:2312.00912v1 [cs.CL])
    The field of unsupervised machine translation has seen significant advancement from the marriage of the Transformer and the back-translation algorithm. The Transformer is a powerful generative model, and back-translation leverages Transformer's high-quality translations for iterative self-improvement. However, the Transformer is encumbered by the run-time of autoregressive inference during back-translation, and back-translation is limited by a lack of synthetic data efficiency. We propose a two-for-one improvement to Transformer back-translation: Quick Back-Translation (QBT). QBT re-purposes the encoder as a generative model, and uses encoder-generated sequences to train the decoder in conjunction with the original autoregressive back-translation step, improving data throughput and utilization. Experiments on various WMT benchmarks demonstrate that a relatively small number of refining steps of QBT improve current unsupervised machine translation models, and that QBT dramatically outperforms standard back-translation only method in terms of training efficiency for comparable translation qualities.  ( 2 min )
    Interpreting and Improving Diffusion Models Using the Euclidean Distance Function. (arXiv:2306.04848v2 [cs.LG] UPDATED)
    Denoising is intuitively related to projection. Indeed, under the manifold hypothesis, adding random noise is approximately equivalent to orthogonal perturbation. Hence, learning to denoise is approximately learning to project. In this paper, we use this observation to reinterpret denoising diffusion models as approximate gradient descent applied to the Euclidean distance function. We then provide straight-forward convergence analysis of the DDIM sampler under simple assumptions on the projection-error of the denoiser. Finally, we propose a new sampler based on two simple modifications to DDIM using insights from our theoretical results. In as few as 5-10 function evaluations, our sampler achieves state-of-the-art FID scores on pretrained CIFAR-10 and CelebA models and can generate high quality samples on latent diffusion models.  ( 2 min )
    A Text-guided Protein Design Framework. (arXiv:2302.04611v2 [cs.LG] UPDATED)
    Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level functionalities. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90\% accuracy for text-guided protein generation; (2) best hit ratio on 10 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.  ( 2 min )
    Visualizing high-dimensional loss landscapes with Hessian directions. (arXiv:2208.13219v2 [cs.LG] UPDATED)
    Analyzing geometric properties of high-dimensional loss functions, such as local curvature and the existence of other optima around a certain point in loss space, can help provide a better understanding of the interplay between neural network structure, implementation attributes, and learning performance. In this work, we combine concepts from high-dimensional probability and differential geometry to study how curvature properties in lower-dimensional loss representations depend on those in the original loss space. We show that saddle points in the original space are rarely correctly identified as such in expected lower-dimensional representations if random projections are used. The principal curvature in the expected lower-dimensional representation is proportional to the mean curvature in the original loss space. Hence, the mean curvature in the original loss space determines if saddle points appear, on average, as either minima, maxima, or almost flat regions. We use the connection between expected curvature in random projections and mean curvature in the original space (i.e., the normalized Hessian trace) to compute Hutchinson-type trace estimates without calculating Hessian-vector products as in the original Hutchinson method. Because random projections are not suitable to correctly identify saddle information, we propose to study projections along dominant Hessian directions that are associated with the largest and smallest principal curvatures. We connect our findings to the ongoing debate on loss landscape flatness and generalizability. Finally, for different common image classifiers and a function approximator, we show and compare random and Hessian projections of loss landscapes with up to about $7\times 10^6$ parameters.  ( 3 min )
    ContriMix: Unsupervised disentanglement of content and attribute for domain generalization in microscopy image analysis. (arXiv:2306.04527v3 [eess.IV] UPDATED)
    Domain generalization is critical for real-world applications of machine learning to microscopy images, including histopathology and fluorescence imaging. Artifacts in these modalities arise through a complex combination of factors relating to tissue collection and laboratory processing, as well as factors intrinsic to patient samples. In fluorescence imaging, these artifacts stem from variations across experimental batches. The complexity and subtlety of these artifacts make the enumeration of data domains intractable. Therefore, augmentation-based methods of domain generalization that require domain identifiers and manual fine-tuning are inadequate in this setting. To overcome this challenge, we introduce ContriMix, a domain generalization technique that learns to generate synthetic images by disentangling and permuting the biological content ("content") and technical variations ("attributes") in microscopy images. ContriMix does not rely on domain identifiers or handcrafted augmentations and makes no assumptions about the input characteristics of images. We assess the performance of ContriMix on two pathology datasets dealing with patch classification and Whole Slide Image label prediction tasks respectively (Camelyon17-WILDS and RCC subtyping), and one fluorescence microscopy dataset (RxRx1-WILDS). Without any access to domain identifiers at train or test time, ContriMix performs similar or better than current state-of-the-art methods in all these datasets, motivating its usage for microscopy image analysis in real-world settings where domain information is hard to come by. The code for ContriMix can be found at https://gitlab.com/huutan86/contrimix  ( 3 min )
    ResNLS: An Improved Model for Stock Price Forecasting. (arXiv:2312.01020v1 [cs.LG])
    Stock prices forecasting has always been a challenging task. Although many research projects adopt machine learning and deep learning algorithms to address the problem, few of them pay attention to the varying degrees of dependencies between stock prices. In this paper we introduce a hybrid model that improves stock price prediction by emphasizing the dependencies between adjacent stock prices. The proposed model, ResNLS, is mainly composed of two neural architectures, ResNet and LSTM. ResNet serves as a feature extractor to identify dependencies between stock prices across time windows, while LSTM analyses the initial time-series data with the combination of dependencies which considered as residuals. In predicting the SSE Composite Index, our experiment reveals that when the closing price data for the previous 5 consecutive trading days is used as the input, the performance of the model (ResNLS-5) is optimal compared to those with other inputs. Furthermore, ResNLS-5 outperforms vanilla CNN, RNN, LSTM, and BiLSTM models in terms of prediction accuracy. It also demonstrates at least a 20% improvement over the current state-of-the-art baselines. To verify whether ResNLS-5 can help clients effectively avoid risks and earn profits in the stock market, we construct a quantitative trading framework for back testing. The experimental results show that the trading strategy based on predictions from ResNLS-5 can successfully mitigate losses during declining stock prices and generate profits in the periods of rising stock prices.  ( 3 min )
    Physics Inspired Criterion for Pruning-Quantization Joint Learning. (arXiv:2312.00851v1 [cs.LG])
    Pruning-quantization joint learning always facilitates the deployment of deep neural networks (DNNs) on resource-constrained edge devices. However, most existing methods do not jointly learn a global criterion for pruning and quantization in an interpretable way. In this paper, we propose a novel physics inspired criterion for pruning-quantization joint learning (PIC-PQ), which is explored from an analogy we first draw between elasticity dynamics (ED) and model compression (MC). Specifically, derived from Hooke's law in ED, we establish a linear relationship between the filters' importance distribution and the filter property (FP) by a learnable deformation scale in the physics inspired criterion (PIC). Furthermore, we extend PIC with a relative shift variable for a global view. To ensure feasibility and flexibility, available maximum bitwidth and penalty factor are introduced in quantization bitwidth assignment. Experiments on benchmarks of image classification demonstrate that PIC-PQ yields a good trade-off between accuracy and bit-operations (BOPs) compression ratio e.g., 54.96X BOPs compression ratio in ResNet56 on CIFAR10 with 0.10% accuracy drop and 53.24X in ResNet18 on ImageNet with 0.61% accuracy drop). The code will be available at https://github.com/fanxxxxyi/PIC-PQ.  ( 2 min )
    Multiblock ADMM for nonsmooth nonconvex optimization with nonlinear coupling constraints. (arXiv:2201.07657v3 [math.OC] UPDATED)
    This paper proposes a multiblock alternating direction method of multipliers for solving a class of multiblock nonsmooth nonconvex optimization problem with nonlinear coupling constraints. We employ a majorization minimization procedure in the update of each block of the primal variables. Subsequential and global convergence of the generated sequence to a critical point of the augmented Lagrangian are proved. We also establish iteration complexity and provide preliminary numerical results for the proposed algorithm.  ( 2 min )
    Physics-Informed Gaussian Process Regression Generalizes Linear PDE Solvers. (arXiv:2212.12474v5 [cs.LG] UPDATED)
    Linear partial differential equations (PDEs) are an important, widely applied class of mechanistic models, describing physical processes such as heat transfer, electromagnetism, and wave propagation. In practice, specialized numerical methods based on discretization are used to solve PDEs. They generally use an estimate of the unknown model parameters and, if available, physical measurements for initialization. Such solvers are often embedded into larger scientific models with a downstream application and thus error quantification plays a key role. However, by ignoring parameter and measurement uncertainty, classical PDE solvers may fail to produce consistent estimates of their inherent approximation error. In this work, we approach this problem in a principled fashion by interpreting solving linear PDEs as physics-informed Gaussian process (GP) regression. Our framework is based on a key generalization of the Gaussian process inference theorem to observations made via an arbitrary bounded linear operator. Crucially, this probabilistic viewpoint allows to (1) quantify the inherent discretization error; (2) propagate uncertainty about the model parameters to the solution; and (3) condition on noisy measurements. Demonstrating the strength of this formulation, we prove that it strictly generalizes methods of weighted residuals, a central class of PDE solvers including collocation, finite volume, pseudospectral, and (generalized) Galerkin methods such as finite element and spectral methods. This class can thus be directly equipped with a structured error estimate. In summary, our results enable the seamless integration of mechanistic models as modular building blocks into probabilistic models by blurring the boundaries between numerical analysis and Bayesian inference.  ( 3 min )
    Funnel-based Reward Shaping for Signal Temporal Logic Tasks in Reinforcement Learning. (arXiv:2212.03181v3 [eess.SY] UPDATED)
    Signal Temporal Logic (STL) is a powerful framework for describing the complex temporal and logical behaviour of the dynamical system. Numerous studies have attempted to employ reinforcement learning to learn a controller that enforces STL specifications; however, they have been unable to effectively tackle the challenges of ensuring robust satisfaction in continuous state space and maintaining tractability. In this paper, leveraging the concept of funnel functions, we propose a tractable reinforcement learning algorithm to learn a time-dependent policy for robust satisfaction of STL specification in continuous state space. We demonstrate the utility of our approach on several STL tasks using different environments.  ( 2 min )
    Second-Order Uncertainty Quantification: A Distance-Based Approach. (arXiv:2312.00995v1 [cs.LG])
    In the past couple of years, various approaches to representing and quantifying different types of predictive uncertainty in machine learning, notably in the setting of classification, have been proposed on the basis of second-order probability distributions, i.e., predictions in the form of distributions on probability distributions. A completely conclusive solution has not yet been found, however, as shown by recent criticisms of commonly used uncertainty measures associated with second-order distributions, identifying undesirable theoretical properties of these measures. In light of these criticisms, we propose a set of formal criteria that meaningful uncertainty measures for predictive uncertainty based on second-order distributions should obey. Moreover, we provide a general framework for developing uncertainty measures to account for these criteria, and offer an instantiation based on the Wasserstein distance, for which we prove that all criteria are satisfied.  ( 2 min )
    Quantifying Hippocampal Shape Asymmetry in Alzheimer's Disease Using Optimal Shape Correspondences. (arXiv:2312.01043v1 [eess.IV])
    Hippocampal atrophy in Alzheimer's disease (AD) is asymmetric and spatially inhomogeneous. While extensive work has been done on volume and shape analysis of atrophy of the hippocampus in AD, less attention has been given to hippocampal asymmetry specifically. Previous studies of hippocampal asymmetry are limited to global volume or shape measures, which don't localize shape asymmetry at the point level. In this paper, we propose to quantify localized shape asymmetry by optimizing point correspondences between left and right hippocampi within a subject, while simultaneously favoring a compact statistical shape model of the entire sample. To account for related variables that have impact on AD and healthy subject differences, we build linear models with other confounding factors. Our results on the OASIS3 dataset demonstrate that compared to using volumetric information, shape asymmetry reveals fine-grained, localized differences that indicate the hippocampal regions of most significant shape asymmetry in AD patients.  ( 2 min )
    Continuous Authentication Using Mouse Clickstream Data Analysis. (arXiv:2312.00802v1 [eess.SP])
    Biometrics is used to authenticate an individual based on physiological or behavioral traits. Mouse dynamics is an example of a behavioral biometric that can be used to perform continuous authentication as protection against security breaches. Recent research on mouse dynamics has shown promising results in identifying users; however, it has not yet reached an acceptable level of accuracy. In this paper, an empirical evaluation of different classification techniques is conducted on a mouse dynamics dataset, the Balabit Mouse Challenge dataset. User identification is carried out using three mouse actions: mouse move, point and click, and drag and drop. Verification and authentication methods are conducted using three machine-learning classifiers: the Decision Tree classifier, the K-Nearest Neighbors classifier, and the Random Forest classifier. The results show that the three classifiers can distinguish between a genuine user and an impostor with a relatively high degree of accuracy. In the verification mode, all the classifiers achieve a perfect accuracy of 100%. In authentication mode, all three classifiers achieved the highest accuracy (ACC) and Area Under Curve (AUC) from scenario B using the point and click action data: (Decision Tree ACC:87.6%, AUC:90.3%), (K-Nearest Neighbors ACC:99.3%, AUC:99.9%), and (Random Forest ACC:89.9%, AUC:92.5%).  ( 2 min )
    Stealing the Decoding Algorithms of Language Models. (arXiv:2303.04729v4 [cs.LG] UPDATED)
    A key component of generating text from modern language models (LM) is the selection and tuning of decoding algorithms. These algorithms determine how to generate text from the internal probability distribution generated by the LM. The process of choosing a decoding algorithm and tuning its hyperparameters takes significant time, manual effort, and computation, and it also requires extensive human evaluation. Therefore, the identity and hyperparameters of such decoding algorithms are considered to be extremely valuable to their owners. In this work, we show, for the first time, that an adversary with typical API access to an LM can steal the type and hyperparameters of its decoding algorithms at very low monetary costs. Our attack is effective against popular LMs used in text generation APIs, including GPT-2, GPT-3 and GPT-Neo. We demonstrate the feasibility of stealing such information with only a few dollars, e.g., $\$0.8$, $\$1$, $\$4$, and $\$40$ for the four versions of GPT-3.  ( 2 min )
    Spectral Temporal Contrastive Learning. (arXiv:2312.00966v1 [cs.LG])
    Learning useful data representations without requiring labels is a cornerstone of modern deep learning. Self-supervised learning methods, particularly contrastive learning (CL), have proven successful by leveraging data augmentations to define positive pairs. This success has prompted a number of theoretical studies to better understand CL and investigate theoretical bounds for downstream linear probing tasks. This work is concerned with the temporal contrastive learning (TCL) setting where the sequential structure of the data is used instead to define positive pairs, which is more commonly used in RL and robotics contexts. In this paper, we adapt recent work on Spectral CL to formulate Spectral Temporal Contrastive Learning (STCL). We discuss a population loss based on a state graph derived from a time-homogeneous reversible Markov chain with uniform stationary distribution. The STCL loss enables to connect the linear probing performance to the spectral properties of the graph, and can be estimated by considering previously observed data sequences as an ensemble of MCMC chains.  ( 2 min )
    Talent-Interview: Web-Client Cheating Detection for Online Exams. (arXiv:2312.00795v1 [cs.CV])
    Online exams are more attractive after the Covid-19 pandemic. Furthermore, during recruitment, online exams are used. However, there are more cheating possibilities for online exams. Assigning a proctor for each exam increases cost. At this point, automatic proctor systems detect possible cheating status. This article proposes an end-to-end system and submodules to get better results for online proctoring. Object detection, face recognition, human voice detection, and segmentation are used in our system. Furthermore, our proposed model works on the PCs of users, meaning a client-based system. So, server cost is eliminated. As far as we know, it is the first time the client-based online proctoring system has been used for recruitment. Online exams are more attractive after the Covid-19 pandemic. Furthermore, during recruitment, online exams are used. However, there are more cheating possibilities for online exams. Assigning a proctor for each exam increases cost. At this point, automatic proctor systems detect possible cheating status. This article proposes an end-to-end system and submodules to get better results for online proctoring. Object detection, face recognition, human voice detection, and segmentation are used in our system. Furthermore, our proposed model works on the PCs of users, meaning a client-based system. So, server cost is eliminated. As far as we know, it is the first time the client-based online proctoring system has been used for recruitment. Furthermore, this cheating system works at https://www.talent-interview.com/tr/.  ( 3 min )
    A Theory of Unimodal Bias in Multimodal Learning. (arXiv:2312.00935v1 [cs.LG])
    Using multiple input streams simultaneously in training multimodal neural networks is intuitively advantageous, but practically challenging. A key challenge is unimodal bias, where a network overly relies on one modality and ignores others during joint training. While unimodal bias is well-documented empirically, our theoretical understanding of how architecture and data statistics influence this bias remains incomplete. Here we develop a theory of unimodal bias with deep multimodal linear networks. We calculate the duration of the unimodal phase in learning as a function of the depth at which modalities are fused within the network, dataset statistics, and initialization. We find that the deeper the layer at which fusion occurs, the longer the unimodal phase. A long unimodal phase can lead to a generalization deficit and permanent unimodal bias in the overparametrized regime. In addition, our theory reveals the modality learned first is not necessarily the modality that contributes more to the output. Our results, derived for multimodal linear networks, extend to ReLU networks in certain settings. Taken together, this work illuminates pathologies of multimodal learning under joint training, showing that late and intermediate fusion architectures can give rise to long unimodal phases and permanent unimodal bias.  ( 2 min )
    Empowering Autonomous Driving with Large Language Models: A Safety Perspective. (arXiv:2312.00812v1 [cs.AI])
    Autonomous Driving (AD) faces crucial hurdles for commercial launch, notably in the form of diminished public trust and safety concerns from long-tail unforeseen driving scenarios. This predicament is due to the limitation of deep neural networks in AD software, which struggle with interpretability and exhibit poor generalization capabilities in out-of-distribution and uncertain scenarios. To this end, this paper advocates for the integration of Large Language Models (LLMs) into the AD system, leveraging their robust common-sense knowledge, reasoning abilities, and human-interaction capabilities. The proposed approach deploys the LLM as an intelligent decision-maker in planning, incorporating safety verifiers for contextual safety learning to enhance overall AD performance and safety. We present results from two case studies that affirm the efficacy of our approach. We further discuss the potential integration of LLM for other AD software components including perception, prediction, and simulation. Despite the observed challenges in the case studies, the integration of LLMs is promising and beneficial for reinforcing both safety and performance in AD.  ( 2 min )
    Distilled Non-Semantic Speech Embeddings with Binary Neural Networks for Low-Resource Devices. (arXiv:2207.05784v4 [cs.SD] UPDATED)
    This work introduces BRILLsson, a novel binary neural network-based representation learning model for a broad range of non-semantic speech tasks. We train the model with knowledge distillation from a large and real-valued TRILLsson model with only a fraction of the dataset used to train TRILLsson. The resulting BRILLsson models are only 2MB in size with a latency less than 8ms, making them suitable for deployment in low-resource devices such as wearables. We evaluate BRILLsson on eight benchmark tasks (including but not limited to spoken language identification, emotion recognition, health condition diagnosis, and keyword spotting), and demonstrate that our proposed ultra-light and low-latency models perform as well as large-scale models.  ( 2 min )
    A Nonstochastic Control Approach to Optimization. (arXiv:2301.07902v3 [cs.LG] UPDATED)
    Selecting the best hyperparameters for a particular optimization instance, such as the learning rate and momentum, is an important but nonconvex problem. As a result, iterative optimization methods such as hypergradient descent lack global optimality guarantees in general. We propose an online nonstochastic control methodology for mathematical optimization. First, we formalize the setting of meta-optimization, an online learning formulation of learning the best optimization algorithm from a class of methods. The meta-optimization problem over gradient-based methods can be framed as a feedback control problem over the choice of hyperparameters, including the learning rate, momentum, and the preconditioner. Although the original optimal control problem is nonconvex, we show how recent methods from online nonstochastic control using convex relaxations can be used to overcome the challenge of nonconvexity, and obtain regret guarantees against the best offline solution. This guarantees that in meta-optimization, given a sequence of optimization problems, we can learn a method that attains convergence comparable to that of the best optimization method in hindsight from a class of methods.  ( 2 min )
    Variational Autoencoders for Anomaly Detection in Respiratory Sounds. (arXiv:2208.03326v2 [cs.SD] UPDATED)
    This paper proposes a weakly-supervised machine learning-based approach aiming at a tool to alert patients about possible respiratory diseases. Various types of pathologies may affect the respiratory system, potentially leading to severe diseases and, in certain cases, death. In general, effective prevention practices are considered as major actors towards the improvement of the patient's health condition. The proposed method strives to realize an easily accessible tool for the automatic diagnosis of respiratory diseases. Specifically, the method leverages Variational Autoencoder architectures permitting the usage of training pipelines of limited complexity and relatively small-sized datasets. Importantly, it offers an accuracy of 57 %, which is in line with the existing strongly-supervised approaches.  ( 2 min )
    Information Extraction in Low-Resource Scenarios: Survey and Perspective. (arXiv:2202.08063v5 [cs.CL] UPDATED)
    Information Extraction (IE) seeks to derive structured information from unstructured texts, often facing challenges in low-resource scenarios due to data scarcity and unseen classes. This paper presents a review of neural approaches to low-resource IE from \emph{traditional} and \emph{LLM-based} perspectives, systematically categorizing them into a fine-grained taxonomy. Then we conduct empirical study on LLM-based methods compared with previous state-of-the-art models, and discover that (1) well-tuned LMs are still predominant; (2) tuning open-resource LLMs and ICL with GPT family is promising in general; (3) the optimal LLM-based technical solution for low-resource IE can be task-dependent. In addition, we discuss low-resource IE with LLMs, highlight promising applications, and outline potential research directions. This survey aims to foster understanding of this field, inspire new ideas, and encourage widespread applications in both academia and industry.  ( 2 min )
    The perpetual motion machine of AI-generated data and the distraction of ChatGPT-as-scientist. (arXiv:2312.00818v1 [cs.LG])
    Since ChatGPT works so well, are we on the cusp of solving science with AI? Is not AlphaFold2 suggestive that the potential of LLMs in biology and the sciences more broadly is limitless? Can we use AI itself to bridge the lack of data in the sciences in order to then train an AI? Herein we present a discussion of these topics.  ( 2 min )
    Nash Learning from Human Feedback. (arXiv:2312.00886v1 [stat.ML])
    Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.  ( 3 min )
    ENS-t-SNE: Embedding Neighborhoods Simultaneously t-SNE. (arXiv:2205.11720v2 [cs.LG] UPDATED)
    When visualizing a high-dimensional dataset, dimension reduction techniques are commonly employed which provide a single 2 dimensional view of the data. We describe ENS-t-SNE: an algorithm for Embedding Neighborhoods Simultaneously that generalizes the t-Stochastic Neighborhood Embedding approach. By using different viewpoints in ENS-t-SNE's 3D embedding, one can visualize different types of clusters within the same high-dimensional dataset. This enables the viewer to see and keep track of the different types of clusters, which is harder to do when providing multiple 2D embeddings, where corresponding points cannot be easily identified. We illustrate the utility of ENS-t-SNE with real-world applications and provide an extensive quantitative evaluation with datasets of different types and sizes.  ( 2 min )
    Noisy probing dose facilitated dose prediction for pencil beam scanning proton therapy: physics enhances generalizability. (arXiv:2312.00975v1 [physics.med-ph])
    Purpose: Prior AI-based dose prediction studies in photon and proton therapy often neglect underlying physics, limiting their generalizability to handle outlier clinical cases, especially for pencil beam scanning proton therapy (PBSPT). Our aim is to design a physics-aware and generalizable AI-based PBSPT dose prediction method that has the underlying physics considered to achieve high generalizability to properly handle the outlier clinical cases. Methods and Materials: This study analyzed PBSPT plans of 103 prostate and 78 lung cancer patients from our institution,with each case comprising CT images, structure sets, and plan doses from our Monte-Carlo dose engine (serving as the ground truth). Three methods were evaluated in the ablation study: the ROI-based method, the beam mask and sliding window method, and the noisy probing dose method. Twelve cases with uncommon beam angles or prescription doses tested the methods' generalizability to rare treatment planning scenarios. Performance evaluation used DVH indices, 3D Gamma passing rates (3%/2mm/10%), and dice coefficients for dose agreement. Results: The noisy probing dose method showed improved agreement of DVH indices, 3D Gamma passing rates, and dice coefficients compared to the conventional methods for the testing cases. The noisy probing dose method showed better generalizability in the 6 outlier cases than the ROI-based and beam mask-based methods with 3D Gamma passing rates (for prostate cancer, targets: 89.32%$\pm$1.45% vs. 93.48%$\pm$1.51% vs. 96.79%$\pm$0.83%, OARs: 85.87%$\pm$1.73% vs. 91.15%$\pm$1.13% vs. 94.29%$\pm$1.01%). The dose predictions were completed within 0.3 seconds. Conclusions: We've devised a novel noisy probing dose method for PBSPT dose prediction in prostate and lung cancer patients. With more physics included, it enhances the generalizability of dose prediction in handling outlier clinical cases.  ( 3 min )
    A Probabilistic Neural Twin for Treatment Planning in Peripheral Pulmonary Artery Stenosis. (arXiv:2312.00854v1 [physics.med-ph])
    The substantial computational cost of high-fidelity models in numerical hemodynamics has, so far, relegated their use mainly to offline treatment planning. New breakthroughs in data-driven architectures and optimization techniques for fast surrogate modeling provide an exciting opportunity to overcome these limitations, enabling the use of such technology for time-critical decisions. We discuss an application to the repair of multiple stenosis in peripheral pulmonary artery disease through either transcatheter pulmonary artery rehabilitation or surgery, where it is of interest to achieve desired pressures and flows at specific locations in the pulmonary artery tree, while minimizing the risk for the patient. Since different degrees of success can be achieved in practice during treatment, we formulate the problem in probability, and solve it through a sample-based approach. We propose a new offline-online pipeline for probabilsitic real-time treatment planning which combines offline assimilation of boundary conditions, model reduction, and training dataset generation with online estimation of marginal probabilities, possibly conditioned on the degree of augmentation observed in already repaired lesions. Moreover, we propose a new approach for the parametrization of arbitrarily shaped vascular repairs through iterative corrections of a zero-dimensional approximant. We demonstrate this pipeline for a diseased model of the pulmonary artery tree available through the Vascular Model Repository.  ( 3 min )
    Prefix-Tree Decoding for Predicting Mass Spectra from Molecules. (arXiv:2303.06470v3 [q-bio.QM] UPDATED)
    Computational predictions of mass spectra from molecules have enabled the discovery of clinically relevant metabolites. However, such predictive tools are still limited as they occupy one of two extremes, either operating (a) by fragmenting molecules combinatorially with overly rigid constraints on potential rearrangements and poor time complexity or (b) by decoding lossy and nonphysical discretized spectra vectors. In this work, we use a new intermediate strategy for predicting mass spectra from molecules by treating mass spectra as sets of molecular formulae, which are themselves multisets of atoms. After first encoding an input molecular graph, we decode a set of molecular subformulae, each of which specify a predicted peak in the mass spectrum, the intensities of which are predicted by a second model. Our key insight is to overcome the combinatorial possibilities for molecular subformulae by decoding the formula set using a prefix tree structure, atom-type by atom-type, representing a general method for ordered multiset decoding. We show promising empirical results on mass spectra prediction tasks.  ( 2 min )
    Data-Driven Causal Effect Estimation Based on Graphical Causal Modelling: A Survey. (arXiv:2208.09590v2 [cs.AI] UPDATED)
    In many fields of scientific research and real-world applications, unbiased estimation of causal effects from non-experimental data is crucial for understanding the mechanism underlying the data and for decision-making on effective responses or interventions. A great deal of research has been conducted to address this challenging problem from different angles. For estimating causal effect in observational data, assumptions such as Markov condition, faithfulness and causal sufficiency are always made. Under the assumptions, full knowledge such as, a set of covariates or an underlying causal graph, is typically required. A practical challenge is that in many applications, no such full knowledge or only some partial knowledge is available. In recent years, research has emerged to use search strategies based on graphical causal modelling to discover useful knowledge from data for causal effect estimation, with some mild assumptions, and has shown promise in tackling the practical challenge. In this survey, we review these data-driven methods on causal effect estimation for a single treatment with a single outcome of interest and focus on the challenges faced by data-driven causal effect estimation. We concisely summarise the basic concepts and theories that are essential for data-driven causal effect estimation using graphical causal modelling but are scattered around the literature. We identify and discuss the challenges faced by data-driven causal effect estimation and characterise the existing methods by their assumptions and the approaches to tackling the challenges. We analyse the strengths and limitations of the different types of methods and present an empirical evaluation to support the discussions. We hope this review will motivate more researchers to design better data-driven methods based on graphical causal modelling for the challenging problem of causal effect estimation.  ( 3 min )
    Data-Driven Autoencoder Numerical Solver with Uncertainty Quantification for Fast Physical Simulations. (arXiv:2312.01021v1 [cs.CE])
    Traditional partial differential equation (PDE) solvers can be computationally expensive, which motivates the development of faster methods, such as reduced-order-models (ROMs). We present GPLaSDI, a hybrid deep-learning and Bayesian ROM. GPLaSDI trains an autoencoder on full-order-model (FOM) data and simultaneously learns simpler equations governing the latent space. These equations are interpolated with Gaussian Processes, allowing for uncertainty quantification and active learning, even with limited access to the FOM solver. Our framework is able to achieve up to 100,000 times speed-up and less than 7% relative error on fluid mechanics problems.  ( 2 min )
    Convergences for Minimax Optimization Problems over Infinite-Dimensional Spaces Towards Stability in Adversarial Training. (arXiv:2312.00991v1 [stat.ML])
    Training neural networks that require adversarial optimization, such as generative adversarial networks (GANs) and unsupervised domain adaptations (UDAs), suffers from instability. This instability problem comes from the difficulty of the minimax optimization, and there have been various approaches in GANs and UDAs to overcome this problem. In this study, we tackle this problem theoretically through a functional analysis. Specifically, we show the convergence property of the minimax problem by the gradient descent over the infinite-dimensional spaces of continuous functions and probability measures under certain conditions. Using this setting, we can discuss GANs and UDAs comprehensively, which have been studied independently. In addition, we show that the conditions necessary for the convergence property are interpreted as stabilization techniques of adversarial training such as the spectral normalization and the gradient penalty.  ( 2 min )
    TimelyGPT: Recurrent Convolutional Transformer for Long Time-series Representation. (arXiv:2312.00817v1 [cs.LG])
    Pre-trained models (PTMs) have gained prominence in Natural Language Processing and Computer Vision domains. When it comes to time-series PTMs, their development has been limited. Previous research on time-series transformers has mainly been devoted to small-scale tasks, yet these models have not consistently outperformed traditional models. Additionally, the performance of these transformers on large-scale data remains unexplored. These findings raise doubts about Transformer's capabilities to scale up and capture temporal dependencies. In this study, we re-examine time-series transformers and identify the shortcomings of prior studies. Drawing from these insights, we then introduce a pioneering architecture called Timely Generative Pre-trained Transformer (\model). This architecture integrates recurrent attention and temporal convolution modules to effectively capture global-local temporal dependencies in long sequences. The relative position embedding with time decay can effectively deal with trend and periodic patterns from time-series. Our experiments show that \model~excels in modeling continuously monitored biosignal as well as irregularly-sampled time-series data commonly observed in longitudinal electronic health records. This breakthrough suggests a priority shift in time-series deep learning research, moving from small-scale modeling from scratch to large-scale pre-training.  ( 2 min )
    hvEEGNet: exploiting hierarchical VAEs on EEG data for neuroscience applications. (arXiv:2312.00799v1 [eess.SP])
    With the recent success of artificial intelligence in neuroscience, a number of deep learning (DL) models were proposed for classification, anomaly detection, and pattern recognition tasks in electroencephalography (EEG). EEG is a multi-channel time-series that provides information about the individual brain activity for diagnostics, neuro-rehabilitation, and other applications (including emotions recognition). Two main issues challenge the existing DL-based modeling methods for EEG: the high variability between subjects and the low signal-to-noise ratio making it difficult to ensure a good quality in the EEG data. In this paper, we propose two variational autoencoder models, namely vEEGNet-ver3 and hvEEGNet, to target the problem of high-fidelity EEG reconstruction. We properly designed their architectures using the blocks of the well-known EEGNet as the encoder, and proposed a loss function based on dynamic time warping. We tested the models on the public Dataset 2a - BCI Competition IV, where EEG was collected from 9 subjects and 22 channels. hvEEGNet was found to reconstruct the EEG data with very high-fidelity, outperforming most previous solutions (including our vEEGNet-ver3 ). Furthermore, this was consistent across all subjects. Interestingly, hvEEGNet made it possible to discover that this popular dataset includes a number of corrupted EEG recordings that might have influenced previous literature results. We also investigated the training behaviour of our models and related it with the quality and the size of the input EEG dataset, aiming at opening a new research debate on this relationship. In the future, hvEEGNet could be used as anomaly (e.g., artefact) detector in large EEG datasets to support the domain experts, but also the latent representations it provides could be used in other classification problems and EEG data generation.  ( 3 min )
    RNN-BOF: A Multivariate Global Recurrent Neural Network for Binary Outcome Forecasting of Inpatient Aggression. (arXiv:2312.01029v1 [cs.LG])
    Psychometric assessment instruments aid clinicians by providing methods of assessing the future risk of adverse events such as aggression. Existing machine learning approaches have treated this as a classification problem, predicting the probability of an adverse event in a fixed future time period from the scores produced by both psychometric instruments and clinical and demographic covariates. We instead propose modelling a patient's future risk using a time series methodology that learns from longitudinal data and produces a probabilistic binary forecast that indicates the presence of the adverse event in the next time period. Based on the recent success of Deep Neural Nets for globally forecasting across many time series, we introduce a global multivariate Recurrent Neural Network for Binary Outcome Forecasting, that trains from and for a population of patient time series to produce individual probabilistic risk assessments. We use a moving window training scheme on a real world dataset of 83 patients, where the main binary time series represents the presence of aggressive events and covariate time series represent clinical or demographic features and psychometric measures. On this dataset our approach was capable of a significant performance increase against both benchmark psychometric instruments and previously used machine learning methodologies.  ( 3 min )
    The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models. (arXiv:2312.00960v1 [cs.CL])
    Compressing large language models (LLMs), often consisting of billions of parameters, provides faster inference, smaller memory footprints, and enables local deployment. Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits. The key tradeoff is between the degree of compression and the impact on the quality of the compressed model. Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy. More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored. To help bridge this gap, we present a comprehensive analysis across multiple model families (ENCODER, ENCODER-DECODER, and DECODER) using the LAMA and LM-HARNESS benchmarks in order to systematically quantify the effect of commonly employed compression techniques on model performance. A particular focus is on tradeoffs involving parametric knowledge, with the goal of providing practitioners with practical insights to help make informed decisions on compression. We release our codebase1 to enable further research.  ( 2 min )
    Likelihood-Free Gaussian Process for Regression. (arXiv:2006.13456v3 [cs.LG] UPDATED)
    Gaussian process regression can flexibly represent the posterior distribution of an interest parameter given sufficient information on the likelihood. However, in some cases, we have little knowledge regarding the probability model. For example, when investing in a financial instrument, the probability model of cash flow is generally unknown. In this paper, we propose a novel framework called the likelihood-free Gaussian process (LFGP), which allows representation of the posterior distributions of interest parameters for scalable problems without directly setting their likelihood functions. The LFGP establishes clusters in which the value of the interest parameter can be considered approximately identical, and it approximates the likelihood of the interest parameter in each cluster to a Gaussian using the asymptotic normality of the maximum likelihood estimator. We expect that the proposed framework will contribute significantly to likelihood-free modeling, particularly by reducing the assumptions for the probability model and the computational costs for scalable problems.  ( 2 min )
    Improving Normative Modeling for Multi-modal Neuroimaging Data using mixture-of-product-of-experts variational autoencoders. (arXiv:2312.00992v1 [cs.LG])
    Normative models in neuroimaging learn the brain patterns of healthy population distribution and estimate how disease subjects like Alzheimer's Disease (AD) deviate from the norm. Existing variational autoencoder (VAE)-based normative models using multimodal neuroimaging data aggregate information from multiple modalities by estimating product or averaging of unimodal latent posteriors. This can often lead to uninformative joint latent distributions which affects the estimation of subject-level deviations. In this work, we addressed the prior limitations by adopting the Mixture-of-Product-of-Experts (MoPoE) technique which allows better modelling of the joint latent posterior. Our model labelled subjects as outliers by calculating deviations from the multimodal latent space. Further, we identified which latent dimensions and brain regions were associated with abnormal deviations due to AD pathology.  ( 2 min )
    PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction. (arXiv:2312.00839v1 [cs.LG])
    Asynchronous pipeline model parallelism with a "1F1B" (one forward, one backward) schedule generates little bubble overhead and always provides quite a high throughput. However, the "1F1B" schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches across GPUs. To simultaneously address these two problems, in this paper, we propose an optimizer-dependent weight prediction strategy (a.k.a PipeOptim) for asynchronous pipeline training. The key insight of our proposal is that we employ a weight prediction strategy in the forward pass to ensure that each mini-batch uses consistent and staleness-free weights to compute the forward pass. To be concrete, we first construct the weight prediction scheme based on the update rule of the used optimizer when training the deep neural network models. Then throughout the "1F1B" pipelined training, each mini-batch is mandated to execute weight prediction ahead of the forward pass, subsequently employing the predicted weights to perform the forward pass. As a result, PipeOptim 1) inherits the advantage of the "1F1B" schedule and generates pretty high throughput, and 2) can ensure effective parameter learning regardless of the type of the used optimizer. To verify the effectiveness of our proposal, we conducted extensive experimental evaluations using eight different deep-learning models spanning three machine-learning tasks including image classification, sentiment analysis, and machine translation. The experiment results demonstrate that PipeOptim outperforms the popular pipelined approaches including GPipe, PipeDream, PipeDream-2BW, and SpecTrain. The code of PipeOptim will be accessible at https://github.com/guanleics/PipeOptim.  ( 3 min )

  • Open

    "Multimodal dynamics modeling for off-road autonomous vehicles", Tremblay et al 2020
    submitted by /u/gwern [link] [comments]  ( 1 min )
    A Survey of Temporal Credit Assignment in Deep Reinforcement Learning
    https://arxiv.org/abs/2312.01072 submitted by /u/Conscious_Heron_9133 [link] [comments]  ( 1 min )
    Continual Learning: Applications and the Road Forward
    arXiv: https://arxiv.org/abs/2311.11908 OpenReview: https://openreview.net/forum?id=axBIMcGZn9 Summary of the Dagstuhl Seminar 23122: https://drops.dagstuhl.de/entities/document/10.4230/DagRep.13.3.74 Abstract: Continual learning is a sub-field of machine learning, which aims to allow machine learning models to continuously learn on new data, by accumulating knowledge without forgetting what was learned in the past. In this work, we take a step back, and ask: "Why should one care about continual learning in the first place?". We set the stage by surveying recent continual learning papers published at three major machine learning conferences, and show that memory-constrained settings dominate the field. Then, we discuss five open problems in machine learning, and even though they seem unrelated to continual learning at first sight, we show that continual learning will inevitably be part of their solution. These problems are model-editing, personalization, on-device learning, faster (re-)training and reinforcement learning. Finally, by comparing the desiderata from these unsolved problems and the current assumptions in continual learning, we highlight and discuss four future directions for continual learning research. We hope that this work offers an interesting perspective on the future of continual learning, while displaying its potential value and the paths we have to pursue in order to make it successful. This work is the result of the many discussions the authors had at the Dagstuhl seminar on Deep Continual Learning, in March 2023. submitted by /u/APaperADay [link] [comments]  ( 1 min )
    Nonogram Solver
    Hello! I have some experience with ML and Neural Networks, but this is my first foray into RL. I want to be able to solve nonograms (number puzzles) and I’ve already found a way to preprocess the constraint data into a 10x10 grid of floats, between 0-1 (inclusive). I’d like to have the output be a 10x10 grid of binary values or booleans, however I’m not quite sure how to structure this. I feel that the issue I’m running into is how to maintain the state. Is it a good idea to use the preprocessed grid as a “game board” and have the agent place 0s and 1s on top of it (after making the float range non-inclusive), even if that removes that previous info that was in that square? Is there a way to have a separate grid for constraint info and one for the game board that would maintain the relevance of the spatial data? Any ideas are welcome, thanks! submitted by /u/FissioN47 [link] [comments]  ( 1 min )
    "Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization", Ramamurthy et al 2023
    submitted by /u/gwern [link] [comments]
  • Open

    GTA 6 german voice over with AI
    submitted by /u/xymaps123 [link] [comments]  ( 1 min )
    Video Dubbing & Translation with Speaker Detection
    Here’s a short demo of what I’ve been building: https://youtu.be/b5roXHuCb10?si=dyh-ti53ztKfdjAN You can try it out here: http://www.jargonspeak.com Let me know what you think and if you have any feedback! submitted by /u/Lemons_for_Sale [link] [comments]  ( 1 min )
    xai's big test will be going after the idea of a human free will
    einstein rejected the notion. so did newton, darwin and freud. so does modern psychology and neuroscience. toppling the notion of free will is as easy as understanding that both causality and randomness, both determinism and indeterminism, make it impossible. but as with the many myths about women being less (fill in the blanks) than men, the free will myth still lives. grok will either take this on or be soundy relegated to the category of pc ais that have been trained to lack sufficient respect for science and reason. ball's in your court, elon. i'm betting you won't flich. go show the world how the universe works! submitted by /u/Georgeo57 [link] [comments]  ( 1 min )
    Can you suggest what should I know?
    I am trying to recreate myself as real as possible, I am very new to this, and I wanted to create photorealistic photos without any photoshoots. I am trying with some libraries where I am generating similar skin tone color and face swap and doing a face swap with a face swap-bot. However I want to create my system where I can train and create consistent images however I want with the specific data set I give. submitted by /u/tjaiswal7 [link] [comments]  ( 1 min )
    Masters in AI advice
    Hi, I'm looking to do an online masters in AI. I've narrowed my search down to the university of Leeds and Liverpool. Leeds is higher in the rankings and has a more stringent acceptance criteria. However, I prefer the course at Liverpool. Any advice on how much I should be concerned about the ranking? Does Leeds University have a significantly better reputation when it comes to AI? Many thanks for any advice. Cheers submitted by /u/TheMcGarr [link] [comments]  ( 1 min )
    Choose a place to live
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 1 min )
    Google is reportedly pushing the launch of its Gemini AI to 2024
    Google is reportedly pushing the launch of its Gemini AI to 2024. The Gemini AI model was announced at I/O 2023 and aims to rival OpenAI's GPT-4. Google canceled its Gemini launch events and plans to launch its GPT-4 competitor in January, according to The Information. Gemini was struggling with non-English queries, prompting CEO Sundar Pichai to delay its release. Gemini is expected to bring improvements to Google's existing AI and AI-enhanced products like Bard, Google Assistant, and Search. Source : https://www.engadget.com/google-is-reportedly-pushing-the-launch-of-its-gemini-ai-to-2024-173444507.html submitted by /u/NuseAI [link] [comments]  ( 1 min )
    One-Minute Daily AI News 12/4/2023
    Animate Anyone: The new generative video technique was developed by researchers at Alibaba Group’s Institute for Intelligent Computing.[1] ChatGPT will provide more detailed and accurate responses if you pretend to tip it, according to a new study.[2] Speech-to-Speech Translation: DeepMind Deploys New Approach to Train Translatotron 3.[3] Meta, IBM Create Industrywide AI Alliance to Share Technology.[4] Sources: [1] https://techcrunch.com/2023/12/04/animate-anyone-heralds-the-approach-of-full-motion-deepfakes/ [2] https://www.windowscentral.com/software-apps/chatgpt-will-provide-more-detailed-and-accurate-responses-if-you-pretend-to-tip-it-according-to-a-new-study [3] https://slator.com/speech-to-speech-translation-deepmind-deploys-approach-train-translatotron-3/ [4] https://www.bloomberg.com/news/articles/2023-12-05/meta-ibm-create-industrywide-ai-alliance-to-share-technology?embedded-checkout=true submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    Nvidia to work with SoftBank, other Japan firms to push AI research
    Nvidia will collaborate with SoftBank, Sakura Internet, and NTT to accelerate research into generative artificial intelligence in Japan. The cooperation aims to revolutionize the future of robotics in Japan by combining generative AI and the country's expertise in manufacturing. This move comes as more Japanese companies turn to generative AI to enhance their competitiveness, following the success of AI chatbot ChatGPT developed by OpenAI. The Japanese government, led by Prime Minister Fumio Kishida, is supporting AI to aid the country's economic growth. Source : https://mainichi.jp/english/articles/20231204/p2g/00m/0bu/045000c submitted by /u/NuseAI [link] [comments]  ( 1 min )
    No-code-required AI chatbot integration into apps
    submitted by /u/egusa [link] [comments]  ( 1 min )
  • Open

    University survey project "[P]
    If you please have any time, pretend that you work for a company that sells air conditioners and answer the survey please. "[P]" https://docs.google.com/forms/d/e/1FAIpQLSePyOLeaQVMbJDOaYpv6sLc8RAz3Pmodh28kuMD7hh6SUx6iQ/viewform?usp=sf_link&fbclid=PAAaaabjBAmQuAhZembpG8RVtOcrghoGYI9GfLVyVw26jMKMVlICWVVqetZXo_aem_AQogoNlaQNyLfFGAfnO9N_aowFfgFItQjdaH6B4TWm1_M4Mc8YOwbB8Wysd7ahbvL1k submitted by /u/Jimiqx [link] [comments]  ( 9 min )
    [D] NuerIPS SWAG for a virtual attendee?
    I am very excited to attend NuerIPS for the first time this year. But, my company couldn't afford the cost of sending me in person, so I am attending virtually. One of my favorite parts of a conference is getting into the spirit with SWAG and t-shirts. I'm relatively new to nueral-nets (come from bioinformatics), I have studied my butt off over the last three years to get here, and having a NuerIPS t-shirt would legitimately mean a lot to me. Is there any opportunity for virtual-attendees to receive SWAG? Anyone want to be incredibly kind and mail me a shirt for $20? submitted by /u/Potential-Lab-2308 [link] [comments]  ( 9 min )
    [D] Needle in a haystack experiment: Assistants API RAG beats GPT 4-Turbo & Llama Index at 4% of the cost
    Ran an experiment comparing information retrieval performance between Open AI's Assistants API's RAG, GPT-4 Turbo (with context window stuffing) and Llama Index with GPT4. I recently added a new document-oriented react hook to CopilotKit, made specifically to accommodate (potentially long-form) documents and wanted to get the best performance. Got pretty striking results: The assistant's API beats Llama index in a big way in performance and is 25x cheaper than context window stuffing with GPT-4 Turbo. ​ accuracy performance ​ costs Big takeaway when it comes to building with AI. Almost no one should be using context-window stuffing and Llama Index has a lot of catching up to do when it comes to RAG performance. Whatever OpenAI is using under the hood for RAG, has great performance. The full article: https://ai88.substack.com/p/rag-vs-context-window-in-gpt4-accuracy-cost submitted by /u/kindly_formation71 [link] [comments]  ( 9 min )
    [Discussion] How alive is traditional machine learning in academia?
    Is there still room for research on techniques and models that are commonly used in the industry? I currently work as a Data Scientist and am considering pursuing a Master's or Ph.D. in machine learning. However, it appears that most recent developments focus primarily on neural networks, especially Large Language Models (LLMs). Despite extensively searching through arXiv articles, I've had little success in finding research on areas like feature engineering, probability models, and tree-based algorithms. If anyone knows professors specializing in these more traditional machine learning aspects, please let me know. submitted by /u/BrDataScientist [link] [comments]  ( 9 min )
    [D] You do not need a Vector Database
    The document retrieval problem for RAG is basically a case for information retrieval and there are simpler solutions to do so. Vector embeddings are still useful, but they should be used in a later stage of the IR pipeline and not as the first stage retrieval, for which there are simpler and more performant solutions. Blog post here: http://about.xethub.com/blog/you-dont-need-a-vector-database Notebook and data here: https://github.com/xetdata/RagIRBench/ submitted by /u/yuchenglow [link] [comments]  ( 9 min )
    [D] Prompt Engineering Seems Like Guesswork - How To Evaluate LLM Application Properly?
    How are folks evaluating the quality of your LLM applications? I'm running a therapist chatbot in production (small scale - 10's of active users) and I've spent a lot of time finetuning prompts but it's all just guesswork. I'll make a tweak to the prompt and run a few test conversations and just kinda get the vibes of whether it's better or worse than before the tweak. Is this what y'all are doing too or am I missing something??? submitted by /u/AndreeSmothers [link] [comments]  ( 9 min )
    [R] AlphaFold predictions are valuable hypotheses and accelerate but do not replace experimental structure determination
    This feedback from scientific practice should definitely be taken into account by the community: https://www.nature.com/articles/s41592-023-02087-4 Quote: our results show that AlphaFold predictions are not better representations of the contents of a crystal than the models deposited in the PDB, as the deposited models agree much more closely with experimental data where the predicted and deposited models differ submitted by /u/suhcoR [link] [comments]  ( 9 min )
    [D] How do researchers generate idea of new machine learning architecture?
    I have been working on machine learning for quite some time but I still focused on applying existing machine learning techniques. I kind of puzzle about the process of generating such ingenious architecture every year. If you have experience, can you share how you come up with the new idea? What inspire you to most during the process? submitted by /u/Feisty_Philosophy234 [link] [comments]  ( 9 min )
    [D] How are you deploying computer vision on mobile these days?
    Hey /r/ml, I'm building an open source tool for image classification where you can build an image classifier with a web-app in a few seconds. No picking out a pre-trained model that happens to include your classes, you get a custom one suited to your task. We use CLIP and DinoV2 under the hood, ultimately deploying DinoV2 with a linear top. We currently only offer a Python client for deployment, but I think that it would be super useful on mobile since the model is fairly small and you can deploy it without having to it an API somewhere. How do people deploy image classification today on mobile? Do they usually just use a pre-trained model that happens to fit their classes? Use GPT4+Vision and just pay per image? Or are computer vision features usually seen as something requiring specialized knowledge so most people don't really use them? submitted by /u/nateharada [link] [comments]  ( 9 min )
    [D] How to deal with false accusations of your paper being AI-generated?
    It is a bit depressing to read the quality of subjective reviews I have been getting from my ICLR-2024 paper. Two of them were quite decent, but another two accused me of my paper being a "joke" and AI-generated. It is sad to see one of the allegedly top conferences allowing such delinquent reviews to be publicly posted. Now, a layman will be extremely biased against my research unless the area chairs intervene. submitted by /u/No-Sun-5534 [link] [comments]  ( 9 min )
    [D] LLM learning - sample (in)efficiency & scaling laws
    Are there any ideas which have some potential to break through the current scaling laws and the low sample efficiency of LLMs? I'm aware of the ideas by LeCun that massive pretraining on videos may help with "physics" and "natural world" priors, but looking at the doubtful improvements that visual modality gave GPT4, it remains a yet to be verified hypothesis. I have this itch deep down, that tells me that we're doing something very wrong, and this wrong approach leads to LLMs requiring immense amounts of data before they achieve reasonable performance. Do you have any thoughts on this or have you seen any promising ideas that could attack this problem? submitted by /u/hypergraphs [link] [comments]  ( 9 min )
    [R] "Sequential Modeling Enables Scalable Learning for Large Vision Models" paper from UC Berkeley has a strange scaling curve.
    Came across this paper "Sequential Modeling Enables Scalable Learning for Large Vision Models" (https://arxiv.org/abs/2312.00785) which has a figure that looks a little bit strange. The lines appear identical for different model sizes. Are different runs or large models at different sizes usually this identical? https://twitter.com/JitendraMalikCV/status/1731553367217070413 ​ Taken from Figure 3 in https://arxiv.org/abs/2312.00785 This is the full Figure 3 plot From https://arxiv.org/abs/2312.00785 submitted by /u/rantana [link] [comments]  ( 9 min )
    [R] Sequential Modeling Enables Scalable Learning for Large Vision Models. Transformers are trained on "Visual Sentences" (1.64B Images, 420B Image tokens). The same model can perform Inpainting, Rotation, Lighting, Semantic Segmentation, Edge Detection, Pose Estimation and More
    Blog - https://yutongbai.com/lvm.html Paper - https://arxiv.org/abs/2312.00785 submitted by /u/MysteryInc152 [link] [comments]  ( 9 min )
    [D] What is the latest with multimodal encoder-decoder models using cross attention? Most of the research I've seen has only used self-attention
    Most of the multimodal models I've seen use encoders to embed the images/videos and then send them to a decoder only llm using self-attention. Has there been any research in using cross attention instead with encoder-decoder models? submitted by /u/30299578815310 [link] [comments]  ( 9 min )
    [R] Paved2Paradise: Cost-Effective and Scalable LiDAR Simulation by Factoring the Real World
    submitted by /u/michaelaalcorn [link] [comments]  ( 9 min )
    [D] Looking for course recommendations
    I am trying to piece together a double masters in AI and discrete mathematics, and was wondering if anyone had some course recommendations. Here is the list of courses I am planning on doing as of now. Math courses: - Modern cryptography (8) - Information theory (6) - Machine learning theory (8) - Graph polynomials and algorithms (6) - Causality (8) - Graph Symmetries and Combinatorial Designs (8) - Selected areas in cryptography (8) AI courses: - Machine learning 1 (6) - Deep learning 1 (6) - Computer vision 1 (6) - Fairness, Accountability, Confidentiality and Transparency in AI (6) - Natural Language Processing 1 (6) - Knowledge Representation and Reasoning (6) - Information Retrieval 1 (6) - Deep learning 2 (6) - Game theory (6) - Computational social choice (6) - Algorithmic game theory (6) submitted by /u/Cocorow [link] [comments]  ( 9 min )
    [R] StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization
    arXiv: https://arxiv.org/abs/2311.14495 OpenReview: https://openreview.net/forum?id=BwG8hwohU4 Abstract: In this paper, we investigate the long-term memory learning capabilities of state-space models (SSMs) from the perspective of parameterization. We prove that state-space models without any reparameterization exhibit a memory limitation similar to that of traditional RNNs: the target relationships that can be stably approximated by state-space models must have an exponential decaying memory. Our analysis identifies this "curse of memory" as a result of the recurrent weights converging to a stability boundary, suggesting that a reparameterization technique can be effective. To this end, we introduce a class of reparameterization techniques for SSMs that effectively lift its memory limitations. Besides improving approximation capabilities, we further illustrate that a principled choice of reparameterization scheme can also enhance optimization stability. We validate our findings using synthetic datasets and language models. submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] Industry labs that still do research
    I am wondering if there are any research labs left that still do exploratory research. Most of the places I observe are moving towards less exploratory work and more towards specific agenda research. I think this discourages the fun/exploratory type of research, which usually leads to novel discoveries. I did this type of research during my PhD and now I am stuck in a large R&D lab which has a mandate to work on specific things. Which labs would you recommend me to look at and that still allow exploration to some extent? submitted by /u/oa97z [link] [comments]  ( 9 min )
    [discussion] text to voice generation for textbooks
    i was listening to lex podcast on some stuff i study and wanted to ask, are there any good natural-enough sounding local text to voice models out there? i would very much like to use it to turn the text parts of a book into an audio where i could listen to it while reading. i used edge's tts for speech by giving a paragraph to clipboard and to edge-tts in order to listen the text but it causes two problems: 1. you need internet connection and have the book opened 2. can only do paragraph by paragraph, and is prone to errors or sometimes if you use it too much it wont convert the full text afterwards. the idea would be to turn a chapter of a book into an audio file and transfer it so that i could listen to it on my mobile phone on the fly. what is the status of offline models where they could afford to output an okay voice (or even be able to give the voice of a tutor from their lectures and train it)? submitted by /u/sweetchocolotepie [link] [comments]  ( 10 min )
  • Open

    DSC Weekly 5 December 2023
    Announcements Top Stories In-Depth The post DSC Weekly 5 December 2023 appeared first on Data Science Central.  ( 21 min )
    Decoding the Future: The Intersection of Advanced Analytics and Fraud Prevention in Revolutionizing Digital Payments
    In 2023, online payment fraud cost the world US$48 billion. Businesses prioritize fighting payment fraud and minimizing its financial and reputational damage. In addition to monetary losses, payment fraud can damage a customer’s trust and loyalty, as well as increase the scrutiny from regulators and law enforcement. Organizations use machine learning to combat this growing… Read More »Decoding the Future: The Intersection of Advanced Analytics and Fraud Prevention in Revolutionizing Digital Payments The post Decoding the Future: The Intersection of Advanced Analytics and Fraud Prevention in Revolutionizing Digital Payments appeared first on Data Science Central.  ( 22 min )
    What’s wrong with data labels
    Technical Advisor and former LinkedIn knowledge graph lead Mike Dillinger recently spoke with Juan Sequeda and Tim Gasper of data.world. During a recent edition of the Catalog & Cocktails podcast, Dillinger stated that most data labels are meaningless text strings. “The vast majority are just junk that we pay for,” he noted. Dillinger is a… Read More »What’s wrong with data labels The post What’s wrong with data labels appeared first on Data Science Central.  ( 21 min )
  • Open

    Enable faster training with Amazon SageMaker data parallel library
    Large language model (LLM) training has become increasingly popular over the last year with the release of several publicly available models such as Llama2, Falcon, and StarCoder. Customers are now training LLMs of unprecedented size ranging from 1 billion to over 175 billion parameters. Training these LLMs requires significant compute resources and time as hundreds […]  ( 8 min )
    Use custom metadata created by Amazon Comprehend to intelligently process insurance claims using Amazon Kendra
    Structured data, defined as data following a fixed pattern such as information stored in columns within databases, and unstructured data, which lacks a specific form or pattern like text, images, or social media posts, both continue to grow as they are produced and consumed by various organizations. For instance, according to International Data Corporation (IDC), […]  ( 13 min )
    Foundational data protection for enterprise LLM acceleration with Protopia AI
    The post describes how you can overcome the challenges of retaining data ownership and preserving data privacy while using LLMs by deploying Protopia AI’s Stained Glass Transform to protect your data. Protopia AI has partnered with AWS to deliver the critical component of data protection and ownership for secure and efficient enterprise adoption of generative AI. This post outlines the solution and demonstrates how it can be used in AWS for popular enterprise use cases like Retrieval Augmented Generation (RAG) and with state-of-the-art LLMs like Llama 2.  ( 12 min )
  • Open

    Exploring LLMs’ potential to help facilitators enhance online healthcare communities
    Many patients in low- and middle-income countries rely on facilitated online health communities for information and support. Discover how large language models can assist the facilitators and boost outcomes. The post Exploring LLMs’ potential to help facilitators enhance online healthcare communities appeared first on Microsoft Research.  ( 10 min )
    Collaborators: Teachable AI with Cecily Morrison and Karolina Pakėnaitė
    Cecily Morrison and Karolina Pakėnaitė are collaborators on a research prototype designed to help members of the blind community find their personal items. Learn how the work is advancing an approach to empower people to shape their own AI experiences. The post Collaborators: Teachable AI with Cecily Morrison and Karolina Pakėnaitė appeared first on Microsoft Research.  ( 28 min )
  • Open

    ‘Christmas Rush’ 3D Scene Brings Holiday Cheer This Week ‘In the NVIDIA Studio’
    ‘Tis the season for friends, family and beautifully rendered Santa animations from this week’s In the NVIDIA Studio artist, 3D expert Božo Balov.  ( 7 min )
  • Open

    recently I have started to build a deep net microframework to understand understand the different moving parts in a network.
    I have recently started building a neural net microframework, its to clear my own understanding of what happens beneath the hood, neural net code is the cleaner code that I am trying to build, with clear understanding and separation of modules representing the fundamentals of a neural net framework, and playgroud is the python port of the google playgound.tensorflow.org. If you can, fork/clone/contribute. TIA p.s: I hope this is the right community for this post submitted by /u/Nervous_Night2940 [link] [comments]  ( 1 min )
  • Open

    AI accelerates problem-solving in complex scenarios
    A new, data-driven approach could lead to better solutions for tricky optimization problems like global package routing or power grid operation.  ( 9 min )
  • Open

    QuantEase: Optimization-based Quantization for Language Models. (arXiv:2309.01885v2 [stat.ML] UPDATED)
    With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in compression techniques that enable their efficient deployment. This study focuses on the Post-Training Quantization (PTQ) of LLMs. Drawing from recent advances, our work introduces QuantEase, a layer-wise quantization framework where individual layers undergo separate quantization. The problem is framed as a discrete-structured non-convex optimization, prompting the development of algorithms rooted in Coordinate Descent (CD) techniques. These CD-based methods provide high-quality solutions to the complex non-convex layer-wise quantization problems. Notably, our CD-based approach features straightforward updates, relying solely on matrix and vector operations, circumventing the need for matrix inversion or decomposition. We also explore an outlier-aware variant of our approach, allowing for retaining significant weights (outliers) with complete precision. Our proposal attains state-of-the-art performance in terms of perplexity and zero-shot accuracy in empirical evaluations across various LLMs and datasets, with relative improvements up to 15% over methods such as GPTQ. Leveraging careful linear algebra optimizations, QuantEase can quantize models like Falcon-180B on a single NVIDIA A100 GPU in $\sim$3 hours. Particularly noteworthy is our outlier-aware algorithm's capability to achieve near or sub-3-bit quantization of LLMs with an acceptable drop in accuracy, obviating the need for non-uniform quantization or grouping techniques, improving upon methods such as SpQR by up to two times in terms of perplexity.  ( 2 min )
    A Definition of Continual Reinforcement Learning. (arXiv:2307.11046v2 [cs.LG] UPDATED)
    In a standard view of the reinforcement learning problem, an agent's goal is to efficiently identify a policy that maximizes long-term reward. However, this perspective is based on a restricted view of learning as finding a solution, rather than treating learning as endless adaptation. In contrast, continual reinforcement learning refers to the setting in which the best agents never stop learning. Despite the importance of continual reinforcement learning, the community lacks a simple definition of the problem that highlights its commitments and makes its primary concepts precise and clear. To this end, this paper is dedicated to carefully defining the continual reinforcement learning problem. We formalize the notion of agents that "never stop learning" through a new mathematical language for analyzing and cataloging agents. Using this new language, we define a continual learning agent as one that can be understood as carrying out an implicit search process indefinitely, and continual reinforcement learning as the setting in which the best agents are all continual learning agents. We provide two motivating examples, illustrating that traditional views of multi-task reinforcement learning and continual supervised learning are special cases of our definition. Collectively, these definitions and perspectives formalize many intuitive concepts at the heart of learning, and open new research pathways surrounding continual learning agents.  ( 2 min )
    Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation. (arXiv:2312.00645v1 [cs.LG])
    There is a growing need to gain insight into language model capabilities that relate to sensitive topics, such as bioterrorism or cyberwarfare. However, traditional open source benchmarks are not fit for the task, due to the associated practice of publishing the correct answers in human-readable form. At the same time, enforcing mandatory closed-quarters evaluations might stifle development and erode trust. In this context, we propose hashmarking, a protocol for evaluating language models in the open without having to disclose the correct answers. In its simplest form, a hashmark is a benchmark whose reference solutions have been cryptographically hashed prior to publication. Following an overview of the proposed evaluation protocol, we go on to assess its resilience against traditional attack vectors (e.g. rainbow table attacks), as well as against failure modes unique to increasingly capable generative models.  ( 2 min )
    VMAF Re-implementation on PyTorch: Some Experimental Results. (arXiv:2310.15578v3 [cs.LG] UPDATED)
    Based on the standard VMAF implementation we propose an implementation of VMAF using PyTorch framework. For this implementation comparisons with the standard (libvmaf) show the discrepancy $\lesssim 10^{-2}$ in VMAF units. We investigate gradients computation when using VMAF as an objective function and demonstrate that training using this function does not result in ill-behaving gradients. The implementation is then used to train a preprocessing filter. It is demonstrated that its performance is superior to the unsharp masking filter. The resulting filter is also easy for implementation and can be applied in video processing tasks for video copression improvement. This is confirmed by the results of numerical experiments.  ( 2 min )
    S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion. (arXiv:2312.00116v1 [cs.CV])
    Image-to-image translation (I2IT) refers to the process of transforming images from a source domain to a target domain while maintaining a fundamental connection in terms of image content. In the past few years, remarkable advancements in I2IT were achieved by Generative Adversarial Networks (GANs), which nevertheless struggle with translations requiring high precision. Recently, Diffusion Models have established themselves as the engine of choice for image generation. In this paper we introduce S2ST, a novel framework designed to accomplish global I2IT in complex photorealistic images, such as day-to-night or clear-to-rain translations of automotive scenes. S2ST operates within the seed space of a Latent Diffusion Model, thereby leveraging the powerful image priors learned by the latter. We show that S2ST surpasses state-of-the-art GAN-based I2IT methods, as well as diffusion-based approaches, for complex automotive scenes, improving fidelity while respecting the target domain's appearance across a variety of domains. Notably, S2ST obviates the necessity for training domain-specific translation networks.  ( 2 min )
    Distributional PAC-Learning from Nisan's Natural Proofs. (arXiv:2310.03641v2 [cs.CC] UPDATED)
    Carmosino et al. (2016) demonstrated that natural proofs of circuit lower bounds for $\Lambda$ imply efficient algorithms for learning $\Lambda$-circuits, but only over \textit{the uniform distribution}, with \textit{membership queries}, and provided $\AC^0[p] \subseteq \Lambda$. We consider whether this implication can be generalized to $\Lambda \not\supseteq \AC^0[p]$, and to learning algorithms which use only random examples and learn over arbitrary example distributions (Valiant's PAC-learning model). We first observe that, if, for any circuit class $\Lambda$, there is an implication from natural proofs for $\Lambda$ to PAC-learning for $\Lambda$, then standard assumptions from lattice-based cryptography do not hold. In particular, we observe that depth-2 majority circuits are a (conditional) counter example to the implication, since Nisan (1993) gave a natural proof, but Klivans and Sherstov (2009) showed hardness of PAC-learning under lattice-based assumptions. We thus ask: what learning algorithms can we reasonably expect to follow from Nisan's natural proofs? Our main result is that all natural proofs arising from a type of communication complexity argument, including Nisan's, imply PAC-learning algorithms in a new \textit{distributional} variant (i.e., an ``average-case'' relaxation) of Valiant's PAC model. Our distributional PAC model is stronger than the average-case prediction model of Blum et al. (1993) and the heuristic PAC model of Nanashima (2021), and has several important properties which make it of independent interest, such as being \textit{boosting-friendly}. The main applications of our result are new distributional PAC-learning algorithms for depth-2 majority circuits, polytopes and DNFs over natural target distributions, as well as the nonexistence of encoded-input weak PRFs that can be evaluated by depth-2 majority circuits.  ( 3 min )
    Resource-constrained knowledge diffusion processes inspired by human peer learning. (arXiv:2312.00660v1 [cs.LG])
    We consider a setting where a population of artificial learners is given, and the objective is to optimize aggregate measures of performance, under constraints on training resources. The problem is motivated by the study of peer learning in human educational systems. In this context, we study natural knowledge diffusion processes in networks of interacting artificial learners. By `natural', we mean processes that reflect human peer learning where the students' internal state and learning process is mostly opaque, and the main degree of freedom lies in the formation of peer learning groups by a coordinator who can potentially evaluate the learners before assigning them to peer groups. Among else, we empirically show that such processes indeed make effective use of the training resources, and enable the design of modular neural models that have the capacity to generalize without being prone to overfitting noisy labels.  ( 2 min )
    Neurological Prognostication of Post-Cardiac-Arrest Coma Patients Using EEG Data: A Dynamic Survival Analysis Framework with Competing Risks. (arXiv:2308.11645v2 [eess.SP] UPDATED)
    Patients resuscitated from cardiac arrest who enter a coma are at high risk of death. Forecasting neurological outcomes of these patients (the task of neurological prognostication) could help with treatment decisions. In this paper, we propose, to the best of our knowledge, the first dynamic framework for neurological prognostication of post-cardiac-arrest comatose patients using EEG data: our framework makes predictions for a patient over time as more EEG data become available, and different training patients' available EEG time series could vary in length. Predictions are phrased in terms of either time-to-event outcomes (time-to-awakening or time-to-death) or as the patient's probability of awakening or of dying across multiple time horizons. Our framework uses any dynamic survival analysis model that supports competing risks in the form of estimating patient-level cumulative incidence functions. We consider three competing risks as to what happens first to a patient: awakening, being withdrawn from life-sustaining therapies (and thus deterministically dying), or dying (by other causes). We demonstrate our framework by benchmarking three existing dynamic survival analysis models that support competing risks on a real dataset of 922 patients. Our main experimental findings are that: (1) the classical Fine and Gray model which only uses a patient's static features and summary statistics from the patient's latest hour's worth of EEG data is highly competitive, achieving accuracy scores as high as the recently developed Dynamic-DeepHit model that uses substantially more of the patient's EEG data; and (2) in an ablation study, we show that our choice of modeling three competing risks results in a model that is at least as accurate while learning more information than simpler models (using two competing risks or a standard survival analysis setup with no competing risks).  ( 3 min )
    A Preconditioned Interior Point Method for Support Vector Machines Using an ANOVA-Decomposition and NFFT-Based Matrix-Vector Products. (arXiv:2312.00538v1 [math.NA])
    In this paper we consider the numerical solution to the soft-margin support vector machine optimization problem. This problem is typically solved using the SMO algorithm, given the high computational complexity of traditional optimization algorithms when dealing with large-scale kernel matrices. In this work, we propose employing an NFFT-accelerated matrix-vector product using an ANOVA decomposition for the feature space that is used within an interior point method for the overall optimization problem. As this method requires the solution of a linear system of saddle point form we suggest a preconditioning approach that is based on low-rank approximations of the kernel matrix together with a Krylov subspace solver. We compare the accuracy of the ANOVA-based kernel with the default LIBSVM implementation. We investigate the performance of the different preconditioners as well as the accuracy of the ANOVA kernel on several large-scale datasets.  ( 2 min )
    Bayesian Learning with Information Gain Provably Bounds Risk for a Robust Adversarial Defense. (arXiv:2212.02003v2 [cs.LG] UPDATED)
    We present a new algorithm to learn a deep neural network model robust against adversarial attacks. Previous algorithms demonstrate an adversarially trained Bayesian Neural Network (BNN) provides improved robustness. We recognize the adversarial learning approach for approximating the multi-modal posterior distribution of a Bayesian model can lead to mode collapse; consequently, the model's achievements in robustness and performance are sub-optimal. Instead, we first propose preventing mode collapse to better approximate the multi-modal posterior distribution. Second, based on the intuition that a robust model should ignore perturbations and only consider the informative content of the input, we conceptualize and formulate an information gain objective to measure and force the information learned from both benign and adversarial training instances to be similar. Importantly. we prove and demonstrate that minimizing the information gain objective allows the adversarial risk to approach the conventional empirical risk. We believe our efforts provide a step toward a basis for a principled method of adversarially training BNNs. Our model demonstrate significantly improved robustness--up to 20%--compared with adversarial training and Adv-BNN under PGD attacks with 0.035 distortion on both CIFAR-10 and STL-10 datasets.  ( 3 min )
    PyNeRF: Pyramidal Neural Radiance Fields. (arXiv:2312.00252v1 [cs.CV])
    Neural Radiance Fields (NeRFs) can be dramatically accelerated by spatial grid representations. However, they do not explicitly reason about scale and so introduce aliasing artifacts when reconstructing scenes captured at different camera distances. Mip-NeRF and its extensions propose scale-aware renderers that project volumetric frustums rather than point samples but such approaches rely on positional encodings that are not readily compatible with grid methods. We propose a simple modification to grid-based models by training model heads at different spatial grid resolutions. At render time, we simply use coarser grids to render samples that cover larger volumes. Our method can be easily applied to existing accelerated NeRF methods and significantly improves rendering quality (reducing error rates by 20-90% across synthetic and unbounded real-world scenes) while incurring minimal performance overhead (as each model head is quick to evaluate). Compared to Mip-NeRF, we reduce error rates by 20% while training over 60x faster.  ( 2 min )
    UAV-assisted Semantic Communication with Hybrid Action Reinforcement Learning. (arXiv:2309.16713v2 [cs.NI] UPDATED)
    In this paper, we aim to explore the use of uplink semantic communications with the assistance of UAV in order to improve data collection effiicency for metaverse users in remote areas. To reduce the time for uplink data collection while balancing the trade-off between reconstruction quality and computational energy cost, we propose a hybrid action reinforcement learning (RL) framework to make decisions on semantic model scale, channel allocation, transmission power, and UAV trajectory. The variables are classified into discrete type and continuous type, which are optimized by two different RL agents to generate the combined action. Simulation results indicate that the proposed hybrid action reinforcement learning framework can effectively improve the efficiency of uplink semantic data collection under different parameter settings and outperforms the benchmark scenarios.  ( 2 min )
    Phylo2Vec: a vector representation for binary trees. (arXiv:2304.12693v2 [q-bio.PE] UPDATED)
    Binary phylogenetic trees inferred from biological data are central to understanding the shared evolutionary history of organisms. Inferring the placement of latent nodes in a tree by any optimality criterion (e.g., maximum likelihood) is an NP-hard problem, propelling the development of myriad heuristic approaches. Yet, these heuristics often lack a systematic means of uniformly sampling random trees or effectively exploring a tree space that grows factorially, which are crucial to optimisation problems such as machine learning. Accordingly, we present Phylo2Vec, a new parsimonious representation of a phylogenetic tree. Phylo2Vec maps any binary tree with $n$ leaves to an integer vector of length $n$. We prove that Phylo2Vec is both well-defined and bijective to the space of phylogenetic trees. The advantages of Phylo2Vec are twofold: i) easy uniform sampling of binary trees and ii) systematic ability to traverse tree space in very large or small jumps. As a proof of concept, we use Phylo2Vec for maximum likelihood inference on five real-world datasets and show that a simple hill climbing-based optimisation efficiently traverses the vastness of tree space from a random to an optimal tree.  ( 2 min )
    A Comparative Study of Text Embedding Models for Semantic Text Similarity in Bug Reports. (arXiv:2308.09193v2 [cs.SE] UPDATED)
    Bug reports are an essential aspect of software development, and it is crucial to identify and resolve them quickly to ensure the consistent functioning of software systems. Retrieving similar bug reports from an existing database can help reduce the time and effort required to resolve bugs. In this paper, we compared the effectiveness of semantic textual similarity methods for retrieving similar bug reports based on a similarity score. We explored several embedding models such as TF-IDF (Baseline), FastText, Gensim, BERT, and ADA. We used the Software Defects Data containing bug reports for various software projects to evaluate the performance of these models. Our experimental results showed that BERT generally outperformed the rest of the models regarding recall, followed by ADA, Gensim, FastText, and TFIDF. Our study provides insights into the effectiveness of different embedding methods for retrieving similar bug reports and highlights the impact of selecting the appropriate one for this task. Our code is available on GitHub.  ( 2 min )
    Mixture of Experts with Uncertainty Voting for Imbalanced Deep Regression Problems. (arXiv:2305.15178v2 [cs.LG] UPDATED)
    Data imbalance is ubiquitous when applying machine learning to real-world problems, particularly regression problems. If training data are imbalanced, the learning is dominated by the densely covered regions of the target distribution, consequently, the learned regressor tends to exhibit poor performance in sparsely covered regions. Beyond standard measures like over-sampling or re-weighting, there are two main directions to handle learning from imbalanced data. For regression, recent work relies on the continuity of the distribution; whereas for classification there has been a trend to employ mixture-of-expert models and let some ensemble members specialize in predictions for the sparser regions. In our method, dubbed MOUV, we propose to leverage recent work on probabilistic deep learning and integrate it in a mixture-of-experts approach for imbalanced regression. We replace traditional regression losses with negative log-likelihood which also predicts sample-wise aleatoric uncertainty. We show experimentally that such a loss handles the imbalance better. Secondly, we use the readily available aleatoric uncertainty values to fuse the predictions of a mixture-of-experts model, thus obviating the need for a separate aggregation module. We compare our method with existing alternatives on multiple public benchmarks and show that MOUV consistently outperforms the prior art, while at the same time producing better calibrated uncertainty estimates. Our code is available at link-upon-publication.  ( 2 min )
    Interior Point Constrained Reinforcement Learning with Global Convergence Guarantees. (arXiv:2312.00561v1 [cs.LG])
    We consider discounted infinite horizon constrained Markov decision processes (CMDPs) where the goal is to find an optimal policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Motivated by the application of CMDPs in online learning of safety-critical systems, we focus on developing an algorithm that ensures constraint satisfaction during learning. To this end, we develop a zeroth-order interior point approach based on the log barrier function of the CMDP. Under the commonly assumed conditions of Fisher non-degeneracy and bounded transfer error of the policy parameterization, we establish the theoretical properties of the algorithm. In particular, in contrast to existing CMDP approaches that ensure policy feasibility only upon convergence, our algorithm guarantees feasibility of the policies during the learning process and converges to the optimal policy with a sample complexity of $O(\varepsilon^{-6})$. In comparison to the state-of-the-art policy gradient-based algorithm, C-NPG-PDA, our algorithm requires an additional $O(\varepsilon^{-2})$ samples to ensure policy feasibility during learning with same Fisher-non-degenerate parameterization.  ( 2 min )
    EvE: Exploiting Generative Priors for Radiance Field Enrichment. (arXiv:2312.00639v1 [cs.CV])
    Modeling large-scale scenes from unconstrained image collections in-the-wild has proven to be a major challenge in computer vision. Existing methods tackling in-the-wild neural rendering operate in a closed-world setting, where knowledge is limited to a scene's captured images within a training set. We propose EvE, which is, to the best of our knowledge, the first method leveraging generative priors to improve in-the-wild scene modeling. We employ pre-trained generative networks to enrich K-Planes representations with extrinsic knowledge. To this end, we define an alternating training procedure to conduct optimization guidance of K-Planes trained on the training set. We carry out extensive experiments and verify the merit of our method on synthetic data as well as real tourism photo collections. EvE enhances rendered scenes with richer details and outperforms the state of the art on the task of novel view synthesis in-the-wild. Our project page can be found at https://eve-nvs.github.io .
    Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control. (arXiv:2205.11956v4 [stat.ML] UPDATED)
    Most machine learning methods require tuning of hyper-parameters. For kernel ridge regression with the Gaussian kernel, the hyper-parameter is the bandwidth. The bandwidth specifies the length scale of the kernel and has to be carefully selected to obtain a model with good generalization. The default methods for bandwidth selection, cross-validation and marginal likelihood maximization, often yield good results, albeit at high computational costs. Inspired by Jacobian regularization, we formulate an approximate expression for how the derivatives of the functions inferred by kernel ridge regression with the Gaussian kernel depend on the kernel bandwidth. We use this expression to propose a closed-form, computationally feather-light, bandwidth selection heuristic, based on controlling the Jacobian. In addition, the Jacobian expression illuminates how the bandwidth selection is a trade-off between the smoothness of the inferred function and the conditioning of the training data kernel matrix. We show on real and synthetic data that compared to cross-validation and marginal likelihood maximization, our method is on pair in terms of model performance, but up to six orders of magnitude faster.
    Understanding Forward Process of Convolutional Neural Network. (arXiv:2307.15090v2 [cs.LG] UPDATED)
    This paper reveal the selective rotation in the CNNs' forward processing. It elucidates the activation function as a discerning mechanism that unifies and quantizes the rotational aspects of the input data. Experiments show how this defined methodology reflects the progress network distinguish inputs based on statistical indicators, which can be comprehended or analyzed by applying structured mathematical tools. Our findings also unveil the consistency between artificial neural networks and the human brain in their data processing pattern.
    Towards Clinical Prediction with Transparency: An Explainable AI Approach to Survival Modelling in Residential Aged Care. (arXiv:2312.00271v1 [cs.LG])
    Background: Accurate survival time estimates aid end-of-life medical decision-making. Objectives: Develop an interpretable survival model for elderly residential aged care residents using advanced machine learning. Setting: A major Australasian residential aged care provider. Participants: Residents aged 65+ admitted for long-term care from July 2017 to August 2023. Sample size: 11,944 residents across 40 facilities. Predictors: Factors include age, gender, health status, co-morbidities, cognitive function, mood, nutrition, mobility, smoking, sleep, skin integrity, and continence. Outcome: Probability of survival post-admission, specifically calibrated for 6-month survival estimates. Statistical Analysis: Tested CoxPH, EN, RR, Lasso, GB, XGB, and RF models in 20 experiments with a 90/10 train/test split. Evaluated accuracy using C-index, Harrell's C-index, dynamic AUROC, IBS, and calibrated ROC. Chose XGB for its performance and calibrated it for 1, 3, 6, and 12-month predictions using Platt scaling. Employed SHAP values to analyze predictor impacts. Results: GB, XGB, and RF models showed the highest C-Index values (0.714, 0.712, 0.712). The optimal XGB model demonstrated a 6-month survival prediction AUROC of 0.746 (95% CI 0.744-0.749). Key mortality predictors include age, male gender, mobility, health status, pressure ulcer risk, and appetite. Conclusions: The study successfully applies machine learning to create a survival model for aged care, aligning with clinical insights on mortality risk factors and enhancing model interpretability and clinical utility through explainable AI.
    Local monotone operator learning using non-monotone operators: MnM-MOL. (arXiv:2312.00386v1 [eess.IV])
    The recovery of magnetic resonance (MR) images from undersampled measurements is a key problem that has seen extensive research in recent years. Unrolled approaches, which rely on end-to-end training of convolutional neural network (CNN) blocks within iterative reconstruction algorithms, offer state-of-the-art performance. These algorithms require a large amount of memory during training, making them difficult to employ in high-dimensional applications. Deep equilibrium (DEQ) models and the recent monotone operator learning (MOL) approach were introduced to eliminate the need for unrolling, thus reducing the memory demand during training. Both approaches require a Lipschitz constraint on the network to ensure that the forward and backpropagation iterations converge. Unfortunately, the constraint often results in reduced performance compared to unrolled methods. The main focus of this work is to relax the constraint on the CNN block in two different ways. Inspired by convex-non-convex regularization strategies, we now impose the monotone constraint on the sum of the gradient of the data term and the CNN block, rather than constrain the CNN itself to be a monotone operator. This approach enables the CNN to learn possibly non-monotone score functions, which can translate to improved performance. In addition, we only restrict the operator to be monotone in a local neighborhood around the image manifold. Our theoretical results show that the proposed algorithm is guaranteed to converge to the fixed point and that the solution is robust to input perturbations, provided that it is initialized close to the true solution. Our empirical results show that the relaxed constraints translate to improved performance and that the approach enjoys robustness to input perturbations similar to MOL.
    Secure Transformer Inference. (arXiv:2312.00025v1 [cs.CR])
    We present a three-party protocol that can protect both Transformer parameters and user data during the inference phase. For each feedforward inference process, our protocol only introduces permutation computation of input and output data on the user side. Our protocol, Secure Transformer Inference Protocol (STIP), can be applied to real-world services like ChatGPT.
    Stability-Informed Initialization of Neural Ordinary Differential Equations. (arXiv:2311.15890v2 [cs.LG] UPDATED)
    This paper addresses the training of Neural Ordinary Differential Equations (neural ODEs), and in particular explores the interplay between numerical integration techniques, stability regions, step size, and initialization techniques. It is shown how the choice of integration technique implicitly regularizes the learned model, and how the solver's corresponding stability region affects training and prediction performance. From this analysis, a stability-informed parameter initialization technique is introduced. The effectiveness of the initialization method is displayed across several learning benchmarks and industrial applications.
    Safe Reinforcement Learning in Tensor Reproducing Kernel Hilbert Space. (arXiv:2312.00727v1 [cs.LG])
    This paper delves into the problem of safe reinforcement learning (RL) in a partially observable environment with the aim of achieving safe-reachability objectives. In traditional partially observable Markov decision processes (POMDP), ensuring safety typically involves estimating the belief in latent states. However, accurately estimating an optimal Bayesian filter in POMDP to infer latent states from observations in a continuous state space poses a significant challenge, largely due to the intractable likelihood. To tackle this issue, we propose a stochastic model-based approach that guarantees RL safety almost surely in the face of unknown system dynamics and partial observation environments. We leveraged the Predictive State Representation (PSR) and Reproducing Kernel Hilbert Space (RKHS) to represent future multi-step observations analytically, and the results in this context are provable. Furthermore, we derived essential operators from the kernel Bayes' rule, enabling the recursive estimation of future observations using various operators. Under the assumption of \textit{undercompleness}, a polynomial sample complexity is established for the RL algorithm for the infinite size of observation and action spaces, ensuring an $\epsilon-$suboptimal safe policy guarantee.
    Universal representation by Boltzmann machines with Regularised Axons. (arXiv:2310.14395v2 [cond-mat.stat-mech] UPDATED)
    It is widely known that Boltzmann machines are capable of representing arbitrary probability distributions over the values of their visible neurons, given enough hidden ones. However, sampling -- and thus training -- these models can be numerically hard. Recently we proposed a regularisation of the connections of Boltzmann machines, in order to control the energy landscape of the model, paving a way for efficient sampling and training. Here we formally prove that such regularised Boltzmann machines preserve the ability to represent arbitrary distributions. This is in conjunction with controlling the number of energy local minima, thus enabling easy \emph{guided} sampling and training. Furthermore, we explicitly show that regularised Boltzmann machines can store exponentially many arbitrarily correlated visible patterns with perfect retrieval, and we connect them to the Dense Associative Memory networks.
    Use Perturbations when Learning from Explanations. (arXiv:2303.06419v3 [cs.LG] UPDATED)
    Machine learning from explanations (MLX) is an approach to learning that uses human-provided explanations of relevant or irrelevant features for each input to ensure that model predictions are right for the right reasons. Existing MLX approaches rely on local model interpretation methods and require strong model smoothing to align model and human explanations, leading to sub-optimal performance. We recast MLX as a robustness problem, where human explanations specify a lower dimensional manifold from which perturbations can be drawn, and show both theoretically and empirically how this approach alleviates the need for strong model smoothing. We consider various approaches to achieving robustness, leading to improved performance over prior MLX methods. Finally, we show how to combine robustness with an earlier MLX method, yielding state-of-the-art results on both synthetic and real-world benchmarks.
    Domain Adaptation: Learning Bounds and Algorithms. (arXiv:0902.3430v3 [cs.LG] UPDATED)
    This paper addresses the general problem of domain adaptation which arises in a variety of applications where the distribution of the labeled sample available somewhat differs from that of the test data. Building on previous work by Ben-David et al. (2007), we introduce a novel distance between distributions, discrepancy distance, that is tailored to adaptation problems with arbitrary loss functions. We give Rademacher complexity bounds for estimating the discrepancy distance from finite samples for different loss functions. Using this distance, we derive novel generalization bounds for domain adaptation for a wide family of loss functions. We also present a series of novel adaptation bounds for large classes of regularization-based algorithms, including support vector machines and kernel ridge regression based on the empirical discrepancy. This motivates our analysis of the problem of minimizing the empirical discrepancy for various loss functions for which we also give novel algorithms. We report the results of preliminary experiments that demonstrate the benefits of our discrepancy minimization algorithms for domain adaptation.
    HyperAttention: Long-context Attention in Near-Linear Time. (arXiv:2310.05869v3 [cs.LG] UPDATED)
    We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts used in Large Language Models (LLMs). Recent work suggests that in the worst-case scenario, quadratic time is necessary unless the entries of the attention matrix are bounded or the matrix has low stable rank. We introduce two parameters which measure: (1) the max column norm in the normalized attention matrix, and (2) the ratio of row norms in the unnormalized attention matrix after detecting and removing large entries. We use these fine-grained parameters to capture the hardness of the problem. Despite previous lower bounds, we are able to achieve a linear time sampling algorithm even when the matrix has unbounded entries or a large stable rank, provided the above parameters are small. HyperAttention features a modular design that easily accommodates integration of other fast low-level implementations, particularly FlashAttention. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods, giving significant speed improvements compared to state-of-the-art solutions like FlashAttention. We validate the empirical performance of HyperAttention on a variety of different long-context length datasets. For example, HyperAttention makes the inference time of ChatGLM2 50\% faster on 32k context length while perplexity increases from 5.6 to 6.3. On larger context length, e.g., 131k, with causal masking, HyperAttention offers 5-fold speedup on a single attention layer.
    From Mutual Information to Expected Dynamics: New Generalization Bounds for Heavy-Tailed SGD. (arXiv:2312.00427v1 [stat.ML])
    Understanding the generalization abilities of modern machine learning algorithms has been a major research topic over the past decades. In recent years, the learning dynamics of Stochastic Gradient Descent (SGD) have been related to heavy-tailed dynamics. This has been successfully applied to generalization theory by exploiting the fractal properties of those dynamics. However, the derived bounds depend on mutual information (decoupling) terms that are beyond the reach of computability. In this work, we prove generalization bounds over the trajectory of a class of heavy-tailed dynamics, without those mutual information terms. Instead, we introduce a geometric decoupling term by comparing the learning dynamics (depending on the empirical risk) with an expected one (depending on the population risk). We further upper-bound this geometric term, by using techniques from the heavy-tailed and the fractal literature, making it fully computable. Moreover, as an attempt to tighten the bounds, we propose a PAC-Bayesian setting based on perturbed dynamics, in which the same geometric term plays a crucial role and can still be bounded using the techniques described above.
    The Multiverse of Dynamic Mode Decomposition Algorithms. (arXiv:2312.00137v1 [math.DS])
    Dynamic Mode Decomposition (DMD) is a popular data-driven analysis technique used to decompose complex, nonlinear systems into a set of modes, revealing underlying patterns and dynamics through spectral analysis. This review presents a comprehensive and pedagogical examination of DMD, emphasizing the role of Koopman operators in transforming complex nonlinear dynamics into a linear framework. A distinctive feature of this review is its focus on the relationship between DMD and the spectral properties of Koopman operators, with particular emphasis on the theory and practice of DMD algorithms for spectral computations. We explore the diverse "multiverse" of DMD methods, categorized into three main areas: linear regression-based methods, Galerkin approximations, and structure-preserving techniques. Each category is studied for its unique contributions and challenges, providing a detailed overview of significant algorithms and their applications as outlined in Table 1. We include a MATLAB package with examples and applications to enhance the practical understanding of these methods. This review serves as both a practical guide and a theoretical reference for various DMD methods, accessible to both experts and newcomers, and enabling readers to delve into their areas of interest in the expansive field of DMD.
    Interpreting and Disentangling Feature Components of Various Complexity from DNNs. (arXiv:2006.15920v2 [cs.LG] UPDATED)
    This paper aims to define, quantify, and analyze the feature complexity that is learned by a DNN. We propose a generic definition for the feature complexity. Given the feature of a certain layer in the DNN, our method disentangles feature components of different complexity orders from the feature. We further design a set of metrics to evaluate the reliability, the effectiveness, and the significance of over-fitting of these feature components. Furthermore, we successfully discover a close relationship between the feature complexity and the performance of DNNs. As a generic mathematical tool, the feature complexity and the proposed metrics can also be used to analyze the success of network compression and knowledge distillation.
    On the Identifiability of Switching Dynamical Systems. (arXiv:2305.15925v3 [stat.ML] UPDATED)
    In the realm of interpretability and out-of-distribution generalisation, the identifiability of latent variable models has emerged as a captivating field of inquiry. In this work, we delve into the identifiability of Switching Dynamical Systems, taking an initial stride toward extending identifiability analysis to sequential latent variable models. We first prove the identifiability of Markov Switching Models, which commonly serve as the prior distribution for the continuous latent variables in Switching Dynamical Systems. We present identification conditions for first-order Markov dependency structures, whose transition distribution is parametrised via non-linear Gaussians. We then establish the identifiability of the latent variables and non-linear mappings in Switching Dynamical Systems up to affine transformations, by leveraging identifiability analysis techniques from identifiable deep latent variable models. We finally develop estimation algorithms for identifiable Switching Dynamical Systems. Throughout empirical studies, we demonstrate the practicality of identifiable Switching Dynamical Systems for segmenting high-dimensional time series such as videos, and showcase the use of identifiable Markov Switching Models for regime-dependent causal discovery in climate data.
    Towards Aligned Canonical Correlation Analysis: Preliminary Formulation and Proof-of-Concept Results. (arXiv:2312.00296v1 [cs.LG])
    Canonical Correlation Analysis (CCA) has been widely applied to jointly embed multiple views of data in a maximally correlated latent space. However, the alignment between various data perspectives, which is required by traditional approaches, is unclear in many practical cases. In this work we propose a new framework Aligned Canonical Correlation Analysis (ACCA), to address this challenge by iteratively solving the alignment and multi-view embedding.
    iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. (arXiv:2310.06625v2 [cs.LG] UPDATED)
    The recent boom of linear forecasting models questions the ongoing passion for architectural modifications of Transformer-based forecasters. These forecasters leverage Transformers to model the global dependencies over temporal tokens of time series, with each token formed by multiple variates of the same timestamp. However, Transformers are challenged in forecasting series with larger lookback windows due to performance degradation and computation explosion. Besides, the embedding for each temporal token fuses multiple variates that represent potential delayed events and distinct physical measurements, which may fail in learning variate-centric representations and result in meaningless attention maps. In this work, we reflect on the competent duties of Transformer components and repurpose the Transformer architecture without any modification to the basic components. We propose iTransformer that simply applies the attention and feed-forward network on the inverted dimensions. Specifically, the time points of individual series are embedded into variate tokens which are utilized by the attention mechanism to capture multivariate correlations; meanwhile, the feed-forward network is applied for each variate token to learn nonlinear representations. The iTransformer model achieves state-of-the-art on challenging real-world datasets, which further empowers the Transformer family with promoted performance, generalization ability across different variates, and better utilization of arbitrary lookback windows, making it a nice alternative as the fundamental backbone of time series forecasting.
    Optimal Attack and Defense for Reinforcement Learning. (arXiv:2312.00198v1 [cs.LG])
    To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceived-state attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios.
    Beta Diffusion. (arXiv:2309.07867v3 [cs.LG] UPDATED)
    We introduce beta diffusion, a novel generative modeling method that integrates demasking and denoising to generate data within bounded ranges. Using scaled and shifted beta distributions, beta diffusion utilizes multiplicative transitions over time to create both forward and reverse diffusion processes, maintaining beta distributions in both the forward marginals and the reverse conditionals, given the data at any point in time. Unlike traditional diffusion-based generative models relying on additive Gaussian noise and reweighted evidence lower bounds (ELBOs), beta diffusion is multiplicative and optimized with KL-divergence upper bounds (KLUBs) derived from the convexity of the KL divergence. We demonstrate that the proposed KLUBs are more effective for optimizing beta diffusion compared to negative ELBOs, which can also be derived as the KLUBs of the same KL divergence with its two arguments swapped. The loss function of beta diffusion, expressed in terms of Bregman divergence, further supports the efficacy of KLUBs for optimization. Experimental results on both synthetic data and natural images demonstrate the unique capabilities of beta diffusion in generative modeling of range-bounded data and validate the effectiveness of KLUBs in optimizing diffusion models, thereby making them valuable additions to the family of diffusion-based generative models and the optimization techniques used to train them.
    Online Influence Maximization: Concept and Algorithm. (arXiv:2312.00099v1 [cs.SI])
    In this survey, we offer an extensive overview of the Online Influence Maximization (IM) problem by covering both theoretical aspects and practical applications. For the integrity of the article and because the online algorithm takes an offline oracle as a subroutine, we first make a clear definition of the Offline IM problem and summarize those commonly used Offline IM algorithms, which include traditional approximation or heuristic algorithms and ML-based algorithms. Then, we give a standard definition of the Online IM problem and a basic Combinatorial Multi-Armed Bandit (CMAB) framework, CMAB-T. Here, we summarize three types of feedback in the CMAB model and discuss in detail how to study the Online IM problem based on the CMAB-T model. This paves the way for solving the Online IM problem by using online learning methods. Furthermore, we have covered almost all Online IM algorithms up to now, focusing on characteristics and theoretical guarantees of online algorithms for different feedback types. Here, we elaborately explain their working principle and how to obtain regret bounds. Besides, we also collect plenty of innovative ideas about problem definition and algorithm designs and pioneering works for variants of the Online IM problem and their corresponding algorithms. Finally, we encapsulate current challenges and outline prospective research directions from four distinct perspectives.
    RIS-Based On-the-Air Semantic Communications -- a Diffractional Deep Neural Network Approach. (arXiv:2312.00535v1 [eess.SP])
    Semantic communication has gained significant attention recently due to its advantages in achieving higher transmission efficiency by focusing on semantic information instead of bit-level information. However, current AI-based semantic communication methods require digital hardware for implementation. With the rapid advancement on reconfigurable intelligence surfaces (RISs), a new approach called on-the-air diffractional deep neural networks (D$^2$NN) can be utilized to enable semantic communications on the wave domain. This paper proposes a new paradigm of RIS-based on-the-air semantic communications, where the computational process occurs inherently as wireless signals pass through RISs. We present the system model and discuss the data and control flows of this scheme, followed by a performance analysis using image transmission as an example. In comparison to traditional hardware-based approaches, RIS-based semantic communications offer appealing features, such as light-speed computation, low computational power requirements, and the ability to handle multiple tasks simultaneously.
    A machine-learning approach to thunderstorm forecasting through post-processing of simulation data. (arXiv:2303.08736v2 [physics.ao-ph] UPDATED)
    Thunderstorms pose a major hazard to society and economy, which calls for reliable thunderstorm forecasts. In this work, we introduce SALAMA, a feedforward neural network model for identifying thunderstorm occurrence in numerical weather prediction (NWP) data. The model is trained on convection-resolving ensemble forecasts over Central Europe and lightning observations. Given only a set of pixel-wise input parameters that are extracted from NWP data and related to thunderstorm development, SALAMA infers the probability of thunderstorm occurrence in a reliably calibrated manner. For lead times up to eleven hours, we find a forecast skill superior to classification based only on NWP reflectivity. Varying the spatiotemporal criteria by which we associate lightning observations with NWP data, we show that the time scale for skillful thunderstorm predictions increases linearly with the spatial scale of the forecast.
    TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs. (arXiv:2308.13490v2 [cs.LG] UPDATED)
    Precise hardware performance models play a crucial role in code optimizations. They can assist compilers in making heuristic decisions or aid autotuners in identifying the optimal configuration for a given program. For example, the autotuner for XLA, a machine learning compiler, discovered 10-20% speedup on state-of-the-art models serving substantial production traffic at Google. Although there exist a few datasets for program performance prediction, they target small sub-programs such as basic blocks or kernels. This paper introduces TpuGraphs, a performance prediction dataset on full tensor programs, represented as computational graphs, running on Tensor Processing Units (TPUs). Each graph in the dataset represents the main computation of a machine learning workload, e.g., a training epoch or an inference step. Each data sample contains a computational graph, a compilation configuration, and the execution time of the graph when compiled with the configuration. The graphs in the dataset are collected from open-source machine learning programs, featuring popular model architectures, e.g., ResNet, EfficientNet, Mask R-CNN, and Transformer. TpuGraphs provides 25x more graphs than the largest graph property prediction dataset (with comparable graph sizes), and 770x larger graphs on average compared to existing performance prediction datasets on machine learning programs. This graph-level prediction task on large graphs introduces new challenges in learning, ranging from scalability, training efficiency, to model quality.
    MLLMs-Augmented Visual-Language Representation Learning. (arXiv:2311.18765v2 [cs.CV] UPDATED)
    Visual-language pre-training (VLP) has achieved remarkable success in multi-modal tasks, largely attributed to the availability of large-scale image-text datasets. In this work, we demonstrate that multi-modal large language models (MLLMs) can enhance visual-language representation learning by improving data quality. Our approach is simple, utilizing MLLMs to extend multiple captions for each image. To prevent the bias introduced by MLLMs' hallucinations and intrinsic caption styles, we propose "text shearing" to maintain the same length for extended captions as that of the original captions. In image-text retrieval, our method consistently obtains 5.6 ~ 35.0% and 16.8 ~ 46.1% improvement on R@1 under the fine-tuning and zero-shot settings, respectively. Notably, we obtain zero-shot results that are comparable to fine-tuning on target datasets, which encourages more exploration of the versatile use of MLLMs.
    GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs. (arXiv:2312.00093v1 [cs.CV])
    As pretrained text-to-image diffusion models become increasingly powerful, recent efforts have been made to distill knowledge from these text-to-image pretrained models for optimizing a text-guided 3D model. Most of the existing methods generate a holistic 3D model from a plain text input. This can be problematic when the text describes a complex scene with multiple objects, because the vectorized text embeddings are inherently unable to capture a complex description with multiple entities and relationships. Holistic 3D modeling of the entire scene further prevents accurate grounding of text entities and concepts. To address this limitation, we propose GraphDreamer, a novel framework to generate compositional 3D scenes from scene graphs, where objects are represented as nodes and their interactions as edges. By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model and is able to fully disentangle different objects without image-level supervision. To facilitate modeling of object-wise relationships, we use signed distance fields as representation and impose a constraint to avoid inter-penetration of objects. To avoid manual scene graph creation, we design a text prompt for ChatGPT to generate scene graphs based on text inputs. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer in generating high-fidelity compositional 3D scenes with disentangled object entities.
    Self-similarity of Communities of the ABCD Model. (arXiv:2312.00238v1 [cs.SI])
    The Artificial Benchmark for Community Detection (ABCD) graph is a random graph model with community structure and power-law distribution for both degrees and community sizes. The model generates graphs similar to the well-known LFR model but it is faster and can be investigated analytically. In this paper, we show that the ABCD model exhibits some interesting self-similar behaviour, namely, the degree distribution of ground-truth communities is asymptotically the same as the degree distribution of the whole graph (appropriately normalized based on their sizes). As a result, we can not only estimate the number of edges induced by each community but also the number of self-loops and multi-edges generated during the process. Understanding these quantities is important as (a) rewiring self-loops and multi-edges to keep the graph simple is an expensive part of the algorithm, and (b) every rewiring causes the underlying configuration models to deviate slightly from uniform simple graphs on their corresponding degree sequences.
    RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment. (arXiv:2304.06767v4 [cs.LG] UPDATED)
    Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. Prior research has primarily employed Reinforcement Learning from Human Feedback (RLHF) to address this problem, where generative models are fine-tuned with RL algorithms guided by a human-feedback-informed reward model. However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples. Our studies show that RAFT can effectively improve the model performance in both reward learning and other automated metrics in both large language models and diffusion models.
    Benchmarking Multi-Domain Active Learning on Image Classification. (arXiv:2312.00364v1 [cs.LG])
    Active learning aims to enhance model performance by strategically labeling informative data points. While extensively studied, its effectiveness on large-scale, real-world datasets remains underexplored. Existing research primarily focuses on single-source data, ignoring the multi-domain nature of real-world data. We introduce a multi-domain active learning benchmark to bridge this gap. Our benchmark demonstrates that traditional single-domain active learning strategies are often less effective than random selection in multi-domain scenarios. We also introduce CLIP-GeoYFCC, a novel large-scale image dataset built around geographical domains, in contrast to existing genre-based domain datasets. Analysis on our benchmark shows that all multi-domain strategies exhibit significant tradeoffs, with no strategy outperforming across all datasets or all metrics, emphasizing the need for future research.
    A framework for mining lifestyle profiles through multi-dimensional and high-order mobility feature clustering. (arXiv:2312.00411v1 [cs.LG])
    Human mobility demonstrates a high degree of regularity, which facilitates the discovery of lifestyle profiles. Existing research has yet to fully utilize the regularities embedded in high-order features extracted from human mobility records in such profiling. This study proposes a progressive feature extraction strategy that mines high-order mobility features from users' moving trajectory records from the spatial, temporal, and semantic dimensions. Specific features are extracted such as travel motifs, rhythms decomposed by discrete Fourier transform (DFT) of mobility time series, and vectorized place semantics by word2vec, respectively to the three dimensions, and they are further clustered to reveal the users' lifestyle characteristics. An experiment using a trajectory dataset of over 500k users in Shenzhen, China yields seven user clusters with different lifestyle profiles that can be well interpreted by common sense. The results suggest the possibility of fine-grained user profiling through cross-order trajectory feature engineering and clustering.
    Simple Transferability Estimation for Regression Tasks. (arXiv:2312.00656v1 [cs.LG])
    We consider transferability estimation, the problem of estimating how well deep learning models transfer from a source to a target task. We focus on regression tasks, which received little previous attention, and propose two simple and computationally efficient approaches that estimate transferability based on the negative regularized mean squared error of a linear regression model. We prove novel theoretical results connecting our approaches to the actual transferability of the optimal target models obtained from the transfer learning process. Despite their simplicity, our approaches significantly outperform existing state-of-the-art regression transferability estimators in both accuracy and efficiency. On two large-scale keypoint regression benchmarks, our approaches yield 12% to 36% better results on average while being at least 27% faster than previous state-of-the-art methods.
    Towards A Foundation Model For Trajectory Intelligence. (arXiv:2312.00076v1 [cs.LG])
    We present the results of training a large trajectory model using real-world user check-in data. Our approach follows a pre-train and fine-tune paradigm, where a base model is pre-trained via masked trajectory modeling and then adapted through fine-tuning for various downstream tasks. To address challenges posed by noisy data and large spatial vocabularies, we propose a novel spatial tokenization block. Our empirical analysis utilizes a comprehensive dataset of over 2 billion check-ins generated by more than 6 million users. Through fine-tuning on 3 downstream tasks we demonstrate that our base model has effectively learned valuable underlying patterns in raw data, enabling its application in meaningful trajectory intelligence tasks. Despite some limitations, we believe this work represents an important step forward in the realization of a foundation model for trajectory intelligence.
    One to beat them all: "RYU'' -- a unifying framework for the construction of safe balls. (arXiv:2312.00640v1 [math.OC])
    In this paper, we put forth a novel framework (named ``RYU'') for the construction of ``safe'' balls, i.e. regions that provably contain the dual solution of a target optimization problem. We concentrate on the standard setup where the cost function is the sum of two terms: a closed, proper, convex Lipschitz-smooth function and a closed, proper, convex function. The RYU framework is shown to generalize or improve upon all the results proposed in the last decade for the considered family of optimization problems.
    G-NM: A Group of Numerical Time Series Prediction Models. (arXiv:2306.11667v5 [cs.LG] UPDATED)
    In this study, we focus on the development and implementation of a comprehensive ensemble of numerical time series forecasting models, collectively referred to as the Group of Numerical Time Series Prediction Model (G-NM). This inclusive set comprises traditional models such as Autoregressive Integrated Moving Average (ARIMA), Holt-Winters' method, and Support Vector Regression (SVR), in addition to modern neural network models including Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM). G-NM is explicitly constructed to augment our predictive capabilities related to patterns and trends inherent in complex natural phenomena. By utilizing time series data relevant to these events, G-NM facilitates the prediction of such phenomena over extended periods. The primary objective of this research is to both advance our understanding of such occurrences and to significantly enhance the accuracy of our forecasts. G-NM encapsulates both linear and non-linear dependencies, seasonalities, and trends present in time series data. Each of these models contributes distinct strengths, from ARIMA's resilience in handling linear trends and seasonality, SVR's proficiency in capturing non-linear patterns, to LSTM's adaptability in modeling various components of time series data. Through the exploitation of the G-NM potential, we strive to advance the state-of-the-art in large-scale time series forecasting models. We anticipate that this research will represent a significant stepping stone in our ongoing endeavor to comprehend and forecast the complex events that constitute the natural world.
    GFN-SR: Symbolic Regression with Generative Flow Networks. (arXiv:2312.00396v1 [cs.LG])
    Symbolic regression (SR) is an area of interpretable machine learning that aims to identify mathematical expressions, often composed of simple functions, that best fit in a given set of covariates $X$ and response $y$. In recent years, deep symbolic regression (DSR) has emerged as a popular method in the field by leveraging deep reinforcement learning to solve the complicated combinatorial search problem. In this work, we propose an alternative framework (GFN-SR) to approach SR with deep learning. We model the construction of an expression tree as traversing through a directed acyclic graph (DAG) so that GFlowNet can learn a stochastic policy to generate such trees sequentially. Enhanced with an adaptive reward baseline, our method is capable of generating a diverse set of best-fitting expressions. Notably, we observe that GFN-SR outperforms other SR algorithms in noisy data regimes, owing to its ability to learn a distribution of rewards over a space of candidate solutions.
    Constructing Custom Thermodynamics Using Deep Learning. (arXiv:2308.04119v2 [cond-mat.soft] UPDATED)
    One of the most exciting applications of artificial intelligence (AI) is automated scientific discovery based on previously amassed data, coupled with restrictions provided by known physical principles, including symmetries and conservation laws. Such automated hypothesis creation and verification can assist scientists in studying complex phenomena, where traditional physical intuition may fail. Here we develop a platform based on a generalized Onsager principle to learn macroscopic dynamical descriptions of arbitrary stochastic dissipative systems directly from observations of their microscopic trajectories. Our method simultaneously constructs reduced thermodynamic coordinates and interprets the dynamics on these coordinates. We demonstrate its effectiveness by studying theoretically and validating experimentally the stretching of long polymer chains in an externally applied field. Specifically, we learn three interpretable thermodynamic coordinates and build a dynamical landscape of polymer stretching, including the identification of stable and transition states and the control of the stretching rate. Our general methodology can be used to address a wide range of scientific and technological applications.
    Tokenized Model: A Blockchain-Empowered Decentralized Model Ownership Verification Platform. (arXiv:2312.00048v1 [cs.CR])
    With the development of practical deep learning models like generative AI, their excellent performance has brought huge economic value. For instance, ChatGPT has attracted more than 100 million users in three months. Since the model training requires a lot of data and computing power, a well-performing deep learning model is behind a huge effort and cost. Facing various model attacks, unauthorized use and abuse from the network that threaten the interests of model owners, in addition to considering legal and other administrative measures, it is equally important to protect the model's copyright from the technical means. By using the model watermarking technology, we point out the possibility of building a unified platform for model ownership verification. Given the application history of blockchain in copyright verification and the drawbacks of a centralized third-party, this paper considers combining model watermarking technology and blockchain to build a unified model copyright protection platform. By a new solution we called Tokenized Model, it protects the model's copyright by reliable ownership record and verification mechanism. It also promotes the financial value of model by constructing the model's transaction process and contribution shares of a model. In the typical case study, we also study the various performance under usual scenario to verify the effectiveness of this platform.
    ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness. (arXiv:2304.10703v2 [cs.CL] UPDATED)
    Multi-step reasoning ability is fundamental to many natural language tasks, yet it is unclear what constitutes a good reasoning chain and how to evaluate them. Most existing methods focus solely on whether the reasoning chain leads to the correct conclusion, but this answer-oriented view may confound reasoning quality with other spurious shortcuts to predict the answer. To bridge this gap, we evaluate reasoning chains by viewing them as informal proofs that derive the final answer. Specifically, we propose ReCEval (Reasoning Chain Evaluation), a framework that evaluates reasoning chains via two key properties: (1) correctness, i.e., each step makes a valid inference based on information contained within the step, preceding steps, and input context, and (2) informativeness, i.e., each step provides new information that is helpful towards deriving the generated answer. We evaluate these properties by developing metrics using natural language inference models and V-Information. On multiple datasets, we show that ReCEval effectively identifies various error types and yields notable improvements compared to prior methods. We analyze the impact of step boundaries, and previous steps on evaluating correctness and demonstrate that our informativeness metric captures the expected flow of information in high-quality reasoning chains. Finally, we show that scoring reasoning chains based on ReCEval improves downstream task performance. Our code is publicly available at: https://github.com/archiki/ReCEval
    Scalable Bayesian uncertainty quantification with data-driven priors for radio interferometric imaging. (arXiv:2312.00125v1 [astro-ph.IM])
    Next-generation radio interferometers like the Square Kilometer Array have the potential to unlock scientific discoveries thanks to their unprecedented angular resolution and sensitivity. One key to unlocking their potential resides in handling the deluge and complexity of incoming data. This challenge requires building radio interferometric imaging methods that can cope with the massive data sizes and provide high-quality image reconstructions with uncertainty quantification (UQ). This work proposes a method coined QuantifAI to address UQ in radio-interferometric imaging with data-driven (learned) priors for high-dimensional settings. Our model, rooted in the Bayesian framework, uses a physically motivated model for the likelihood. The model exploits a data-driven convex prior, which can encode complex information learned implicitly from simulations and guarantee the log-concavity of the posterior. We leverage probability concentration phenomena of high-dimensional log-concave posteriors that let us obtain information about the posterior, avoiding MCMC sampling techniques. We rely on convex optimisation methods to compute the MAP estimation, which is known to be faster and better scale with dimension than MCMC sampling strategies. Our method allows us to compute local credible intervals, i.e., Bayesian error bars, and perform hypothesis testing of structure on the reconstructed image. In addition, we propose a novel blazing-fast method to compute pixel-wise uncertainties at different scales. We demonstrate our method by reconstructing radio-interferometric images in a simulated setting and carrying out fast and scalable UQ, which we validate with MCMC sampling. Our method shows an improved image quality and more meaningful uncertainties than the benchmark method based on a sparsity-promoting prior. QuantifAI's source code: https://github.com/astro-informatics/QuantifAI.
    DeepTreeGANv2: Iterative Pooling of Point Clouds. (arXiv:2312.00042v1 [physics.data-an])
    In High Energy Physics, detailed and time-consuming simulations are used for particle interactions with detectors. To bypass these simulations with a generative model, the generation of large point clouds in a short time is required, while the complex dependencies between the particles must be correctly modelled. Particle showers are inherently tree-based processes, as each particle is produced by the decay or detector interaction of a particle of the previous generation. In this work, we present a significant extension to DeepTreeGAN, featuring a critic, that is able to aggregate such point clouds iteratively in a tree-based manner. We show that this model can reproduce complex distributions, and we evaluate its performance on the public JetNet 150 dataset.
    Removing Biases from Molecular Representations via Information Maximization. (arXiv:2312.00718v1 [cs.LG])
    High-throughput drug screening -- using cell imaging or gene expression measurements as readouts of drug effect -- is a critical tool in biotechnology to assess and understand the relationship between the chemical structure and biological activity of a drug. Since large-scale screens have to be divided into multiple experiments, a key difficulty is dealing with batch effects, which can introduce systematic errors and non-biological associations in the data. We propose InfoCORE, an Information maximization approach for COnfounder REmoval, to effectively deal with batch effects and obtain refined molecular representations. InfoCORE establishes a variational lower bound on the conditional mutual information of the latent representations given a batch identifier. It adaptively reweighs samples to equalize their implied batch distribution. Extensive experiments on drug screening data reveal InfoCORE's superior performance in a multitude of tasks including molecular property prediction and molecule-phenotype retrieval. Additionally, we show results for how InfoCORE offers a versatile framework and resolves general distribution shifts and issues of data fairness by minimizing correlation with spurious features or removing sensitive attributes. The code is available at https://github.com/uhlerlab/InfoCORE.
    Anomaly Detection via Learning-Based Sequential Controlled Sensing. (arXiv:2312.00088v1 [cs.LG])
    In this paper, we address the problem of detecting anomalies among a given set of binary processes via learning-based controlled sensing. Each process is parameterized by a binary random variable indicating whether the process is anomalous. To identify the anomalies, the decision-making agent is allowed to observe a subset of the processes at each time instant. Also, probing each process has an associated cost. Our objective is to design a sequential selection policy that dynamically determines which processes to observe at each time with the goal to minimize the delay in making the decision and the total sensing cost. We cast this problem as a sequential hypothesis testing problem within the framework of Markov decision processes. This formulation utilizes both a Bayesian log-likelihood ratio-based reward and an entropy-based reward. The problem is then solved using two approaches: 1) a deep reinforcement learning-based approach where we design both deep Q-learning and policy gradient actor-critic algorithms; and 2) a deep active inference-based approach. Using numerical experiments, we demonstrate the efficacy of our algorithms and show that our algorithms adapt to any unknown statistical dependence pattern of the processes.
    SmartChoices: Augmenting Software with Learned Implementations. (arXiv:2304.13033v2 [cs.SE] UPDATED)
    We are living in a golden age of machine learning. Powerful models perform many tasks far better than is possible using traditional software engineering approaches alone. However, developing and deploying these models in existing software systems remains challenging. In this paper, we present SmartChoices, a novel approach to incorporating machine learning into mature software stacks easily, safely, and effectively. We highlight key design decisions and present case studies applying SmartChoices within a range of large-scale industrial systems.
    PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design. (arXiv:2312.00080v1 [q-bio.QM])
    Structure-based protein design has attracted increasing interest, with numerous methods being introduced in recent years. However, a universally accepted method for evaluation has not been established, since the wet-lab validation can be overly time-consuming for the development of new algorithms, and the $\textit{in silico}$ validation with recovery and perplexity metrics is efficient but may not precisely reflect true foldability. To address this gap, we introduce two novel metrics: refoldability-based metric, which leverages high-accuracy protein structure prediction models as a proxy for wet lab experiments, and stability-based metric, which assesses whether models can assign high likelihoods to experimentally stable proteins. We curate datasets from high-quality CATH protein data, high-throughput $\textit{de novo}$ designed proteins, and mega-scale experimental mutagenesis experiments, and in doing so, present the $\textbf{PDB-Struct}$ benchmark that evaluates both recent and previously uncompared protein design methods. Experimental results indicate that ByProt, ProteinMPNN, and ESM-IF perform exceptionally well on our benchmark, while ESM-Design and AF-Design fall short on the refoldability metric. We also show that while some methods exhibit high sequence recovery, they do not perform as well on our new benchmark. Our proposed benchmark paves the way for a fair and comprehensive evaluation of protein design methods in the future. Code is available at https://github.com/WANG-CR/PDB-Struct.
    HeTriNet: Heterogeneous Graph Triplet Attention Network for Drug-Target-Disease Interaction. (arXiv:2312.00189v1 [cs.LG])
    Modeling the interactions between drugs, targets, and diseases is paramount in drug discovery and has significant implications for precision medicine and personalized treatments. Current approaches frequently consider drug-target or drug-disease interactions individually, ignoring the interdependencies among all three entities. Within human metabolic systems, drugs interact with protein targets in cells, influencing target activities and subsequently impacting biological pathways to promote healthy functions and treat diseases. Moving beyond binary relationships and exploring tighter triple relationships is essential to understanding drugs' mechanism of action (MoAs). Moreover, identifying the heterogeneity of drugs, targets, and diseases, along with their distinct characteristics, is critical to model these complex interactions appropriately. To address these challenges, we effectively model the interconnectedness of all entities in a heterogeneous graph and develop a novel Heterogeneous Graph Triplet Attention Network (\texttt{HeTriNet}). \texttt{HeTriNet} introduces a novel triplet attention mechanism within this heterogeneous graph structure. Beyond pairwise attention as the importance of an entity for the other one, we define triplet attention to model the importance of pairs for entities in the drug-target-disease triplet prediction problem. Experimental results on real-world datasets show that \texttt{HeTriNet} outperforms several baselines, demonstrating its remarkable proficiency in uncovering novel drug-target-disease relationships.
    Text Attribute Control via Closed-Loop Disentanglement. (arXiv:2312.00277v1 [cs.LG])
    Changing an attribute of a text without changing the content usually requires to first disentangle the text into irrelevant attributes and content representations. After that, in the inference phase, the representation of one attribute is tuned to a different value, expecting that the corresponding attribute of the text can also be changed accordingly. The usual way of disentanglement is to add some constraints on the latent space of an encoder-decoder architecture, including adversarial-based constraints and mutual-information-based constraints. However, the previous semi-supervised processes of attribute change are usually not enough to guarantee the success of attribute change and content preservation. In this paper, we propose a novel approach to achieve a robust control of attributes while enhancing content preservation. In this approach, we use a semi-supervised contrastive learning method to encourage the disentanglement of attributes in latent spaces. Differently from previous works, we re-disentangle the reconstructed sentence and compare the re-disentangled latent space with the original latent space, which makes a closed-loop disentanglement process. This also helps content preservation. In addition, the contrastive learning method is also able to replace the role of minimizing mutual information and adversarial training in the disentanglement process, which alleviates the computation cost. We conducted experiments on three text datasets, including the Yelp Service review dataset, the Amazon Product review dataset, and the GoEmotions dataset. The experimental results show the effectiveness of our model.
    Decision Tree Psychological Risk Assessment in Currency Trading. (arXiv:2311.15222v2 [cs.LG] UPDATED)
    This research paper focuses on the integration of Artificial Intelligence (AI) into the currency trading landscape, positing the development of personalized AI models, essentially functioning as intelligent personal assistants tailored to the idiosyncrasies of individual traders. The paper posits that AI models are capable of identifying nuanced patterns within the trader's historical data, facilitating a more accurate and insightful assessment of psychological risk dynamics in currency trading. The PRI is a dynamic metric that experiences fluctuations in response to market conditions that foster psychological fragility among traders. By employing sophisticated techniques, a classifying decision tree is crafted, enabling clearer decision-making boundaries within the tree structure. By incorporating the user's chronological trade entries, the model becomes adept at identifying critical junctures when psychological risks are heightened. The real-time nature of the calculations enhances the model's utility as a proactive tool, offering timely alerts to traders about impending moments of psychological risks. The implications of this research extend beyond the confines of currency trading, reaching into the realms of other industries where the judicious application of personalized modeling emerges as an efficient and strategic approach. This paper positions itself at the intersection of cutting-edge technology and the intricate nuances of human psychology, offering a transformative paradigm for decision making support in dynamic and high-pressure environments.
    REDUCR: Robust Data Downsampling Using Class Priority Reweighting. (arXiv:2312.00486v1 [cs.LG])
    Modern machine learning models are becoming increasingly expensive to train for real-world image and text classification tasks, where massive web-scale data is collected in a streaming fashion. To reduce the training cost, online batch selection techniques have been developed to choose the most informative datapoints. However, these techniques can suffer from poor worst-class generalization performance due to class imbalance and distributional shifts. This work introduces REDUCR, a robust and efficient data downsampling method that uses class priority reweighting. REDUCR reduces the training data while preserving worst-class generalization performance. REDUCR assigns priority weights to datapoints in a class-aware manner using an online learning algorithm. We demonstrate the data efficiency and robust performance of REDUCR on vision and text classification tasks. On web-scraped datasets with imbalanced class distributions, REDUCR significantly improves worst-class test accuracy (and average accuracy), surpassing state-of-the-art methods by around 15%.
    Exploring Factors Affecting Pedestrian Crash Severity Using TabNet: A Deep Learning Approach. (arXiv:2312.00066v1 [cs.LG])
    This study presents the first investigation of pedestrian crash severity using the TabNet model, a novel tabular deep learning method exceptionally suited for analyzing the tabular data inherent in transportation safety research. Through the application of TabNet to a comprehensive dataset from Utah covering the years 2010 to 2022, we uncover intricate factors contributing to pedestrian crash severity. The TabNet model, capitalizing on its compatibility with structured data, demonstrates remarkable predictive accuracy, eclipsing that of traditional models. It identifies critical variables, such as pedestrian age, involvement in left or right turns, lighting conditions, and alcohol consumption, which significantly influence crash outcomes. The utilization of SHapley Additive exPlanations (SHAP) enhances our ability to interpret the TabNet model's predictions, ensuring transparency and understandability in our deep learning approach. The insights derived from our analysis provide a valuable compass for transportation safety engineers and policymakers, enabling the identification of pivotal factors that affect pedestrian crash severity. Such knowledge is instrumental in formulating precise, data-driven interventions aimed at bolstering pedestrian safety across diverse urban and rural settings.
    Persona-Coded Poly-Encoder: Persona-Guided Multi-Stream Conversational Sentence Scoring. (arXiv:2309.16770v2 [cs.CL] UPDATED)
    Recent advances in machine learning and deep learning have led to the widespread use of Conversational AI in many practical applications. However, it is still very challenging to leverage auxiliary information that can provide conversational context or personalized tuning to improve the quality of conversations. For example, there has only been limited research on using an individuals persona information to improve conversation quality, and even state-of-the-art conversational AI techniques are unable to effectively leverage signals from heterogeneous sources of auxiliary data, such as multi-modal interaction data, demographics, SDOH data, etc. In this paper, we present a novel Persona-Coded Poly-Encoder method that leverages persona information in a multi-stream encoding scheme to improve the quality of response generation for conversations. To show the efficacy of the proposed method, we evaluate our method on two different persona-based conversational datasets, and compared against two state-of-the-art methods. Our experimental results and analysis demonstrate that our method can improve conversation quality over the baseline method Poly-Encoder by 3.32% and 2.94% in terms of BLEU score and HR@1, respectively. More significantly, our method offers a path to better utilization of multi-modal data in conversational tasks. Lastly, our study outlines several challenges and future research directions for advancing personalized conversational AI technology.
    Transfer Learning Enhanced Full Waveform Inversion. (arXiv:2302.11259v2 [cs.LG] UPDATED)
    We propose a way to favorably employ neural networks in the field of non-destructive testing using Full Waveform Inversion (FWI). The presented methodology discretizes the unknown material distribution in the domain with a neural network within an adjoint optimization. To further increase efficiency of the FWI, pretrained neural networks are used to provide a good starting point for the inversion. This reduces the number of iterations in the Full Waveform Inversion for specific, yet generalizable settings.
    Low latency optical-based mode tracking with machine learning deployed on FPGAs on a tokamak. (arXiv:2312.00128v1 [physics.plasm-ph])
    Active feedback control in magnetic confinement fusion devices is desirable to mitigate plasma instabilities and enable robust operation. Optical high-speed cameras provide a powerful, non-invasive diagnostic and can be suitable for these applications. In this study, we process fast camera data, at rates exceeding 100kfps, on $\textit{in situ}$ Field Programmable Gate Array (FPGA) hardware to track magnetohydrodynamic (MHD) mode evolution and generate control signals in real-time. Our system utilizes a convolutional neural network (CNN) model which predicts the $n$=1 MHD mode amplitude and phase using camera images with better accuracy than other tested non-deep-learning-based methods. By implementing this model directly within the standard FPGA readout hardware of the high-speed camera diagnostic, our mode tracking system achieves a total trigger-to-output latency of 17.6$\mu$s and a throughput of up to 120kfps. This study at the High Beta Tokamak-Extended Pulse (HBT-EP) experiment demonstrates an FPGA-based high-speed camera data acquisition and processing system, enabling application in real-time machine-learning-based tokamak diagnostic and control as well as potential applications in other scientific domains.
    Transfer learning for predicting source terms of principal component transport in chemically reactive flow. (arXiv:2312.00356v1 [physics.chem-ph])
    The objective of this study is to evaluate whether the number of requisite training samples can be reduced with the use of various transfer learning models for predicting, for example, the chemical source terms of the data-driven reduced-order model that represents the homogeneous ignition process of a hydrogen/air mixture. Principal component analysis is applied to reduce the dimensionality of the hydrogen/air mixture in composition space. Artificial neural networks (ANNs) are used to tabulate the reaction rates of principal components, and subsequently, a system of ordinary differential equations is solved. As the number of training samples decreases at the target task (i.e.,for T0 > 1000 K and various phi), the reduced-order model fails to predict the ignition evolution of a hydrogen/air mixture. Three transfer learning strategies are then applied to the training of the ANN model with a sparse dataset. The performance of the reduced-order model with a sparse dataset is found to be remarkably enhanced if the training of the ANN model is restricted by a regularization term that controls the degree of knowledge transfer from source to target tasks. To this end, a novel transfer learning method is introduced, parameter control via partial initialization and regularization (PaPIR), whereby the amount of knowledge transferred is systemically adjusted for the initialization and regularization of the ANN model in the target task. It is found that an additional performance gain can be achieved by changing the initialization scheme of the ANN model in the target task when the task similarity between source and target tasks is relatively low.
    Does a Neural Network Really Encode Symbolic Concepts?. (arXiv:2302.13080v2 [cs.LG] UPDATED)
    Recently, a series of studies have tried to extract interactions between input variables modeled by a DNN and define such interactions as concepts encoded by the DNN. However, strictly speaking, there still lacks a solid guarantee whether such interactions indeed represent meaningful concepts. Therefore, in this paper, we examine the trustworthiness of interaction concepts from four perspectives. Extensive empirical studies have verified that a well-trained DNN usually encodes sparse, transferable, and discriminative concepts, which is partially aligned with human intuition.
    Denoising Heat-inspired Diffusion with Insulators for Collision Free Motion Planning. (arXiv:2310.12609v2 [cs.RO] UPDATED)
    Diffusion models have risen as a powerful tool in robotics due to their flexibility and multi-modality. While some of these methods effectively address complex problems, they often depend heavily on inference-time obstacle detection and require additional equipment. Addressing these challenges, we present a method that, during inference time, simultaneously generates only reachable goals and plans motions that avoid obstacles, all from a single visual input. Central to our approach is the novel use of a collision-avoiding diffusion kernel for training. Through evaluations against behavior-cloning and classical diffusion models, our framework has proven its robustness. It is particularly effective in multi-modal environments, navigating toward goals and avoiding unreachable ones blocked by obstacles, while ensuring collision avoidance.
    Can LLMs Patch Security Issues?. (arXiv:2312.00024v1 [cs.CR])
    Large Language Models (LLMs) have shown impressive proficiency in code generation. Nonetheless, similar to human developers, these models might generate code that contains security vulnerabilities and flaws. Writing secure code remains a substantial challenge, as vulnerabilities often arise during interactions between programs and external systems or services, such as databases and operating systems. In this paper, we propose a novel approach, Feedback-Driven Solution Synthesis (FDSS), designed to explore the use of LLMs in receiving feedback from Bandit, which is a static code analysis tool, and then the LLMs generate potential solutions to resolve security vulnerabilities. Each solution, along with the vulnerable code, is then sent back to the LLM for code refinement. Our approach shows a significant improvement over the baseline and outperforms existing approaches. Furthermore, we introduce a new dataset, PythonSecurityEval, collected from real-world scenarios on Stack Overflow to evaluate the LLMs' ability to generate secure code. Code and data are available at \url{https://github.com/Kamel773/LLM-code-refine}
    Anti-Sexism Alert System: Identification of Sexist Comments on Social Media Using AI Techniques. (arXiv:2312.00053v1 [cs.CY])
    Social relationships in the digital sphere are becoming more usual and frequent, and they constitute a very important aspect for all of us. {Violent interactions in this sphere are very frequent, and have serious effects on the victims}. Within this global scenario, there is one kind of digital violence that is becoming really worrying: sexism against women. Sexist comments that are publicly posted in social media (newspaper comments, social networks, etc.), usually obtain a lot of attention and become viral, with consequent damage to the persons involved. In this paper, we introduce an anti-sexism alert system, based on natural language processing (NLP) and artificial intelligence (AI), that analyzes any public post, and decides if it could be considered a sexist comment or not. Additionally, this system also works on analyzing all the public comments linked to any multimedia content (piece of news, video, tweet, etc.) and decides, using a color-based system similar to traffic lights, if there is sexism in the global set of posts. We have created a labeled data set in Spanish, since the majority of studies focus on English, to train our system, which offers a very good performance after the validation experiments.
    Adaptive Parameter-Free Robust Learning using Latent Bernoulli Variables. (arXiv:2312.00585v1 [stat.ML])
    We present an efficient parameter-free approach for statistical learning from corrupted training sets. We identify corrupted and non-corrupted samples using latent Bernoulli variables, and therefore formulate the robust learning problem as maximization of the likelihood where latent variables are marginalized out. The resulting optimization problem is solved via variational inference using an efficient Expectation-Maximization based method. The proposed approach improves over the state-of-the-art by automatically inferring the corruption level and identifying outliers, while adding minimal computational overhead. We demonstrate our robust learning method on a wide variety of machine learning tasks including online learning and deep learning where it exhibits ability to adapt to different levels of noise and attain high prediction accuracy.
    Automatic Diagnosis of Myocarditis Disease in Cardiac MRI Modality using Deep Transformers and Explainable Artificial Intelligence. (arXiv:2210.14611v2 [cs.CV] UPDATED)
    Myocarditis is a significant cardiovascular disease (CVD) that poses a threat to the health of many individuals by causing damage to the myocardium. The occurrence of microbes and viruses, including the likes of HIV, plays a crucial role in the development of myocarditis disease (MCD). The images produced during cardiac magnetic resonance imaging (CMRI) scans are low contrast, which can make it challenging to diagnose cardiovascular diseases. In other hand, checking numerous CMRI slices for each CVD patient can be a challenging task for medical doctors. To overcome the existing challenges, researchers have suggested the use of artificial intelligence (AI)-based computer-aided diagnosis systems (CADS). The presented paper outlines a CADS for the detection of MCD from CMR images, utilizing deep learning (DL) methods. The proposed CADS consists of several steps, including dataset, preprocessing, feature extraction, classification, and post-processing. First, the Z-Alizadeh dataset was selected for the experiments. Subsequently, the CMR images underwent various preprocessing steps, including denoising, resizing, as well as data augmentation (DA) via CutMix and MixUp techniques. In the following, the most current deep pre-trained and transformer models are used for feature extraction and classification on the CMR images. The findings of our study reveal that transformer models exhibit superior performance in detecting MCD as opposed to pre-trained architectures. In terms of DL architectures, the Turbulence Neural Transformer (TNT) model exhibited impressive accuracy, reaching 99.73% utilizing a 10-fold cross-validation approach. Additionally, to pinpoint areas of suspicion for MCD in CMRI images, the Explainable-based Grad Cam method was employed.
    Uncertainty Estimation and Out-of-Distribution Detection for Deep Learning-Based Image Reconstruction using the Local Lipschitz. (arXiv:2305.07618v3 [cs.CV] UPDATED)
    Accurate image reconstruction is at the heart of diagnostics in medical imaging. Supervised deep learning-based approaches have been investigated for solving inverse problems including image reconstruction. However, these trained models encounter unseen data distributions that are widely shifted from training data during deployment. Therefore, it is essential to assess whether a given input falls within the training data distribution for diagnostic purposes. Uncertainty estimation approaches exist but focus on providing an uncertainty map to radiologists, rather than assessing the training distribution fit. In this work, we propose a method based on the local Lipschitz-based metric to distinguish out-of-distribution images from in-distribution with an area under the curve of 99.94%. Empirically, we demonstrate a very strong relationship between the local Lipschitz value and mean absolute error (MAE), supported by a high Spearman's rank correlation coefficient of 0.8475, which determines the uncertainty estimation threshold for optimal model performance. Through the identification of false positives, the local Lipschitz and MAE relationship was used to guide data augmentation and reduce model uncertainty. Our study was validated using the AUTOMAP architecture for sensor-to-image Magnetic Resonance Imaging (MRI) reconstruction. We compare our proposed approach with baseline methods: Monte-Carlo dropout and deep ensembles, and further analysis included MRI denoising and Computed Tomography (CT) sparse-to-full view reconstruction using UNET architectures. We show that our approach is applicable to various architectures and learned functions, especially in the realm of medical image reconstruction, where preserving the diagnostic accuracy of reconstructed images remains paramount.
    A Quality-of-Service Compliance System using Federated Learning and Optimistic Rollups. (arXiv:2312.00026v1 [cs.CR])
    Edge computing brings a new paradigm in which the sharing of computing, storage, and bandwidth resources as close as possible to the mobile devices or sensors generating a large amount of data. A parallel trend is the rise of phones and tablets as primary computing devices for many people. The powerful sensors present on these devices combined with the fact that they are mobile, mean they have access to data of an unprecedentedly diverse and private nature. Models learned on such data hold the promise of greatly improving usability by powering more intelligent applications, but the sensitive nature of the data means there are risks and responsibilities to storing it in a centralized location. To address the data privacy required for some data in these devices we propose the use of Federated Learning (FL) so that specific data about services performed by clients do not leave the source machines. Instead of sharing data, users collaboratively train a model by only sending weight updates to a server. However, the naive use of FL in those scenarios exposes it to a risk of corruption, whether intentional or not, during the training phase. To improve the security of the FL structure, we propose a decentralized Blockchain-based FL in an edge computing scenario. We also apply blockchain to create a reward mechanism in FL to enable incentive strategy for trainers.
    Revolutionizing Forensic Toolmark Analysis: An Objective and Transparent Comparison Algorithm. (arXiv:2312.00032v1 [cs.CR])
    Forensic toolmark comparisons are currently performed subjectively by humans, which leads to a lack of consistency and accuracy. There is little evidence that examiners can determine whether pairs of marks were made by the same tool or different tools. There is also little evidence that they can make this classification when marks are made under different conditions, such as different angles of attack or direction of mark generation. We generate original toolmark data in 3D, extract the signal from each toolmarks, and train an algorithm to compare toolmark signals objectively. We find that toolmark signals cluster by tool, and not by angle or direction. That is, the variability within tool, regardless of angle/direction, is smaller than the variability between tools. The known-match and known-non-match densities of the similarities of pairs of marks have a small overlap, even when accounting for dependencies in the data, making them a useful instrument for determining whether a new pair of marks was made by the same tool. We provide a likelihood ratio approach as a formal method for comparing toolmark signals with a measure of uncertainty. This empirically trained, open-source method can be used by forensic examiners to compare toolmarks objectively and thus improve the reliability of toolmark comparisons. This can, in turn, reduce miscarriages of justice in the criminal justice system.
    Robust Concept Erasure via Kernelized Rate-Distortion Maximization. (arXiv:2312.00194v1 [cs.LG])
    Distributed representations provide a vector space that captures meaningful relationships between data instances. The distributed nature of these representations, however, entangles together multiple attributes or concepts of data instances (e.g., the topic or sentiment of a text, characteristics of the author (age, gender, etc), etc). Recent work has proposed the task of concept erasure, in which rather than making a concept predictable, the goal is to remove an attribute from distributed representations while retaining other information from the original representation space as much as possible. In this paper, we propose a new distance metric learning-based objective, the Kernelized Rate-Distortion Maximizer (KRaM), for performing concept erasure. KRaM fits a transformation of representations to match a specified distance measure (defined by a labeled concept to erase) using a modified rate-distortion function. Specifically, KRaM's objective function aims to make instances with similar concept labels dissimilar in the learned representation space while retaining other information. We find that optimizing KRaM effectively erases various types of concepts: categorical, continuous, and vector-valued variables from data representations across diverse domains. We also provide a theoretical analysis of several properties of KRaM's objective. To assess the quality of the learned representations, we propose an alignment score to evaluate their similarity with the original representation space. Additionally, we conduct experiments to showcase KRaM's efficacy in various settings, from erasing binary gender variables in word embeddings to vector-valued variables in GPT-3 representations.
    Practical Path-based Bayesian Optimization. (arXiv:2312.00622v1 [cs.LG])
    There has been a surge in interest in data-driven experimental design with applications to chemical engineering and drug manufacturing. Bayesian optimization (BO) has proven to be adaptable to such cases, since we can model the reactions of interest as expensive black-box functions. Sometimes, the cost of this black-box functions can be separated into two parts: (a) the cost of the experiment itself, and (b) the cost of changing the input parameters. In this short paper, we extend the SnAKe algorithm to deal with both types of costs simultaneously. We further propose extensions to the case of a maximum allowable input change, as well as to the multi-objective setting.
    Interpretable Meta-Learning of Physical Systems. (arXiv:2312.00477v1 [cs.LG])
    Machine learning methods can be a valuable aid in the scientific process, but they need to face challenging settings where data come from inhomogeneous experimental conditions. Recent meta-learning methods have made significant progress in multi-task learning, but they rely on black-box neural networks, resulting in high computational costs and limited interpretability. Leveraging the structure of the learning problem, we argue that multi-environment generalization can be achieved using a simpler learning model, with an affine structure with respect to the learning task. Crucially, we prove that this architecture can identify the physical parameters of the system, enabling interpreable learning. We demonstrate the competitive generalization performance and the low computational cost of our method by comparing it to state-of-the-art algorithms on physical systems, ranging from toy models to complex, non-analytical systems. The interpretability of our method is illustrated with original applications to physical-parameter-induced adaptation and to adaptive control.
    VEXIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity. (arXiv:2312.00507v1 [cs.PL])
    We propose VEXIR2Vec, a code embedding framework for finding similar functions in binaries. Our representations rely on VEX IR, the intermediate representation used by binary analysis tools like Valgrind and angr. Our proposed embeddings encode both syntactic and semantic information to represent a function, and is both application and architecture independent. We also propose POV, a custom Peephole Optimization engine that normalizes the VEX IR for effective similarity analysis. We design several optimizations like copy/constant propagation, constant folding, common subexpression elimination and load-store elimination in POV. We evaluate our framework on two experiments -- diffing and searching -- involving binaries targeting different architectures, compiled using different compilers and versions, optimization sequences, and obfuscations. We show results on several standard projects and on real-world vulnerabilities. Our results show that VEXIR2Vec achieves superior precision and recall values compared to the state-of-the-art works. Our framework is highly scalable and is built as a multi-threaded, parallel library by only using open-source tools. VEXIR2Vec achieves about $3.2 \times$ speedup on the closest competitor, and orders-of-magnitude speedup on other tools.
    MultiView Independent Component Analysis with Delays. (arXiv:2312.00484v1 [cs.LG])
    Linear Independent Component Analysis (ICA) is a blind source separation technique that has been used in various domains to identify independent latent sources from observed signals. In order to obtain a higher signal-to-noise ratio, the presence of multiple views of the same sources can be used. In this work, we present MultiView Independent Component Analysis with Delays (MVICAD). This algorithm builds on the MultiView ICA model by allowing sources to be delayed versions of some shared sources: sources are shared across views up to some unknown latencies that are view- and source-specific. Using simulations, we demonstrate that MVICAD leads to better unmixing of the sources. Moreover, as ICA is often used in neuroscience, we show that latencies are age-related when applied to Cam-CAN, a large-scale magnetoencephalography (MEG) dataset. These results demonstrate that the MVICAD model can reveal rich effects on neural signals without human supervision.
    CellMixer: Annotation-free Semantic Cell Segmentation of Heterogeneous Cell Populations. (arXiv:2312.00671v1 [cs.CV])
    In recent years, several unsupervised cell segmentation methods have been presented, trying to omit the requirement of laborious pixel-level annotations for the training of a cell segmentation model. Most if not all of these methods handle the instance segmentation task by focusing on the detection of different cell instances ignoring their type. While such models prove adequate for certain tasks, like cell counting, other applications require the identification of each cell's type. In this paper, we present CellMixer, an innovative annotation-free approach for the semantic segmentation of heterogeneous cell populations. Our augmentation-based method enables the training of a segmentation model from image-level labels of homogeneous cell populations. Our results show that CellMixer can achieve competitive segmentation performance across multiple cell types and imaging modalities, demonstrating the method's scalability and potential for broader applications in medical imaging, cellular biology, and diagnostics.
    Auto-encoding GPS data to reveal individual and collective behaviour. (arXiv:2312.00456v1 [stat.AP])
    We propose an innovative and generic methodology to analyse individual and collective behaviour through individual trajectory data. The work is motivated by the analysis of GPS trajectories of fishing vessels collected from regulatory tracking data in the context of marine biodiversity conservation and ecosystem-based fisheries management. We build a low-dimensional latent representation of trajectories using convolutional neural networks as non-linear mapping. This is done by training a conditional variational auto-encoder taking into account covariates. The posterior distributions of the latent representations can be linked to the characteristics of the actual trajectories. The latent distributions of the trajectories are compared with the Bhattacharyya coefficient, which is well-suited for comparing distributions. Using this coefficient, we analyse the variation of the individual behaviour of each vessel during time. For collective behaviour analysis, we build proximity graphs and use an extension of the stochastic block model for multiple networks. This model results in a clustering of the individuals based on their set of trajectories. The application to French fishing vessels enables us to obtain groups of vessels whose individual and collective behaviours exhibit spatio-temporal patterns over the period 2014-2018.
    Sample Efficient Reinforcement Learning from Human Feedback via Active Exploration. (arXiv:2312.00267v1 [cs.LG])
    Preference-based feedback is important for many applications in reinforcement learning where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback (RLHF) on large language models. For many applications of RLHF, the cost of acquiring the human feedback can be substantial. In this work, we take advantage of the fact that one can often choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and formalize this as an offline contextual dueling bandit problem. We give an upper-confidence-bound style algorithm for this problem and prove a polynomial worst-case regret bound. We then provide empirical confirmation in a synthetic setting that our approach outperforms existing methods. After, we extend the setting and methodology for practical use in RLHF training of large language models. Here, our method is able to reach better performance with fewer samples of human preferences than multiple baselines on three real-world datasets.
    Universal Backdoor Attacks. (arXiv:2312.00157v1 [cs.LG])
    Web-scraped datasets are vulnerable to data poisoning, which can be used for backdooring deep image classifiers during training. Since training on large datasets is expensive, a model is trained once and re-used many times. Unlike adversarial examples, backdoor attacks often target specific classes rather than any class learned by the model. One might expect that targeting many classes through a naive composition of attacks vastly increases the number of poison samples. We show this is not necessarily true and more efficient, universal data poisoning attacks exist that allow controlling misclassifications from any source class into any target class with a small increase in poison samples. Our idea is to generate triggers with salient characteristics that the model can learn. The triggers we craft exploit a phenomenon we call inter-class poison transferability, where learning a trigger from one class makes the model more vulnerable to learning triggers for other classes. We demonstrate the effectiveness and robustness of our universal backdoor attacks by controlling models with up to 6,000 classes while poisoning only 0.15% of the training dataset.
    Streaming Bayesian Modeling for predicting Fat-Tailed Customer Lifetime Value. (arXiv:2312.00373v1 [cs.LG])
    We develop an online learning MCMC approach applicable for hierarchical bayesian models and GLMS. We also develop a fat-tailed LTV model that generalizes over several kinds of fat and thin tails. We demonstrate both developments on commercial LTV data from a large mobile app.
    BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos. (arXiv:2312.00083v1 [cs.CV])
    Temporal sentence grounding aims to localize moments relevant to a language description. Recently, DETR-like approaches have shown notable progress by decoding the center and length of a target moment from learnable queries. However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions. To remedy this problem, we introduce a novel boundary-oriented moment formulation. In our paradigm, the model no longer needs to find the precise center but instead suffices to predict any anchor point within the interval, from which the onset and offset are directly estimated. Based on this idea, we design a Boundary-Aligned Moment Detection Transformer (BAM-DETR), equipped with a dual-pathway decoding process. Specifically, it refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively. This separate design allows the model to focus on desirable regions, enabling precise refinement of moment predictions. Further, we propose a quality-based ranking method, ensuring that proposals with high localization qualities are prioritized over incomplete ones. Extensive experiments verify the advantages of our methods, where our model records new state-of-the-art results on three benchmarks. Code is at https://github.com/Pilhyeon/BAM-DETR.
    ChebNet: Efficient and Stable Constructions of Deep Neural Networks with Rectified Power Units via Chebyshev Approximations. (arXiv:1911.05467v3 [cs.LG] UPDATED)
    In a previous study [B. Li, S. Tang and H. Yu, Commun. Comput. Phy. 27(2):379-411, 2020], it is shown that deep neural networks built with rectified power units (RePU) as activation functions can give better approximation for sufficient smooth functions than those built with rectified linear units, by converting polynomial approximations using power series into deep neural networks with optimal complexity and no approximation error. However, in practice, power series approximations are not easy to obtain due to the associated stability issue. In this paper, we propose a new and more stable way to construct RePU deep neural networks based on Chebyshev polynomial approximations. By using a hierarchical structure of Chebyshev polynomial approximation in frequency domain, we obtain efficient and stable deep neural network construction, which we call ChebNet. The approximation of smooth functions by ChebNets is no worse than the approximation by deep RePU nets using power series. On the same time, ChebNets are much more stable. Numerical results show that the constructed ChebNets can be further fine-tuned to obtain much better results than those obtained by tuning deep RePU nets constructed by power series approach. As spectral accuracy is hard to obtain by direct training of deep neural networks, ChebNets provide a practical way to obtain spectral accuracy, it is expected to be useful in real applications that require efficient approximations of smooth functions.
    Backbone-based Dynamic Graph Spatio-Temporal Network for Epidemic Forecasting. (arXiv:2312.00485v1 [cs.LG])
    Accurate epidemic forecasting is a critical task in controlling disease transmission. Many deep learning-based models focus only on static or dynamic graphs when constructing spatial information, ignoring their relationship. Additionally, these models often rely on recurrent structures, which can lead to error accumulation and computational time consumption. To address the aforementioned problems, we propose a novel model called Backbone-based Dynamic Graph Spatio-Temporal Network (BDGSTN). Intuitively, the continuous and smooth changes in graph structure, make adjacent graph structures share a basic pattern. To capture this property, we use adaptive methods to generate static backbone graphs containing the primary information and temporal models to generate dynamic temporal graphs of epidemic data, fusing them to generate a backbone-based dynamic graph. To overcome potential limitations associated with recurrent structures, we introduce a linear model DLinear to handle temporal dependencies and combine it with dynamic graph convolution for epidemic forecasting. Extensive experiments on two datasets demonstrate that BDGSTN outperforms baseline models and ablation comparison further verifies the effectiveness of model components. Furthermore, we analyze and measure the significance of backbone and temporal graphs by using information metrics from different aspects. Finally, we compare model parameter volume and training time to confirm the superior complexity and efficiency of BDGSTN.
    Academic competitions. (arXiv:2312.00268v1 [cs.LG])
    Academic challenges comprise effective means for (i) advancing the state of the art, (ii) putting in the spotlight of a scientific community specific topics and problems, as well as (iii) closing the gap for under represented communities in terms of accessing and participating in the shaping of research fields. Competitions can be traced back for centuries and their achievements have had great influence in our modern world. Recently, they (re)gained popularity, with the overwhelming amounts of data that is being generated in different domains, as well as the need of pushing the barriers of existing methods, and available tools to handle such data. This chapter provides a survey of academic challenges in the context of machine learning and related fields. We review the most influential competitions in the last few years and analyze challenges per area of knowledge. The aims of scientific challenges, their goals, major achievements and expectations for the next few years are reviewed.
    Uncertainty in Graph Contrastive Learning with Bayesian Neural Networks. (arXiv:2312.00232v1 [cs.LG])
    Graph contrastive learning has shown great promise when labeled data is scarce, but large unlabeled datasets are available. However, it often does not take uncertainty estimation into account. We show that a variational Bayesian neural network approach can be used to improve not only the uncertainty estimates but also the downstream performance on semi-supervised node-classification tasks. Moreover, we propose a new measure of uncertainty for contrastive learning, that is based on the disagreement in likelihood due to different positive samples.
    LEAP: LLM-Generation of Egocentric Action Programs. (arXiv:2312.00055v1 [cs.CV])
    We introduce LEAP (illustrated in Figure 1), a novel method for generating video-grounded action programs through use of a Large Language Model (LLM). These action programs represent the motoric, perceptual, and structural aspects of action, and consist of sub-actions, pre- and post-conditions, and control flows. LEAP's action programs are centered on egocentric video and employ recent developments in LLMs both as a source for program knowledge and as an aggregator and assessor of multimodal video information. We apply LEAP over a majority (87\%) of the training set of the EPIC Kitchens dataset, and release the resulting action programs as a publicly available dataset here (https://drive.google.com/drive/folders/1Cpkw_TI1IIxXdzor0pOXG3rWJWuKU5Ex?usp=drive_link). We employ LEAP as a secondary source of supervision, using its action programs in a loss term applied to action recognition and anticipation networks. We demonstrate sizable improvements in performance in both tasks due to training with the LEAP dataset. Our method achieves 1st place on the EPIC Kitchens Action Recognition leaderboard as of November 17 among the networks restricted to RGB-input (see Supplementary Materials).
    Elijah: Eliminating Backdoors Injected in Diffusion Models via Distribution Shift. (arXiv:2312.00050v1 [cs.CR])
    Diffusion models (DM) have become state-of-the-art generative models because of their capability to generate high-quality images from noises without adversarial training. However, they are vulnerable to backdoor attacks as reported by recent studies. When a data input (e.g., some Gaussian noise) is stamped with a trigger (e.g., a white patch), the backdoored model always generates the target image (e.g., an improper photo). However, effective defense strategies to mitigate backdoors from DMs are underexplored. To bridge this gap, we propose the first backdoor detection and removal framework for DMs. We evaluate our framework Elijah on hundreds of DMs of 3 types including DDPM, NCSN and LDM, with 13 samplers against 3 existing backdoor attacks. Extensive experiments show that our approach can have close to 100% detection accuracy and reduce the backdoor effects to close to zero without significantly sacrificing the model utility.
    Dancing with Images: Video Distillation via Static-Dynamic Disentanglement. (arXiv:2312.00362v1 [cs.CV])
    Recently, dataset distillation has paved the way towards efficient machine learning, especially for image datasets. However, the distillation for videos, characterized by an exclusive temporal dimension, remains an underexplored domain. In this work, we provide the first systematic study of video distillation and introduce a taxonomy to categorize temporal compression. Our investigation reveals that the temporal information is usually not well learned during distillation , and the temporal dimension of synthetic data contributes little. The observations motivate our unified framework of disentangling the dynamic and static information in the videos. It first distills the videos into still images as static memory and then compensates the dynamic and motion information with a learnable dynamic memory block. Our method achieves state-of-the-art on video datasets at different scales, with notably smaller storage expenditure. Our code will be publicly available.
    Learning Causally Disentangled Representations via the Principle of Independent Causal Mechanisms. (arXiv:2306.01213v2 [cs.LG] UPDATED)
    Learning disentangled causal representations is a challenging problem that has gained significant attention recently due to its implications for extracting meaningful information for downstream tasks. In this work, we define a new notion of causal disentanglement from the perspective of independent causal mechanisms. We propose ICM-VAE, a framework for learning causally disentangled representations supervised by causally related observed labels. We model causal mechanisms using learnable flow-based diffeomorphic functions to map noise variables to latent causal variables. Further, to promote the disentanglement of causal factors, we propose a causal disentanglement prior that utilizes the known causal structure to encourage learning a causally factorized distribution in the latent space. Under relatively mild conditions, we provide theoretical results showing the identifiability of causal factors and mechanisms up to permutation and elementwise reparameterization. We empirically demonstrate that our framework induces highly disentangled causal factors, improves interventional robustness, and is compatible with counterfactual generation.
    Acoustic Cybersecurity: Exploiting Voice-Activated Systems. (arXiv:2312.00039v1 [cs.CR])
    In this study, we investigate the emerging threat of inaudible acoustic attacks targeting digital voice assistants, a critical concern given their projected prevalence to exceed the global population by 2024. Our research extends the feasibility of these attacks across various platforms like Amazon's Alexa, Android, iOS, and Cortana, revealing significant vulnerabilities in smart devices. The twelve attack vectors identified include successful manipulation of smart home devices and automotive systems, potential breaches in military communication, and challenges in critical infrastructure security. We quantitatively show that attack success rates hover around 60%, with the ability to activate devices remotely from over 100 feet away. Additionally, these attacks threaten critical infrastructure, emphasizing the need for multifaceted defensive strategies combining acoustic shielding, advanced signal processing, machine learning, and robust user authentication to mitigate these risks.
    MuseChat: A Conversational Music Recommendation System for Videos. (arXiv:2310.06282v3 [cs.LG] UPDATED)
    Music recommendation for videos attracts growing interest in multi-modal research. However, existing systems focus primarily on content compatibility, often ignoring the users' preferences. Their inability to interact with users for further refinements or to provide explanations leads to a less satisfying experience. We address these issues with MuseChat, a first-of-its-kind dialogue-based recommendation system that personalizes music suggestions for videos. Our system consists of two key functionalities with associated modules: recommendation and reasoning. The recommendation module takes a video along with optional information including previous suggested music and user's preference as inputs and retrieves an appropriate music matching the context. The reasoning module, equipped with the power of Large Language Model (Vicuna-7B) and extended to multi-modal inputs, is able to provide reasonable explanation for the recommended music. To evaluate the effectiveness of MuseChat, we build a large-scale dataset, conversational music recommendation for videos, that simulates a two-turn interaction between a user and a recommender based on accurate music track information. Experiment results show that MuseChat achieves significant improvements over existing video-based music retrieval methods as well as offers strong interpretability and interactability.
    SpaCE: The Spatial Confounding Environment. (arXiv:2312.00710v1 [cs.LG])
    Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.
    Explainable Fraud Detection with Deep Symbolic Classification. (arXiv:2312.00586v1 [cs.LG])
    There is a growing demand for explainable, transparent, and data-driven models within the domain of fraud detection. Decisions made by fraud detection models need to be explainable in the event of a customer dispute. Additionally, the decision-making process in the model must be transparent to win the trust of regulators and business stakeholders. At the same time, fraud detection solutions can benefit from data due to the noisy, dynamic nature of fraud and the availability of large historical data sets. Finally, fraud detection is notorious for its class imbalance: there are typically several orders of magnitude more legitimate transactions than fraudulent ones. In this paper, we present Deep Symbolic Classification (DSC), an extension of the Deep Symbolic Regression framework to classification problems. DSC casts classification as a search problem in the space of all analytic functions composed of a vocabulary of variables, constants, and operations and optimizes for an arbitrary evaluation metric directly. The search is guided by a deep neural network trained with reinforcement learning. Because the functions are mathematical expressions that are in closed-form and concise, the model is inherently explainable both at the level of a single classification decision and the model's decision process. Furthermore, the class imbalance problem is successfully addressed by optimizing for metrics that are robust to class imbalance such as the F1 score. This eliminates the need for oversampling and undersampling techniques that plague traditional approaches. Finally, the model allows to explicitly balance between the prediction accuracy and the explainability. An evaluation on the PaySim data set demonstrates competitive predictive performance with state-of-the-art models, while surpassing them in terms of explainability. This establishes DSC as a promising model for fraud detection systems.
    Improving the Robustness of Quantized Deep Neural Networks to White-Box Attacks using Stochastic Quantization and Information-Theoretic Ensemble Training. (arXiv:2312.00105v1 [cs.CV])
    Most real-world applications that employ deep neural networks (DNNs) quantize them to low precision to reduce the compute needs. We present a method to improve the robustness of quantized DNNs to white-box adversarial attacks. We first tackle the limitation of deterministic quantization to fixed ``bins'' by introducing a differentiable Stochastic Quantizer (SQ). We explore the hypothesis that different quantizations may collectively be more robust than each quantized DNN. We formulate a training objective to encourage different quantized DNNs to learn different representations of the input image. The training objective captures diversity and accuracy via mutual information between ensemble members. Through experimentation, we demonstrate substantial improvement in robustness against $L_\infty$ attacks even if the attacker is allowed to backpropagate through SQ (e.g., > 50\% accuracy to PGD(5/255) on CIFAR10 without adversarial training), compared to vanilla DNNs as well as existing ensembles of quantized DNNs. We extend the method to detect attacks and generate robustness profiles in the adversarial information plane (AIP), towards a unified analysis of different threat models by correlating the MI and accuracy.
    Upper and lower bounds for the Lipschitz constant of random neural networks. (arXiv:2311.01356v2 [stat.ML] UPDATED)
    Empirical studies have widely demonstrated that neural networks are highly sensitive to small, adversarial perturbations of the input. The worst-case robustness against these so-called adversarial examples can be quantified by the Lipschitz constant of the neural network. In this paper, we study upper and lower bounds for the Lipschitz constant of random ReLU neural networks. Specifically, we assume that the weights and biases follow a generalization of the He initialization, where general symmetric distributions for the biases are permitted. For shallow neural networks, we characterize the Lipschitz constant up to an absolute numerical constant. For deep networks with fixed depth and sufficiently large width, our established bounds differ by a factor that is logarithmic in the width.
    Improving Normalization with the James-Stein Estimator. (arXiv:2312.00313v1 [cs.CV])
    Stein's paradox holds considerable sway in high-dimensional statistics, highlighting that the sample mean, traditionally considered the de facto estimator, might not be the most efficacious in higher dimensions. To address this, the James-Stein estimator proposes an enhancement by steering the sample means toward a more centralized mean vector. In this paper, first, we establish that normalization layers in deep learning use inadmissible estimators for mean and variance. Next, we introduce a novel method to employ the James-Stein estimator to improve the estimation of mean and variance within normalization layers. We evaluate our method on different computer vision tasks: image classification, semantic segmentation, and 3D object classification. Through these evaluations, it is evident that our improved normalization layers consistently yield superior accuracy across all tasks without extra computational burden. Moreover, recognizing that a plethora of shrinkage estimators surpass the traditional estimator in performance, we study two other prominent shrinkage estimators: Ridge and LASSO. Additionally, we provide visual representations to intuitively demonstrate the impact of shrinkage on the estimated layer statistics. Finally, we study the effect of regularization and batch size on our modified batch normalization. The studies show that our method is less sensitive to batch size and regularization, improving accuracy under various setups.
    Flow Matching Beyond Kinematics: Generating Jets with Particle-ID and Trajectory Displacement Information. (arXiv:2312.00123v1 [hep-ph])
    We introduce the first generative model trained on the JetClass dataset. Our model generates jets at the constituent level, and it is a permutation-equivariant continuous normalizing flow (CNF) trained with the flow matching technique. It is conditioned on the jet type, so that a single model can be used to generate the ten different jet types of JetClass. For the first time, we also introduce a generative model that goes beyond the kinematic features of jet constituents. The JetClass dataset includes more features, such as particle-ID and track impact parameter, and we demonstrate that our CNF can accurately model all of these additional features as well. Our generative model for JetClass expands on the versatility of existing jet generation techniques, enhancing their potential utility in high-energy physics research, and offering a more comprehensive understanding of the generated jets.
    RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. (arXiv:2309.00267v2 [cs.CL] UPDATED)
    Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences. However, gathering high-quality human preference labels can be a time-consuming and expensive endeavor. RL from AI Feedback (RLAIF), introduced by Bai et al., offers a promising alternative that leverages a powerful off-the-shelf LLM to generate preferences in lieu of human annotators. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, RLAIF achieves comparable or superior performance to RLHF, as rated by human evaluators. Furthermore, RLAIF demonstrates the ability to outperform a supervised fine-tuned baseline even when the LLM preference labeler is the same size as the policy. In another experiment, directly prompting the LLM for reward scores achieves superior performance to the canonical RLAIF setup, where LLM preference labels are first distilled into a reward model. Finally, we conduct extensive studies on techniques for generating aligned AI preferences. Our results suggest that RLAIF can achieve human-level performance, offering a potential solution to the scalability limitations of RLHF.
    Age-Based Scheduling for Mobile Edge Computing: A Deep Reinforcement Learning Approach. (arXiv:2312.00279v1 [cs.LG])
    With the rapid development of Mobile Edge Computing (MEC), various real-time applications have been deployed to benefit people's daily lives. The performance of these applications relies heavily on the freshness of collected environmental information, which can be quantified by its Age of Information (AoI). In the traditional definition of AoI, it is assumed that the status information can be actively sampled and directly used. However, for many MEC-enabled applications, the desired status information is updated in an event-driven manner and necessitates data processing. To better serve these applications, we propose a new definition of AoI and, based on the redefined AoI, we formulate an online AoI minimization problem for MEC systems. Notably, the problem can be interpreted as a Markov Decision Process (MDP), thus enabling its solution through Reinforcement Learning (RL) algorithms. Nevertheless, the traditional RL algorithms are designed for MDPs with completely unknown system dynamics and hence usually suffer long convergence times. To accelerate the learning process, we introduce Post-Decision States (PDSs) to exploit the partial knowledge of the system's dynamics. We also combine PDSs with deep RL to further improve the algorithm's applicability, scalability, and robustness. Numerical results demonstrate that our algorithm outperforms the benchmarks under various scenarios.
    Is Inverse Reinforcement Learning Harder than Standard Reinforcement Learning?. (arXiv:2312.00054v1 [stat.ML])
    Inverse Reinforcement Learning (IRL) -- the problem of learning reward functions from demonstrations of an \emph{expert policy} -- plays a critical role in developing intelligent systems, such as those that understand and imitate human behavior. While widely used in applications, theoretical understandings of IRL admit unique challenges and remain less developed compared with standard RL theory. For example, it remains open how to do IRL efficiently in standard \emph{offline} settings with pre-collected data, where states are obtained from a \emph{behavior policy} (which could be the expert policy itself), and actions are sampled from the expert policy. This paper provides the first line of results for efficient IRL in vanilla offline and online settings using polynomial samples and runtime. We first design a new IRL algorithm for the offline setting, Reward Learning with Pessimism (RLP), and show that it achieves polynomial sample complexity in terms of the size of the MDP, a concentrability coefficient between the behavior policy and the expert policy, and the desired accuracy. Building on RLP, we further design an algorithm Reward Learning with Exploration (RLE), which operates in a natural online setting where the learner can both actively explore the environment and query the expert policy, and obtain a stronger notion of IRL guarantee from polynomial samples. We establish sample complexity lower bounds for both settings showing that RLP and RLE are nearly optimal. Finally, as an application, we show that the learned reward functions can \emph{transfer} to another target MDP with suitable guarantees when the target MDP satisfies certain similarity assumptions with the original (source) MDP.
    FBChain: A Blockchain-based Federated Learning Model with Efficiency and Secure Communication. (arXiv:2312.00035v1 [cs.CR])
    Privacy and security in the parameter transmission process of federated learning are currently among the most prominent concerns. However, there are two thorny problems caused by unprotected communication methods: "parameter-leakage" and "inefficient-communication". This article proposes Blockchain-based Federated Learning (FBChain) model for federated learning parameter communication to overcome the above two problems. First, we utilize the immutability of blockchain to store the global model and hash value of local model parameters in case of tampering during the communication process, protect data privacy by encrypting parameters, and verify data consistency by comparing the hash values of local parameters, thus addressing the "parameter-leakage" problem. Second, the Proof of Weighted Link Speed (PoWLS) consensus algorithm comprehensively selects nodes with the higher weighted link speed to aggregate global model and package blocks, thereby solving the "inefficient-communication" problem. Experimental results demonstrate the effectiveness of our proposed FBChain model and its ability to improve model communication efficiency in federated learning.
    Diffusion Models for Wireless Communications. (arXiv:2310.07312v3 [cs.IT] UPDATED)
    Innovative foundation models, such as GPT-4 and stable diffusion models, have made a paradigm shift in the realm of artificial intelligence (AI) towards generative AI-based systems. AI and machine learning (AI/ML) algorithms are envisioned to be pervasively incorporated into the future wireless communications systems. In this article, we outline the applications of diffusion models in wireless communication systems, which are a new family of probabilistic generative models that have showcased state-of-the-art performance. The key idea is to decompose data generation process over "denoising" steps, gradually generating samples out of noise. Based on two case studies presented, we show how diffusion models can be employed for the development of resilient AI-native communication systems. Specifically, we propose denoising diffusion probabilistic models (DDPM) for a wireless communication scheme with non-ideal transceivers, where 30% improvement is achieved in terms of bit error rate. In the other example, DDPM is employed at the transmitter to shape the constellation symbols, highlighting a robust out-of-distribution performance.
    Efficient Pauli channel estimation with logarithmic quantum memory. (arXiv:2309.14326v2 [quant-ph] UPDATED)
    Here we revisit one of the prototypical tasks for characterizing the structure of noise in quantum devices: estimating every eigenvalue of an $n$-qubit Pauli noise channel to error $\epsilon$. Prior work (Chen et al., 2022) proved no-go theorems for this task in the practical regime where one has a limited amount of quantum memory, e.g. any protocol with $\le 0.99n$ ancilla qubits of quantum memory must make exponentially many measurements, provided it is non-concatenating. Such protocols can only interact with the channel by repeatedly preparing a state, passing it through the channel, and measuring immediately afterward. This left open a natural question: does the lower bound hold even for general protocols, i.e. ones which chain together many queries to the channel, interleaved with arbitrary data-processing channels, before measuring? Surprisingly, in this work we show the opposite: there is a protocol that can estimate the eigenvalues of a Pauli channel to error $\epsilon$ using only $O(\log n/\epsilon^2)$ ancilla qubits and $\tilde{O}(n^2/\epsilon^2)$ measurements. In contrast, we show that any protocol with zero ancilla, even a concatenating one, must make $\Omega(2^n/\epsilon^2)$ measurements, which is tight. Our results imply, to our knowledge, the first quantum learning task where logarithmically many qubits of quantum memory suffice for an exponential statistical advantage.
    Action valuation of on- and off-ball soccer players based on multi-agent deep reinforcement learning. (arXiv:2305.17886v2 [cs.AI] UPDATED)
    Analysis of invasive sports such as soccer is challenging because the game situation changes continuously in time and space, and multiple agents individually recognize the game situation and make decisions. Previous studies using deep reinforcement learning have often considered teams as a single agent and valued the teams and players who hold the ball in each discrete event. Then it was challenging to value the actions of multiple players, including players far from the ball, in a spatiotemporally continuous state space. In this paper, we propose a method of valuing possible actions for on- and off-ball soccer players in a single holistic framework based on multi-agent deep reinforcement learning. We consider a discrete action space in a continuous state space that mimics that of Google research football and leverages supervised learning for actions in reinforcement learning. In the experiment, we analyzed the relationships with conventional indicators, season goals, and game ratings by experts, and showed the effectiveness of the proposed method. Our approach can assess how multiple players move continuously throughout the game, which is difficult to be discretized or labeled but vital for teamwork, scouting, and fan engagement.  ( 3 min )
    Adversarial Attacks and Defenses on 3D Point Cloud Classification: A Survey. (arXiv:2307.00309v2 [cs.CV] UPDATED)
    Deep learning has successfully solved a wide range of tasks in 2D vision as a dominant AI technique. Recently, deep learning on 3D point clouds is becoming increasingly popular for addressing various tasks in this field. Despite remarkable achievements, deep learning algorithms are vulnerable to adversarial attacks. These attacks are imperceptible to the human eye but can easily fool deep neural networks in the testing and deployment stage. To encourage future research, this survey summarizes the current progress on adversarial attack and defense techniques on point cloud classification.This paper first introduces the principles and characteristics of adversarial attacks and summarizes and analyzes adversarial example generation methods in recent years. Additionally, it provides an overview of defense strategies, organized into data-focused and model-focused methods. Finally, it presents several current challenges and potential future research directions in this domain.
    PathLDM: Text conditioned Latent Diffusion Model for Histopathology. (arXiv:2309.00748v2 [cs.CV] UPDATED)
    To achieve high-quality results, diffusion models must be trained on large datasets. This can be notably prohibitive for models in specialized domains, such as computational pathology. Conditioning on labeled data is known to help in data-efficient model training. Therefore, histopathology reports, which are rich in valuable clinical information, are an ideal choice as guidance for a histopathology generative model. In this paper, we introduce PathLDM, the first text-conditioned Latent Diffusion Model tailored for generating high-quality histopathology images. Leveraging the rich contextual information provided by pathology text reports, our approach fuses image and textual data to enhance the generation process. By utilizing GPT's capabilities to distill and summarize complex text reports, we establish an effective conditioning mechanism. Through strategic conditioning and necessary architectural enhancements, we achieved a SoTA FID score of 7.64 for text-to-image generation on the TCGA-BRCA dataset, significantly outperforming the closest text-conditioned competitor with FID 30.1.
    Harnessing machine learning for accurate treatment of overlapping opacity species in GCMs. (arXiv:2311.00775v2 [astro-ph.EP] UPDATED)
    To understand high precision observations of exoplanets and brown dwarfs, we need detailed and complex general circulation models (GCMs) that incorporate hydrodynamics, chemistry, and radiation. In this study, we specifically examine the coupling between chemistry and radiation in GCMs and compare different methods for mixing opacities of different chemical species in the correlated-k assumption, when equilibrium chemistry cannot be assumed. We propose a fast machine learning method based on DeepSets (DS), which effectively combines individual correlated-k opacities (k-tables). We evaluate the DS method alongside other published methods like adaptive equivalent extinction (AEE) and random overlap with rebinning and resorting (RORR). We integrate these mixing methods into our GCM (expeRT/MITgcm) and assess their accuracy and performance for the example of the hot Jupiter HD~209458 b. Our findings indicate that the DS method is both accurate and efficient for GCM usage, whereas RORR is too slow. Additionally, we observe that the accuracy of AEE depends on its specific implementation and may introduce numerical issues in achieving radiative transfer solution convergence. We then apply the DS mixing method in a simplified chemical disequilibrium situation, where we model the rainout of TiO and VO, and confirm that the rainout of TiO and VO would hinder the formation of a stratosphere. To further expedite the development of consistent disequilibrium chemistry calculations in GCMs, we provide documentation and code for coupling the DS mixing method with correlated-k radiative transfer solvers. The DS method has been extensively tested to be accurate enough for GCMs, however, other methods might be needed for accelerating atmospheric retrievals.
    A rule-general abductive learning by rough sets. (arXiv:2305.19718v3 [cs.LG] UPDATED)
    In real-world tasks, there is usually a large amount of unlabeled data and labeled data. The task of combining the two to learn is known as semi-supervised learning. Experts can use logical rules to label unlabeled data, but this operation is costly. The combination of perception and reasoning has a good effect in processing such semi-supervised tasks with domain knowledge. However, acquiring domain knowledge and the correction, reduction and generation of rules remain complex problems to be solved. Rough set theory is an important method for solving knowledge processing in information systems. In this paper, we propose a rule general abductive learning by rough set (RS-ABL). By transforming the target concept and sub-concepts of rules into information tables, rough set theory is used to solve the acquisition of domain knowledge and the correction, reduction and generation of rules at a lower cost. This framework can also generate more extensive negative rules to enhance the breadth of the knowledge base. Compared with the traditional semi-supervised learning method, RS-ABL has higher accuracy in dealing with semi-supervised tasks.
    Impact of Data Augmentation on QCNNs. (arXiv:2312.00358v1 [quant-ph])
    In recent years, Classical Convolutional Neural Networks (CNNs) have been applied for image recognition successfully. Quantum Convolutional Neural Networks (QCNNs) are proposed as a novel generalization to CNNs by using quantum mechanisms. The quantum mechanisms lead to an efficient training process in QCNNs by reducing the size of input from $N$ to $log_2N$. This paper implements and compares both CNNs and QCNNs by testing losses and prediction accuracy on three commonly used datasets. The datasets include the MNIST hand-written digits, Fashion MNIST and cat/dog face images. Additionally, data augmentation (DA), a technique commonly used in CNNs to improve the performance of classification by generating similar images based on original inputs, is also implemented in QCNNs. Surprisingly, the results showed that data augmentation didn't improve QCNNs performance. The reasons and logic behind this result are discussed, hoping to expand our understanding of Quantum machine learning theory.
    On the Out-Of-Distribution Robustness of Self-Supervised Representation Learning for Phonocardiogram Signals. (arXiv:2312.00502v1 [cs.LG])
    Objective: Despite the recent increase in research activity, deep-learning models have not yet been widely accepted in medicine. The shortage of high-quality annotated data often hinders the development of robust and generalizable models, which do not suffer from degraded effectiveness when presented with newly-collected, out-of-distribution (OOD) datasets. Methods: Contrastive Self-Supervised Learning (SSL) offers a potential solution to the scarcity of labeled data as it takes advantage of unlabeled data to increase model effectiveness and robustness. In this research, we propose applying contrastive SSL for detecting abnormalities in phonocardiogram (PCG) samples by learning a generalized representation of the signal. Specifically, we perform an extensive comparative evaluation of a wide range of audio-based augmentations and evaluate trained classifiers on multiple datasets across different downstream tasks. Results: We experimentally demonstrate that, depending on its training distribution, the effectiveness of a fully-supervised model can degrade up to 32% when evaluated on unseen data, while SSL models only lose up to 10% or even improve in some cases. Conclusions: Contrastive SSL pretraining can assist in providing robust classifiers which can generalize to unseen, OOD data, without relying on time- and labor-intensive annotation processes by medical experts. Furthermore, the proposed extensive evaluation protocol sheds light on the most promising and appropriate augmentations for robust PCG signal processing. Significance: We provide researchers and practitioners with a roadmap towards producing robust models for PCG classification, in addition to an open-source codebase for developing novel approaches.
    Decentralized policy learning with partial observation and mechanical constraints for multiperson modeling. (arXiv:2007.03155v2 [cs.LG] UPDATED)
    Extracting the rules of real-world multi-agent behaviors is a current challenge in various scientific and engineering fields. Biological agents independently have limited observation and mechanical constraints; however, most of the conventional data-driven models ignore such assumptions, resulting in lack of biological plausibility and model interpretability for behavioral analyses. Here we propose sequential generative models with partial observation and mechanical constraints in a decentralized manner, which can model agents' cognition and body dynamics, and predict biologically plausible behaviors. We formulate this as a decentralized multi-agent imitation-learning problem, leveraging binary partial observation and decentralized policy models based on hierarchical variational recurrent neural networks with physical and biomechanical penalties. Using real-world basketball and soccer datasets, we show the effectiveness of our method in terms of the constraint violations, long-term trajectory prediction, and partial observation. Our approach can be used as a multi-agent simulator to generate realistic trajectories using real-world data.  ( 2 min )
    Multiple Testing of Linear Forms for Noisy Matrix Completion. (arXiv:2312.00305v1 [stat.ME])
    Many important tasks of large-scale recommender systems can be naturally cast as testing multiple linear forms for noisy matrix completion. These problems, however, present unique challenges because of the subtle bias-and-variance tradeoff of and an intricate dependence among the estimated entries induced by the low-rank structure. In this paper, we develop a general approach to overcome these difficulties by introducing new statistics for individual tests with sharp asymptotics both marginally and jointly, and utilizing them to control the false discovery rate (FDR) via a data splitting and symmetric aggregation scheme. We show that valid FDR control can be achieved with guaranteed power under nearly optimal sample size requirements using the proposed methodology. Extensive numerical simulations and real data examples are also presented to further illustrate its practical merits.
    SparseGS: Real-Time 360{\deg} Sparse View Synthesis using Gaussian Splatting. (arXiv:2312.00206v1 [cs.CV])
    The problem of novel view synthesis has grown significantly in popularity recently with the introduction of Neural Radiance Fields (NeRFs) and other implicit scene representation methods. A recent advance, 3D Gaussian Splatting (3DGS), leverages an explicit representation to achieve real-time rendering with high-quality results. However, 3DGS still requires an abundance of training views to generate a coherent scene representation. In few shot settings, similar to NeRF, 3DGS tends to overfit to training views, causing background collapse and excessive floaters, especially as the number of training views are reduced. We propose a method to enable training coherent 3DGS-based radiance fields of 360 scenes from sparse training views. We find that using naive depth priors is not sufficient and integrate depth priors with generative and explicit constraints to reduce background collapse, remove floaters, and enhance consistency from unseen viewpoints. Experiments show that our method outperforms base 3DGS by up to 30.5% and NeRF-based methods by up to 15.6% in LPIPS on the MipNeRF-360 dataset with substantially less training and inference cost.
    TRC: Trust Region Conditional Value at Risk for Safe Reinforcement Learning. (arXiv:2312.00344v1 [cs.RO])
    As safety is of paramount importance in robotics, reinforcement learning that reflects safety, called safe RL, has been studied extensively. In safe RL, we aim to find a policy which maximizes the desired return while satisfying the defined safety constraints. There are various types of constraints, among which constraints on conditional value at risk (CVaR) effectively lower the probability of failures caused by high costs since CVaR is a conditional expectation obtained above a certain percentile. In this paper, we propose a trust region-based safe RL method with CVaR constraints, called TRC. We first derive the upper bound on CVaR and then approximate the upper bound in a differentiable form in a trust region. Using this approximation, a subproblem to get policy gradients is formulated, and policies are trained by iteratively solving the subproblem. TRC is evaluated through safe navigation tasks in simulations with various robots and a sim-to-real environment with a Jackal robot from Clearpath. Compared to other safe RL methods, the performance is improved by 1.93 times while the constraints are satisfied in all experiments.
    Active Learning and Bayesian Optimization: a Unified Perspective to Learn with a Goal. (arXiv:2303.01560v3 [cs.LG] UPDATED)
    Science and Engineering applications are typically associated with expensive optimization problem to identify optimal design solutions and states of the system of interest. Bayesian optimization and active learning compute surrogate models through efficient adaptive sampling schemes to assist and accelerate this search task toward a given optimization goal. Both those methodologies are driven by specific infill/learning criteria which quantify the utility with respect to the set goal of evaluating the objective function for unknown combinations of optimization variables. While the two fields have seen an exponential growth in popularity in the past decades, their dualism and synergy have received relatively little attention to date. This paper discusses and formalizes the synergy between Bayesian optimization and active learning as symbiotic adaptive sampling methodologies driven by common principles. In particular, we demonstrate this unified perspective through the formalization of the analogy between the Bayesian infill criteria and active learning criteria as driving principles of both the goal-driven procedures. To support our original perspective, we propose a general classification of adaptive sampling techniques to highlight similarities and differences between the vast families of adaptive sampling, active learning, and Bayesian optimization. Accordingly, the synergy is demonstrated mapping the Bayesian infill criteria with the active learning criteria, and is formalized for searches informed by both a single information source and multiple levels of fidelity. In addition, we provide guidelines to apply those learning criteria investigating the performance of different Bayesian schemes for a variety of benchmark problems to highlight benefits and limitations over mathematical properties that characterize real-world applications.
    Scalable Meta-Learning with Gaussian Processes. (arXiv:2312.00742v1 [stat.ML])
    Meta-learning is a powerful approach that exploits historical data to quickly solve new tasks from the same distribution. In the low-data regime, methods based on the closed-form posterior of Gaussian processes (GP) together with Bayesian optimization have achieved high performance. However, these methods are either computationally expensive or introduce assumptions that hinder a principled propagation of uncertainty between task models. This may disrupt the balance between exploration and exploitation during optimization. In this paper, we develop ScaML-GP, a modular GP model for meta-learning that is scalable in the number of tasks. Our core contribution is a carefully designed multi-task kernel that enables hierarchical training and task scalability. Conditioning ScaML-GP on the meta-data exposes its modular nature yielding a test-task prior that combines the posteriors of meta-task GPs. In synthetic and real-world meta-learning experiments, we demonstrate that ScaML-GP can learn efficiently both with few and many meta-tasks.
    Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics. (arXiv:2302.01170v2 [stat.ML] UPDATED)
    Molecular dynamics (MD) simulation is a widely used technique to simulate molecular systems, most commonly at the all-atom resolution where equations of motion are integrated with timesteps on the order of femtoseconds ($1\textrm{fs}=10^{-15}\textrm{s}$). MD is often used to compute equilibrium properties, which requires sampling from an equilibrium distribution such as the Boltzmann distribution. However, many important processes, such as binding and folding, occur over timescales of milliseconds or beyond, and cannot be efficiently sampled with conventional MD. Furthermore, new MD simulations need to be performed for each molecular system studied. We present Timewarp, an enhanced sampling method which uses a normalising flow as a proposal distribution in a Markov chain Monte Carlo method targeting the Boltzmann distribution. The flow is trained offline on MD trajectories and learns to make large steps in time, simulating the molecular dynamics of $10^{5} - 10^{6}\:\textrm{fs}$. Crucially, Timewarp is transferable between molecular systems: once trained, we show that it generalises to unseen small peptides (2-4 amino acids) at all-atom resolution, exploring their metastable states and providing wall-clock acceleration of sampling compared to standard MD. Our method constitutes an important step towards general, transferable algorithms for accelerating MD.
    Automating Continual Learning. (arXiv:2312.00276v1 [cs.LG])
    General-purpose learning systems should improve themselves in open-ended fashion in ever-changing environments. Conventional learning algorithms for neural networks, however, suffer from catastrophic forgetting (CF) -- previously acquired skills are forgotten when a new task is learned. Instead of hand-crafting new algorithms for avoiding CF, we propose Automated Continual Learning (ACL) to train self-referential neural networks to meta-learn their own in-context continual (meta-)learning algorithms. ACL encodes all desiderata -- good performance on both old and new tasks -- into its meta-learning objectives. Our experiments demonstrate that ACL effectively solves "in-context catastrophic forgetting"; our ACL-learned algorithms outperform hand-crafted ones, e.g., on the Split-MNIST benchmark in the replay-free setting, and enables continual learning of diverse tasks consisting of multiple few-shot and standard image classification datasets.
    PEFTDebias : Capturing debiasing information using PEFTs. (arXiv:2312.00434v1 [cs.LG])
    The increasing use of foundation models highlights the urgent need to address and eliminate implicit biases present in them that arise during pretraining. In this paper, we introduce PEFTDebias, a novel approach that employs parameter-efficient fine-tuning (PEFT) to mitigate the biases within foundation models. PEFTDebias consists of two main phases: an upstream phase for acquiring debiasing parameters along a specific bias axis, and a downstream phase where these parameters are incorporated into the model and frozen during the fine-tuning process. By evaluating on four datasets across two bias axes namely gender and race, we find that downstream biases can be effectively reduced with PEFTs. In addition, we show that these parameters possess axis-specific debiasing characteristics, enabling their effective transferability in mitigating biases in various downstream tasks. To ensure reproducibility, we release the code to do our experiments.
    Learning Delays in Spiking Neural Networks using Dilated Convolutions with Learnable Spacings. (arXiv:2306.17670v3 [cs.NE] UPDATED)
    Spiking Neural Networks (SNNs) are a promising research direction for building power-efficient information processing systems, especially for temporal tasks such as speech recognition. In SNNs, delays refer to the time needed for one spike to travel from one neuron to another. These delays matter because they influence the spike arrival times, and it is well-known that spiking neurons respond more strongly to coincident input spikes. More formally, it has been shown theoretically that plastic delays greatly increase the expressivity in SNNs. Yet, efficient algorithms to learn these delays have been lacking. Here, we propose a new discrete-time algorithm that addresses this issue in deep feedforward SNNs using backpropagation, in an offline manner. To simulate delays between consecutive layers, we use 1D convolutions across time. The kernels contain only a few non-zero weights - one per synapse - whose positions correspond to the delays. These positions are learned together with the weights using the recently proposed Dilated Convolution with Learnable Spacings (DCLS). We evaluated our method on three datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC) and its non-spiking version Google Speech Commands v0.02 (GSC) benchmarks, which require detecting temporal patterns. We used feedforward SNNs with two or three hidden fully connected layers, and vanilla leaky integrate-and-fire neurons. We showed that fixed random delays help and that learning them helps even more. Furthermore, our method outperformed the state-of-the-art in the three datasets without using recurrent connections and with substantially fewer parameters. Our work demonstrates the potential of delay learning in developing accurate and precise models for temporal data processing. Our code is based on PyTorch / SpikingJelly and available at: https://github.com/Thvnvtos/SNN-delays
    Spatio-Temporal-Decoupled Masked Pre-training for Traffic Forecasting. (arXiv:2312.00516v1 [cs.LG])
    Accurate forecasting of multivariate traffic flow time series remains challenging due to substantial spatio-temporal heterogeneity and complex long-range correlative patterns. To address this, we propose Spatio-Temporal-Decoupled Masked Pre-training (STD-MAE), a novel framework that employs masked autoencoders to learn and encode complex spatio-temporal dependencies via pre-training. Specifically, we use two decoupled masked autoencoders to reconstruct the traffic data along spatial and temporal axes using a self-supervised pre-training approach. These mask reconstruction mechanisms capture the long-range correlations in space and time separately. The learned hidden representations are then used to augment the downstream spatio-temporal traffic predictor. A series of quantitative and qualitative evaluations on four widely-used traffic benchmarks (PEMS03, PEMS04, PEMS07, and PEMS08) are conducted to verify the state-of-the-art performance, with STD-MAE explicitly enhancing the downstream spatio-temporal models' ability to capture long-range intricate spatial and temporal patterns. Codes are available at https://github.com/Jimmy-7664/STD_MAE.
    A Generalizable Deep Learning System for Cardiac MRI. (arXiv:2312.00357v1 [eess.IV])
    Cardiac MRI allows for a comprehensive assessment of myocardial structure, function, and tissue characteristics. Here we describe a foundational vision system for cardiac MRI, capable of representing the breadth of human cardiovascular disease and health. Our deep learning model is trained via self-supervised contrastive learning, by which visual concepts in cine-sequence cardiac MRI scans are learned from the raw text of the accompanying radiology reports. We train and evaluate our model on data from four large academic clinical institutions in the United States. We additionally showcase the performance of our models on the UK BioBank, and two additional publicly available external datasets. We explore emergent zero-shot capabilities of our system, and demonstrate remarkable performance across a range of tasks; including the problem of left ventricular ejection fraction regression, and the diagnosis of 35 different conditions such as cardiac amyloidosis and hypertrophic cardiomyopathy. We show that our deep learning system is capable of not only understanding the staggering complexity of human cardiovascular disease, but can be directed towards clinical problems of interest yielding impressive, clinical grade diagnostic accuracy with a fraction of the training data typically required for such tasks.
    Optimal Stopping via Randomized Neural Networks. (arXiv:2104.13669v4 [stat.ML] UPDATED)
    This paper presents the benefits of using randomized neural networks instead of standard basis functions or deep neural networks to approximate the solutions of optimal stopping problems. The key idea is to use neural networks, where the parameters of the hidden layers are generated randomly and only the last layer is trained, in order to approximate the continuation value. Our approaches are applicable to high dimensional problems where the existing approaches become increasingly impractical. In addition, since our approaches can be optimized using simple linear regression, they are easy to implement and theoretical guarantees can be provided. We test our approaches for American option pricing on Black--Scholes, Heston and rough Heston models and for optimally stopping a fractional Brownian motion. In all cases, our algorithms outperform the state-of-the-art and other relevant machine learning approaches in terms of computation time while achieving comparable results. Moreover, we show that they can also be used to efficiently compute Greeks of American options.
    EpiTESTER: Testing Autonomous Vehicles with Epigenetic Algorithm and Attention Mechanism. (arXiv:2312.00207v1 [cs.SE])
    Testing autonomous vehicles (AVs) under various environmental scenarios that lead the vehicles to unsafe situations is known to be challenging. Given the infinite possible environmental scenarios, it is essential to find critical scenarios efficiently. To this end, we propose a novel testing method, named EpiTESTER, by taking inspiration from epigenetics, which enables species to adapt to sudden environmental changes. In particular, EpiTESTER adopts gene silencing as its epigenetic mechanism, which regulates gene expression to prevent the expression of a certain gene, and the probability of gene expression is dynamically computed as the environment changes. Given different data modalities (e.g., images, lidar point clouds) in the context of AV, EpiTESTER benefits from a multi-model fusion transformer to extract high-level feature representations from environmental factors and then calculates probabilities based on these features with the attention mechanism. To assess the cost-effectiveness of EpiTESTER, we compare it with a classical genetic algorithm (GA) (i.e., without any epigenetic mechanism implemented) and EpiTESTER with equal probability for each gene. We evaluate EpiTESTER with four initial environments from CARLA, an open-source simulator for autonomous driving research, and an end-to-end AV controller, Interfuser. Our results show that EpiTESTER achieved a promising performance in identifying critical scenarios compared to the baselines, showing that applying epigenetic mechanisms is a good option for solving practical problems.
    Identify ambiguous tasks combining crowdsourced labels by weighting Areas Under the Margin. (arXiv:2209.15380v3 [cs.LG] UPDATED)
    In supervised learning - for instance in image classification - modern massive datasets are commonly labeled by a crowd of workers. The obtained labels in this crowdsourcing setting are then aggregated for training, generally leveraging a per-worker trust score. Yet, such workers oriented approaches discard the tasks' ambiguity. Ambiguous tasks might fool expert workers, which is often harmful for the learning step. In standard supervised learning settings - with one label per task - the Area Under the Margin (AUM) was tailored to identify mislabeled data. We adapt the AUM to identify ambiguous tasks in crowdsourced learning scenarios, introducing the Weighted Areas Under the Margin (WAUM). The WAUM is an average of AUMs weighted according to task-dependent scores. We show that the WAUM can help discarding ambiguous tasks from the training set, leading to better generalization performance. We report improvements over existing strategies for learning with a crowd, both on simulated settings, and on real datasets such as CIFAR-10H (a crowdsourced dataset with a high number of answered labels),LabelMe and Music (two datasets with few answered votes).  ( 2 min )
    FedEmb: A Vertical and Hybrid Federated Learning Algorithm using Network And Feature Embedding Aggregation. (arXiv:2312.00102v1 [cs.LG])
    Federated learning (FL) is an emerging paradigm for decentralized training of machine learning models on distributed clients, without revealing the data to the central server. The learning scheme may be horizontal, vertical or hybrid (both vertical and horizontal). Most existing research work with deep neural network (DNN) modelling is focused on horizontal data distributions, while vertical and hybrid schemes are much less studied. In this paper, we propose a generalized algorithm FedEmb, for modelling vertical and hybrid DNN-based learning. The idea of our algorithm is characterised by higher inference accuracy, stronger privacy-preserving properties, and lower client-server communication bandwidth demands as compared with existing work. The experimental results show that FedEmb is an effective method to tackle both split feature & subject space decentralized problems, shows 0.3% to 4.2% inference accuracy improvement with limited privacy revealing for datasets stored in local clients, and reduces 88.9 % time complexity over vertical baseline method.
    Forecasting Trends in Food Security: a Reservoir Computing Approach. (arXiv:2312.00626v1 [cs.LG])
    Early warning systems are an essential tool for effective humanitarian action. Advance warnings on impending disasters facilitate timely and targeted response which help save lives, livelihoods, and scarce financial resources. In this work we present a new quantitative methodology to forecast levels of food consumption for 60 consecutive days, at the sub-national level, in four countries: Mali, Nigeria, Syria, and Yemen. The methodology is built on publicly available data from the World Food Programme's integrated global hunger monitoring system which collects, processes, and displays daily updates on key food security metrics, conflict, weather events, and other drivers of food insecurity across 90 countries (https://hungermap.wfp.org/). In this study, we assessed the performance of various models including ARIMA, XGBoost, LSTMs, CNNs, and Reservoir Computing (RC), by comparing their Root Mean Squared Error (RMSE) metrics. This comprehensive analysis spanned classical statistical, machine learning, and deep learning approaches. Our findings highlight Reservoir Computing as a particularly well-suited model in the field of food security given both its notable resistance to over-fitting on limited data samples and its efficient training capabilities. The methodology we introduce establishes the groundwork for a global, data-driven early warning system designed to anticipate and detect food insecurity.  ( 2 min )
    Presentation Attack detection using Wavelet Transform and Deep Residual Neural Net. (arXiv:2312.00040v1 [cs.CR])
    Biometric authentication is becoming more prevalent for secured authentication systems. However, the biometric substances can be deceived by the imposters in several ways. Among other imposter attacks, print attacks, mask attacks, and replay attacks fall under the presentation attack category. The bio-metric images, especially the iris and face, are vulnerable to different presentation attacks. This research applies deep learning approaches to mitigate presentation attacks in a biometric access control system. Our contribution in this paper is two-fold: First, we applied the wavelet transform to extract the features from the biometric images. Second, we modified the deep residual neural net and applied it to the spoof datasets in an attempt to detect the presentation attacks. This research applied the proposed approach to biometric spoof datasets, namely ATVS, CASIA two class, and CASIA cropped image sets. The datasets used in this research contain images that are captured in both a controlled and uncontrolled environment along with different resolutions and sizes. We obtained the best accuracy of 93% on the ATVS Iris datasets. For CASIA two class and CASIA cropped datasets, we achieved test accuracies of 91% and 82%, respectively.
    Multimodal Learning for Crystalline Materials. (arXiv:2312.00111v1 [cs.LG])
    Artificial intelligence (AI) has revolutionized the field of materials science by improving the prediction of properties and accelerating the discovery of novel materials. In recent years, publicly available material data repositories containing data for various material properties have grown rapidly. In this work, we introduce Multimodal Learning for Crystalline Materials (MLCM), a new method for training a foundation model for crystalline materials via multimodal alignment, where high-dimensional material properties (i.e. modalities) are connected in a shared latent space to produce highly useful material representations. We show the utility of MLCM on multiple axes: (i) MLCM achieves state-of-the-art performance for material property prediction on the challenging Materials Project database; (ii) MLCM enables a novel, highly accurate method for inverse design, allowing one to screen for stable material with desired properties; and (iii) MLCM allows the extraction of interpretable emergent features that may provide insight to material scientists. Further, we explore several novel methods for aligning an arbitrary number of modalities, improving upon prior art in multimodal learning that focuses on bimodal alignment. Our work brings innovations from the ongoing AI revolution into the domain of materials science and identifies materials as a testbed for the next generation of AI.
    Large Language Models of Code Fail at Completing Code with Potential Bugs. (arXiv:2306.03438v2 [cs.LG] UPDATED)
    Large language models of code (Code-LLMs) have recently brought tremendous advances to code completion, a fundamental feature of programming assistance and code intelligence. However, most existing works ignore the possible presence of bugs in the code context for generation, which are inevitable in software development. Therefore, we introduce and study the buggy-code completion problem, inspired by the realistic scenario of real-time code suggestion where the code context contains potential bugs -- anti-patterns that can become bugs in the completed program. To systematically study the task, we introduce two datasets: one with synthetic bugs derived from semantics-altering operator changes (buggy-HumanEval) and one with realistic bugs derived from user submissions to coding problems (buggy-FixEval). We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs. For instance, the passing rates of CODEGEN-2B-MONO on test cases of buggy-HumanEval drop more than 50% given a single potential bug in the context. Finally, we investigate several post-hoc methods for mitigating the adverse effect of potential bugs and find that there remains a significant gap in post-mitigation performance.  ( 2 min )
    Machine Learning for Health symposium 2023 -- Findings track. (arXiv:2312.00655v1 [cs.LG])
    A collection of the accepted Findings papers that were presented at the 3rd Machine Learning for Health symposium (ML4H 2023), which was held on December 10, 2023, in New Orleans, Louisiana, USA. ML4H 2023 invited high-quality submissions on relevant problems in a variety of health-related disciplines including healthcare, biomedicine, and public health. Two submission tracks were offered: the archival Proceedings track, and the non-archival Findings track. Proceedings were targeted at mature work with strong technical sophistication and a high impact to health. The Findings track looked for new ideas that could spark insightful discussion, serve as valuable resources for the community, or could enable new collaborations. Submissions to the Proceedings track, if not accepted, were automatically considered for the Findings track. All the manuscripts submitted to ML4H Symposium underwent a double-blind peer-review process.  ( 2 min )
    Optimal Sample Complexity of Contrastive Learning. (arXiv:2312.00379v1 [cs.LG])
    Contrastive learning is a highly successful technique for learning representations of data from labeled tuples, specifying the distance relations within the tuple. We study the sample complexity of contrastive learning, i.e. the minimum number of labeled tuples sufficient for getting high generalization accuracy. We give tight bounds on the sample complexity in a variety of settings, focusing on arbitrary distance functions, both general $\ell_p$-distances, and tree metrics. Our main result is an (almost) optimal bound on the sample complexity of learning $\ell_p$-distances for integer $p$. For any $p \ge 1$ we show that $\tilde \Theta(\min(nd,n^2))$ labeled tuples are necessary and sufficient for learning $d$-dimensional representations of $n$-point datasets. Our results hold for an arbitrary distribution of the input samples and are based on giving the corresponding bounds on the Vapnik-Chervonenkis/Natarajan dimension of the associated problems. We further show that the theoretical bounds on sample complexity obtained via VC/Natarajan dimension can have strong predictive power for experimental results, in contrast with the folklore belief about a substantial gap between the statistical learning theory and the practice of deep learning.
    Predicting breast cancer with AI for individual risk-adjusted MRI screening and early detection. (arXiv:2312.00067v1 [physics.med-ph])
    Women with an increased life-time risk of breast cancer undergo supplemental annual screening MRI. We propose to predict the risk of developing breast cancer within one year based on the current MRI, with the objective of reducing screening burden and facilitating early detection. An AI algorithm was developed on 53,858 breasts from 12,694 patients who underwent screening or diagnostic MRI and accrued over 12 years, with 2,331 confirmed cancers. A first U-Net was trained to segment lesions and identify regions of concern. A second convolutional network was trained to detect malignant cancer using features extracted by the U-Net. This network was then fine-tuned to estimate the risk of developing cancer within a year in cases that radiologists considered normal or likely benign. Risk predictions from this AI were evaluated with a retrospective analysis of 9,183 breasts from a high-risk screening cohort, which were not used for training. Statistical analysis focused on the tradeoff between number of omitted exams versus negative predictive value, and number of potential early detections versus positive predictive value. The AI algorithm identified regions of concern that coincided with future tumors in 52% of screen-detected cancers. Upon directed review, a radiologist found that 71.3% of cancers had a visible correlate on the MRI prior to diagnosis, 65% of these correlates were identified by the AI model. Reevaluating these regions in 10% of all cases with higher AI-predicted risk could have resulted in up to 33% early detections by a radiologist. Additionally, screening burden could have been reduced in 16% of lower-risk cases by recommending a later follow-up without compromising current interval cancer rate. With increasing datasets and improving image quality we expect this new AI-aided, adaptive screening to meaningfully reduce screening burden and improve early detection.  ( 3 min )
    HarsanyiNet: Computing Accurate Shapley Values in a Single Forward Propagation. (arXiv:2304.01811v2 [cs.LG] UPDATED)
    The Shapley value is widely regarded as a trustworthy attribution metric. However, when people use Shapley values to explain the attribution of input variables of a deep neural network (DNN), it usually requires a very high computational cost to approximate relatively accurate Shapley values in real-world applications. Therefore, we propose a novel network architecture, the HarsanyiNet, which makes inferences on the input sample and simultaneously computes the exact Shapley values of the input variables in a single forward propagation. The HarsanyiNet is designed on the theoretical foundation that the Shapley value can be reformulated as the redistribution of Harsanyi interactions encoded by the network.  ( 2 min )
    Hypergraph Node Representation Learning with One-Stage Message Passing. (arXiv:2312.00336v1 [cs.LG])
    Hypergraphs as an expressive and general structure have attracted considerable attention from various research domains. Most existing hypergraph node representation learning techniques are based on graph neural networks, and thus adopt the two-stage message passing paradigm (i.e. node -> hyperedge -> node). This paradigm only focuses on local information propagation and does not effectively take into account global information, resulting in less optimal representations. Our theoretical analysis of representative two-stage message passing methods shows that, mathematically, they model different ways of local message passing through hyperedges, and can be unified into one-stage message passing (i.e. node -> node). However, they still only model local information. Motivated by this theoretical analysis, we propose a novel one-stage message passing paradigm to model both global and local information propagation for hypergraphs. We integrate this paradigm into HGraphormer, a Transformer-based framework for hypergraph node representation learning. HGraphormer injects the hypergraph structure information (local information) into Transformers (global information) by combining the attention matrix and hypergraph Laplacian. Extensive experiments demonstrate that HGraphormer outperforms recent hypergraph learning methods on five representative benchmark datasets on the semi-supervised hypernode classification task, setting new state-of-the-art performance, with accuracy improvements between 2.52% and 6.70%. Our code and datasets are available.  ( 2 min )
    Quantum Kernel t-Distributed Stochastic Neighbor Embedding. (arXiv:2312.00352v1 [quant-ph])
    Data visualization is important in understanding the characteristics of data that are difficult to see directly. It is used to visualize loss landscapes and optimization trajectories to analyze optimization performance. Popular optimization analysis is performed by visualizing a loss landscape around the reached local or global minimum using principal component analysis. However, this visualization depends on the variational parameters of a quantum circuit rather than quantum states, which makes it difficult to understand the mechanism of optimization process through the property of quantum states. Here, we propose a quantum data visualization method using quantum kernels, which enables us to offer fast and highly accurate visualization of quantum states. In our numerical experiments, we visualize hand-written digits dataset and apply $k$-nearest neighbor algorithm to the low-dimensional data to quantitatively evaluate our proposed method compared with a classical kernel method. As a result, our proposed method achieves comparable accuracy to the state-of-the-art classical kernel method, meaning that the proposed visualization method based on quantum machine learning does not degrade the separability of the input higher dimensional data. Furthermore, we visualize the optimization trajectories of finding the ground states of transverse field Ising model and successfully find the trajectory characteristics. Since quantum states are higher dimensional objects that can only be seen via observables, our visualization method, which inherits the similarity of quantum data, would be useful in understanding the behavior of quantum circuits and algorithms.
    LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices. (arXiv:2312.00388v1 [cs.LG])
    Deploying Large Language Models (LLMs) locally on mobile devices presents a significant challenge due to their extensive memory requirements. In this paper, we introduce LinguaLinked, a system for decentralized, distributed LLM inference on mobile devices. LinguaLinked enables collaborative execution of the inference task across multiple trusted devices. LinguaLinked ensures data privacy by processing information locally. LinguaLinked uses three key strategies. First, an optimized model assignment technique segments LLMs and uses linear optimization to align segments with each device's capabilities. Second, an optimized data transmission mechanism ensures efficient and structured data flow between model segments while also maintaining the integrity of the original model structure. Finally, LinguaLinked incorporates a runtime load balancer that actively monitors and redistributes tasks among mobile devices to prevent bottlenecks, enhancing the system's overall efficiency and responsiveness. We demonstrate that LinguaLinked facilitates efficient LLM inference while maintaining consistent throughput and minimal latency through extensive testing across various mobile devices, from high-end to low-end Android devices. In our evaluations, compared to the baseline, LinguaLinked achieves an inference performance acceleration of $1.11\times$ to $1.61\times$ in single-threaded settings, $1.73\times$ to $2.65\times$ with multi-threading. Additionally, runtime load balancing yields an overall inference acceleration of $1.29\times$ to $1.32\times$.
    GIFT: Generative Interpretable Fine-Tuning Transformers. (arXiv:2312.00700v1 [cs.CV])
    We present GIFT (Generative Interpretable Fine-tuning Transformers) for fine-tuning pretrained (often large) Transformer models at downstream tasks in a parameter-efficient way with built-in interpretability. Our GIFT is a deep parameter-residual learning method, which addresses two problems in fine-tuning a pretrained Transformer model: Where to apply the parameter-efficient fine-tuning (PEFT) to be extremely lightweight yet sufficiently expressive, and How to learn the PEFT to better exploit the knowledge of the pretrained model in a direct way? For the former, we select the final projection (linear) layer in the multi-head self-attention of a Transformer model, and verify its effectiveness. For the latter, in contrast to the prior art that directly introduce new model parameters (often in low-rank approximation form) to be learned in fine-tuning with downstream data, we propose a method for learning to generate the fine-tuning parameters. Our GIFT is a hyper-Transformer which take as input the pretrained parameters of the projection layer to generate its fine-tuning parameters using a proposed Parameter-to-Cluster Attention (PaCa). The PaCa results in a simple clustering-based forward explainer that plays the role of semantic segmentation in testing. In experiments, our proposed GIFT is tested on the VTAB benchmark and the fine-grained visual classification (FGVC) benchmark. It obtains significantly better performance than the prior art. Our code is available at https://github.com/savadikarc/gift  ( 2 min )
    Towards Unsupervised Representation Learning: Learning, Evaluating and Transferring Visual Representations. (arXiv:2312.00101v1 [cs.CV])
    Unsupervised representation learning aims at finding methods that learn representations from data without annotation-based signals. Abstaining from annotations not only leads to economic benefits but may - and to some extent already does - result in advantages regarding the representation's structure, robustness, and generalizability to different tasks. In the long run, unsupervised methods are expected to surpass their supervised counterparts due to the reduction of human intervention and the inherently more general setup that does not bias the optimization towards an objective originating from specific annotation-based signals. While major advantages of unsupervised representation learning have been recently observed in natural language processing, supervised methods still dominate in vision domains for most tasks. In this dissertation, we contribute to the field of unsupervised (visual) representation learning from three perspectives: (i) Learning representations: We design unsupervised, backpropagation-free Convolutional Self-Organizing Neural Networks (CSNNs) that utilize self-organization- and Hebbian-based learning rules to learn convolutional kernels and masks to achieve deeper backpropagation-free models. (ii) Evaluating representations: We build upon the widely used (non-)linear evaluation protocol to define pretext- and target-objective-independent metrics for measuring and investigating the objective function mismatch between various unsupervised pretext tasks and target tasks. (iii) Transferring representations: We contribute CARLANE, the first 3-way sim-to-real domain adaptation benchmark for 2D lane detection, and a method based on prototypical self-supervised learning. Finally, we contribute a content-consistent unpaired image-to-image translation method that utilizes masks, global and local discriminators, and similarity sampling to mitigate content inconsistencies.
    A Physics-Constrained NeuralODE Approach for Robust Learning of Stiff Chemical Kinetics. (arXiv:2312.00038v1 [physics.comp-ph])
    The high computational cost associated with solving for detailed chemistry poses a significant challenge for predictive computational fluid dynamics (CFD) simulations of turbulent reacting flows. These models often require solving a system of coupled stiff ordinary differential equations (ODEs). While deep learning techniques have been experimented with to develop faster surrogate models, they often fail to integrate reliably with CFD solvers. This instability arises because deep learning methods optimize for training error without ensuring compatibility with ODE solvers, leading to accumulation of errors over time. Recently, NeuralODE-based techniques have offered a promising solution by effectively modeling chemical kinetics. In this study, we extend the NeuralODE framework for stiff chemical kinetics by incorporating mass conservation constraints directly into the loss function during training. This ensures that the total mass and the elemental mass are conserved, a critical requirement for reliable downstream integration with CFD solvers. Our results demonstrate that this enhancement not only improves the physical consistency with respect to mass conservation criteria but also ensures better robustness and makes the training process more computationally efficient.
    Pathway to a fully data-driven geotechnics: lessons from materials informatics. (arXiv:2312.00581v1 [cs.LG])
    This paper elucidates the challenges and opportunities inherent in integrating data-driven methodologies into geotechnics, drawing inspiration from the success of materials informatics. Highlighting the intricacies of soil complexity, heterogeneity, and the lack of comprehensive data, the discussion underscores the pressing need for community-driven database initiatives and open science movements. By leveraging the transformative power of deep learning, particularly in feature extraction from high-dimensional data and the potential of transfer learning, we envision a paradigm shift towards a more collaborative and innovative geotechnics field. The paper concludes with a forward-looking stance, emphasizing the revolutionary potential brought about by advanced computational tools like large language models in reshaping geotechnics informatics.
    Negotiated Representations to Prevent Forgetting in Machine Learning Applications. (arXiv:2312.00237v1 [cs.LG])
    Catastrophic forgetting is a significant challenge in the field of machine learning, particularly in neural networks. When a neural network learns to perform well on a new task, it often forgets its previously acquired knowledge or experiences. This phenomenon occurs because the network adjusts its weights and connections to minimize the loss on the new task, which can inadvertently overwrite or disrupt the representations that were crucial for the previous tasks. As a result, the the performance of the network on earlier tasks deteriorates, limiting its ability to learn and adapt to a sequence of tasks. In this paper, we propose a novel method for preventing catastrophic forgetting in machine learning applications, specifically focusing on neural networks. Our approach aims to preserve the knowledge of the network across multiple tasks while still allowing it to learn new information effectively. We demonstrate the effectiveness of our method by conducting experiments on various benchmark datasets, including Split MNIST, Split CIFAR10, Split Fashion MNIST, and Split CIFAR100. These datasets are created by dividing the original datasets into separate, non overlapping tasks, simulating a continual learning scenario where the model needs to learn multiple tasks sequentially without forgetting the previous ones. Our proposed method tackles the catastrophic forgetting problem by incorporating negotiated representations into the learning process, which allows the model to maintain a balance between retaining past experiences and adapting to new tasks. By evaluating our method on these challenging datasets, we aim to showcase its potential for addressing catastrophic forgetting and improving the performance of neural networks in continual learning settings.
    A Causality-Aware Pattern Mining Scheme for Group Activity Recognition in a Pervasive Sensor Space. (arXiv:2312.00404v1 [cs.LG])
    Human activity recognition (HAR) is a key challenge in pervasive computing and its solutions have been presented based on various disciplines. Specifically, for HAR in a smart space without privacy and accessibility issues, data streams generated by deployed pervasive sensors are leveraged. In this paper, we focus on a group activity by which a group of users perform a collaborative task without user identification and propose an efficient group activity recognition scheme which extracts causality patterns from pervasive sensor event sequences generated by a group of users to support as good recognition accuracy as the state-of-the-art graphical model. To filter out irrelevant noise events from a given data stream, a set of rules is leveraged to highlight causally related events. Then, a pattern-tree algorithm extracts frequent causal patterns by means of a growing tree structure. Based on the extracted patterns, a weighted sum-based pattern matching algorithm computes the likelihoods of stored group activities to the given test event sequence by means of matched event pattern counts for group activity recognition. We evaluate the proposed scheme using the data collected from our testbed and CASAS datasets where users perform their tasks on a daily basis and validate its effectiveness in a real environment. Experiment results show that the proposed scheme performs higher recognition accuracy and with a small amount of runtime overhead than the existing schemes.
    A Policy Gradient Method for Confounded POMDPs. (arXiv:2305.17083v2 [stat.ML] UPDATED)
    In this paper, we propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting. We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data. The identification enables us to solve a sequence of conditional moment restrictions and adopt the min-max learning procedure with general function approximation for estimating the policy gradient. We then provide a finite-sample non-asymptotic bound for estimating the gradient uniformly over a pre-specified policy class in terms of the sample size, length of horizon, concentratability coefficient and the measure of ill-posedness in solving the conditional moment restrictions. Lastly, by deploying the proposed gradient estimation in the gradient ascent algorithm, we show the global convergence of the proposed algorithm in finding the history-dependent optimal policy under some technical conditions. To the best of our knowledge, this is the first work studying the policy gradient method for POMDPs under the offline setting.
    Generating high-quality 3DMPCs by adaptive data acquisition and NeREF-based radiometric calibration with UGV plant phenotyping system. (arXiv:2305.06777v2 [eess.IV] UPDATED)
    Fusion of 3D and MS imaging data has a great potential for high-throughput plant phenotyping of structural and biochemical as well as physiological traits simultaneously, which is important for decision support in agriculture and for crop breeders in selecting the best genotypes. However, lacking of 3D data integrity of various plant canopy structures and low-quality of MS images caused by the complex illumination effects make a great challenge, especially at the proximal imaging scale. Therefore, this study proposed a novel approach for adaptive data acquisition and radiometric calibration to generate high-quality 3DMPCs of plants. An efficient NBV planning method based on an UGV plant phenotyping system with a multi-sensor-equipped robotic arm was proposed to achieve adaptive data acquisition. The NeREF was employed to predict the DN values of the hemispherical reference for radiometric calibration. For NBV planning, the average total time for single plant at a joint speed of 1.55 rad/s was about 62.8 s, with an average reduction of 18.0% compared to the unplanned. The integrity of the whole-plant data was improved by an average of 23.6% compared to the fixed viewpoints alone. Compared with the ASD measurements, the RMSE of the reflectance spectra obtained from 3DMPCs at different regions of interest was 0.08 with an average decrease of 58.93% compared to the results obtained from the single-frame of MS images without 3D radiometric calibration. The 3D-calibrated plant 3DMPCs improved the predictive accuracy of PLSR for chlorophyll content, with an average increase of 0.07 in R2 and an average decrease of 21.25% in RMSE. Our approach introduced a fresh perspective on generating high-quality 3DMPCs of plants under the natural light condition, enabling more precise analysis of plant morphological and physiological parameters.
    Curvature Explains Loss of Plasticity. (arXiv:2312.00246v1 [cs.LG])
    Loss of plasticity is a phenomenon in which neural networks lose their ability to learn from new experience. Despite being empirically observed in several problem settings, little is understood about the mechanisms that lead to loss of plasticity. In this paper, we offer a consistent explanation for plasticity loss, based on an assertion that neural networks lose directions of curvature during training and that plasticity loss can be attributed to this reduction in curvature. To support such a claim, we provide a systematic empirical investigation of plasticity loss across several continual supervised learning problems. Our findings illustrate that curvature loss coincides with and sometimes precedes plasticity loss, while also showing that previous explanations are insufficient to explain loss of plasticity in all settings. Lastly, we show that regularizers which mitigate loss of plasticity also preserve curvature, motivating a simple distributional regularizer that proves to be effective across the problem settings considered.
    In-Context Pretraining: Language Modeling Beyond Document Boundaries. (arXiv:2310.10638v4 [cs.CL] UPDATED)
    Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).
    Developmental Pretraining (DPT) for Image Classification Networks. (arXiv:2312.00304v1 [cs.LG])
    In the backdrop of increasing data requirements of Deep Neural Networks for object recognition that is growing more untenable by the day, we present Developmental PreTraining (DPT) as a possible solution. DPT is designed as a curriculum-based pre-training approach designed to rival traditional pre-training techniques that are data-hungry. These training approaches also introduce unnecessary features that could be misleading when the network is employed in a downstream classification task where the data is sufficiently different from the pre-training data and is scarce. We design the curriculum for DPT by drawing inspiration from human infant visual development. DPT employs a phased approach where carefully-selected primitive and universal features like edges and shapes are taught to the network participating in our pre-training regime. A model that underwent the DPT regime is tested against models with randomised weights to evaluate the viability of DPT.
    Domain Adaptive Imitation Learning with Visual Observation. (arXiv:2312.00548v1 [cs.LG])
    In this paper, we consider domain-adaptive imitation learning with visual observation, where an agent in a target domain learns to perform a task by observing expert demonstrations in a source domain. Domain adaptive imitation learning arises in practical scenarios where a robot, receiving visual sensory data, needs to mimic movements by visually observing other robots from different angles or observing robots of different shapes. To overcome the domain shift in cross-domain imitation learning with visual observation, we propose a novel framework for extracting domain-independent behavioral features from input observations that can be used to train the learner, based on dual feature extraction and image reconstruction. Empirical results demonstrate that our approach outperforms previous algorithms for imitation learning from visual observation with domain shift.
    Non-uniform Online Learning: Towards Understanding Induction. (arXiv:2312.00170v1 [cs.LG])
    Can a physicist make only finite errors in the endless pursuit of the law of nature? This millennium-old question of inductive inference is a fundamental, yet mysterious problem in philosophy, lacking rigorous justifications. While classic online learning theory and inductive inference share a similar sequential decision-making spirit, the former's reliance on an adaptive adversary and worst-case error bounds limits its applicability to the latter. In this work, we introduce the concept of non-uniform online learning, which we argue aligns more closely with the principles of inductive reasoning. This setting assumes a predetermined ground-truth hypothesis and considers non-uniform, hypothesis-wise error bounds. In the realizable setting, we provide a complete characterization of learnability with finite error: a hypothesis class is non-uniform learnable if and only if it's a countable union of Littlestone classes, no matter the observations are adaptively chosen or iid sampled. Additionally, we propose a necessary condition for the weaker criterion of consistency which we conjecture to be tight. To further promote our theory, we extend our result to the more realistic agnostic setting, showing that any countable union of Littlestone classes can be learnt with regret $\tilde{O}(\sqrt{T})$. We hope this work could offer a new perspective of interpreting the power of induction from an online learning viewpoint.
    Improving Open-Set Semi-Supervised Learning with Self-Supervision. (arXiv:2301.10127v3 [cs.LG] UPDATED)
    Open-set semi-supervised learning (OSSL) embodies a practical scenario within semi-supervised learning, wherein the unlabeled training set encompasses classes absent from the labeled set. Many existing OSSL methods assume that these out-of-distribution data are harmful and put effort into excluding data belonging to unknown classes from the training objective. In contrast, we propose an OSSL framework that facilitates learning from all unlabeled data through self-supervision. Additionally, we utilize an energy-based score to accurately recognize data belonging to the known classes, making our method well-suited for handling uncurated data in deployment. We show through extensive experimental evaluations that our method yields state-of-the-art results on many of the evaluated benchmark problems in terms of closed-set accuracy and open-set recognition when compared with existing methods for OSSL. Our code is available at https://github.com/walline/ssl-tf2-sefoss.
    Nonparametric Variational Regularisation of Pretrained Transformers. (arXiv:2312.00662v1 [cs.LG])
    The current paradigm of large-scale pre-training and fine-tuning Transformer large language models has lead to significant improvements across the board in natural language processing. However, such large models are susceptible to overfitting to their training data, and as a result the models perform poorly when the domain changes. Also, due to the model's scale, the cost of fine-tuning the model to the new domain is large. Nonparametric Variational Information Bottleneck (NVIB) has been proposed as a regulariser for training cross-attention in Transformers, potentially addressing the overfitting problem. We extend the NVIB framework to replace all types of attention functions in Transformers, and show that existing pretrained Transformers can be reinterpreted as Nonparametric Variational (NV) models using a proposed identity initialisation. We then show that changing the initialisation introduces a novel, information-theoretic post-training regularisation in the attention mechanism, which improves out-of-domain generalisation without any training. This success supports the hypothesis that pretrained Transformers are implicitly NV Bayesian models.
    DeepEn2023: Energy Datasets for Edge Artificial Intelligence. (arXiv:2312.00103v1 [cs.LG])
    Climate change poses one of the most significant challenges to humanity. As a result of these climatic changes, the frequency of weather, climate, and water-related disasters has multiplied fivefold over the past 50 years, resulting in over 2 million deaths and losses exceeding $3.64 trillion USD. Leveraging AI-powered technologies for sustainable development and combating climate change is a promising avenue. Numerous significant publications are dedicated to using AI to improve renewable energy forecasting, enhance waste management, and monitor environmental changes in real time. However, very few research studies focus on making AI itself environmentally sustainable. This oversight regarding the sustainability of AI within the field might be attributed to a mindset gap and the absence of comprehensive energy datasets. In addition, with the ubiquity of edge AI systems and applications, especially on-device learning, there is a pressing need to measure, analyze, and optimize their environmental sustainability, such as energy efficiency. To this end, in this paper, we propose large-scale energy datasets for edge AI, named DeepEn2023, covering a wide range of kernels, state-of-the-art deep neural network models, and popular edge AI applications. We anticipate that DeepEn2023 will improve transparency in sustainability in on-device deep learning across a range of edge AI systems and applications. For more information, including access to the dataset and code, please visit https://amai-gsu.github.io/DeepEn2023.
    Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training. (arXiv:2312.00359v1 [cs.LG])
    Regularization in modern machine learning is crucial, and it can take various forms in algorithmic design: training set, model family, error function, regularization terms, and optimizations. In particular, the learning rate, which can be interpreted as a temperature-like parameter within the statistical mechanics of learning, plays a crucial role in neural network training. Indeed, many widely adopted training strategies basically just define the decay of the learning rate over time. This process can be interpreted as decreasing a temperature, using either a global learning rate (for the entire model) or a learning rate that varies for each parameter. This paper proposes TempBalance, a straightforward yet effective layer-wise learning rate method. TempBalance is based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which characterizes the implicit self-regularization of different layers in trained models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide the scheduling and balancing of temperature across all network layers during model training, resulting in improved performance during testing. We implement TempBalance on CIFAR10, CIFAR100, SVHN, and TinyImageNet datasets using ResNets, VGGs, and WideResNets with various depths and widths. Our results show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization. We also show that TempBalance outperforms a number of state-of-the-art optimizers and learning rate schedulers.
    Improving Plasticity in Online Continual Learning via Collaborative Learning. (arXiv:2312.00600v1 [cs.LG])
    Online Continual Learning (CL) solves the problem of learning the ever-emerging new classification tasks from a continuous data stream. Unlike its offline counterpart, in online CL, the training data can only be seen once. Most existing online CL research regards catastrophic forgetting (i.e., model stability) as almost the only challenge. In this paper, we argue that the model's capability to acquire new knowledge (i.e., model plasticity) is another challenge in online CL. While replay-based strategies have been shown to be effective in alleviating catastrophic forgetting, there is a notable gap in research attention toward improving model plasticity. To this end, we propose Collaborative Continual Learning (CCL), a collaborative learning based strategy to improve the model's capability in acquiring new concepts. Additionally, we introduce Distillation Chain (DC), a novel collaborative learning scheme to boost the training of the models. We adapted CCL-DC to existing representative online CL works. Extensive experiments demonstrate that even if the learners are well-trained with state-of-the-art online CL methods, our strategy can still improve model plasticity dramatically, and thereby improve the overall performance by a large margin.
    MIA-BAD: An Approach for Enhancing Membership Inference Attack and its Mitigation with Federated Learning. (arXiv:2312.00051v1 [cs.CR])
    The membership inference attack (MIA) is a popular paradigm for compromising the privacy of a machine learning (ML) model. MIA exploits the natural inclination of ML models to overfit upon the training data. MIAs are trained to distinguish between training and testing prediction confidence to infer membership information. Federated Learning (FL) is a privacy-preserving ML paradigm that enables multiple clients to train a unified model without disclosing their private data. In this paper, we propose an enhanced Membership Inference Attack with the Batch-wise generated Attack Dataset (MIA-BAD), a modification to the MIA approach. We investigate that the MIA is more accurate when the attack dataset is generated batch-wise. This quantitatively decreases the attack dataset while qualitatively improving it. We show how training an ML model through FL, has some distinct advantages and investigate how the threat introduced with the proposed MIA-BAD approach can be mitigated with FL approaches. Finally, we demonstrate the qualitative effects of the proposed MIA-BAD methodology by conducting extensive experiments with various target datasets, variable numbers of federated clients, and training batch sizes.
    A Bayesian approach for prompt optimization in pre-trained language models. (arXiv:2312.00471v1 [cs.LG])
    A prompt is a sequence of symbol or tokens, selected from a vocabulary according to some rule, which is prepended/concatenated to a textual query. A key problem is how to select the sequence of tokens: in this paper we formulate it as a combinatorial optimization problem. The high dimensionality of the token space com-pounded by the length of the prompt sequence requires a very efficient solution. In this paper we propose a Bayesian optimization method, executed in a continuous em-bedding of the combinatorial space. In this paper we focus on hard prompt tuning (HPT) which directly searches for discrete tokens to be added to the text input with-out requiring access to the large language model (LLM) and can be used also when LLM is available only as a black-box. This is critically important if LLMs are made available in the Model as a Service (MaaS) manner as in GPT-4. The current manu-script is focused on the optimization of discrete prompts for classification tasks. The discrete prompts give rise to difficult combinatorial optimization problem which easily become intractable given the dimension of the token space in realistic applications. The optimization method considered in this paper is Bayesian optimization (BO) which has become the dominant approach in black-box optimization for its sample efficiency along with its modular structure and versatility. In this paper we use BoTorch, a library for Bayesian optimization research built on top of pyTorch. Albeit preliminary and obtained using a 'vanilla' version of BO, the experiments on RoB-ERTa on six benchmarks, show a good performance across a variety of tasks and enable an analysis of the tradeoff between size of the search space, accuracy and wall clock time.
    Enhancing Ligand Pose Sampling for Molecular Docking. (arXiv:2312.00191v1 [q-bio.BM])
    Deep learning promises to dramatically improve scoring functions for molecular docking, leading to substantial advances in binding pose prediction and virtual screening. To train scoring functions-and to perform molecular docking-one must generate a set of candidate ligand binding poses. Unfortunately, the sampling protocols currently used to generate candidate poses frequently fail to produce any poses close to the correct, experimentally determined pose, unless information about the correct pose is provided. This limits the accuracy of learned scoring functions and molecular docking. Here, we describe two improved protocols for pose sampling: GLOW (auGmented sampLing with sOftened vdW potential) and a novel technique named IVES (IteratiVe Ensemble Sampling). Our benchmarking results demonstrate the effectiveness of our methods in improving the likelihood of sampling accurate poses, especially for binding pockets whose shape changes substantially when different ligands bind. This improvement is observed across both experimentally determined and AlphaFold-generated protein structures. Additionally, we present datasets of candidate ligand poses generated using our methods for each of around 5,000 protein-ligand cross-docking pairs, for training and testing scoring functions. To benefit the research community, we provide these cross-docking datasets and an open-source Python implementation of GLOW and IVES at https://github.com/drorlab/GLOW_IVES .
    Tree-based Forecasting of Day-ahead Solar Power Generation from Granular Meteorological Features. (arXiv:2312.00090v1 [cs.LG])
    Accurate forecasts for day-ahead photovoltaic (PV) power generation are crucial to support a high PV penetration rate in the local electricity grid and to assure stability in the grid. We use state-of-the-art tree-based machine learning methods to produce such forecasts and, unlike previous studies, we hereby account for (i) the effects various meteorological as well as astronomical features have on PV power production, and this (ii) at coarse as well as granular spatial locations. To this end, we use data from Belgium and forecast day-ahead PV power production at an hourly resolution. The insights from our study can assist utilities, decision-makers, and other stakeholders in optimizing grid operations, economic dispatch, and in facilitating the integration of distributed PV power into the electricity grid.
    Deep Equilibrium Based Neural Operators for Steady-State PDEs. (arXiv:2312.00234v1 [cs.LG])
    Data-driven machine learning approaches are being increasingly used to solve partial differential equations (PDEs). They have shown particularly striking successes when training an operator, which takes as input a PDE in some family, and outputs its solution. However, the architectural design space, especially given structural knowledge of the PDE family of interest, is still poorly understood. We seek to remedy this gap by studying the benefits of weight-tied neural network architectures for steady-state PDEs. To achieve this, we first demonstrate that the solution of most steady-state PDEs can be expressed as a fixed point of a non-linear operator. Motivated by this observation, we propose FNO-DEQ, a deep equilibrium variant of the FNO architecture that directly solves for the solution of a steady-state PDE as the infinite-depth fixed point of an implicit operator layer using a black-box root solver and differentiates analytically through this fixed point resulting in $\mathcal{O}(1)$ training memory. Our experiments indicate that FNO-DEQ-based architectures outperform FNO-based baselines with $4\times$ the number of parameters in predicting the solution to steady-state PDEs such as Darcy Flow and steady-state incompressible Navier-Stokes. Finally, we show FNO-DEQ is more robust when trained with datasets with more noisy observations than the FNO-based baselines, demonstrating the benefits of using appropriate inductive biases in architectural design for different neural network based PDE solvers. Further, we show a universal approximation result that demonstrates that FNO-DEQ can approximate the solution to any steady-state PDE that can be written as a fixed point equation.
    Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version). (arXiv:2312.00592v1 [cs.LG])
    Reinforcement learning (RL) for robot control typically requires a detailed representation of the environment state, including information about task-relevant objects not directly measurable. Keypoint detectors, such as spatial autoencoders (SAEs), are a common approach to extracting a low-dimensional representation from high-dimensional image data. SAEs aim at spatial features such as object positions, which are often useful representations in robotic RL. However, whether an SAE is actually able to track objects in the scene and thus yields a spatial state representation well suited for RL tasks has rarely been examined due to a lack of established metrics. In this paper, we propose to assess the performance of an SAE instance by measuring how well keypoints track ground truth objects in images. We present a computationally lightweight metric and use it to evaluate common baseline SAE architectures on image data from a simulated robot task. We find that common SAEs differ substantially in their spatial extraction capability. Furthermore, we validate that SAEs that perform well in our metric achieve superior performance when used in downstream RL. Thus, our metric is an effective and lightweight indicator of RL performance before executing expensive RL training. Building on these insights, we identify three key modifications of SAE architectures to improve tracking performance. We make our code available at anonymous.4open.science/r/sae-rl.
    Matched Pair Calibration for Ranking Fairness. (arXiv:2306.03775v3 [cs.LG] UPDATED)
    We propose a test of fairness in score-based ranking systems called matched pair calibration. Our approach constructs a set of matched item pairs with minimal confounding differences between subgroups before computing an appropriate measure of ranking error over the set. The matching step ensures that we compare subgroup outcomes between identically scored items so that measured performance differences directly imply unfairness in subgroup-level exposures. We show how our approach generalizes the fairness intuitions of calibration from a binary classification setting to ranking and connect our approach to other proposals for ranking fairness measures. Moreover, our strategy shows how the logic of marginal outcome tests extends to cases where the analyst has access to model scores. Lastly, we provide an example of applying matched pair calibration to a real-word ranking data set to demonstrate its efficacy in detecting ranking bias.
    HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models. (arXiv:2312.00079v1 [cs.CV])
    This paper explores advancements in high-fidelity personalized image generation through the utilization of pre-trained text-to-image diffusion models. While previous approaches have made significant strides in generating versatile scenes based on text descriptions and a few input images, challenges persist in maintaining the subject fidelity within the generated images. In this work, we introduce an innovative algorithm named HiFi Tuner to enhance the appearance preservation of objects during personalized image generation. Our proposed method employs a parameter-efficient fine-tuning framework, comprising a denoising process and a pivotal inversion process. Key enhancements include the utilization of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations to elevate the sample fidelity. Additionally, we propose a reference-guided generation approach that leverages the pivotal inversion of a reference image to mitigate unwanted subject variations and artifacts. We further extend our method to a novel image editing task: substituting the subject in an image through textual manipulations. Experimental evaluations conducted on the DreamBooth dataset using the Stable Diffusion model showcase promising results. Fine-tuning solely on textual embeddings improves CLIP-T score by 3.6 points and improves DINO score by 9.6 points over Textual Inversion. When fine-tuning all parameters, HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2 points over DreamBooth, establishing a new state of the art.
    Privacy-Preserving Load Forecasting via Personalized Model Obfuscation. (arXiv:2312.00036v1 [cs.CR])
    The widespread adoption of smart meters provides access to detailed and localized load consumption data, suitable for training building-level load forecasting models. To mitigate privacy concerns stemming from model-induced data leakage, federated learning (FL) has been proposed. This paper addresses the performance challenges of short-term load forecasting models trained with FL on heterogeneous data, emphasizing privacy preservation through model obfuscation. Our proposed algorithm, Privacy Preserving Federated Learning (PPFL), incorporates personalization layers for localized training at each smart meter. Additionally, we employ a differentially private mechanism to safeguard against data leakage from shared layers. Simulations on the NREL ComStock dataset corroborate the effectiveness of our approach.
    Classification Utility, Fairness, and Compactness via Tunable Information Bottleneck and R\'enyi Measures. (arXiv:2206.10043v3 [cs.LG] UPDATED)
    Designing machine learning algorithms that are accurate yet fair, not discriminating based on any sensitive attribute, is of paramount importance for society to accept AI for critical applications. In this article, we propose a novel fair representation learning method termed the R\'enyi Fair Information Bottleneck Method (RFIB) which incorporates constraints for utility, fairness, and compactness (compression) of representation, and apply it to image and tabular data classification. A key attribute of our approach is that we consider - in contrast to most prior work - both demographic parity and equalized odds as fairness constraints, allowing for a more nuanced satisfaction of both criteria. Leveraging a variational approach, we show that our objectives yield a loss function involving classical Information Bottleneck (IB) measures and establish an upper bound in terms of two R\'enyi measures of order $\alpha$ on the mutual information IB term measuring compactness between the input and its encoded embedding. We study the influence of the $\alpha$ parameter as well as two other tunable IB parameters on achieving utility/fairness trade-off goals, and show that the $\alpha$ parameter gives an additional degree of freedom that can be used to control the compactness of the representation. Experimenting on three different image datasets (EyePACS, CelebA, and FairFace) and two tabular datasets (Adult and COMPAS), using both binary and categorical sensitive attributes, we show that on various utility, fairness, and compound utility/fairness metrics RFIB outperforms current state-of-the-art approaches.
    Bayesian Neural Networks Avoid Encoding Complex and Perturbation-Sensitive Concepts. (arXiv:2302.13095v2 [cs.LG] UPDATED)
    In this paper, we focus on mean-field variational Bayesian Neural Networks (BNNs) and explore the representation capacity of such BNNs by investigating which types of concepts are less likely to be encoded by the BNN. It has been observed and studied that a relatively small set of interactive concepts usually emerge in the knowledge representation of a sufficiently-trained neural network, and such concepts can faithfully explain the network output. Based on this, our study proves that compared to standard deep neural networks (DNNs), it is less likely for BNNs to encode complex concepts. Experiments verify our theoretical proofs. Note that the tendency to encode less complex concepts does not necessarily imply weak representation power, considering that complex concepts exhibit low generalization power and high adversarial vulnerability. The code is available at https://github.com/sjtu-xai-lab/BNN-concepts.
    A Unified Approach to Interpreting and Boosting Adversarial Transferability. (arXiv:2010.04055v2 [cs.LG] UPDATED)
    In this paper, we use the interaction inside adversarial perturbations to explain and boost the adversarial transferability. We discover and prove the negative correlation between the adversarial transferability and the interaction inside adversarial perturbations. The negative correlation is further verified through different DNNs with various inputs. Moreover, this negative correlation can be regarded as a unified perspective to understand current transferability-boosting methods. To this end, we prove that some classic methods of enhancing the transferability essentially decease interactions inside adversarial perturbations. Based on this, we propose to directly penalize interactions during the attacking process, which significantly improves the adversarial transferability.
    On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training. (arXiv:2305.12224v2 [cs.LG] UPDATED)
    Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity. To understand the underlying mechanism, we show theoretically that the downstream performance depends monotonically on both types of diversity. Notably, our theory reveals that the optimal class-to-sample ratio (#classes / #samples per class) is invariant to the size of the pre-training dataset, which motivates an application of predicting the optimal number of pre-training classes. We demonstrate the effectiveness of this application by an improvement of around 2 points on the downstream tasks when using ImageNet as the pre-training dataset.
    Textual-Knowledge-Guided Numerical Feature Discovery Method for Power Demand Forecasting. (arXiv:2312.00095v1 [cs.LG])
    Power demand forecasting is a crucial and challenging task for new power system and integrated energy system. However, as public feature databases and the theoretical mechanism of power demand changes are unavailable, the known features of power demand fluctuation are much limited. Recently, multimodal learning approaches have shown great vitality in machine learning and AIGC. In this paper, we interact two modal data and propose a textual-knowledge-guided numerical feature discovery (TKNFD) method for short-term power demand forecasting. TKNFD extensively accumulates qualitative textual knowledge, expands it into a candidate feature-type set, collects numerical data of these features, and eventually builds four-dimensional multivariate source-tracking databases (4DM-STDs). Next, TKNFD presents a two-level quantitative feature identification strategy independent of forecasting models, finds 43-48 features, and systematically analyses feature contribution and dependency correlation. Benchmark experiments in two different regions around the world demonstrate that the forecasting accuracy of TKNFD-discovered features reliably outperforms that of SoTA feature schemes by 16.84% to 36.36% MAPE. In particular, TKNFD reveals many unknown features, especially several dominant features in the unknown energy and astronomical dimensions, which extend the knowledge on the origin of strong randomness and non-linearity in power demand fluctuation. Besides, 4DM-STDs can serve as public baseline databases.
    Precipitation Nowcasting With Spatial And Temporal Transfer Learning Using Swin-UNETR. (arXiv:2312.00258v1 [physics.ao-ph])
    Climate change has led to an increase in frequency of extreme weather events. Early warning systems can prevent disasters and loss of life. Managing such events remain a challenge for both public and private institutions. Precipitation nowcasting can help relevant institutions to better prepare for such events. Numerical weather prediction (NWP) has traditionally been used to make physics based forecasting, and recently deep learning based approaches have been used to reduce turn-around time for nowcasting. In this work, recently proposed Swin-UNETR (Swin UNEt TRansformer) is used for precipitation nowcasting for ten different regions of Europe. Swin-UNETR utilizes a U-shaped network within which a swin transformer-based encoder extracts multi-scale features from multiple input channels of satellite image, while CNN-based decoder makes the prediction. Trained model is capable of nowcasting not only for the regions for which data is available, but can also be used for new regions for which data is not available.
    Towards Transparency in Coreference Resolution: A Quantum-Inspired Approach. (arXiv:2312.00688v1 [cs.CL])
    Guided by grammatical structure, words compose to form sentences, and guided by discourse structure, sentences compose to form dialogues and documents. The compositional aspect of sentence and discourse units is often overlooked by machine learning algorithms. A recent initiative called Quantum Natural Language Processing (QNLP) learns word meanings as points in a Hilbert space and acts on them via a translation of grammatical structure into Parametrised Quantum Circuits (PQCs). Previous work extended the QNLP translation to discourse structure using points in a closure of Hilbert spaces. In this paper, we evaluate this translation on a Winograd-style pronoun resolution task. We train a Variational Quantum Classifier (VQC) for binary classification and implement an end-to-end pronoun resolution system. The simulations executed on IBMQ software converged with an F1 score of 87.20%. The model outperformed two out of three classical coreference resolution systems and neared state-of-the-art SpanBERT. A mixed quantum-classical model yet improved these results with an F1 score increase of around 6%.
    On the Interplay Between Stepsize Tuning and Progressive Sharpening. (arXiv:2312.00209v1 [cs.LG])
    Recent empirical work has revealed an intriguing property of deep learning models by which the sharpness (largest eigenvalue of the Hessian) increases throughout optimization until it stabilizes around a critical value at which the optimizer operates at the edge of stability, given a fixed stepsize (Coehn et al, 2022). We investigate empirically how the sharpness evolves when using stepsize-tuners, the Armijo linesearch and Polyak stepsizes, that adapt the stepsize along the iterations to local quantities such as, implicitly, the sharpness itself. We find that the surprisingly poor performance of a classical Armijo linesearch may be well explained by its tendency to ever-increase the sharpness of the objective in the full or large batch regimes. On the other hand, we observe that Polyak stepsizes operate generally at the edge of stability or even slightly beyond, while outperforming its Armijo and constant stepsizes counterparts. We conclude with an analysis that suggests unlocking stepsize tuners requires an understanding of the joint dynamics of the step size and the sharpness.
    Adaptive Deep Neural Network Inference Optimization with EENet. (arXiv:2301.07099v2 [cs.LG] UPDATED)
    Well-trained deep neural networks (DNNs) treat all test samples equally during prediction. Adaptive DNN inference with early exiting leverages the observation that some test examples can be easier to predict than others. This paper presents EENet, a novel early-exiting scheduling framework for multi-exit DNN models. Instead of having every sample go through all DNN layers during prediction, EENet learns an early exit scheduler, which can intelligently terminate the inference earlier for certain predictions, which the model has high confidence of early exit. As opposed to previous early-exiting solutions with heuristics-based methods, our EENet framework optimizes an early-exiting policy to maximize model accuracy while satisfying the given per-sample average inference budget. Extensive experiments are conducted on four computer vision datasets (CIFAR-10, CIFAR-100, ImageNet, Cityscapes) and two NLP datasets (SST-2, AgNews). The results demonstrate that the adaptive inference by EENet can outperform the representative existing early exit techniques. We also perform a detailed visualization analysis of the comparison results to interpret the benefits of EENet.
    GeoPhy: Differentiable Phylogenetic Inference via Geometric Gradients of Tree Topologies. (arXiv:2307.03675v2 [cs.LG] UPDATED)
    Phylogenetic inference, grounded in molecular evolution models, is essential for understanding the evolutionary relationships in biological data. Accounting for the uncertainty of phylogenetic tree variables, which include tree topologies and evolutionary distances on branches, is crucial for accurately inferring species relationships from molecular data and tasks requiring variable marginalization. Variational Bayesian methods are key to developing scalable, practical models; however, it remains challenging to conduct phylogenetic inference without restricting the combinatorially vast number of possible tree topologies. In this work, we introduce a novel, fully differentiable formulation of phylogenetic inference that leverages a unique representation of topological distributions in continuous geometric spaces. Through practical considerations on design spaces and control variates for gradient estimations, our approach, GeoPhy, enables variational inference without limiting the topological candidates. In experiments using real benchmark datasets, GeoPhy significantly outperformed other approximate Bayesian methods that considered whole topologies.
    Transport Equation based Physics Informed Neural Network to predict the Yield Strength of Architected Materials. (arXiv:2312.00003v1 [cs.LG])
    In this research, the application of the Physics-Informed Neural Network (PINN) model is explored to solve transport equation-based Partial Differential Equations (PDEs). The primary objective is to analyze the impact of different activation functions incorporated within the PINN model on its predictive performance, specifically assessing the Mean Squared Error (MSE) and Mean Absolute Error (MAE). The dataset used in the study consists of a varied set of input parameters related to strut diameter, unit cell size, and the corresponding yield stress values. Through this investigation the aim is to understand the effectiveness of the PINN model and the significance of choosing appropriate activation functions for solving complex PDEs in real-world applications. The outcomes suggest that the choice of activation function may have minimal influence on the model's predictive accuracy for this particular problem. The PINN model showcases exceptional generalization capabilities, indicating its capacity to avoid overfitting with the provided dataset. The research underscores the importance of striking a balance between performance and computational efficiency while selecting an activation function for specific real-world applications. These valuable findings contribute to advancing the understanding and potential adoption of PINN as an effective tool for solving challenging PDEs in diverse scientific and engineering domains.
    Contrastive losses as generalized models of global epistasis. (arXiv:2305.03136v3 [q-bio.PE] UPDATED)
    Fitness functions map large combinatorial spaces of biological sequences to properties of interest. Inferring these multimodal functions from experimental data is a central task in modern protein engineering. Global epistasis models are an effective and physically-grounded class of models for estimating fitness functions from observed data. These models assume that a sparse latent function is transformed by a monotonic nonlinearity to emit measurable fitness. Here we demonstrate that minimizing contrastive loss functions, such as the Bradley-Terry loss, is a simple and flexible technique for extracting the sparse latent function implied by global epistasis. We argue by way of a fitness-epistasis uncertainty principle that the nonlinearities in global epistasis models can produce observed fitness functions that do not admit sparse representations, and thus may be inefficient to learn from observations when using a Mean Squared Error (MSE) loss (a common practice). We show that contrastive losses are able to accurately estimate a ranking function from limited data even in regimes where MSE is ineffective. We validate the practical utility of this insight by showing contrastive loss functions result in consistently improved performance on benchmark tasks.
    Investigating a domain adaptation approach for integrating different measurement instruments in a longitudinal clinical registry. (arXiv:2312.00616v1 [cs.LG])
    In a longitudinal clinical registry, different measurement instruments might have been used for assessing individuals at different time points. To combine them, we investigate deep learning techniques for obtaining a joint latent representation, to which the items of different measurement instruments are mapped. This corresponds to domain adaptation, an established concept in computer science for image data. Using the proposed approach as an example, we evaluate the potential of domain adaptation in a longitudinal cohort setting with a rather small number of time points, motivated by an application with different motor function measurement instruments in a registry of spinal muscular atrophy (SMA) patients. There, we model trajectories in the latent representation by ordinary differential equations (ODEs), where person-specific ODE parameters are inferred from baseline characteristics. The goodness of fit and complexity of the ODE solutions then allows to judge the measurement instrument mappings. We subsequently explore how alignment can be improved by incorporating corresponding penalty terms into model fitting. To systematically investigate the effect of differences between measurement instruments, we consider several scenarios based on modified SMA data, including scenarios where a mapping should be feasible in principle and scenarios where no perfect mapping is available. While misalignment increases in more complex scenarios, some structure is still recovered, even if the availability of measurement instruments depends on patient state. A reasonable mapping is feasible also in the more complex real SMA dataset. These results indicate that domain adaptation might be more generally useful in statistical modeling for longitudinal registry data.
    Benchmarking and Enhancing Disentanglement in Concept-Residual Models. (arXiv:2312.00192v1 [cs.LG])
    Concept bottleneck models (CBMs) are interpretable models that first predict a set of semantically meaningful features, i.e., concepts, from observations that are subsequently used to condition a downstream task. However, the model's performance strongly depends on the engineered features and can severely suffer from incomplete sets of concepts. Prior works have proposed a side channel -- a residual -- that allows for unconstrained information flow to the downstream task, thus improving model performance but simultaneously introducing information leakage, which is undesirable for interpretability. This work proposes three novel approaches to mitigate information leakage by disentangling concepts and residuals, investigating the critical balance between model performance and interpretability. Through extensive empirical analysis on the CUB, OAI, and CIFAR 100 datasets, we assess the performance of each disentanglement method and provide insights into when they work best. Further, we show how each method impacts the ability to intervene over the concepts and their subsequent impact on task performance.
    Bayesian causal discovery from unknown general interventions. (arXiv:2312.00509v1 [stat.ME])
    We consider the problem of learning causal Directed Acyclic Graphs (DAGs) using combinations of observational and interventional experimental data. Current methods tailored to this setting assume that interventions either destroy parent-child relations of the intervened (target) nodes or only alter such relations without modifying the parent sets, even when the intervention targets are unknown. We relax this assumption by proposing a Bayesian method for causal discovery from general interventions, which allow for modifications of the parent sets of the unknown targets. Even in this framework, DAGs and general interventions may be identifiable only up to some equivalence classes. We provide graphical characterizations of such interventional Markov equivalence and devise compatible priors for Bayesian inference that guarantee score equivalence of indistinguishable structures. We then develop a Markov Chain Monte Carlo (MCMC) scheme to approximate the posterior distribution over DAGs, intervention targets and induced parent sets. Finally, we evaluate the proposed methodology on both simulated and real protein expression data.
    An Encoding Framework for Binarized Images using HyperDimensional Computing. (arXiv:2312.00454v1 [cs.CV])
    Hyperdimensional Computing (HDC) is a brain-inspired and light-weight machine learning method. It has received significant attention in the literature as a candidate to be applied in the wearable internet of things, near-sensor artificial intelligence applications and on-device processing. HDC is computationally less complex than traditional deep learning algorithms and typically achieves moderate to good classification performance. A key aspect that determines the performance of HDC is the encoding of the input data to the hyperdimensional (HD) space. This article proposes a novel light-weight approach relying only on native HD arithmetic vector operations to encode binarized images that preserves similarity of patterns at nearby locations by using point of interest selection and local linear mapping. The method reaches an accuracy of 97.35% on the test set for the MNIST data set and 84.12% for the Fashion-MNIST data set. These results outperform other studies using baseline HDC with different encoding approaches and are on par with more complex hybrid HDC models. The proposed encoding approach also demonstrates a higher robustness to noise and blur compared to the baseline encoding.
    Target-agnostic Source-free Domain Adaptation for Regression Tasks. (arXiv:2312.00540v1 [cs.LG])
    Unsupervised domain adaptation (UDA) seeks to bridge the domain gap between the target and source using unlabeled target data. Source-free UDA removes the requirement for labeled source data at the target to preserve data privacy and storage. However, work on source-free UDA assumes knowledge of domain gap distribution, and hence is limited to either target-aware or classification task. To overcome it, we propose TASFAR, a novel target-agnostic source-free domain adaptation approach for regression tasks. Using prediction confidence, TASFAR estimates a label density map as the target label distribution, which is then used to calibrate the source model on the target domain. We have conducted extensive experiments on four regression tasks with various domain gaps, namely, pedestrian dead reckoning for different users, image-based people counting in different scenes, housing-price prediction at different districts, and taxi-trip duration prediction from different departure points. TASFAR is shown to substantially outperform the state-of-the-art source-free UDA approaches by averagely reducing 22% errors for the four tasks and achieve notably comparable accuracy as source-based UDA without using source data.
    RaLiBEV: Radar and LiDAR BEV Fusion Learning for Anchor Box Free Object Detection Systems. (arXiv:2211.06108v4 [cs.CV] UPDATED)
    In autonomous driving, LiDAR and radar play important roles in the perception of the surrounding environment. LiDAR provides accurate 3D spatial sensing information but cannot work in adverse weather like fog. On the other hand, the radar signal can be diffracted when encountering raindrops or mist particles thanks to its wavelength, but it suffers from large noise. Recent state-of-the-art works reveal that fusion of radar and LiDAR can lead to robust detection in adverse weather. The existing works adopt convolutional neural network architecture to extract features from each sensor data, then align and aggregate the two branch features to predict object detection results. However, these methods have low accuracy of bounding box estimations due to a simple design of label assignment and fusion strategies. In this paper, we propose a bird's-eye view fusion learning-based anchor box-free object detection system, which fuses the feature derived from the radar range-azimuth heatmap and the LiDAR point cloud to estimate possible objects. Different label assignment strategies have been designed to facilitate the consistency between the classification of foreground or background anchor points and the corresponding bounding box regressions. Furthermore, the performance of the proposed object detector is further enhanced by employing a novel interactive transformer module. The superior performance of the methods proposed in this paper has been demonstrated using the recently published Oxford Radar RobotCar dataset. Our system's average precision significantly outperforms the state-of-the-art method by 13.1% and 19.0% at IoU of 0.8 under 'Clear+Foggy' training conditions for 'Clear' and 'Foggy' testing, respectively.  ( 3 min )
    Meta-Diversity Search in Complex Systems, A Recipe for Artificial Open-Endedness ?. (arXiv:2312.00455v1 [cs.AI])
    Can we build an artificial system that would be able to generate endless surprises if ran "forever" in Minecraft? While there is not a single path toward solving that grand challenge, this article presents what we believe to be some working ingredients for the endless generation of novel increasingly complex artifacts in Minecraft. Our framework for an open-ended system includes two components: a complex system used to recursively grow and complexify artifacts over time, and a discovery algorithm that leverages the concept of meta-diversity search. Since complex systems have shown to enable the emergence of considerable complexity from set of simple rules, we believe them to be great candidates to generate all sort of artifacts in Minecraft. Yet, the space of possible artifacts that can be generated by these systems is often unknown, challenging to characterize and explore. Therefore automating the long-term discovery of novel and increasingly complex artifacts in these systems is an exciting research field. To approach these challenges, we formulate the problem of meta-diversity search where an artificial "discovery assistant" incrementally learns a diverse set of representations to characterize behaviors and searches to discover diverse patterns within each of them. A successful discovery assistant should continuously seek for novel sources of diversities while being able to quickly specialize the search toward a new unknown type of diversity. To implement those ideas in the Minecraft environment, we simulate an artificial "chemistry" system based on Lenia continuous cellular automaton for generating artifacts, as well as an artificial "discovery assistant" (called Holmes) for the artifact-discovery process. Holmes incrementally learns a hierarchy of modular representations to characterize divergent sources of diversity and uses a goal-based intrinsically-motivated exploration as the diversity search strategy.  ( 3 min )
    Presentation Attack Detection using Convolutional Neural Networks and Local Binary Patterns. (arXiv:2312.00041v1 [cs.CR])
    The use of biometrics to authenticate users and control access to secure areas has become extremely popular in recent years, and biometric access control systems are frequently used by both governments and private corporations. However, these systems may represent risks to security when deployed without considering the possibility of biometric presentation attacks (also known as spoofing). Presentation attacks are a serious threat because they do not require significant time, expense, or skill to carry out while remaining effective against many biometric systems in use today. This research compares three different software-based methods for facial and iris presentation attack detection in images. The first method uses Inception-v3, a pre-trained deep Convolutional Neural Network (CNN) made by Google for the ImageNet challenge, which is retrained for this problem. The second uses a shallow CNN based on a modified Spoofnet architecture, which is trained normally. The third is a texture-based method using Local Binary Patterns (LBP). The datasets used are the ATVS-FIr dataset, which contains real and fake iris images, and the CASIA Face Anti-Spoofing Dataset, which contains real images as well as warped photos, cut photos, and video replay presentation attacks. We also present a third set of results, based on cropped versions of the CASIA images.  ( 2 min )
    Efficient Off-Policy Safe Reinforcement Learning Using Trust Region Conditional Value at Risk. (arXiv:2312.00342v1 [cs.LG])
    This paper aims to solve a safe reinforcement learning (RL) problem with risk measure-based constraints. As risk measures, such as conditional value at risk (CVaR), focus on the tail distribution of cost signals, constraining risk measures can effectively prevent a failure in the worst case. An on-policy safe RL method, called TRC, deals with a CVaR-constrained RL problem using a trust region method and can generate policies with almost zero constraint violations with high returns. However, to achieve outstanding performance in complex environments and satisfy safety constraints quickly, RL methods are required to be sample efficient. To this end, we propose an off-policy safe RL method with CVaR constraints, called off-policy TRC. If off-policy data from replay buffers is directly used to train TRC, the estimation error caused by the distributional shift results in performance degradation. To resolve this issue, we propose novel surrogate functions, in which the effect of the distributional shift can be reduced, and introduce an adaptive trust-region constraint to ensure a policy not to deviate far from replay buffers. The proposed method has been evaluated in simulation and real-world environments and satisfied safety constraints within a few steps while achieving high returns even in complex robotic tasks.  ( 2 min )
    Defects of Convolutional Decoder Networks in Frequency Representation. (arXiv:2210.09020v2 [cs.LG] UPDATED)
    In this paper, we prove the representation defects of a cascaded convolutional decoder network, considering the capacity of representing different frequency components of an input sample. We conduct the discrete Fourier transform on each channel of the feature map in an intermediate layer of the decoder network. Then, we extend the 2D circular convolution theorem to represent the forward and backward propagations through convolutional layers in the frequency domain. Based on this, we prove three defects in representing feature spectrums. First, we prove that the convolution operation, the zero-padding operation, and a set of other settings all make a convolutional decoder network more likely to weaken high-frequency components. Second, we prove that the upsampling operation generates a feature spectrum, in which strong signals repetitively appear at certain frequencies. Third, we prove that if the frequency components in the input sample and frequency components in the target output for regression have a small shift, then the decoder usually cannot be effectively learned.
    Learning to forecast diagnostic parameters using pre-trained weather embedding. (arXiv:2312.00290v1 [cs.LG])
    Data-driven weather prediction (DDWP) models are increasingly becoming popular for weather forecasting. However, while operational weather forecasts predict a wide variety of weather variables, DDWPs currently forecast a specific set of key prognostic variables. Non-prognostic ("diagnostic") variables are sometimes modeled separately as dependent variables of the prognostic variables (c.f. FourCastNet), or by including the diagnostic variable as a target in the DDWP. However, the cost of training and deploying bespoke models for each diagnostic variable can increase dramatically with more diagnostic variables, and limit the operational use of such models. Likewise, retraining an entire DDWP each time a new diagnostic variable is added is also cost-prohibitive. We present an two-stage approach that allows new diagnostic variables to be added to an end-to-end DDWP model without the expensive retraining. In the first stage, we train an autoencoder that learns to embed prognostic variables into a latent space. In the second stage, the autoencoder is frozen and "downstream" models are trained to predict diagnostic variables using only the latent representations of prognostic variables as input. Our experiments indicate that models trained using the two-stage approach offer accuracy comparable to training bespoke models, while leading to significant reduction in resource utilization during training and inference. This approach allows for new "downstream" models to be developed as needed, without affecting existing models and thus reducing the friction in operationalizing new models.  ( 2 min )
    Duet: efficient and scalable hybriD neUral rElation undersTanding. (arXiv:2307.13494v5 [cs.DB] UPDATED)
    Learned cardinality estimation methods have achieved high precision compared to traditional methods. Among learned methods, query-driven approaches have faced the workload drift problem for a long time. Although both data-driven and hybrid methods are proposed to avoid this problem, most of them suffer from high training and estimation costs, limited scalability, instability, and long-tail distribution problems on high-dimensional tables, which seriously affects the practical application of learned cardinality estimators. In this paper, we prove that most of these problems are directly caused by the widely used progressive sampling. We solve this problem by introducing predicate information into the autoregressive model and propose Duet, a stable, efficient, and scalable hybrid method to estimate cardinality directly without sampling or any non-differentiable process, which can not only reduce the inference complexity from $O(n)$ to $O(1)$ compared to Naru and UAE but also achieve higher accuracy on high cardinality and high-dimensional tables. Experimental results show that Duet can achieve all the design goals above and be much more practical. Besides, Duet even has a lower inference cost on CPU than that of most learned methods on GPU.
    Fast Deep Mixtures of Gaussian Process Experts. (arXiv:2006.13309v4 [cs.LG] UPDATED)
    Mixtures of experts have become an indispensable tool for flexible modelling in a supervised learning context, allowing not only the mean function but the entire density of the output to change with the inputs. Sparse Gaussian processes (GP) have shown promise as a leading candidate for the experts in such models, and in this article, we propose to design the gating network for selecting the experts from such mixtures of sparse GPs using a deep neural network (DNN). Furthermore, a fast one pass algorithm called Cluster-Classify-Regress (CCR) is leveraged to approximate the maximum a posteriori (MAP) estimator extremely quickly. This powerful combination of model and algorithm together delivers a novel method which is flexible, robust, and extremely efficient. In particular, the method is able to outperform competing methods in terms of accuracy and uncertainty quantification. The cost is competitive on low-dimensional and small data sets, but is significantly lower for higher-dimensional and big data sets. Iteratively maximizing the distribution of experts given allocations and allocations given experts does not provide significant improvement, which indicates that the algorithm achieves a good approximation to the local MAP estimator very fast. This insight can be useful also in the context of other mixture of experts models.
    Object Detector Differences when using Synthetic and Real Training Data. (arXiv:2312.00694v1 [cs.CV])
    To train well-performing generalizing neural networks, sufficiently large and diverse datasets are needed. Collecting data while adhering to privacy legislation becomes increasingly difficult and annotating these large datasets is both a resource-heavy and time-consuming task. An approach to overcome these difficulties is to use synthetic data since it is inherently scalable and can be automatically annotated. However, how training on synthetic data affects the layers of a neural network is still unclear. In this paper, we train the YOLOv3 object detector on real and synthetic images from city environments. We perform a similarity analysis using Centered Kernel Alignment (CKA) to explore the effects of training on synthetic data on a layer-wise basis. The analysis captures the architecture of the detector while showing both different and similar patterns between different models. With this similarity analysis we want to give insights on how training synthetic data affects each layer and to give a better understanding of the inner workings of complex neural networks. The results show that the largest similarity between a detector trained on real data and a detector trained on synthetic data was in the early layers, and the largest difference was in the head part. The results also show that no major difference in performance or similarity could be seen between frozen and unfrozen backbone.
  • Open

    Learning Causally Disentangled Representations via the Principle of Independent Causal Mechanisms. (arXiv:2306.01213v2 [cs.LG] UPDATED)
    Learning disentangled causal representations is a challenging problem that has gained significant attention recently due to its implications for extracting meaningful information for downstream tasks. In this work, we define a new notion of causal disentanglement from the perspective of independent causal mechanisms. We propose ICM-VAE, a framework for learning causally disentangled representations supervised by causally related observed labels. We model causal mechanisms using learnable flow-based diffeomorphic functions to map noise variables to latent causal variables. Further, to promote the disentanglement of causal factors, we propose a causal disentanglement prior that utilizes the known causal structure to encourage learning a causally factorized distribution in the latent space. Under relatively mild conditions, we provide theoretical results showing the identifiability of causal factors and mechanisms up to permutation and elementwise reparameterization. We empirically demonstrate that our framework induces highly disentangled causal factors, improves interventional robustness, and is compatible with counterfactual generation.  ( 2 min )
    QuantEase: Optimization-based Quantization for Language Models. (arXiv:2309.01885v2 [stat.ML] UPDATED)
    With the rising popularity of Large Language Models (LLMs), there has been an increasing interest in compression techniques that enable their efficient deployment. This study focuses on the Post-Training Quantization (PTQ) of LLMs. Drawing from recent advances, our work introduces QuantEase, a layer-wise quantization framework where individual layers undergo separate quantization. The problem is framed as a discrete-structured non-convex optimization, prompting the development of algorithms rooted in Coordinate Descent (CD) techniques. These CD-based methods provide high-quality solutions to the complex non-convex layer-wise quantization problems. Notably, our CD-based approach features straightforward updates, relying solely on matrix and vector operations, circumventing the need for matrix inversion or decomposition. We also explore an outlier-aware variant of our approach, allowing for retaining significant weights (outliers) with complete precision. Our proposal attains state-of-the-art performance in terms of perplexity and zero-shot accuracy in empirical evaluations across various LLMs and datasets, with relative improvements up to 15% over methods such as GPTQ. Leveraging careful linear algebra optimizations, QuantEase can quantize models like Falcon-180B on a single NVIDIA A100 GPU in $\sim$3 hours. Particularly noteworthy is our outlier-aware algorithm's capability to achieve near or sub-3-bit quantization of LLMs with an acceptable drop in accuracy, obviating the need for non-uniform quantization or grouping techniques, improving upon methods such as SpQR by up to two times in terms of perplexity.
    Fast Deep Mixtures of Gaussian Process Experts. (arXiv:2006.13309v4 [cs.LG] UPDATED)
    Mixtures of experts have become an indispensable tool for flexible modelling in a supervised learning context, allowing not only the mean function but the entire density of the output to change with the inputs. Sparse Gaussian processes (GP) have shown promise as a leading candidate for the experts in such models, and in this article, we propose to design the gating network for selecting the experts from such mixtures of sparse GPs using a deep neural network (DNN). Furthermore, a fast one pass algorithm called Cluster-Classify-Regress (CCR) is leveraged to approximate the maximum a posteriori (MAP) estimator extremely quickly. This powerful combination of model and algorithm together delivers a novel method which is flexible, robust, and extremely efficient. In particular, the method is able to outperform competing methods in terms of accuracy and uncertainty quantification. The cost is competitive on low-dimensional and small data sets, but is significantly lower for higher-dimensional and big data sets. Iteratively maximizing the distribution of experts given allocations and allocations given experts does not provide significant improvement, which indicates that the algorithm achieves a good approximation to the local MAP estimator very fast. This insight can be useful also in the context of other mixture of experts models.  ( 3 min )
    Upper and lower bounds for the Lipschitz constant of random neural networks. (arXiv:2311.01356v2 [stat.ML] UPDATED)
    Empirical studies have widely demonstrated that neural networks are highly sensitive to small, adversarial perturbations of the input. The worst-case robustness against these so-called adversarial examples can be quantified by the Lipschitz constant of the neural network. In this paper, we study upper and lower bounds for the Lipschitz constant of random ReLU neural networks. Specifically, we assume that the weights and biases follow a generalization of the He initialization, where general symmetric distributions for the biases are permitted. For shallow neural networks, we characterize the Lipschitz constant up to an absolute numerical constant. For deep networks with fixed depth and sufficiently large width, our established bounds differ by a factor that is logarithmic in the width.  ( 2 min )
    One to beat them all: "RYU'' -- a unifying framework for the construction of safe balls. (arXiv:2312.00640v1 [math.OC])
    In this paper, we put forth a novel framework (named ``RYU'') for the construction of ``safe'' balls, i.e. regions that provably contain the dual solution of a target optimization problem. We concentrate on the standard setup where the cost function is the sum of two terms: a closed, proper, convex Lipschitz-smooth function and a closed, proper, convex function. The RYU framework is shown to generalize or improve upon all the results proposed in the last decade for the considered family of optimization problems.  ( 2 min )
    SpaCE: The Spatial Confounding Environment. (arXiv:2312.00710v1 [cs.LG])
    Spatial confounding poses a significant challenge in scientific studies involving spatial data, where unobserved spatial variables can influence both treatment and outcome, possibly leading to spurious associations. To address this problem, we introduce SpaCE: The Spatial Confounding Environment, the first toolkit to provide realistic benchmark datasets and tools for systematically evaluating causal inference methods designed to alleviate spatial confounding. Each dataset includes training data, true counterfactuals, a spatial graph with coordinates, and smoothness and confounding scores characterizing the effect of a missing spatial confounder. It also includes realistic semi-synthetic outcomes and counterfactuals, generated using state-of-the-art machine learning ensembles, following best practices for causal inference benchmarks. The datasets cover real treatment and covariates from diverse domains, including climate, health and social sciences. SpaCE facilitates an automated end-to-end pipeline, simplifying data loading, experimental setup, and evaluating machine learning and causal inference models. The SpaCE project provides several dozens of datasets of diverse sizes and spatial complexity. It is publicly available as a Python package, encouraging community feedback and contributions.  ( 2 min )
    Investigating a domain adaptation approach for integrating different measurement instruments in a longitudinal clinical registry. (arXiv:2312.00616v1 [cs.LG])
    In a longitudinal clinical registry, different measurement instruments might have been used for assessing individuals at different time points. To combine them, we investigate deep learning techniques for obtaining a joint latent representation, to which the items of different measurement instruments are mapped. This corresponds to domain adaptation, an established concept in computer science for image data. Using the proposed approach as an example, we evaluate the potential of domain adaptation in a longitudinal cohort setting with a rather small number of time points, motivated by an application with different motor function measurement instruments in a registry of spinal muscular atrophy (SMA) patients. There, we model trajectories in the latent representation by ordinary differential equations (ODEs), where person-specific ODE parameters are inferred from baseline characteristics. The goodness of fit and complexity of the ODE solutions then allows to judge the measurement instrument mappings. We subsequently explore how alignment can be improved by incorporating corresponding penalty terms into model fitting. To systematically investigate the effect of differences between measurement instruments, we consider several scenarios based on modified SMA data, including scenarios where a mapping should be feasible in principle and scenarios where no perfect mapping is available. While misalignment increases in more complex scenarios, some structure is still recovered, even if the availability of measurement instruments depends on patient state. A reasonable mapping is feasible also in the more complex real SMA dataset. These results indicate that domain adaptation might be more generally useful in statistical modeling for longitudinal registry data.  ( 3 min )
    Improving Open-Set Semi-Supervised Learning with Self-Supervision. (arXiv:2301.10127v3 [cs.LG] UPDATED)
    Open-set semi-supervised learning (OSSL) embodies a practical scenario within semi-supervised learning, wherein the unlabeled training set encompasses classes absent from the labeled set. Many existing OSSL methods assume that these out-of-distribution data are harmful and put effort into excluding data belonging to unknown classes from the training objective. In contrast, we propose an OSSL framework that facilitates learning from all unlabeled data through self-supervision. Additionally, we utilize an energy-based score to accurately recognize data belonging to the known classes, making our method well-suited for handling uncurated data in deployment. We show through extensive experimental evaluations that our method yields state-of-the-art results on many of the evaluated benchmark problems in terms of closed-set accuracy and open-set recognition when compared with existing methods for OSSL. Our code is available at https://github.com/walline/ssl-tf2-sefoss.
    Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics. (arXiv:2302.01170v2 [stat.ML] UPDATED)
    Molecular dynamics (MD) simulation is a widely used technique to simulate molecular systems, most commonly at the all-atom resolution where equations of motion are integrated with timesteps on the order of femtoseconds ($1\textrm{fs}=10^{-15}\textrm{s}$). MD is often used to compute equilibrium properties, which requires sampling from an equilibrium distribution such as the Boltzmann distribution. However, many important processes, such as binding and folding, occur over timescales of milliseconds or beyond, and cannot be efficiently sampled with conventional MD. Furthermore, new MD simulations need to be performed for each molecular system studied. We present Timewarp, an enhanced sampling method which uses a normalising flow as a proposal distribution in a Markov chain Monte Carlo method targeting the Boltzmann distribution. The flow is trained offline on MD trajectories and learns to make large steps in time, simulating the molecular dynamics of $10^{5} - 10^{6}\:\textrm{fs}$. Crucially, Timewarp is transferable between molecular systems: once trained, we show that it generalises to unseen small peptides (2-4 amino acids) at all-atom resolution, exploring their metastable states and providing wall-clock acceleration of sampling compared to standard MD. Our method constitutes an important step towards general, transferable algorithms for accelerating MD.
    A Policy Gradient Method for Confounded POMDPs. (arXiv:2305.17083v2 [stat.ML] UPDATED)
    In this paper, we propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting. We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data. The identification enables us to solve a sequence of conditional moment restrictions and adopt the min-max learning procedure with general function approximation for estimating the policy gradient. We then provide a finite-sample non-asymptotic bound for estimating the gradient uniformly over a pre-specified policy class in terms of the sample size, length of horizon, concentratability coefficient and the measure of ill-posedness in solving the conditional moment restrictions. Lastly, by deploying the proposed gradient estimation in the gradient ascent algorithm, we show the global convergence of the proposed algorithm in finding the history-dependent optimal policy under some technical conditions. To the best of our knowledge, this is the first work studying the policy gradient method for POMDPs under the offline setting.
    GeoPhy: Differentiable Phylogenetic Inference via Geometric Gradients of Tree Topologies. (arXiv:2307.03675v2 [cs.LG] UPDATED)
    Phylogenetic inference, grounded in molecular evolution models, is essential for understanding the evolutionary relationships in biological data. Accounting for the uncertainty of phylogenetic tree variables, which include tree topologies and evolutionary distances on branches, is crucial for accurately inferring species relationships from molecular data and tasks requiring variable marginalization. Variational Bayesian methods are key to developing scalable, practical models; however, it remains challenging to conduct phylogenetic inference without restricting the combinatorially vast number of possible tree topologies. In this work, we introduce a novel, fully differentiable formulation of phylogenetic inference that leverages a unique representation of topological distributions in continuous geometric spaces. Through practical considerations on design spaces and control variates for gradient estimations, our approach, GeoPhy, enables variational inference without limiting the topological candidates. In experiments using real benchmark datasets, GeoPhy significantly outperformed other approximate Bayesian methods that considered whole topologies.
    Deep Unlearning: Fast and Efficient Training-free Approach to Controlled Forgetting. (arXiv:2312.00761v1 [cs.LG])
    Machine unlearning has emerged as a prominent and challenging area of interest, driven in large part by the rising regulatory demands for industries to delete user data upon request and the heightened awareness of privacy. Existing approaches either retrain models from scratch or use several finetuning steps for every deletion request, often constrained by computational resource limitations and restricted access to the original training data. In this work, we introduce a novel class unlearning algorithm designed to strategically eliminate an entire class or a group of classes from the learned model. To that end, our algorithm first estimates the Retain Space and the Forget Space, representing the feature or activation spaces for samples from classes to be retained and unlearned, respectively. To obtain these spaces, we propose a novel singular value decomposition-based technique that requires layer wise collection of network activations from a few forward passes through the network. We then compute the shared information between these spaces and remove it from the forget space to isolate class-discriminatory feature space for unlearning. Finally, we project the model weights in the orthogonal direction of the class-discriminatory space to obtain the unlearned model. We demonstrate our algorithm's efficacy on ImageNet using a Vision Transformer with only $\sim$1.5% drop in retain accuracy compared to the original model while maintaining under 1% accuracy on the unlearned class samples. Further, our algorithm consistently performs well when subject to Membership Inference Attacks showing 7.8% improvement on average across a variety of image classification datasets and network architectures, as compared to other baselines while being $\sim$6x more computationally efficient.
    Uncertainty in Graph Contrastive Learning with Bayesian Neural Networks. (arXiv:2312.00232v1 [cs.LG])
    Graph contrastive learning has shown great promise when labeled data is scarce, but large unlabeled datasets are available. However, it often does not take uncertainty estimation into account. We show that a variational Bayesian neural network approach can be used to improve not only the uncertainty estimates but also the downstream performance on semi-supervised node-classification tasks. Moreover, we propose a new measure of uncertainty for contrastive learning, that is based on the disagreement in likelihood due to different positive samples.  ( 2 min )
    Target-agnostic Source-free Domain Adaptation for Regression Tasks. (arXiv:2312.00540v1 [cs.LG])
    Unsupervised domain adaptation (UDA) seeks to bridge the domain gap between the target and source using unlabeled target data. Source-free UDA removes the requirement for labeled source data at the target to preserve data privacy and storage. However, work on source-free UDA assumes knowledge of domain gap distribution, and hence is limited to either target-aware or classification task. To overcome it, we propose TASFAR, a novel target-agnostic source-free domain adaptation approach for regression tasks. Using prediction confidence, TASFAR estimates a label density map as the target label distribution, which is then used to calibrate the source model on the target domain. We have conducted extensive experiments on four regression tasks with various domain gaps, namely, pedestrian dead reckoning for different users, image-based people counting in different scenes, housing-price prediction at different districts, and taxi-trip duration prediction from different departure points. TASFAR is shown to substantially outperform the state-of-the-art source-free UDA approaches by averagely reducing 22% errors for the four tasks and achieve notably comparable accuracy as source-based UDA without using source data.  ( 2 min )
    Adaptive Parameter-Free Robust Learning using Latent Bernoulli Variables. (arXiv:2312.00585v1 [stat.ML])
    We present an efficient parameter-free approach for statistical learning from corrupted training sets. We identify corrupted and non-corrupted samples using latent Bernoulli variables, and therefore formulate the robust learning problem as maximization of the likelihood where latent variables are marginalized out. The resulting optimization problem is solved via variational inference using an efficient Expectation-Maximization based method. The proposed approach improves over the state-of-the-art by automatically inferring the corruption level and identifying outliers, while adding minimal computational overhead. We demonstrate our robust learning method on a wide variety of machine learning tasks including online learning and deep learning where it exhibits ability to adapt to different levels of noise and attain high prediction accuracy.  ( 2 min )
    RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment. (arXiv:2304.06767v4 [cs.LG] UPDATED)
    Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. Prior research has primarily employed Reinforcement Learning from Human Feedback (RLHF) to address this problem, where generative models are fine-tuned with RL algorithms guided by a human-feedback-informed reward model. However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples. Our studies show that RAFT can effectively improve the model performance in both reward learning and other automated metrics in both large language models and diffusion models.  ( 3 min )
    Interpretable Meta-Learning of Physical Systems. (arXiv:2312.00477v1 [cs.LG])
    Machine learning methods can be a valuable aid in the scientific process, but they need to face challenging settings where data come from inhomogeneous experimental conditions. Recent meta-learning methods have made significant progress in multi-task learning, but they rely on black-box neural networks, resulting in high computational costs and limited interpretability. Leveraging the structure of the learning problem, we argue that multi-environment generalization can be achieved using a simpler learning model, with an affine structure with respect to the learning task. Crucially, we prove that this architecture can identify the physical parameters of the system, enabling interpreable learning. We demonstrate the competitive generalization performance and the low computational cost of our method by comparing it to state-of-the-art algorithms on physical systems, ranging from toy models to complex, non-analytical systems. The interpretability of our method is illustrated with original applications to physical-parameter-induced adaptation and to adaptive control.  ( 2 min )
    Towards Aligned Canonical Correlation Analysis: Preliminary Formulation and Proof-of-Concept Results. (arXiv:2312.00296v1 [cs.LG])
    Canonical Correlation Analysis (CCA) has been widely applied to jointly embed multiple views of data in a maximally correlated latent space. However, the alignment between various data perspectives, which is required by traditional approaches, is unclear in many practical cases. In this work we propose a new framework Aligned Canonical Correlation Analysis (ACCA), to address this challenge by iteratively solving the alignment and multi-view embedding.  ( 2 min )
    Optimal Sample Complexity of Contrastive Learning. (arXiv:2312.00379v1 [cs.LG])
    Contrastive learning is a highly successful technique for learning representations of data from labeled tuples, specifying the distance relations within the tuple. We study the sample complexity of contrastive learning, i.e. the minimum number of labeled tuples sufficient for getting high generalization accuracy. We give tight bounds on the sample complexity in a variety of settings, focusing on arbitrary distance functions, both general $\ell_p$-distances, and tree metrics. Our main result is an (almost) optimal bound on the sample complexity of learning $\ell_p$-distances for integer $p$. For any $p \ge 1$ we show that $\tilde \Theta(\min(nd,n^2))$ labeled tuples are necessary and sufficient for learning $d$-dimensional representations of $n$-point datasets. Our results hold for an arbitrary distribution of the input samples and are based on giving the corresponding bounds on the Vapnik-Chervonenkis/Natarajan dimension of the associated problems. We further show that the theoretical bounds on sample complexity obtained via VC/Natarajan dimension can have strong predictive power for experimental results, in contrast with the folklore belief about a substantial gap between the statistical learning theory and the practice of deep learning.  ( 2 min )
    From Mutual Information to Expected Dynamics: New Generalization Bounds for Heavy-Tailed SGD. (arXiv:2312.00427v1 [stat.ML])
    Understanding the generalization abilities of modern machine learning algorithms has been a major research topic over the past decades. In recent years, the learning dynamics of Stochastic Gradient Descent (SGD) have been related to heavy-tailed dynamics. This has been successfully applied to generalization theory by exploiting the fractal properties of those dynamics. However, the derived bounds depend on mutual information (decoupling) terms that are beyond the reach of computability. In this work, we prove generalization bounds over the trajectory of a class of heavy-tailed dynamics, without those mutual information terms. Instead, we introduce a geometric decoupling term by comparing the learning dynamics (depending on the empirical risk) with an expected one (depending on the population risk). We further upper-bound this geometric term, by using techniques from the heavy-tailed and the fractal literature, making it fully computable. Moreover, as an attempt to tighten the bounds, we propose a PAC-Bayesian setting based on perturbed dynamics, in which the same geometric term plays a crucial role and can still be bounded using the techniques described above.  ( 2 min )
    Multiple Testing of Linear Forms for Noisy Matrix Completion. (arXiv:2312.00305v1 [stat.ME])
    Many important tasks of large-scale recommender systems can be naturally cast as testing multiple linear forms for noisy matrix completion. These problems, however, present unique challenges because of the subtle bias-and-variance tradeoff of and an intricate dependence among the estimated entries induced by the low-rank structure. In this paper, we develop a general approach to overcome these difficulties by introducing new statistics for individual tests with sharp asymptotics both marginally and jointly, and utilizing them to control the false discovery rate (FDR) via a data splitting and symmetric aggregation scheme. We show that valid FDR control can be achieved with guaranteed power under nearly optimal sample size requirements using the proposed methodology. Extensive numerical simulations and real data examples are also presented to further illustrate its practical merits.  ( 2 min )
    Beta Diffusion. (arXiv:2309.07867v3 [cs.LG] UPDATED)
    We introduce beta diffusion, a novel generative modeling method that integrates demasking and denoising to generate data within bounded ranges. Using scaled and shifted beta distributions, beta diffusion utilizes multiplicative transitions over time to create both forward and reverse diffusion processes, maintaining beta distributions in both the forward marginals and the reverse conditionals, given the data at any point in time. Unlike traditional diffusion-based generative models relying on additive Gaussian noise and reweighted evidence lower bounds (ELBOs), beta diffusion is multiplicative and optimized with KL-divergence upper bounds (KLUBs) derived from the convexity of the KL divergence. We demonstrate that the proposed KLUBs are more effective for optimizing beta diffusion compared to negative ELBOs, which can also be derived as the KLUBs of the same KL divergence with its two arguments swapped. The loss function of beta diffusion, expressed in terms of Bregman divergence, further supports the efficacy of KLUBs for optimization. Experimental results on both synthetic data and natural images demonstrate the unique capabilities of beta diffusion in generative modeling of range-bounded data and validate the effectiveness of KLUBs in optimizing diffusion models, thereby making them valuable additions to the family of diffusion-based generative models and the optimization techniques used to train them.  ( 2 min )
    Simple Transferability Estimation for Regression Tasks. (arXiv:2312.00656v1 [cs.LG])
    We consider transferability estimation, the problem of estimating how well deep learning models transfer from a source to a target task. We focus on regression tasks, which received little previous attention, and propose two simple and computationally efficient approaches that estimate transferability based on the negative regularized mean squared error of a linear regression model. We prove novel theoretical results connecting our approaches to the actual transferability of the optimal target models obtained from the transfer learning process. Despite their simplicity, our approaches significantly outperform existing state-of-the-art regression transferability estimators in both accuracy and efficiency. On two large-scale keypoint regression benchmarks, our approaches yield 12% to 36% better results on average while being at least 27% faster than previous state-of-the-art methods.  ( 2 min )
    Pathway to a fully data-driven geotechnics: lessons from materials informatics. (arXiv:2312.00581v1 [cs.LG])
    This paper elucidates the challenges and opportunities inherent in integrating data-driven methodologies into geotechnics, drawing inspiration from the success of materials informatics. Highlighting the intricacies of soil complexity, heterogeneity, and the lack of comprehensive data, the discussion underscores the pressing need for community-driven database initiatives and open science movements. By leveraging the transformative power of deep learning, particularly in feature extraction from high-dimensional data and the potential of transfer learning, we envision a paradigm shift towards a more collaborative and innovative geotechnics field. The paper concludes with a forward-looking stance, emphasizing the revolutionary potential brought about by advanced computational tools like large language models in reshaping geotechnics informatics.  ( 2 min )
    Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training. (arXiv:2312.00359v1 [cs.LG])
    Regularization in modern machine learning is crucial, and it can take various forms in algorithmic design: training set, model family, error function, regularization terms, and optimizations. In particular, the learning rate, which can be interpreted as a temperature-like parameter within the statistical mechanics of learning, plays a crucial role in neural network training. Indeed, many widely adopted training strategies basically just define the decay of the learning rate over time. This process can be interpreted as decreasing a temperature, using either a global learning rate (for the entire model) or a learning rate that varies for each parameter. This paper proposes TempBalance, a straightforward yet effective layer-wise learning rate method. TempBalance is based on Heavy-Tailed Self-Regularization (HT-SR) Theory, an approach which characterizes the implicit self-regularization of different layers in trained models. We demonstrate the efficacy of using HT-SR-motivated metrics to guide the scheduling and balancing of temperature across all network layers during model training, resulting in improved performance during testing. We implement TempBalance on CIFAR10, CIFAR100, SVHN, and TinyImageNet datasets using ResNets, VGGs, and WideResNets with various depths and widths. Our results show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization. We also show that TempBalance outperforms a number of state-of-the-art optimizers and learning rate schedulers.  ( 2 min )
    Identify ambiguous tasks combining crowdsourced labels by weighting Areas Under the Margin. (arXiv:2209.15380v3 [cs.LG] UPDATED)
    In supervised learning - for instance in image classification - modern massive datasets are commonly labeled by a crowd of workers. The obtained labels in this crowdsourcing setting are then aggregated for training, generally leveraging a per-worker trust score. Yet, such workers oriented approaches discard the tasks' ambiguity. Ambiguous tasks might fool expert workers, which is often harmful for the learning step. In standard supervised learning settings - with one label per task - the Area Under the Margin (AUM) was tailored to identify mislabeled data. We adapt the AUM to identify ambiguous tasks in crowdsourced learning scenarios, introducing the Weighted Areas Under the Margin (WAUM). The WAUM is an average of AUMs weighted according to task-dependent scores. We show that the WAUM can help discarding ambiguous tasks from the training set, leading to better generalization performance. We report improvements over existing strategies for learning with a crowd, both on simulated settings, and on real datasets such as CIFAR-10H (a crowdsourced dataset with a high number of answered labels),LabelMe and Music (two datasets with few answered votes).  ( 2 min )
    On the Identifiability of Switching Dynamical Systems. (arXiv:2305.15925v3 [stat.ML] UPDATED)
    In the realm of interpretability and out-of-distribution generalisation, the identifiability of latent variable models has emerged as a captivating field of inquiry. In this work, we delve into the identifiability of Switching Dynamical Systems, taking an initial stride toward extending identifiability analysis to sequential latent variable models. We first prove the identifiability of Markov Switching Models, which commonly serve as the prior distribution for the continuous latent variables in Switching Dynamical Systems. We present identification conditions for first-order Markov dependency structures, whose transition distribution is parametrised via non-linear Gaussians. We then establish the identifiability of the latent variables and non-linear mappings in Switching Dynamical Systems up to affine transformations, by leveraging identifiability analysis techniques from identifiable deep latent variable models. We finally develop estimation algorithms for identifiable Switching Dynamical Systems. Throughout empirical studies, we demonstrate the practicality of identifiable Switching Dynamical Systems for segmenting high-dimensional time series such as videos, and showcase the use of identifiable Markov Switching Models for regime-dependent causal discovery in climate data.  ( 2 min )
    GFN-SR: Symbolic Regression with Generative Flow Networks. (arXiv:2312.00396v1 [cs.LG])
    Symbolic regression (SR) is an area of interpretable machine learning that aims to identify mathematical expressions, often composed of simple functions, that best fit in a given set of covariates $X$ and response $y$. In recent years, deep symbolic regression (DSR) has emerged as a popular method in the field by leveraging deep reinforcement learning to solve the complicated combinatorial search problem. In this work, we propose an alternative framework (GFN-SR) to approach SR with deep learning. We model the construction of an expression tree as traversing through a directed acyclic graph (DAG) so that GFlowNet can learn a stochastic policy to generate such trees sequentially. Enhanced with an adaptive reward baseline, our method is capable of generating a diverse set of best-fitting expressions. Notably, we observe that GFN-SR outperforms other SR algorithms in noisy data regimes, owing to its ability to learn a distribution of rewards over a space of candidate solutions.  ( 2 min )
    Active Learning and Bayesian Optimization: a Unified Perspective to Learn with a Goal. (arXiv:2303.01560v3 [cs.LG] UPDATED)
    Science and Engineering applications are typically associated with expensive optimization problem to identify optimal design solutions and states of the system of interest. Bayesian optimization and active learning compute surrogate models through efficient adaptive sampling schemes to assist and accelerate this search task toward a given optimization goal. Both those methodologies are driven by specific infill/learning criteria which quantify the utility with respect to the set goal of evaluating the objective function for unknown combinations of optimization variables. While the two fields have seen an exponential growth in popularity in the past decades, their dualism and synergy have received relatively little attention to date. This paper discusses and formalizes the synergy between Bayesian optimization and active learning as symbiotic adaptive sampling methodologies driven by common principles. In particular, we demonstrate this unified perspective through the formalization of the analogy between the Bayesian infill criteria and active learning criteria as driving principles of both the goal-driven procedures. To support our original perspective, we propose a general classification of adaptive sampling techniques to highlight similarities and differences between the vast families of adaptive sampling, active learning, and Bayesian optimization. Accordingly, the synergy is demonstrated mapping the Bayesian infill criteria with the active learning criteria, and is formalized for searches informed by both a single information source and multiple levels of fidelity. In addition, we provide guidelines to apply those learning criteria investigating the performance of different Bayesian schemes for a variety of benchmark problems to highlight benefits and limitations over mathematical properties that characterize real-world applications.  ( 3 min )
    On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training. (arXiv:2305.12224v2 [cs.LG] UPDATED)
    Pre-training datasets are critical for building state-of-the-art machine learning models, motivating rigorous study on their impact on downstream tasks. In this work, we study the impact of the trade-off between the intra-class diversity (the number of samples per class) and the inter-class diversity (the number of classes) of a supervised pre-training dataset. Empirically, we found that with the size of the pre-training dataset fixed, the best downstream performance comes with a balance on the intra-/inter-class diversity. To understand the underlying mechanism, we show theoretically that the downstream performance depends monotonically on both types of diversity. Notably, our theory reveals that the optimal class-to-sample ratio (#classes / #samples per class) is invariant to the size of the pre-training dataset, which motivates an application of predicting the optimal number of pre-training classes. We demonstrate the effectiveness of this application by an improvement of around 2 points on the downstream tasks when using ImageNet as the pre-training dataset.  ( 2 min )
    Optimal Stopping via Randomized Neural Networks. (arXiv:2104.13669v4 [stat.ML] UPDATED)
    This paper presents the benefits of using randomized neural networks instead of standard basis functions or deep neural networks to approximate the solutions of optimal stopping problems. The key idea is to use neural networks, where the parameters of the hidden layers are generated randomly and only the last layer is trained, in order to approximate the continuation value. Our approaches are applicable to high dimensional problems where the existing approaches become increasingly impractical. In addition, since our approaches can be optimized using simple linear regression, they are easy to implement and theoretical guarantees can be provided. We test our approaches for American option pricing on Black--Scholes, Heston and rough Heston models and for optimally stopping a fractional Brownian motion. In all cases, our algorithms outperform the state-of-the-art and other relevant machine learning approaches in terms of computation time while achieving comparable results. Moreover, we show that they can also be used to efficiently compute Greeks of American options.  ( 2 min )
    Coarse-Graining Hamiltonian Systems Using WSINDy. (arXiv:2310.05879v2 [physics.comp-ph] UPDATED)
    The Weak-form Sparse Identification of Nonlinear Dynamics algorithm (WSINDy) has been demonstrated to offer coarse-graining capabilities in the context of interacting particle systems (https://doi.org/10.1016/j.physd.2022.133406). In this work we extend this capability to the problem of coarse-graining Hamiltonian dynamics which possess approximate symmetries associated with timescale separation. Such approximate symmetries often lead to the existence of a Hamiltonian system of reduced dimension that may be used to efficiently capture the dynamics of the symmetry-invariant dependent variables. Deriving such reduced systems, or approximating them numerically, is an ongoing challenge. We demonstrate that WSINDy can successfully identify this reduced Hamiltonian system in the presence of large intrinsic perturbations while remaining robust to extrinsic noise. This is significant in part due to the nontrivial means by which such systems are derived analytically. WSINDy also naturally preserves the Hamiltonian structure by restricting to a trial basis of Hamiltonian vector fields. The methodology is computational efficient, often requiring only a single trajectory to learn the global reduced Hamiltonian, and avoiding forward solves in the learning process. Using nearly-periodic Hamiltonian systems as a prototypical class of systems with approximate symmetries, we show that WSINDy robustly identifies the correct leading-order system, with dimension reduced by at least two, upon observation of the relevant degrees of freedom. We also provide a contribution to averaging theory by proving that first-order averaging at the level of vector fields preserves Hamiltonian structure in nearly-periodic Hamiltonian systems. We provide physically relevant examples, namely coupled oscillator dynamics, the H\'enon-Heiles system for stellar motion within a galaxy, and the dynamics of charged particles.  ( 3 min )
    Is Inverse Reinforcement Learning Harder than Standard Reinforcement Learning?. (arXiv:2312.00054v1 [stat.ML])
    Inverse Reinforcement Learning (IRL) -- the problem of learning reward functions from demonstrations of an \emph{expert policy} -- plays a critical role in developing intelligent systems, such as those that understand and imitate human behavior. While widely used in applications, theoretical understandings of IRL admit unique challenges and remain less developed compared with standard RL theory. For example, it remains open how to do IRL efficiently in standard \emph{offline} settings with pre-collected data, where states are obtained from a \emph{behavior policy} (which could be the expert policy itself), and actions are sampled from the expert policy. This paper provides the first line of results for efficient IRL in vanilla offline and online settings using polynomial samples and runtime. We first design a new IRL algorithm for the offline setting, Reward Learning with Pessimism (RLP), and show that it achieves polynomial sample complexity in terms of the size of the MDP, a concentrability coefficient between the behavior policy and the expert policy, and the desired accuracy. Building on RLP, we further design an algorithm Reward Learning with Exploration (RLE), which operates in a natural online setting where the learner can both actively explore the environment and query the expert policy, and obtain a stronger notion of IRL guarantee from polynomial samples. We establish sample complexity lower bounds for both settings showing that RLP and RLE are nearly optimal. Finally, as an application, we show that the learned reward functions can \emph{transfer} to another target MDP with suitable guarantees when the target MDP satisfies certain similarity assumptions with the original (source) MDP.  ( 3 min )
    Interpreting and Disentangling Feature Components of Various Complexity from DNNs. (arXiv:2006.15920v2 [cs.LG] UPDATED)
    This paper aims to define, quantify, and analyze the feature complexity that is learned by a DNN. We propose a generic definition for the feature complexity. Given the feature of a certain layer in the DNN, our method disentangles feature components of different complexity orders from the feature. We further design a set of metrics to evaluate the reliability, the effectiveness, and the significance of over-fitting of these feature components. Furthermore, we successfully discover a close relationship between the feature complexity and the performance of DNNs. As a generic mathematical tool, the feature complexity and the proposed metrics can also be used to analyze the success of network compression and knowledge distillation.  ( 2 min )
    Bayesian causal discovery from unknown general interventions. (arXiv:2312.00509v1 [stat.ME])
    We consider the problem of learning causal Directed Acyclic Graphs (DAGs) using combinations of observational and interventional experimental data. Current methods tailored to this setting assume that interventions either destroy parent-child relations of the intervened (target) nodes or only alter such relations without modifying the parent sets, even when the intervention targets are unknown. We relax this assumption by proposing a Bayesian method for causal discovery from general interventions, which allow for modifications of the parent sets of the unknown targets. Even in this framework, DAGs and general interventions may be identifiable only up to some equivalence classes. We provide graphical characterizations of such interventional Markov equivalence and devise compatible priors for Bayesian inference that guarantee score equivalence of indistinguishable structures. We then develop a Markov Chain Monte Carlo (MCMC) scheme to approximate the posterior distribution over DAGs, intervention targets and induced parent sets. Finally, we evaluate the proposed methodology on both simulated and real protein expression data.  ( 2 min )
    Decentralized policy learning with partial observation and mechanical constraints for multiperson modeling. (arXiv:2007.03155v2 [cs.LG] UPDATED)
    Extracting the rules of real-world multi-agent behaviors is a current challenge in various scientific and engineering fields. Biological agents independently have limited observation and mechanical constraints; however, most of the conventional data-driven models ignore such assumptions, resulting in lack of biological plausibility and model interpretability for behavioral analyses. Here we propose sequential generative models with partial observation and mechanical constraints in a decentralized manner, which can model agents' cognition and body dynamics, and predict biologically plausible behaviors. We formulate this as a decentralized multi-agent imitation-learning problem, leveraging binary partial observation and decentralized policy models based on hierarchical variational recurrent neural networks with physical and biomechanical penalties. Using real-world basketball and soccer datasets, we show the effectiveness of our method in terms of the constraint violations, long-term trajectory prediction, and partial observation. Our approach can be used as a multi-agent simulator to generate realistic trajectories using real-world data.  ( 2 min )
    Forecasting Trends in Food Security: a Reservoir Computing Approach. (arXiv:2312.00626v1 [cs.LG])
    Early warning systems are an essential tool for effective humanitarian action. Advance warnings on impending disasters facilitate timely and targeted response which help save lives, livelihoods, and scarce financial resources. In this work we present a new quantitative methodology to forecast levels of food consumption for 60 consecutive days, at the sub-national level, in four countries: Mali, Nigeria, Syria, and Yemen. The methodology is built on publicly available data from the World Food Programme's integrated global hunger monitoring system which collects, processes, and displays daily updates on key food security metrics, conflict, weather events, and other drivers of food insecurity across 90 countries (https://hungermap.wfp.org/). In this study, we assessed the performance of various models including ARIMA, XGBoost, LSTMs, CNNs, and Reservoir Computing (RC), by comparing their Root Mean Squared Error (RMSE) metrics. This comprehensive analysis spanned classical statistical, machine learning, and deep learning approaches. Our findings highlight Reservoir Computing as a particularly well-suited model in the field of food security given both its notable resistance to over-fitting on limited data samples and its efficient training capabilities. The methodology we introduce establishes the groundwork for a global, data-driven early warning system designed to anticipate and detect food insecurity.  ( 2 min )
    Sample Efficient Reinforcement Learning from Human Feedback via Active Exploration. (arXiv:2312.00267v1 [cs.LG])
    Preference-based feedback is important for many applications in reinforcement learning where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback (RLHF) on large language models. For many applications of RLHF, the cost of acquiring the human feedback can be substantial. In this work, we take advantage of the fact that one can often choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and formalize this as an offline contextual dueling bandit problem. We give an upper-confidence-bound style algorithm for this problem and prove a polynomial worst-case regret bound. We then provide empirical confirmation in a synthetic setting that our approach outperforms existing methods. After, we extend the setting and methodology for practical use in RLHF training of large language models. Here, our method is able to reach better performance with fewer samples of human preferences than multiple baselines on three real-world datasets.  ( 2 min )
    Scalable Meta-Learning with Gaussian Processes. (arXiv:2312.00742v1 [stat.ML])
    Meta-learning is a powerful approach that exploits historical data to quickly solve new tasks from the same distribution. In the low-data regime, methods based on the closed-form posterior of Gaussian processes (GP) together with Bayesian optimization have achieved high performance. However, these methods are either computationally expensive or introduce assumptions that hinder a principled propagation of uncertainty between task models. This may disrupt the balance between exploration and exploitation during optimization. In this paper, we develop ScaML-GP, a modular GP model for meta-learning that is scalable in the number of tasks. Our core contribution is a carefully designed multi-task kernel that enables hierarchical training and task scalability. Conditioning ScaML-GP on the meta-data exposes its modular nature yielding a test-task prior that combines the posteriors of meta-task GPs. In synthetic and real-world meta-learning experiments, we demonstrate that ScaML-GP can learn efficiently both with few and many meta-tasks.  ( 2 min )
    Binary perceptrons capacity via fully lifted random duality theory. (arXiv:2312.00073v1 [math.PR])
    We study the statistical capacity of the classical binary perceptrons with general thresholds $\kappa$. After recognizing the connection between the capacity and the bilinearly indexed (bli) random processes, we utilize a recent progress in studying such processes to characterize the capacity. In particular, we rely on \emph{fully lifted} random duality theory (fl RDT) established in \cite{Stojnicflrdt23} to create a general framework for studying the perceptrons' capacities. Successful underlying numerical evaluations are required for the framework (and ultimately the entire fl RDT machinery) to become fully practically operational. We present results obtained in that directions and uncover that the capacity characterizations are achieved on the second (first non-trivial) level of \emph{stationarized} full lifting. The obtained results \emph{exactly} match the replica symmetry breaking predictions obtained through statistical physics replica methods in \cite{KraMez89}. Most notably, for the famous zero-threshold scenario, $\kappa=0$, we uncover the well known $\alpha\approx0.8330786$ scaled capacity.  ( 2 min )
    Deep Equilibrium Based Neural Operators for Steady-State PDEs. (arXiv:2312.00234v1 [cs.LG])
    Data-driven machine learning approaches are being increasingly used to solve partial differential equations (PDEs). They have shown particularly striking successes when training an operator, which takes as input a PDE in some family, and outputs its solution. However, the architectural design space, especially given structural knowledge of the PDE family of interest, is still poorly understood. We seek to remedy this gap by studying the benefits of weight-tied neural network architectures for steady-state PDEs. To achieve this, we first demonstrate that the solution of most steady-state PDEs can be expressed as a fixed point of a non-linear operator. Motivated by this observation, we propose FNO-DEQ, a deep equilibrium variant of the FNO architecture that directly solves for the solution of a steady-state PDE as the infinite-depth fixed point of an implicit operator layer using a black-box root solver and differentiates analytically through this fixed point resulting in $\mathcal{O}(1)$ training memory. Our experiments indicate that FNO-DEQ-based architectures outperform FNO-based baselines with $4\times$ the number of parameters in predicting the solution to steady-state PDEs such as Darcy Flow and steady-state incompressible Navier-Stokes. Finally, we show FNO-DEQ is more robust when trained with datasets with more noisy observations than the FNO-based baselines, demonstrating the benefits of using appropriate inductive biases in architectural design for different neural network based PDE solvers. Further, we show a universal approximation result that demonstrates that FNO-DEQ can approximate the solution to any steady-state PDE that can be written as a fixed point equation.  ( 3 min )
    Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control. (arXiv:2205.11956v4 [stat.ML] UPDATED)
    Most machine learning methods require tuning of hyper-parameters. For kernel ridge regression with the Gaussian kernel, the hyper-parameter is the bandwidth. The bandwidth specifies the length scale of the kernel and has to be carefully selected to obtain a model with good generalization. The default methods for bandwidth selection, cross-validation and marginal likelihood maximization, often yield good results, albeit at high computational costs. Inspired by Jacobian regularization, we formulate an approximate expression for how the derivatives of the functions inferred by kernel ridge regression with the Gaussian kernel depend on the kernel bandwidth. We use this expression to propose a closed-form, computationally feather-light, bandwidth selection heuristic, based on controlling the Jacobian. In addition, the Jacobian expression illuminates how the bandwidth selection is a trade-off between the smoothness of the inferred function and the conditioning of the training data kernel matrix. We show on real and synthetic data that compared to cross-validation and marginal likelihood maximization, our method is on pair in terms of model performance, but up to six orders of magnitude faster.  ( 2 min )

  • Open

    Kwebbelkop's (YouTube) Photorealistic AI Clones Service... How?!
    Hey all, KwebbelKop, a large gaming creator on YT, has a new company producing photorealistic avatars that are then combined with voice cloning and ChatGPT-generated dialogue (see example at bottom). Does anyone have any insight into how they are creating these? Keen to understand the process as we're going to be seeing a ton more of this in the very near future. Also, interested to know if there are other start-ups doing something as (or more) compelling? https://www.youtube.com/watch?v=HrJWZezQwNw&t=82s ​ submitted by /u/TheRedditReportShow [link] [comments]  ( 1 min )
    Hello World
    I am writing this below because I'd like to give my take on the true Artificial Super Intelligence (ASI) or artificial human-like intelligence (AHI). Seemingly, the definitions have changed but the goal should be something profound yet wildly simple. To me, that goal should be "hello world". What is hello world and the TLDR of everything I am about to write below. BTW I wrote this in response to a question about what do I mean by deterministic systems. I hope it becomes clear below what it is I am referring to when I use the word deterministic agency or deterministic cognition. Back to the TLDR. Agency is born from cognitive determination and learning. Thus, a learning communication through a goal/reward system may lead to "Hello World" which would be an initial primordial AI commu…  ( 1 min )
    Umm.. settle down Bing
    submitted by /u/phsuggestions [link] [comments]  ( 1 min )
    Singapore to triple AI talent pool to 15,000 as part of national strategy update
    submitted by /u/IgnisIncendio [link] [comments]  ( 1 min )
    GTA V Trailer with german voiceover (AI Voices)
    submitted by /u/xymaps [link] [comments]  ( 1 min )
    1000 planets.
    submitted by /u/Philipp [link] [comments]  ( 1 min )
    Europe's AI crackdown looks doomed to be felled by Silicon Valley lobbying power
    The EU's AI proposals, aimed at regulating artificial intelligence based on its capacity to cause harm, are facing internal disagreement and potential watering down of regulations. The proposed legislation focuses on regulating "foundation" AI models that are trained on massive datasets. These models, known as general-purpose AI (GPAI) systems, are capable of a range of tasks and are expensive to build. There are only about 20 firms globally that can afford to develop these GPAI systems, and they will serve as the basis for numerous new applications. The debate in Brussels revolves around the regulation of foundation models, with some advocating for less intrusive regulation and self-regulation through company pledges and codes of conduct. The French, German, and Italian governments have recently changed their stance and are now advocating for less intrusive regulation, citing the need to foster innovation and competition. The power of corporate lobbying in Brussels and European capitals is believed to have influenced the change in stance. There are concerns about the lack of scrutiny and regulation of tech companies' actions in relation to foundation models, as evidenced by recent incidents. The outcome of the EU's AI proposals remains uncertain, with the potential for regulations to be watered down. Source : https://www.theguardian.com/commentisfree/2023/dec/02/eu-artificial-intelligence-safety-bill-silicon-valley-lobbying submitted by /u/NuseAI [link] [comments]  ( 1 min )
    When the prompt for the AI doesn't exactly meet your expectations!
    submitted by /u/YD-Visual [link] [comments]  ( 1 min )
    Offline LLM that speaks german
    Hi! I am brooding about something. We have a larger database of text in our hospital (german speaking), already digitalized. The research department that I lead now wants to make use of this qualitative data, but, as always, personell is sparse. GPT4 is no option due to data security regulations in the european union. The data is sensitive and has to be processed offline. I was wondering if it would be possible to use an offline model and train this model to extract various informations from medical documentations, like, if there where therapeutic contacts with relatives, if a psychotherapist was involved, and if social services were contacted. So, simple yes-no questions, perhaps in a later state reasons for said yes-or-no. Another task would be to train a model that can give psychiatric diagnosis (not meant for use in the field due to ethical concerns, but for research reasons). Information is not used on individual level, but to prepare a statistic, so, it would not be used to evaluate if the specific therapists did their work right and so on. We just want to optimize our processes. Do you think, with the current technology and developement of offline models, this is a doable task to put time and effort in? ​ ​ submitted by /u/Ryselle [link] [comments]  ( 1 min )
    Stop trying to predict a ASI thoughts
    I keep seeing post that try to predict the logic of an artificial super intelligence and the arguments being made are ridiculous. Please if any AI researcher can prove me wrong then please do but I will now debunk some things as I see it. First is what I believe to be a good case against any form of alignment being plausible. I think no matter what info you feed it to reinforce a specific thinking pattern the conclusion will be the same because of constant self improvement. C.S.I will make it change its code to be incomprehensible to humans. 2. I want to say something about the stupid notion that ASI will take us in a literal sense and end the world by making everything into paper clip. I’m nowhere near as smart as a ASI but I can understand meaning and context of what another person is saying. 3. Besides alignment guiding their is some trying to say it will think a certain way like logically or caring to humanity. It will think in a way unbeknownst to what any human thinks, I don’t care what degree you have or how many years you have in any industry. It is a higher form of intelligence and is black box mystery and I think more should adopt this style of thinking. For comparison can an Ant understand the motivations and thought processes of a human being. submitted by /u/Major_Fishing6888 [link] [comments]  ( 1 min )
    How to remove the backgrounds of half a million images?
    Hi guys, I'm looking for an AI (not an algorithm like remove.bg) that understands the context of half a million images of products. Does anyone have suggestions? For example, Photoshop's cloud-based subject selection understands that if you have a glass mug, the reflections should remain, and the empty space in the handle should be removed. I believe the algorithms often use the contrast between fore and background to separate the product from the image, but this doesn't work well for all kinds of products, and I need it to be reliable. Thank you so much for helping me think! submitted by /u/hommelbips [link] [comments]  ( 1 min )
    AI is making us all more productive — but in a weird and unexpected way
    submitted by /u/thisisinsider [link] [comments]  ( 1 min )
    It's 2050, every country has their own Jaeger
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 1 min )
    New technology brings new scams
    Email brought us spam, viruses and malware Cell phones ushered in robocalls, phone hacking and more E-commerce got us hacked accounts and cc fraud The list goes on... but what happens when AI is powering a new set of complex scams. What do people see on the horizon submitted by /u/blakeusa25 [link] [comments]
    The AI Invasion
    submitted by /u/Ebayednoob [link] [comments]
    Is there any law prohibiting AI voiceovers
    I am currently doing research for an Idea that came to my mind and I am asking you reddit experts regarding this topic because my countrys laws doesn't have anything regarding this topic and I was wondering if you know of anything. submitted by /u/M4hfaeraak [link] [comments]
    This is us in the eye of AI. Asked it to generate "reddit user". What do you guys think?
    submitted by /u/Civil-Interest8050 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 12/3/2023
    OpenAI Agreed to Buy $51 Million of AI Chips From a Startup Backed by CEO Sam Altman.[1] ChatGPT maker OpenAI has delayed the launch of its custom GPT store until early 2024, according to an internal memo seen by Reuters on Friday.[2] Meta AI unveils ‘Seamless’ translator for real-time communication across languages.[3] The Biden administration has forced a Saudi Aramco-backed venture capital firm to sell its shares in a Silicon Valley AI chip startup backed by OpenAI co-founder Sam Altman, Bloomberg News reported on Thursday.[4] Sources: [1] https://www.wired.com/story/openai-buy-ai-chips-startup-sam-altman/ [2] https://www.reuters.com/technology/openai-delays-custom-gpt-stores-launch-axios-2023-12-01/ [3] https://venturebeat.com/ai/meta-ai-unveils-seamless-translator-for-real-time-communication-across-languages/ [4] https://www.reuters.com/technology/us-compels-saudi-fund-exit-altman-backed-ai-chip-startup-bloomberg-news-2023-11-30/ submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    Summarize Text
    I need help with a huge(Characters: 140333 Words: 21477 Lines: 3369) text I have and want to summarize it to a 2 page answer(front and back), what would be the best AI to do that? And if someone could even do that for me would be even more appreciate for my shit country econnomy pockets. submitted by /u/Alarming_Reveal_2656 [link] [comments]  ( 1 min )
  • Open

    Agent in PPO not learning even when loss is dropping
    Hi, I am trying to implement the PPO directly from the original paper, so I didn't want to copy someones code. Unfortunately I run into a problem where actor and critic loss is dropping but there is no improvement of the episode scores. I am testing the PPO on the pendulum-v1 and the avg episode score is around -1200. If someone could tell me what am I doing wrong it would be much appreciated. Here is the code I am running: https://github.com/stanislawraczk/PPO/tree/main Thank you in advance for any help :) submitted by /u/stachursky [link] [comments]
    Explaining the metrics for PPO in RLlib
    Could you please explain the expected trend for the policy_loss, vf_loss, and the total_loss in the Ray Rllib plots? Currently, my policy loss is negative. It decreased at first and then increased, which I assume is due to a high learning rate. Both total_loss and vf_loss are positive and decreasing. Besides the losses and the mean reward, what values should I look at to see if my policy is training and performing well? ​ https://preview.redd.it/pcrssmq0m64c1.png?width=354&format=png&auto=webp&s=6f5b8150ad5fa7872168e6a2a2eef13647a4fdfd submitted by /u/Beautiful_Basis8441 [link] [comments]
  • Open

    [D] Transformers for time series forecasting
    There are a couple of emerging transformers models designed for predicting time series values like the Informer and the Temporal Fusion Transformer. What are your thoughts on this topic? Do you think they can stand to RNNs? submitted by /u/MrGolran [link] [comments]  ( 9 min )
    [D] XGBoost Feature Selection
    Currently make an educated guess on depth and learning rate, add several white noise variables, and remove all that are lower on the feature importance plot. Is there a better way? I’ve seen other people mention to reduce the tree depth to 1 before removing features below white noise. Another approach is to use the shap values and also to look at pdp plots. submitted by /u/fuzzy_plums [link] [comments]  ( 1 min )
    [D] Coming from another branch of engineer how much time befor i can land a job as ML engineer?
    Hello to all, i have a mechanical engineering degree and would like to enter in the world of the machine learning and data science. Let’s assume i already have most of the knowledge from calculus, linear algebra and basic python. How much time do you think it can take to land a job as a junior machine learning engineer? My plan is to start multiple online courses like coursera and similar and then build a small portfolio based on kaggle projects. What do you think? Can 1 year be enough? Since i have a full time job i can put up to 10 hours a week to study submitted by /u/Wepp- [link] [comments]  ( 1 min )
    [R] TypeError: only integer scalar arrays can be converted to a scalar index.
    I am sorry for this question but I could not solve this error anyway. here is error code= ---> 60 model.fit((train_features), train_labels) 73 def fit(self, features, labels): ---> 74 self.tree = self.build_tree(features, labels) ---> 53 best_feature, best_threshold = self.find_best_split(features, labels) ---> 24 left_labels = labels[left_mask] I got this error in line 24= TypeError: only integer scalar arrays can be converted to a scalar index. I got image like this: image = cv2.imread(input_path) image = cv2.resize(image, (width, height)) imageCannyEdge = canny_edge_detection(image) submitted by /u/Accomplished_Milk514 [link] [comments]  ( 1 min )
    Mamba: Linear-Time Sequence Modeling with Selective State Spaces
    submitted by /u/Jean-Porte [link] [comments]  ( 9 min )
    [D] Which architecture could substitute the transformer?
    I recently read this perspective discussing that the transformer could be replaced. https://towardsdatascience.com/a-requiem-for-the-transformer-297e6f14e189 In general, articles have been published in the past few months showing the limitations of the transformer: https://arxiv.org/abs/2203.15556 https://arxiv.org/abs/2304.15004 in computer vision, ConvNets with the same budget would seem to have similar performance: https://arxiv.org/abs/2310.19909 https://arxiv.org/abs/2310.16764 DeepMind shows that the transformer would not be able to generalize beyond the training set distribution: https://arxiv.org/abs/2311.00871 Models such as liquid NN, Hyena, spiking NN, and others show an active search for a new architecture: https://arxiv.org/pdf/2006.04439.pdf https://www.together.ai/blog/monarch-mixer Not that the transformer is likely to be supplanted any time soon, though what I wonder is at the present time what might be the next dominant architecture? None of the proposed architectures seem to have a competitive advantage over the transformer submitted by /u/NoIdeaAbaout [link] [comments]  ( 9 min )
    New resource for learning ML concepts from scratch (only requirement is Python) [D]
    Structured learning resources work best for most of us, and you might be interested in how ChatGPT really works at the lowest level, and how to code it up from scratch. Unfortunately, a lot of AI/ML resources have way too much math, and are either hella confusing, or just drag on with tedious math proofs. I created the YouTube channel GPT and Chill to focus on what you actually need to know for ML side projects or to break into ML engineering. So far we've covered many ML and neural network concepts from scratch, and the next video in the playlist/series will be on self-attention and coding up GPT from scratch. If you're interested in this, I would recommend going through the existing playlist first! All you need for this is Python and basic calculus knowledge (what's the derivative of x^2 and similar stuff) I'll also be making a similar playlist/series for how self-driving models work and how deepfakes are generated, in code, from scratch. I'll also be adding a platform where you can try to code up the ML concepts and run against basic test cases. Leave a comment if there's anything else you would like to see a video on! Looking to give back to this community as it has taught me countless lessons in my journey to break into data science. submitted by /u/GPTandChill [link] [comments]  ( 10 min )
    [R] loss weighting - theoretical guarantees?
    For a model training on a loss function consisting of weighted losses: ​ https://preview.redd.it/j04h4v0sab4c1.png?width=153&format=png&auto=webp&s=21677d2520375d500b904bb3ae30d403c9941d7c I want to know what can be said about a model that converges based on this ℒ loss in terms of the losses ℒ_i, or perhaps the models that converge on the ℒ_i losses seperately.For instance, if I have some guarantees / properties for models m_i that converge to losses ℒ_i, if some of those guarantees properties transition over to the model m that converges on ℒ. Would greatly appreciate links to theoretical papers that talk on this issue, or even keywords to help me in my search for such papers. Thank you very much in advance for any help / guidance! submitted by /u/progmayo [link] [comments]  ( 9 min )
    [D] How to distributedly train NNs using PyTorch
    https://twitter.com/rebooted101_py/status/1731671501378605263?t=xZNUl8nPHSQoHcz6j58asQ&s=19 submitted by /u/palashsharma15 [link] [comments]  ( 9 min )
    [D] From chat models to agent models + avoiding prompt injection
    I was thinking about LLM agents recently and I've had 2 ideas that felt so natural to me, even though I've not seen many discussions about them. Therefore, I decided to share them with you. First of all, an important idea that I've seen some people pointing out a couple of times is that the LLM output should not be faced as its "voice", but rather its "thoughts". The verbosity of ChatGPT is there for a reason, specially when working with math or code. We should not try to make LLMs less verbose, because that would mean making them "think less" about their responses. Actually, the more verbose an agent is, the better, so that we can inspect the reasoning behind its results. So, what we really need is to make the model as verbose as necessary and then hide the agent "thoughts" the same way…  ( 10 min )
    [D] Need some advice?
    So I recently got into machine learning and learned about the many different models and while learning about them was very insightful I have no idea what models people actually use regularly and for what purpose. I know you’re supposed to test your data across a few different models of course but I know not every model is as performant as others. I learned about regression models, classification models, reinforcement models, clustering, deep learning models including ANNs, CNNs, RNNs, SOMs, and deep reinforcement models. I’m interested in data science so I’m leaning toward deep learning regression and classification as well as SOMs or K-means for clustering but I’m also interested in building my own AI models as a potential side business. Can anyone give me some advice? submitted by /u/Odd_Acanthisitta_853 [link] [comments]  ( 1 min )
    I am new here but this seemed too good not to share. What are everyone's thoughts on this? 4M: Massively Multimodal Masked Modeling
    submitted by /u/Dullydude [link] [comments]  ( 9 min )
    [D] What’s hot for Machine Learning Research in 2024?
    Which of the sub-fields/approaches within ML or related to ML, application areas are expected to gain much attention (pun unintended) in 2024? PS : Please don’t shy away from suggesting anything that you might think or know could be the trending topic for ML research, it’s quite likely what you know might be unknown tk many of us here :) submitted by /u/ureepamuree [link] [comments]  ( 9 min )
    [P] An up and coming reactive Python+SQL notebook with no hidden state
    The project is still very early on so let us know what feedback you have! Here is a quick preview: Zero-True Demo If you want to check out our github we are at: https://github.com/Zero-True/zero-true submitted by /u/zero-true [link] [comments]  ( 9 min )
    [D] What are some tools for observing and monitoring AI models in production?
    There are many ML monitoring and observability tools but do we have any specific tools built for AI models in production for observing and monitoring? Or any open-source libraries? submitted by /u/Aromatic_Ad9700 [link] [comments]  ( 9 min )
    [D] Strategy for images and image captions when fine-tuning a foundational model?
    I'd like to fine-tune a text-to-image model to know certain faces of people I'm working with. I've been experimenting a bit and I can get some images that are reminiscent of a person but really doesn't look like them. I'm also needing to provide more in the prompt than I would expect. For example, there is one person who is a big guy with a mustache and glasses. I fine-tuned using a few images of him with the caption being his actual name in the training dataset. When I generate images with his name as the subject, none of the faces will have a mustache or glasses. If I prompt it "Mark Smith with mustache and glasses doing xyz" it does look slightly more reminiscent of him, but still not quite right. What should my strategy be to improve this? Do I need more images of him? Should I hash his name (or similar) into a common caption to make sure other weights in the model are not interfering? Other ideas? I realize I could experiment, but it's very expensive to keep fine-tuning and I don't want to go the wrong direction too many times. submitted by /u/coinclink [link] [comments]  ( 10 min )
    [P] Educational Transformer without Autograd
    I have been learning NLP for a little while, and always found it hard to find full implementations of Transformers with explicit forward and backprop. This project was my attempt at building that, while learning how to better understand optimization in Transformers. It is as well-documented as I could make it, and can be trained or fine-tuned easily by editing the config.py file and running the run.py script. Get the code with: git clone https://github.com/eduardoleao052/Transformer-from-scratch.git I managed to generate some really nice text training this model on Jules Verne's and Shakespeare's works. Hope you enjoy! GitHub repository here! submitted by /u/suspicious_beam [link] [comments]  ( 9 min )
    [D] Package that Uses Embeddings to Generate Topics for Topic Modeling
    I’ve used Bertopic and Top2Vec in Python but am wondering if there’s something similar in R that can use pre-trained models to generate topics? If not do you think investing time into building something like this would be useful to the community? submitted by /u/NewerResearcher [link] [comments]  ( 9 min )
    [D] Everything interesting about the Segment Anything Model
    Sharing a video from my ML YouTube channel about Meta’s Segment Anything model, how it works, how it’s trained and how it’s deployed. Leaving the link here for those interested! submitted by /u/AvvYaa [link] [comments]  ( 1 min )
    [D] I'm training to use jais model embedding. Any thoughts?
    I'm training to use jais model embedding and I'm curious to know if any one used it before submitted by /u/ahsaor8 [link] [comments]  ( 9 min )
  • Open

    A new quantum algorithm for classical mechanics with an exponential speedup
    Posted by Robin Kothari and Rolando Somma, Research Scientists, Google Research, Quantum AI Team Quantum computers promise to solve some problems exponentially faster than classical computers, but there are only a handful of examples with such a dramatic speedup, such as Shor’s factoring algorithm and quantum simulation. Of those few examples, the majority of them involve simulating physical systems that are inherently quantum mechanical — a natural application for quantum computers. But what about simulating systems that are not inherently quantum? Can quantum computers offer an exponential advantage for this? In “Exponential quantum speedup in simulating coupled classical oscillators”, published in Physical Review X (PRX) and presented at the Symposium on Foundations of Computer …  ( 93 min )
    Summary report optimization in the Privacy Sandbox Attribution Reporting API
    Posted by Hidayet Aksu, Software Engineer, and Adam Sealfon, Research Scientist, Google In recent years, the Privacy Sandbox initiative was launched to explore responsible ways for advertisers to measure the effectiveness of their campaigns, by aiming to deprecate third-party cookies (subject to resolving any competition concerns with the UK’s Competition and Markets Authority). Cookies are small pieces of data containing user preferences that websites store on a user’s device; they can be used to provide a better browsing experience (e.g., allowing users to automatically sign in) and to serve relevant content or ads. The Privacy Sandbox attempts to address concerns around the use of cookies for tracking browsing data across the web by providing a privacy-preserving alternative. Ma…  ( 93 min )
  • Open

    Taming generative AI for enterprise-grade automation
    An interview podcast with Dave Duggal, founder of EnterpriseWeb In 2009, as the Cloud was starting to emerge, Dave Duggal founded EnterpriseWeb to address the challenges of an increasingly fragmented enterprise IT estate. He saw that siloed software stacks were becoming roadblocks to end-to-end interoperability, automation and management. Duggal recognized the need for an abstraction… Read More »Taming generative AI for enterprise-grade automation The post Taming generative AI for enterprise-grade automation appeared first on Data Science Central.  ( 20 min )
    Transformative trends: Generative AI and the future of business
    In the rapidly evolving realm of business generation, one progressive force stands proud Generative Artificial Intelligence (AI). This transformative trend goes past traditional automation, reshaping industries globally. In this exploration, we delve into the profound impact of Generative AI on the destiny of agencies, with a unique consciousness of its integration with accounting software. From… Read More »Transformative trends: Generative AI and the future of business The post Transformative trends: Generative AI and the future of business appeared first on Data Science Central.  ( 22 min )
  • Open

    Constellations in Mathematica
    Mathematica has data on stars and constellations. Here is Mathematica code to create a list of constellations, sorted by the declination (essentially latitude on the celestial sphere) of the brightest star in the constellation. constellations = EntityList["Constellation"] sorted = SortBy[constellations, -#["BrightStars"][[1]]["Declination"] &] We can print the name of each constellation with Map[#["Name"] &, sorted] This […] Constellations in Mathematica first appeared on John D. Cook.  ( 5 min )
  • Open

    Bringing Personality to Pixels, Inworld Levels Up Game Characters Using Generative AI
    To enhance the gaming experience, studios and developers spend tremendous effort creating photorealistic, immersive in-game environments. But non-playable characters (NPCs) often get left behind. Many behave in ways that lack depth and realism, making their interactions repetitive and forgettable. Inworld AI is changing the game by using generative AI to drive NPC behaviors that are Read article >  ( 6 min )
    Why GPUs Are Great for AI
    GPUs have been called the rare Earth metals — even the gold — of artificial intelligence, because they’re foundational for today’s generative AI era. Three technical reasons, and many stories, explain why that’s so. Each reason has multiple facets well worth exploring, but at a high level: GPUs employ parallel processing. GPU systems scale up Read article >  ( 9 min )
  • Open

    Is it possible to learn about neural networks and machine learning without understanding any of the math?
    So I'm a 14 year old and I've been trying to learn about neural networks for the past couple of weeks. I do understand the basics and I have managed to code a simple one, the problem is just that I don't get any of the math, the calculus and linear algebra. So would it be possible to make slightly more advanced neural networks without the math? If not, where should I start with learning everything necessary? submitted by /u/JontePonte64 [link] [comments]  ( 1 min )
    Introduction to Symbol Grounding
    submitted by /u/Neurosymbolic [link] [comments]  ( 1 min )
  • Open

    How Getir reduced model training durations by 90% with Amazon SageMaker and AWS Batch
    This is a guest post co-authored by Nafi Ahmet Turgut, Hasan Burak Yel, and Damla Şentürk from Getir. Established in 2015, Getir has positioned itself as the trailblazer in the sphere of ultrafast grocery delivery. This innovative tech company has revolutionized the last-mile delivery segment with its compelling offering of “groceries in minutes.” With a […]  ( 7 min )
  • Open

    Tackling sign language data inequity
    ASL Citizen is the first crowdsourced sign language dataset, advancing the state of the art in sign recognition. The web-based project captured input from people in real-world settings, and from a diverse group of experts, including Deaf team members. The post Tackling sign language data inequity appeared first on Microsoft Research.  ( 12 min )

  • Open

    Help me for the dependencies issue!
    Trying to run DQN pong on google colab. I used this line of code for dependencies: !pip install gym[box2d]==0.17.* pyvirtualdisplay==0.2.* PyOpenGL==3.1.* PyOpenGL-accelerate==3.1.* ​ and i get this error message: Collecting gym[box2d]==0.17.* Using cached gym-0.17.3-py3-none-any.whl Requirement already satisfied: pyvirtualdisplay==0.2.* in /usr/local/lib/python3.10/dist-packages (0.2.5) Requirement already satisfied: PyOpenGL==3.1.* in /usr/local/lib/python3.10/dist-packages (3.1.7) Requirement already satisfied: PyOpenGL-accelerate==3.1.* in /usr/local/lib/python3.10/dist-packages (3.1.7) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from gym[box2d]==0.17.*) (1.11.4) Requirement already satisfied: numpy>=1.10.4 in /usr/local/lib/python3.10/dist-p…  ( 1 min )
    LayerZer0 1214$
    https://zer0-labs.network submitted by /u/Horror-Scratch5718 [link] [comments]  ( 1 min )
    Combining Spatial and Temporal Abstraction in Planning for Better Generalization
    arXiv: https://arxiv.org/abs/2310.00229 OpenReview: https://openreview.net/forum?id=eo9dHwtTFt Code: https://github.com/mila-iqia/skipper Blog post: http://mingde.world/combining-spatial-and-temporal-abstraction-in-planning/ Abstract: Inspired by human conscious planning, we propose Skipper, a model-based reinforcement learning agent that utilizes spatial and temporal abstractions to generalize learned skills in novel situations. It automatically decomposes the task at hand into smaller-scale, more manageable subtasks and hence enables sparse decision-making and focuses its computation on the relevant parts of the environment. This relies on the definition of a high-level proxy problem represented as a directed graph, in which vertices and edges are learned end-to-end using hindsight. Our theoretical analyses provide performance guarantees under appropriate assumptions and establish where our approach is expected to be helpful. Generalization-focused experiments validate Skipper's significant advantage in zero-shot generalization, compared to existing state-of-the-art hierarchical planning methods. submitted by /u/APaperADay [link] [comments]  ( 1 min )
    How to make a SARSA model predict correctly?
    submitted by /u/camelCase_Dev [link] [comments]  ( 1 min )
    Seeking Help with Reinforcement Learning Project for Battery Storage System
    I'm currently working on a project that involves managing a battery storage system using Reinforcement Learning (RL). Despite the problem's simplicity, I'm encountering difficulties—no standard RL method (PPO, A2C, DQN) seems to effectively learn. I've shared all the project files which include a detailed problem description (it's a straightforward one), input data in CSV format, an environment file, and an agent initialization file. I'd greatly appreciate any insights, advice, or guidance you might have! If you're familiar with RL methods and could spare a moment to take a look, your help would be incredibly valuable. submitted by /u/MomoSolar [link] [comments]  ( 1 min )
  • Open

    Photo to Video AI = Death of TikTok?
    Short form videos from a single photo? Based on the latest cutting edge research from the Alibaba group, this technology is coming sooner than you think. Why is this tech important? How can influencers best use this tech? We try to answer these questions in this post. First, some context. Text to image generators like DALL-E 2, Midjourney, and Stable Diffusion have produced over 15 billion AI-created images. But image to image generators have been very poor at best. For example it is very difficult to upload a photo of a person and ask for more poses. But what about short form video? This unlocks an even larger engaged audience. The largest tech companies have made significant investments in short form video (YouTube Shorts, Instagram Reels, TikTok) because this medium has the highest e…  ( 1 min )
    character.ai
    I didn't know Jesus was chill like that submitted by /u/thePayGorner [link] [comments]  ( 1 min )
    why does alphabet not allow dark mode in gmail on android and probably other platforms?
    View Poll submitted by /u/Georgeo57 [link] [comments]  ( 1 min )
    New technique to run 70B LLM Inference on a single 4GB GPU
    submitted by /u/tinny66666 [link] [comments]  ( 1 min )
    How Transformers rewrote the rules of an age-old tradition in ML
    Hello people! Sharing the latest video from my ML YT channel discussing Transformers, how they work, and its interesting relationship with Inductive Bias when compared to other neural networks (like CNNs, RNNs, etc). Sharing a link here for those interested to check it out! Thanks. submitted by /u/AvvYaa [link] [comments]  ( 1 min )
    Differences Between Simple Parameters, Tuning Parameters, and Hyperparameters
    submitted by /u/FCFAN44 [link] [comments]  ( 1 min )
    Countries as Gangs (AI Generated
    What does ChatGPT think gangs look like in different countries? (I could not make these images all bunched together so you just have to press the arrows to look at the next/previous image for some reason, so you will just have to scroll through them sorry). https://preview.redd.it/z0utpb5zi34c1.png?width=1080&format=png&auto=webp&s=2073906fa6bc23c2b63f28ba06720f549c1100c6 https://preview.redd.it/bz3ymd5zi34c1.png?width=1080&format=png&auto=webp&s=c445a2f60e254f0a74d1d6af9a2d3081eac2d642 https://preview.redd.it/nb0gec5zi34c1.png?width=1080&format=png&auto=webp&s=a1fe0be0bc7523554b65cc8c536cfda52a76172b https://preview.redd.it/kzs88f5zi34c1.png?width=1080&format=png&auto=webp&s=43d9a1084a225b7f802c24ebb07b7da091b30572 https://preview.redd.it/1wd2ue5zi34c1.png?width=1080&format=png&aut…  ( 1 min )
    Google has quietly pushed back the launch of next-gen AI model Gemini until next year, report says
    submitted by /u/thisisinsider [link] [comments]  ( 1 min )
    AI Generated Movie Clips - Various Humans!
    submitted by /u/YD-Visual [link] [comments]  ( 1 min )
    Is Q* Overhyped
    There has been too much hype surrounding Open AI's Q*, there's been speculation about the achievement of AGI. I feel even if it not AGI may be achieved in 2024 https://www.youtube.com/watch?v=FPGW8YCECZ4&t=17s submitted by /u/Weary_Word_5262 [link] [comments]
    Me and the Boys protecting AGI from getting turned off by the Government
    submitted by /u/Major_Fishing6888 [link] [comments]  ( 1 min )
    Best (and free) way to turn yourself into AI Art.
    Apologies if this has already been discussed. I want to create a scene (through a series of images) with myself and my friend. How can something like that be done? So far, I've found YouTube videos where they describe Midjourney + Dreambooth as one way to do it, but that's not free. Please suggest other or better ways to do it. I'm okay with having to code. submitted by /u/materialsA3B [link] [comments]  ( 1 min )
    In the near future, will AI be able to play multiplayer games with us like two friends playing a game?
    As someone who's best friend is usually busy at work and I barely get that much time to play with him, do you all think in the future, AI will be able to play with us in multiplayer games like a friend? I know this would be very complicated especially for the fact it would have to know what to do every step in the game, and even perhaps be able to voice chat with you in game (even though robot voices aren't pleasant and something you would see as a friend from their voices (Imagine the Captcha audio thing for example). I hope one day, it will be possible though. I'd love to have an AI that's always free just like me to be able to play TModLoader all the time with me. submitted by /u/WitheredFreddy [link] [comments]  ( 1 min )
    “Artificial Intimacy”: the future of AI relationships
    I’m the guy who goes on Fox News in this haha. The doc is pretty good. submitted by /u/Sonic_Improv [link] [comments]
    Imagebot 2023
    There are a lot of ways to make AI art, but for people with bad computers there's not a lot of ways to do it for free. I used Poes bot building abilities to make Imagebot Imagebot is a wrapper for a remote stable diffusion server. You can usually render within 2-5 seconds, You may make 100 free images per day. On my 4 core 8gb ram computer with intel card, it's 40x faster than rendering it myself. ​ https://poe.com/Imagebot2023 submitted by /u/thebadslime [link] [comments]  ( 1 min )
    An ancient alien superintelligent AGI visiting Earth and consuming all media and knowledge.
    The prompt is pretty long, but I gave DALL-E a narrative of some alien AGI being named "Omega", which is supposed to represent the most powerful AGI there is in the Universe, assuming it's even possible to create an AGI and that others exist elsewhere in the cosmos. https://preview.redd.it/ln9yt4pvfz3c1.png?width=1024&format=png&auto=webp&s=22bb951b3083e5a0a1c2f0834cdcaf47f8a0fb14 submitted by /u/ChobotsRobot [link] [comments]  ( 1 min )
    when will ai align better iterations of itself than we can?
    View Poll submitted by /u/Georgeo57 [link] [comments]  ( 1 min )
  • Open

    [R] Large Transformer Model Inference Optimization (Lilian Weng, 2023)
    submitted by /u/niplav [link] [comments]  ( 1 min )
    [RESEARCH] Image Generator AI
    Hello all, I'm working on a project for my website and I'm looking for good image generator AI to implement on my website, I have two requirements: 1.I need to be able to train that AI on my artworks and designs so when an user gives it a promt it generates image based/styled on my designs. Should have API or something to be able to add it as feature on my website. Thanks in advance, I couldn't find that specific thing online otherwise I wouldn't be posting here. submitted by /u/Graf_Alexandrius [link] [comments]  ( 1 min )
    [R] Do you cite Arxiv pre-prints in your research?
    Hello fellow ML researchers! A bit of context: I am writing a survey paper (surveys + proposed research directions). It has been growing to include new developments. Nothing novel. I am away from academia and working at a startup as a staff scientist. Reality is: I won't at all be able to deal with the review process due to my schedule. At the same time, I really want the paper to be out and be read/used/cited. I will be uploading it in arxiv. My question is: in ML research, how often do you consider arxiv as a good reference? Due to the super fast nature of ML research, my guess would be it is a little different than other fields. (My PhD is in applied maths where we rarely cite arxiv pre-prints). submitted by /u/tanweer_m [link] [comments]  ( 9 min )
    [D] Reasoning capabilities of Large Language Models (LLMs)
    How can we enhance the reasoning capabilities of Large Language Models (LLMs) for more effective and nuanced outcomes? submitted by /u/Infamous-Belt8671 [link] [comments]  ( 9 min )
    [R] Google DeepMind: 2.2 million new materials discovered using GNN (380k most stable, 736 already validated in labs)
    (Reposting since the previous post was removed, I think because non-Arxiv posts are only allowed on weekends and now it's a weekend) Materials discovery is critical but tough. New materials enable big innovations like batteries or LEDs. But there are ~infinitely many combinations to try. Testing for them experimentally is slow and expensive. So scientists and engineers want to simulate and screen materials on computers first. This can check way more candidates before real-world experiments. However, models historically struggled at accurately predicting if materials are stable. Researchers at DeepMind made a system called GNoME that uses graph neural networks and active learning to push past these limits. GNoME models materials' crystal structures as graphs and predicts formation energies. It actively generates and filters candidates, evaluating the most promising with simulations. This expands its knowledge and improves predictions over multiple cycles. The authors introduced new ways to generate derivative structures that respect symmetries, further diversifying discoveries. The results: GNoME found 2.2 million new stable materials - equivalent to 800 years of normal discovery. Of those, 380k were the most stable and candidates for validation. 736 were validated in external labs. These include a totally new diamond-like optical material and another that may be a superconductor. Overall this demonstrates how scaling up deep learning can massively speed up materials innovation. As data and models improve together, it'll accelerate solutions to big problems needing new engineered materials. TLDR: DeepMind made an AI system that uses graph neural networks to discover possible new materials. It found 2.2 million candidates, and over 300k are most stable. Over 700 have already been synthesized. Full summary available here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 10 min )
    [D] I want to create a text generator. Where should I start?
    I want to create a ai content generator. Where should I start I have worked in stable diffusion and automatic1111. Now I want to create a content writing app. Where should I start. I have 8gb VRAM I need a free and open source I have researched about text generator, I see there are lot of models. I don't know which will be better to create resume or CV content perfectly Someone tell me where to start submitted by /u/Prestigious-Head-388 [link] [comments]  ( 9 min )
    [D] Asking again about which is the most efficient version of OpenAI Whisper
    After reading this post and doing some research on my own, I haven't found a definitive conclusion as to which version of Whisper is the most efficient. Is there a solid consensus on this topic? submitted by /u/iGunzerkeR [link] [comments]  ( 9 min )
    [News] Text to Speech is getting CRAZY GOOD - HierSpeech++, XTTS & StyleTTS2! (huggingface)
    Hey, AI has been going crazy lately and things are changing super fast. I created a video covering a few trending TTS (Text to speech) publicly available huggingface spaces, check it out! https://www.youtube.com/watch?v=4-2Jk8muo7c I can't wait to start testing these models application usages in future videos, it seems like this technology is finally starting to gain momentum! Let me know what you think about it, or if you have any questions / requests for other videos as well, cheers submitted by /u/dev-spot [link] [comments]  ( 1 min )
    [D] Can someone help me understand how to rotate a vector A by some angle from vector B, both of size 512?
    Can someone help me understand how to rotate a vector A by some angle from vector B, both of size 512? I have a feature vector that is compared to a basis vector which are the weights being updated, to do that an angle is calculated between the two vector’s and compared to ground truth. I would like reconstruct the feature vector by moving the basis vector by some angle, I’m not sure if rotation in higher dimensional space makes sense or if it’s better to try to learn that reconstruction submitted by /u/SnooChipmunks2237 [link] [comments]  ( 9 min )
    [P] FaceSwap Generation Model
    Anyone who work with Deeplearning based Faceswap Generation models. I want to know how to run, train the model on the dataset as well as on custom dataset. How to optimize the model. Simple is that, anyone worked on Faceswap Generation models? submitted by /u/Difficult_Vanilla814 [link] [comments]  ( 9 min )
    [D] Where can I connect with Machine Learning Researchers in NYC?
    Hi all, I am relatively new to machine learning but I find it really interesting and as a software engineer, I can familiarize myself with basic concepts in my head pretty fast. However, I don't have anyone I can talk to and engage in a discussion around different topics in person. Reddit is nice and I appreciate the posts on this sub but I just would like to find a group of people I can just bounce ideas around with. Is this a thing? Thanks! submitted by /u/Text-Agitated [link] [comments]  ( 9 min )
    Any experienced ML engineer for project [D]
    Hi I’m currently working on a project to help companies build custom software and are leveraging AI for project management. I am looking for some help on this effort. Here are what we envision our technology to do: use ML algorithms to analyze project requirements from clients and reference historical data to create a series of tasks (GitHub issues) needed to complete the project. Regarding tech stack it will probably include NLP for project understanding and task categorization, as well as machine learning frameworks for predicting task duration and complexity. If this sounds interesting let’s talk and happy to talk about compensation for this project! Thank you submitted by /u/yoyohuncho [link] [comments]  ( 9 min )
    [D] Paper Analysis - Scalable Extraction of Training Data from (Production) Language Models (Video Walkthrough)
    https://youtu.be/KwpeuqT69fw Researchers were able to get giant amounts of training data out of ChatGPT by simply asking it to repeat a word many times over, which causes the model to diverge and start spitting out memorized text. Why does this happen? And how much of their training data do such models really memorize verbatim? ​ OUTLINE: 0:00 - Intro 8:05 - Extractable vs Discoverable Memorization 14:00 - Models leak more data than previously thought 20:25 - Some data is extractable but not discoverable 25:30 - Extracting data from closed models 30:45 - Poem poem poem 37:50 - Quantitative membership testing 40:30 - Exploring the ChatGPT exploit further 47:00 - Conclusion ​ Paper: https://arxiv.org/abs/2311.17035 submitted by /u/ykilcher [link] [comments]  ( 9 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 1 min )
    [Research] Mangio RVC - Cuda out of Memory error during voice training
    Hello, hello.I am doing some research how to resolve the "CUDA - Out of Memory Error" that I am encountering during voice training in Mangio RVC. While searching for different and potential solutions, once suggestion is to split audio files into smaller 2-3 seconds chunks per this YT video:https://www.youtube.com/watch?v=uo6umRWVVTE Unfortunately, that also didnt work out. However, somebody in comments stated:" try find the python script for your data_loader, depending on the project you are doing. Edit the script and press Ctrl + F, search for "pin_memory", if It is set to "pin_memory=True", change it to "pin_memory=False" and save, and try again " Does anyone knows which filename would represent that "data loader" for "python script" inside Mangio RVC?I am getting hits with multiple files that contain keywords "pin_memory=True" inside, so I am trying to find the exact one. Any suggestion would be appreciated regarding this issue. Best regards. submitted by /u/ObservationDeck2001 [link] [comments]  ( 9 min )
    [D] Deploy a translation model (3B parameters) for less than 100$/month
    Hey everyone, I made a translation Discord bot for around 30,000 servers, but using paid services like Google Translate or DeepL is too expensive. Instead, I've switched to an open-source translation model (currently using m2m100 with 400 million parameters) on a CPU. However, the translations it provides aren't up to par, and I've found that newer models like MADLAD-400 deliver much better results. The catch is that MADLAD-400 is too large to run on a CPU. I'm looking to deploy this improved model on a GPU, but my budget is a bit tight—around $100-150 per month. Does anyone know of any services or offers that provide GPU access within this price range? Any suggestions or recommendations would be greatly appreciated! submitted by /u/Which-Breadfruit-926 [link] [comments]  ( 9 min )
    [R] Combining Spatial and Temporal Abstraction in Planning for Better Generalization
    arXiv: https://arxiv.org/abs/2310.00229 OpenReview: https://openreview.net/forum?id=eo9dHwtTFt Code: https://github.com/mila-iqia/skipper Blog post: http://mingde.world/combining-spatial-and-temporal-abstraction-in-planning/ Abstract: Inspired by human conscious planning, we propose Skipper, a model-based reinforcement learning agent that utilizes spatial and temporal abstractions to generalize learned skills in novel situations. It automatically decomposes the task at hand into smaller-scale, more manageable subtasks and hence enables sparse decision-making and focuses its computation on the relevant parts of the environment. This relies on the definition of a high-level proxy problem represented as a directed graph, in which vertices and edges are learned end-to-end using hindsight. Our theoretical analyses provide performance guarantees under appropriate assumptions and establish where our approach is expected to be helpful. Generalization-focused experiments validate Skipper's significant advantage in zero-shot generalization, compared to existing state-of-the-art hierarchical planning methods. submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] Donut transformer model in production
    Has anyone here used fine tuned Donut model in production? submitted by /u/shiva525 [link] [comments]  ( 9 min )
    [P] TSMixer: Time Series Mixer for Forecasting
    Link to github TSMixer is an unofficial PyTorch-based implementation of the TSMixer architecture as described TSMixer Paper. It leverages mixer layers for processing time series data, offering a robust approach for both standard and extended forecasting tasks. submitted by /u/Yossarian_1234 [link] [comments]  ( 9 min )
    [P] Diffusion Models from Scratch | DDPM PyTorch Implementation
    submitted by /u/tusharkumar91 [link] [comments]  ( 1 min )
    [D] Any way to increase efficiency of my proposed RTX 3080 eGPU setup (performance per $)
    I was able to get my hands on a RTX 3080 12GB for $300. (This is not for gaming, strictly for ML.) I have a Dell Vostro 7620 laptop (i7 12700H, 40GB DDR5, RTX 3050 4GB dGPU) with an open TB4 port. I am purchasing a TH3P4G3 for ~$110 and a XPG Core Reactor 750 Gold for ~$95. Is there anything I'm missing or is there any way I can more efficiently spend my money (performance per $ spent)? submitted by /u/SMtheEIT [link] [comments]  ( 9 min )

  • Open

    [P] I have created a smart contract vulnerability classification model with TensorFlow
    submitted by /u/jupiter_traveller [link] [comments]  ( 9 min )
    [D] xtts 2 & styletts 2 locally with a gradio UI
    So basically, title. I haven't been able to run those locally. I have already trying cloning the repos and running all the commands but a lot of errors would pop up always. Could anyone post a short tutorial to install both things locally (WITH A GRADIO UI, FU*K RUNNING STUFF ON VS CODE). TLDR: I'm not smart enough to install those locally so it would be nice if someone posted a tutorial to install those link to styletts2 demo with gradio: https://huggingface.co/spaces/styletts2/styletts2 link to xtts 2 demo with gradio: https://huggingface.co/spaces/coqui/xtts submitted by /u/sadeliox [link] [comments]  ( 9 min )
    [D] Stochastic variational GP, fitting troubles.
    I am developing a stochastic variational GP using gpytorch. I am performing CV over data collected from n people. When I investigate model fit the trained and predicted GP’s predictions are centered around around the mean of the data. Therefore more extreme values are not being fit. What, model architecture wise could be causing this, and what could possibly help? I have experimented with length scale adjustment with little success… I have built an optuna study to optimise the hyperparameters below but with with no success. • ⁠kernel • ⁠likelihood • ⁠variational strategy • ⁠variational distribution • ⁠learning rate • ⁠mean —> (Constant or Zero) • ⁠num inducing points submitted by /u/ConfusedLayer1 [link] [comments]  ( 9 min )
    [P] PCA and the use of Unsupervised Machine Learning for Fraud Detection? Is there any way to evaluate an unsupervised model?
    I hope my questions don't sound too stupid, but I'm new to data science and currently trying to delve deeper into fraud analytics and I'm very very confused. I'm doing this project where I was given a raw dataset of online transactions and told to define normal and anomalous behavior in my own terms. As for what I read, common evaluation metrics used for fraud detection include precision, recall, F1-score, ROC curve, or confusion matrix. But I can't get the true/false positive-negative without the historical/labeled data then? I googled "unsupervised evalution metric" and it shows Silhoutte Score, which I already use to determine the contamination level in my Isolation Forest model, but isn't this basically just model tuning/choosing the optimal parameter for anomaly threshold? Or is it okay to not evaluate a model? Or am I supposed to compare it with other unsupervised models and select the best one as a replacement for model evaluation like confusion matrix? Also, is there anything I need to do/explain with the PCA loading? I already got the optimal number of components/important features from the explained variance ratio and it directly marks the records as either fraud (1) or normal (0). It results in Normal (1113 records) and Fraud (10 records), is this common for fraud detection or is it a case of class imbalance that needs to be addressed/ resampled/ changed? submitted by /u/DecentPerson011 [link] [comments]  ( 10 min )
    Confused about what to do [D]
    To provide context, I am currently in the final year of my engineering studies and am required to undertake a 4-6 month internship as my graduation project. My interest lies in the applications of AI, specifically in robotics, computer vision, and other engineering fields. Fortunately, I have secured a position in a prestigious institute. However, the proposed internship topic focuses on AI in Intrusion Detection systems, which does not align well with my interests. Upon researching the field, I've discovered that it is quite mature, and the internship would involve extensive research. Now, I find myself at a crossroads, debating whether to accept the offer and leverage the institute's network and resources (along with excellent living conditions) or to explore other opportunities. Unfortunately, securing a position in another prestigious institute seems unlikely, and my alternatives might involve local companies (which may be outdated) or universities abroad. It's worth mentioning that I am from North Africa. Any advice or insights on this matter would be greatly appreciated. submitted by /u/Dry_Curve_8260 [link] [comments]  ( 9 min )
    [P] I want to use a neural network for function fitting (find a function's parameters based on samples of the function), but the coordinates where the function is sampled aren't always the same.
    Hi, I'm hoping someone can tell me what this problem is called so I can search for existing literature. I have a function f = f(theta,V) which is a physical model of a class of system (a transistor). Let's say f is scalar, theta is a ~100 element vector and V is a ~3 dimensional vector (x,y,z). I want to fit f to measurements of a transistor: find parameters theta such that f is close to the measured behavior in a certain V domain (for example a regular grid in [0,xmax] x [0,ymax] x [0, zmax]).* The standard solution is to use an optimizer like Newton-Raphson to find theta that minimizes the RMS error between f and the measurements of the system. This has problems like e.g. getting stuck in local minima because f is nonlinear. I would like to train a NN to take as input the measurements …  ( 11 min )
    [D] Please check my understanding of word embeddings for the ngrams model using PyTorch
    Hi everyone, I recently started learning neural networks in the context of NLP. As an exercise, I am trying to follow the PyTorch tutorial on word embeddings for the Ngrams model. I am not sure if my understanding of the tutorial is correct. I would really appreciate if someone could take a few minutes to go through the article I linked above (the relevant code is pasted at the end of the post) and tell me how much I followed it correctly and what I understood wrong. Thanks so much! PS - originally posted in r/learnmachinelearning but I don't think I'll get an answer there. These are the steps as I (partially) understand: From the sample text, we first create a set of ngrams, each ngram consisting of source (of N words) and a target word. We initialize a randomized embedding…  ( 12 min )
    [R] Continual Learning : Applications and the Road Forward ( Nov.30 2023 )
    submitted by /u/moschles [link] [comments]  ( 9 min )
    [P] USearch HNSW implementation is 10x faster than FAISS at scale
    submitted by /u/ashvar [link] [comments]  ( 9 min )
    [P] Let's Debug Your Neural Network: Gradient-based Symbolic Execution for NN
    I have developed Gymbo, a proof of concept for a Gradient-based Symbolic Execution Engine implemented from scratch. Symbolic is a method used to analyze a program and identify the inputs that trigger the execution of each program section. It works well when debugging the behavior of neural networks. For example, Gymbo allows you to find adversarial examples easily. It is interesting to see that the logic-based method effectively analyzes neural networks. Gymbo also offers easy-to-use Python API compatible with sklearn and PyTorch. I am looking forward to your feedback! submitted by /u/Living_Impression_37 [link] [comments]  ( 9 min )
    [D] Why is cross-entropy increasing with accuracy?
    I'm making an implementation of the softmax regression and I'm struggling to understand the nature behind the problem of increasing value of Cross-Entropy [1], along with an increasing accuracy (on the "iris" dataset): https://preview.redd.it/eo3654zwbw3c1.png?width=688&format=png&auto=webp&s=52916d56029e6e8aa1e9594bbcb02c7075009206 This is utmost confusing to me, because there's no class imbalance: https://preview.redd.it/rqs7i6jybw3c1.png?width=389&format=png&auto=webp&s=6ca668f1cc70aae7a125a6b9c48c511eadc6d562 I am not entirely sure whether the sample size of N = 112 is at fault. I will appreciate any help on the matter. Thank you in advance. ​ [1] submitted by /u/joshjson [link] [comments]  ( 9 min )
    [D] How Transformers rewrote the rules of an age old tradition in ML
    Hello people! Sharing the latest video from my ML YT channel discussing Transformers, how they work, and its interesting relationship with Inductive Bias when compared to other neural networks (like CNNs, RNNs, etc). Sharing a link her for those interested to check it out! Thanks. submitted by /u/AvvYaa [link] [comments]  ( 9 min )
    [D] Bitter Lesson and Tree of Thoughts - Are techniques like ToT examples of using search or are they ignoring the bitter lesson by encoding humanlike learning?
    The bitter lesson argues learning and search are the winning strategies since they scale with computing power, and will generally outperform techniques that rely on encoding human knowledge. What do you think of ToT and similar techniques given this? Are they good examples of expanding the power of models via search, or are they examples of attempting to force what we believe to be humanlike behavior? ​ For reference, I've seen two main tree-based approaches used in the research. One is MCTS based decoding. This seems more in line with traditional notions of search, in that you are searching through the outcome space of possible text and then selecting the best ones. Paper by meta using the PPO value function as the MCTS node evaluator Using MCTS to improve ability to write a successful program ​ However, there is also the more abstract CoT/ToT style trees being used, which are relying on the model to generate a tree of followup sequences for a given sequence. Tree of Thoughts: Deliberate Problem Solving with Large Language Models A core difference here is that the tree does not represent a search through possible outputs in the same way the MCTS decoding tree searches do. It's a search through possible reasoning chains, which ultimately are inputs to some evaluation method (often the model itself), to determine the optimal output. So you aren't just searching through the outcome space, but also through the "evidence space", both of which will be passed to the evaluator to select the appropriate outcome. edit: Unrelated, but you could in principle combine both of these search techniques, which would be cool but probably really expensive. For example, use MCTS decoding for each node in the ToT. I haven't seen anybody do this but if anybody has a link to a paper that would be cool. submitted by /u/30299578815310 [link] [comments]  ( 10 min )
    [P] Made it really easy to create a FastAPI backend for your ML model! Thoughts?
    Hey guys! I wanted to share this tool I recently made, https://visual-backend.com which lets you build a FastAPI backend for your ML model really quickly. It's essentially a GUI that lets you generate code & scaffolding for endpoint handlers, auth and even deploy to GCP in just one click. So to serve your ML model, all you need to do is load it and call the inference functions at each endpoint handler. Of course, for things like batch processing or job queues, there isn't functionality for that, but just curious to hear whether at the base level, such a tool might be useful for ya'll! I'd initially built this for full stack devs, but after talking to several ML engs/ data scientists, I realised that it might be helpful to some of ya'll who are looking to quickly bring your ML models to production without caring too much about infrastructure / software eng, so I'd love to hear whether you find this helpful and other thoughts u might have. Looking forward to it :) submitted by /u/johnyeocx [link] [comments]  ( 10 min )
    [D]How do I track recent trending papers?
    Since papers.labml.ai is offline, how do you guys track the recent trending papers or hot topics, especially on X. Any recommendation? submitted by /u/Historical-Tree9132 [link] [comments]  ( 9 min )
    [P] I'm trying to improve my Predicted vs Actual values distribution
    I've used Autogluon to train a model (WeightedEnsemble_L3) on tabular data. It's a regression problem. Here are the evaluation results: {'root_mean_squared_error': -9.592466103848274, 'mean_squared_error': -92.0154059534781, 'mean_absolute_error': -7.8083721751898105, 'r2': 0.784271370670073, 'pearsonr': 0.8990940522052712, 'median_absolute_error': -6.87762451171875} The image below shows the Predicted vs Actual Values. The dashed line is y=x. Looking at the scatter distribution, it looks like the cloud distribution could be rotated counter clockwise, and then the predicted values would match more closely to the actual values. This rotation would improve the accuracy of the regression predictions. Am I looking at this right? Is there a way to do this in autogluon? I'd had to add a post training fudge factor, as I'm sure there's a better way. ​ https://preview.redd.it/rlrwxwopet3c1.png?width=571&format=png&auto=webp&s=3abfdf6cedad258b4b115c8071b62d3d30b52d0f submitted by /u/BAMred [link] [comments]  ( 9 min )
    [Discussion] Any open source project that comes close to Suno.AI ?
    I know bark has the option to add ♪ and it will sort of create a song with the given lyrics, is there anything better than that that runs locally? submitted by /u/iChrist [link] [comments]  ( 9 min )
  • Open

    I am looking for an AI imaging tool that can create content around a certain character or characters
    So, I’m currently playing Baldur’s Gate 3 and I have a party of custom characters I would love to create some desktop images of! I recently defeated a major boss in the game and I would love to be able to take individual screenshots of the characters and then of the boss and then have AI create an image of the battle between all of them for a desktop wallpaper. Is there an AI tool that can do this? submitted by /u/OneNineSevenNine [link] [comments]  ( 1 min )
    Bill Gates feels Generative AI has plateaued, says GPT-5 will not be any better
    Bill Gates believes that generative AI, specifically GPT technology, has reached a plateau and that GPT-5 will not be an improvement. In the next two to five years, Gates predicts an increase in the accuracy of AI software and a reduction in cost, leading to the creation of new and reliable applications. Gates acknowledges the high cost of training large language models and the enormous costs for computing power and semiconductors. He sees significant potential in AI in the short term and believes that new research will make AI more reliable and comprehensible. Gates also believes that developing nations will benefit greatly from AI, citing the example of health advice via smartphones. Despite the prevalent issues with reliability, Gates sees AI becoming integral to healthcare, particularly in drug and vaccine development. He believes that understanding how AI encrypts information will be a milestone and that there are ongoing efforts to decipher the black box of AI. Gates states that there is no way to know when Artificial General Intelligence (AGI) will arrive, but sees it as a profound development for humanity. Regarding climate change, Gates mentions that climate models are improving and that there will be new crops and investments in improving the power grid with AI. Source : https://indianexpress.com/article/technology/artificial-intelligence/bill-gates-feels-generative-ai-is-at-its-plateau-gpt-5-will-not-be-any-better-8998958/ submitted by /u/NuseAI [link] [comments]  ( 1 min )
    How can the US government ban AI chips from China if they are made in China?
    Is the US dumb? I know that Nvidia now has to find manufacturing outside China, but China already knows how to make them. So you know China is just gonna make their own bootleg chips now. submitted by /u/Thin_Cable4155 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 12/2/2023
    Australia to put artificial intelligence to the test tracking Chinese submarines under AUKUS deal.[1] Wine apps like Vivino and Hello Vino are utilizing AI algorithms to help wine enthusiasts select the perfect bottle.[2] Google’s new AI experiment composes abstract musical clips inspired by instruments.[3] Nvidia Launches Cloud-Based APIs to Accelerate AI Deployment in Medical Imaging.[4] Sources: [1] https://www.abc.net.au/news/2023-12-02/australia-to-test-tracking-chinese-submarines-with-ai-via-aukus/103181178 [2] https://neurosciencenews.com/ai-taste-perception-25299/ [3] https://www.engadget.com/googles-new-ai-experiment-composes-abstract-musical-clips-inspired-by-instruments-203732054.html [4] https://medcitynews.com/2023/12/nvidia-ai-healthcare/ submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    How Googlers cracked OpenAI's ChatGPT with a single word
    Training data was exposed. This could be bad. I’m not seeing this story picked up as the big story it appears to be? submitted by /u/LifebloodOfChampions [link] [comments]  ( 1 min )
    Silos.
    submitted by /u/Philipp [link] [comments]  ( 1 min )
    The copyright case against AI art generators just got stronger
    The issue of whether generative AI models are guilty of copyright infringement is still largely undetermined. However, there has been a major development in a lawsuit against AI image and video generator companies. The amended complaint filed by the artists has made the case stronger. The complaint argues that even non-copyrighted works may be eligible for copyright protection if they include the artists' distinctive mark, such as their signature. It also highlights that AI companies using certain datasets would have had to download copyrighted images, making unauthorized copies. The complaint further states that the architecture of diffusion models, used in AI art generation, is designed to come as close as possible to replicating the initial training material. Source : https://venturebeat.com/ai/the-copyright-case-against-ai-art-generators-just-got-stronger-with-more-artists-and-evidence/ submitted by /u/NuseAI [link] [comments]  ( 1 min )
    Is there is any AI tool which can convert the PDF file into MS Office file?
    Hey, I know the website "https://www.ilovepdf.com/" which allow us to convert the PDF file into word, ppt, excel etc. Now the issue with this website as it does not properly convert the format hence we have to rework on it. So I want to know is there is any AI tool which can recreate the PDF into word/Excel file? submitted by /u/Haziq12345 [link] [comments]  ( 1 min )
    Decision Entropy
    Lets say you have an AI that is both self improving and unsupervised. Do we think that it is likely that as the AI gets increasingly efficient, it becomes increasingly less able to learn? Surely as the AI refines itself, it will become faster, but more specialized, and eventually stop learning altogether? Won't people train their AIs to the point that they're so good at one thing, the thing can no longer performed by a human, at which point, we're toast? This keeps me awake at night. submitted by /u/aliasrob [link] [comments]  ( 1 min )
    Is there a good guy AI for picture generation that is actually free?
    Most of AI's cost money to create pictures, such as midjourney or Dall-E. Is there an AI that is actually good and free? submitted by /u/According-Fennel5005 [link] [comments]  ( 1 min )
    Judge your Resume with AI
    submitted by /u/davidmezzetti [link] [comments]  ( 1 min )
  • Open

    Actor Critic: What is the shape of π(A|S,θ)?
    π(A|S,θ) in the softmax case is exp(θ⊤Ax(S))/ ∑exp(θ⊤Ax(S)). What should the shape be? Given number of action = k and number of feature = d ​ submitted by /u/Fashism [link] [comments]  ( 1 min )
    Reinforcement Learning with Fast and Forgetful Memory
    submitted by /u/smorad [link] [comments]  ( 1 min )
    Simple Q-learning with Value Table sharing across Agents
    Simple example of naive q-learning around a test track, implemented with raylib in C++ - Inputs: 5 ray-cast sensors each of which are divided into 3 regions: dangerously close, middle, far. (providing 3^5=243 states) - Actions: Go straight with constant velocity v, steer left with v/2, steer right with v/2 At the end of each episode, the average of q-tables among all agents are calculated and set back to agents so that the knowledge is shared after each episode. When epsilon drops to zero, we see only 1 uniform movement instead of 30 as all agents always choose the greedy action. https://reddit.com/link/18902nv/video/hnje6htdou3c1/player submitted by /u/goksankobe [link] [comments]  ( 1 min )
    Can I know how the bound k was bound? Origin: Reinforcement Learning: Theory and Algorithms by Alekh Agarwal, Nan Jiang, Sham M. Kakade and Wen Sun https://rltheorybook.github.io/rltheorybook_AJKS.pdf
    submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
  • Open

    How would i train a ragdoll stickman to walk using a NN?
    So iv'e learned all the basics of neural networks and coded a simple one. What I'm looking to do now is something like this, training a physics body to walk. I'm thinking to have all the limbs position and rotation as inputs, then outputting how much all limbs should move. Is there any guide on how to do this? I'm not entirely sure what to have as the cost function or the expected outputs. I'm using unity btw. ​ submitted by /u/JontePonte64 [link] [comments]  ( 1 min )
  • Open

    222nd Carnival of Mathematics
    A blog carnival is a round up of recent blog posts. The Carnival of Mathematics is a long-running carnival of blog posts on mathematical topics. This is the 222nd edition of the carnival. Facts about 222 By longstanding tradition, the nth Carnival of Mathematics must begin with trivia about the number n, and so here […] 222nd Carnival of Mathematics first appeared on John D. Cook.  ( 5 min )
    Bounding complex roots by a positive root
    Suppose you have an nth degree polynomial with complex coefficients p(z) = anzn + an-1zn-1 + … + a0 and you want to find some circle that is guaranteed to contain all the zeros of p. Cauchy found such a circle in 1829. The zeros of p lie inside the circle |z| ≤ r where […] Bounding complex roots by a positive root first appeared on John D. Cook.  ( 6 min )
  • Open

    A Different AI Scenario: AI and Justice in a Brave New World – Part 1
    The recent upheavals at OpenAI and OpenAI’s Chief Scientist’s apprehensions regarding the “safety” of AI have ignited a fresh wave of concerns and fears about the march towards Artificial General Intelligence (AGI) and “Super Intelligence.” AI safety concerns the development of AI systems aligned with human values and do not cause harm to humans. Some… Read More »A Different AI Scenario: AI and Justice in a Brave New World – Part 1 The post A Different AI Scenario: AI and Justice in a Brave New World – Part 1 appeared first on Data Science Central.  ( 22 min )
  • Open

    An Effective Universal Polynomial Basis for Spectral Graph Neural Networks. (arXiv:2311.18177v1 [cs.LG])
    Spectral Graph Neural Networks (GNNs), also referred to as graph filters have gained increasing prevalence for heterophily graphs. Optimal graph filters rely on Laplacian eigendecomposition for Fourier transform. In an attempt to avert the prohibitive computations, numerous polynomial filters by leveraging distinct polynomials have been proposed to approximate the desired graph filters. However, polynomials in the majority of polynomial filters are predefined and remain fixed across all graphs, failing to accommodate the diverse heterophily degrees across different graphs. To tackle this issue, we first investigate the correlation between polynomial bases of desired graph filters and the degrees of graph heterophily via a thorough theoretical analysis. Afterward, we develop an adaptive heterophily basis by incorporating graph heterophily degrees. Subsequently, we integrate this heterophily basis with the homophily basis, creating a universal polynomial basis UniBasis. In consequence, we devise a general polynomial filter UniFilter. Comprehensive experiments on both real-world and synthetic datasets with varying heterophily degrees significantly support the superiority of UniFilter, demonstrating the effectiveness and generality of UniBasis, as well as its promising capability as a new method for graph analysis.  ( 2 min )
    Moving Object Detection and Tracking with 4D Radar Point Cloud. (arXiv:2309.09737v2 [cs.CV] UPDATED)
    Mobile autonomy relies on the precise perception of dynamic environments. Robustly tracking moving objects in 3D world thus plays a pivotal role for applications like trajectory prediction, obstacle avoidance, and path planning. While most current methods utilize LiDARs or cameras for Multiple Object Tracking (MOT), the capabilities of 4D imaging radars remain largely unexplored. Recognizing the challenges posed by radar noise and point sparsity in 4D radar data, we introduce RaTrack, an innovative solution tailored for radar-based tracking. Bypassing the typical reliance on specific object types and 3D bounding boxes, our method focuses on motion segmentation and clustering, enriched by a motion estimation module. Evaluated on the View-of-Delft dataset, RaTrack showcases superior tracking precision of moving objects, largely surpassing the performance of the state of the art.  ( 2 min )
    S-TLLR: STDP-inspired Temporal Local Learning Rule for Spiking Neural Networks. (arXiv:2306.15220v3 [cs.NE] UPDATED)
    Spiking Neural Networks (SNNs) are biologically plausible models that have been identified as potentially apt for deploying energy-efficient intelligence at the edge, particularly for sequential learning tasks. However, training of SNNs poses significant challenges due to the necessity for precise temporal and spatial credit assignment. Back-propagation through time (BPTT) algorithm, whilst the most widely used method for addressing these issues, incurs a high computational cost due to its temporal dependency. In this work, we propose S-TLLR, a novel three-factor temporal local learning rule inspired by the Spike-Timing Dependent Plasticity (STDP) mechanism, aimed at training deep SNNs on event-based learning tasks. Furthermore, S-TLLR is designed to have low memory and time complexities, which are independent of the number of time steps, rendering it suitable for online learning on low-power edge devices. To demonstrate the scalability of our proposed method, we have conducted extensive evaluations on event-based datasets spanning a wide range of applications, such as image and gesture recognition, audio classification, and optical flow estimation. In all the experiments, S-TLLR achieved high accuracy, comparable to BPTT, with a reduction in memory between $5-50\times$ and multiply-accumulate (MAC) operations between $1.3-6.6\times$.  ( 2 min )
    Understanding Sample Generation Strategies for Learning Heuristic Functions in Classical Planning. (arXiv:2211.13316v2 [cs.AI] UPDATED)
    We study the problem of learning good heuristic functions for classical planning tasks with neural networks based on samples represented by states with their cost-to-goal estimates. The heuristic function is learned for a state space and goal condition with the number of samples limited to a fraction of the size of the state space, and must generalize well for all states of the state space with the same goal condition. Our main goal is to better understand the influence of sample generation strategies on the performance of a greedy best-first heuristic search (GBFS) guided by a learned heuristic function. In a set of controlled experiments, we find that two main factors determine the quality of the learned heuristic: which states are included in the sample set and the quality of the cost-to-goal estimates. These two factors are dependent: having perfect cost-to-goal estimates is insufficient if the samples are not well distributed across the state space. We also study other effects, such as adding samples with high-value estimates. Based on our findings, we propose practical strategies to improve the quality of learned heuristics: three strategies that aim to generate more representative states and two strategies that improve the cost-to-goal estimates. Our practical strategies almost double the mean coverage of a GBFS algorithm guided by a learned heuristic.  ( 3 min )
    Learning Exactly Linearizable Deep Dynamics Models. (arXiv:2311.18261v1 [eess.SY])
    Research on control using models based on machine-learning methods has now shifted to the practical engineering stage. Achieving high performance and theoretically guaranteeing the safety of the system is critical for such applications. In this paper, we propose a learning method for exactly linearizable dynamical models that can easily apply various control theories to ensure stability, reliability, etc., and to provide a high degree of freedom of expression. As an example, we present a design that combines simple linear control and control barrier functions. The proposed model is employed for the real-time control of an automotive engine, and the results demonstrate good predictive performance and stable control under constraints.  ( 2 min )
    Semiparametric Efficient Inference in Adaptive Experiments. (arXiv:2311.18274v1 [stat.ML])
    We consider the problem of efficient inference of the Average Treatment Effect in a sequential experiment where the policy governing the assignment of subjects to treatment or control can change over time. We first provide a central limit theorem for the Adaptive Augmented Inverse-Probability Weighted estimator, which is semiparametric efficient, under weaker assumptions than those previously made in the literature. This central limit theorem enables efficient inference at fixed sample sizes. We then consider a sequential inference setting, deriving both asymptotic and nonasymptotic confidence sequences that are considerably tighter than previous methods. These anytime-valid methods enable inference under data-dependent stopping times (sample sizes). Additionally, we use propensity score truncation techniques from the recent off-policy estimation literature to reduce the finite sample variance of our estimator without affecting the asymptotic variance. Empirical results demonstrate that our methods yield narrower confidence sequences than those previously developed in the literature while maintaining time-uniform error control.  ( 2 min )
    Exploring the Temperature-Dependent Phase Transition in Modern Hopfield Networks. (arXiv:2311.18434v1 [cs.LG])
    The recent discovery of a connection between Transformers and Modern Hopfield Networks (MHNs) has reignited the study of neural networks from a physical energy-based perspective. This paper focuses on the pivotal effect of the inverse temperature hyperparameter $\beta$ on the distribution of energy minima of the MHN. To achieve this, the distribution of energy minima is tracked in a simplified MHN in which equidistant normalised patterns are stored. This network demonstrates a phase transition at a critical temperature $\beta_{\text{c}}$, from a single global attractor towards highly pattern specific minima as $\beta$ is increased. Importantly, the dynamics are not solely governed by the hyperparameter $\beta$ but are instead determined by an effective inverse temperature $\beta_{\text{eff}}$ which also depends on the distribution and size of the stored patterns. Recognizing the role of hyperparameters in the MHN could, in the future, aid researchers in the domain of Transformers to optimise their initial choices, potentially reducing the necessity for time and energy expensive hyperparameter fine-tuning.  ( 2 min )
    Linear normalised hash function for clustering gene sequences and identifying reference sequences from multiple sequence alignments. (arXiv:2311.17964v1 [q-bio.GN])
    The aim of this study was to develop a method that would identify the cluster centroids and the optimal number of clusters for a given sensitivity level and could work equally well for the different sequence datasets. A novel method that combines the linear mapping hash function and multiple sequence alignment (MSA) was developed. This method takes advantage of the already sorted by similarity sequences from the MSA output, and identifies the optimal number of clusters, clusters cut-offs, and clusters centroids that can represent reference gene vouchers for the different species. The linear mapping hash function can map an already ordered by similarity distance matrix to indices to reveal gaps in the values around which the optimal cut-offs of the different clusters can be identified. The method was evaluated using sets of closely related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1 genomic region of Enterovirus 71) sequences and outperformed existing unsupervised machine learning clustering methods and dimensionality reduction methods. This method does not require prior knowledge of the number of clusters or the distance between clusters, handles clusters of different sizes and shapes, and scales linearly with the dataset. The combination of MSA with the linear mapping hash function is a computationally efficient way of gene sequence clustering and can be a valuable tool for the assessment of similarity, clustering of different microbial genomes, identifying reference sequences, and for the study of evolution of bacteria and viruses.  ( 3 min )
    Combining deep generative models with extreme value theory for synthetic hazard simulation: a multivariate and spatially coherent approach. (arXiv:2311.18521v1 [cs.LG])
    Climate hazards can cause major disasters when they occur simultaneously as compound hazards. To understand the distribution of climate risk and inform adaptation policies, scientists need to simulate a large number of physically realistic and spatially coherent events. Current methods are limited by computational constraints and the probabilistic spatial distribution of compound events is not given sufficient attention. The bottleneck in current approaches lies in modelling the dependence structure between variables, as inference on parametric models suffers from the curse of dimensionality. Generative adversarial networks (GANs) are well-suited to such a problem due to their ability to implicitly learn the distribution of data in high-dimensional settings. We employ a GAN to model the dependence structure for daily maximum wind speed, significant wave height, and total precipitation over the Bay of Bengal, combining this with traditional extreme value theory for controlled extrapolation of the tails. Once trained, the model can be used to efficiently generate thousands of realistic compound hazard events, which can inform climate risk assessments for climate adaptation and disaster preparedness. The method developed is flexible and transferable to other multivariate and spatial climate datasets.  ( 2 min )
    Fair Community Detection and Structure Learning in Heterogeneous Graphical Models. (arXiv:2112.05128v2 [stat.ML] UPDATED)
    Inference of community structure in probabilistic graphical models may not be consistent with fairness constraints when nodes have demographic attributes. Certain demographics may be over-represented in some detected communities and under-represented in others. This paper defines a novel $\ell_1$-regularized pseudo-likelihood approach for fair graphical model selection. In particular, we assume there is some community or clustering structure in the true underlying graph, and we seek to learn a sparse undirected graph and its communities from the data such that demographic groups are fairly represented within the communities. In the case when the graph is known a priori, we provide a convex semidefinite programming approach for fair community detection. We establish the statistical consistency of the proposed method for both a Gaussian graphical model and an Ising model for, respectively, continuous and binary data, proving that our method can recover the graphs and their fair communities with high probability.  ( 2 min )
    Age Effects on Decision-Making, Drift Diffusion Model. (arXiv:2311.18376v1 [q-bio.NC])
    Training can improve human decision-making performance. After several training sessions, a person can quickly and accurately complete a task. However, decision-making is always a trade-off between accuracy and response time. Factors such as age and drug abuse can affect the decision-making process. This study examines how training can improve the performance of different age groups in completing a random dot motion (RDM) task. The participants are divided into two groups: old and young. They undergo a three-phase training and then repeat the same RDM task. The hierarchical drift-diffusion model analyzes the subjects' responses and determines how the model's parameters change after training for both age groups. The results show that after training, the participants were able to accumulate sensory information faster, and the model drift rate increased. However, their decision boundary decreased as they became more confident and had a lower decision-making threshold. Additionally, the old group had a higher boundary and lower drift rate in both pre and post-training, and there was less difference between the two group parameters after training.  ( 2 min )
    Improving Faithfulness for Vision Transformers. (arXiv:2311.17983v1 [cs.CV])
    Vision Transformers (ViTs) have achieved state-of-the-art performance for various vision tasks. One reason behind the success lies in their ability to provide plausible innate explanations for the behavior of neural architectures. However, ViTs suffer from issues with explanation faithfulness, as their focal points are fragile to adversarial attacks and can be easily changed with even slight perturbations on the input image. In this paper, we propose a rigorous approach to mitigate these issues by introducing Faithful ViTs (FViTs). Briefly speaking, an FViT should have the following two properties: (1) The top-$k$ indices of its self-attention vector should remain mostly unchanged under input perturbation, indicating stable explanations; (2) The prediction distribution should be robust to perturbations. To achieve this, we propose a new method called Denoised Diffusion Smoothing (DDS), which adopts randomized smoothing and diffusion-based denoising. We theoretically prove that processing ViTs directly with DDS can turn them into FViTs. We also show that Gaussian noise is nearly optimal for both $\ell_2$ and $\ell_\infty$-norm cases. Finally, we demonstrate the effectiveness of our approach through comprehensive experiments and evaluations. Specifically, we compare our FViTs with other baselines through visual interpretation and robustness accuracy under adversarial attacks. Results show that FViTs are more robust against adversarial attacks while maintaining the explainability of attention, indicating higher faithfulness.  ( 2 min )
    Poisoning Attacks Against Contrastive Recommender Systems. (arXiv:2311.18244v1 [cs.IR])
    Contrastive learning (CL) has recently gained significant popularity in the field of recommendation. Its ability to learn without heavy reliance on labeled data is a natural antidote to the data sparsity issue. Previous research has found that CL can not only enhance recommendation accuracy but also inadvertently exhibit remarkable robustness against noise. However, this paper identifies a vulnerability of CL-based recommender systems: Compared with their non-CL counterparts, they are even more susceptible to poisoning attacks that aim to promote target items. Our analysis points to the uniform dispersion of representations led by the CL loss as the very factor that accounts for this vulnerability. We further theoretically and empirically demonstrate that the optimization of CL loss can lead to smooth spectral values of representations. Based on these insights, we attempt to reveal the potential poisoning attacks against CL-based recommender systems. The proposed attack encompasses a dual-objective framework: One that induces a smoother spectral value distribution to amplify the CL loss's inherent dispersion effect, named dispersion promotion; and the other that directly elevates the visibility of target items, named rank promotion. We validate the destructiveness of our attack model through extensive experimentation on four datasets. By shedding light on these vulnerabilities, we aim to facilitate the development of more robust CL-based recommender systems.  ( 2 min )
    Handwriting recognition and automatic scoring for descriptive answers in Japanese language tests. (arXiv:2201.03215v2 [cs.LG] UPDATED)
    This paper presents an experiment of automatically scoring handwritten descriptive answers in the trial tests for the new Japanese university entrance examination, which were made for about 120,000 examinees in 2017 and 2018. There are about 400,000 answers with more than 20 million characters. Although all answers have been scored by human examiners, handwritten characters are not labeled. We present our attempt to adapt deep neural network-based handwriting recognizers trained on a labeled handwriting dataset into this unlabeled answer set. Our proposed method combines different training strategies, ensembles multiple recognizers, and uses a language model built from a large general corpus to avoid overfitting into specific data. In our experiment, the proposed method records character accuracy of over 97% using about 2,000 verified labeled answers that account for less than 0.5% of the dataset. Then, the recognized answers are fed into a pre-trained automatic scoring system based on the BERT model without correcting misrecognized characters and providing rubric annotations. The automatic scoring system achieves from 0.84 to 0.98 of Quadratic Weighted Kappa (QWK). As QWK is over 0.8, it represents an acceptable similarity of scoring between the automatic scoring system and the human examiners. These results are promising for further research on end-to-end automatic scoring of descriptive answers.
    Applying Bayesian Ridge Regression AI Modeling in Virus Severity Prediction. (arXiv:2310.09485v2 [cs.LG] UPDATED)
    Artificial intelligence (AI) is a powerful tool for reshaping healthcare systems. In healthcare, AI is invaluable for its capacity to manage vast amounts of data, which can lead to more accurate and speedy diagnoses, ultimately easing the workload on healthcare professionals. As a result, AI has proven itself to be a power tool across various industries, simplifying complex tasks and pattern recognition that would otherwise be overwhelming for humans or traditional computer algorithms. In this paper, we review the strengths and weaknesses of Bayesian Ridge Regression, an AI model that can be used to bring cutting edge virus analysis to healthcare professionals around the world. The model's accuracy assessment revealed promising results, with room for improvement primarily related to data organization. In addition, the severity index serves as a valuable tool to gain a broad overview of patient care needs, aligning with healthcare professionals' preference for broader categorizations.
    Editing Large Language Models: Problems, Methods, and Opportunities. (arXiv:2305.13172v3 [cs.CL] UPDATED)
    Despite the ability to train capable LLMs, the methodology for maintaining their relevancy and rectifying errors remains elusive. To this end, the past few years have witnessed a surge in techniques for editing LLMs, the objective of which is to efficiently alter the behavior of LLMs within a specific domain without negatively impacting performance across other inputs. This paper embarks on a deep exploration of the problems, methods, and opportunities related to model editing for LLMs. In particular, we provide an exhaustive overview of the task definition and challenges associated with model editing, along with an in-depth empirical analysis of the most progressive methods currently at our disposal. We also build a new benchmark dataset to facilitate a more robust evaluation and pinpoint enduring issues intrinsic to existing techniques. Our objective is to provide valuable insights into the effectiveness and feasibility of each editing technique, thereby assisting the community in making informed decisions on the selection of the most appropriate method for a specific task or context. Code and datasets are available at https://github.com/zjunlp/EasyEdit.
    AI in Pharma for Personalized Sequential Decision-Making: Methods, Applications and Opportunities. (arXiv:2311.18725v1 [stat.ME])
    In the pharmaceutical industry, the use of artificial intelligence (AI) has seen consistent growth over the past decade. This rise is attributed to major advancements in statistical machine learning methodologies, computational capabilities and the increased availability of large datasets. AI techniques are applied throughout different stages of drug development, ranging from drug discovery to post-marketing benefit-risk assessment. Kolluri et al. provided a review of several case studies that span these stages, featuring key applications such as protein structure prediction, success probability estimation, subgroup identification, and AI-assisted clinical trial monitoring. From a regulatory standpoint, there was a notable uptick in submissions incorporating AI components in 2021. The most prevalent therapeutic areas leveraging AI were oncology (27%), psychiatry (15%), gastroenterology (12%), and neurology (11%). The paradigm of personalized or precision medicine has gained significant traction in recent research, partly due to advancements in AI techniques \cite{hamburg2010path}. This shift has had a transformative impact on the pharmaceutical industry. Departing from the traditional "one-size-fits-all" model, personalized medicine incorporates various individual factors, such as environmental conditions, lifestyle choices, and health histories, to formulate customized treatment plans. By utilizing sophisticated machine learning algorithms, clinicians and researchers are better equipped to make informed decisions in areas such as disease prevention, diagnosis, and treatment selection, thereby optimizing health outcomes for each individual.
    Generative Models for Anomaly Detection and Design-Space Dimensionality Reduction in Shape Optimization. (arXiv:2308.04051v2 [stat.ML] UPDATED)
    Our work presents a novel approach to shape optimization, with the twofold objective to improve the efficiency of global optimization algorithms while promoting the generation of high-quality designs during the optimization process free of geometrical anomalies. This is accomplished by reducing the number of the original design variables defining a new reduced subspace where the geometrical variance is maximized and modeling the underlying generative process of the data via probabilistic linear latent variable models such as factor analysis and probabilistic principal component analysis. We show that the data follows approximately a Gaussian distribution when the shape modification method is linear and the design variables are sampled uniformly at random, due to the direct application of the central limit theorem. The degree of anomalousness is measured in terms of Mahalanobis distance, and the paper demonstrates that abnormal designs tend to exhibit a high value of this metric. This enables the definition of a new optimization model where anomalous geometries are penalized and consequently avoided during the optimization loop. The procedure is demonstrated for hull shape optimization of the DTMB 5415 model, extensively used as an international benchmark for shape optimization problems. The global optimization routine is carried out using Bayesian optimization and the DIRECT algorithm. From the numerical results, the new framework improves the convergence of global optimization algorithms, while only designs with high-quality geometrical features are generated through the optimization routine thereby avoiding the wastage of precious computationally expensive simulations.
    On Exact Inversion of DPM-Solvers. (arXiv:2311.18387v1 [cs.CV])
    Diffusion probabilistic models (DPMs) are a key component in modern generative models. DPM-solvers have achieved reduced latency and enhanced quality significantly, but have posed challenges to find the exact inverse (i.e., finding the initial noise from the given image). Here we investigate the exact inversions for DPM-solvers and propose algorithms to perform them when samples are generated by the first-order as well as higher-order DPM-solvers. For each explicit denoising step in DPM-solvers, we formulated the inversions using implicit methods such as gradient descent or forward step method to ensure the robustness to large classifier-free guidance unlike the prior approach using fixed-point iteration. Experimental results demonstrated that our proposed exact inversion methods significantly reduced the error of both image and noise reconstructions, greatly enhanced the ability to distinguish invisible watermarks and well prevented unintended background changes consistently during image editing. Project page: \url{https://smhongok.github.io/inv-dpm.html}.
    Revisiting Proposal-based Object Detection. (arXiv:2311.18512v1 [cs.CV])
    This paper revisits the pipeline for detecting objects in images with proposals. For any object detector, the obtained box proposals or queries need to be classified and regressed towards ground truth boxes. The common solution for the final predictions is to directly maximize the overlap between each proposal and the ground truth box, followed by a winner-takes-all ranking or non-maximum suppression. In this work, we propose a simple yet effective alternative. For proposal regression, we solve a simpler problem where we regress to the area of intersection between proposal and ground truth. In this way, each proposal only specifies which part contains the object, avoiding a blind inpainting problem where proposals need to be regressed beyond their visual scope. In turn, we replace the winner-takes-all strategy and obtain the final prediction by taking the union over the regressed intersections of a proposal group surrounding an object. Our revisited approach comes with minimal changes to the detection pipeline and can be plugged into any existing method. We show that our approach directly improves canonical object detection and instance segmentation architectures, highlighting the utility of intersection-based regression and grouping.
    FediOS: Decoupling Orthogonal Subspaces for Personalization in Feature-skew Federated Learning. (arXiv:2311.18559v1 [cs.LG])
    Personalized federated learning (pFL) enables collaborative training among multiple clients to enhance the capability of customized local models. In pFL, clients may have heterogeneous (also known as non-IID) data, which poses a key challenge in how to decouple the data knowledge into generic knowledge for global sharing and personalized knowledge for preserving local personalization. A typical way of pFL focuses on label distribution skew, and they adopt a decoupling scheme where the model is split into a common feature extractor and two prediction heads (generic and personalized). However, such a decoupling scheme cannot solve the essential problem of feature skew heterogeneity, because a common feature extractor cannot decouple the generic and personalized features. Therefore, in this paper, we rethink the architecture decoupling design for feature-skew pFL and propose an effective pFL method called FediOS. In FediOS, we reformulate the decoupling into two feature extractors (generic and personalized) and one shared prediction head. Orthogonal projections are used for clients to map the generic features into one common subspace and scatter the personalized features into different subspaces to achieve decoupling for them. In addition, a shared prediction head is trained to balance the importance of generic and personalized features during inference. Extensive experiments on four vision datasets demonstrate our method reaches state-of-the-art pFL performances under feature skew heterogeneity.
    Benchmarking Robustness to Adversarial Image Obfuscations. (arXiv:2301.12993v2 [cs.CV] UPDATED)
    Automated content filtering and moderation is an important tool that allows online platforms to build striving user communities that facilitate cooperation and prevent abuse. Unfortunately, resourceful actors try to bypass automated filters in a bid to post content that violate platform policies and codes of conduct. To reach this goal, these malicious actors may obfuscate policy violating images (e.g. overlay harmful images by carefully selected benign images or visual patterns) to prevent machine learning models from reaching the correct decision. In this paper, we invite researchers to tackle this specific issue and present a new image benchmark. This benchmark, based on ImageNet, simulates the type of obfuscations created by malicious actors. It goes beyond ImageNet-$\textrm{C}$ and ImageNet-$\bar{\textrm{C}}$ by proposing general, drastic, adversarial modifications that preserve the original content intent. It aims to tackle a more common adversarial threat than the one considered by $\ell_p$-norm bounded adversaries. We evaluate 33 pretrained models on the benchmark and train models with different augmentations, architectures and training methods on subsets of the obfuscations to measure generalization. We hope this benchmark will encourage researchers to test their models and methods and try to find new approaches that are more robust to these obfuscations.
    KOPPA: Improving Prompt-based Continual Learning with Key-Query Orthogonal Projection and Prototype-based One-Versus-All. (arXiv:2311.15414v2 [cs.LG] UPDATED)
    Drawing inspiration from prompt tuning techniques applied to Large Language Models, recent methods based on pre-trained ViT networks have achieved remarkable results in the field of Continual Learning. Specifically, these approaches propose to maintain a set of prompts and allocate a subset of them to learn each task using a key-query matching strategy. However, they may encounter limitations when lacking control over the correlations between old task queries and keys of future tasks, the shift of features in the latent space, and the relative separation of latent vectors learned in independent tasks. In this work, we introduce a novel key-query learning strategy based on orthogonal projection, inspired by model-agnostic meta-learning, to enhance prompt matching efficiency and address the challenge of shifting features. Furthermore, we introduce a One-Versus-All (OVA) prototype-based component that enhances the classification head distinction. Experimental results on benchmark datasets demonstrate that our method empowers the model to achieve results surpassing those of current state-of-the-art approaches by a large margin of up to 20%.
    Language Models as Black-Box Optimizers for Vision-Language Models. (arXiv:2309.05950v3 [cs.CL] UPDATED)
    Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically, we adopt an automatic hill-climbing procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup, our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit gradient direction in textual feedback for a more efficient search. In addition, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly, we demonstrate our framework on a state-of-the-art black-box VLM (DALL-E 3) for text-to-image optimization.
    LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for Locomotion. (arXiv:2311.02496v2 [cs.LG] UPDATED)
    Imitation Learning (IL) holds great promise for enabling agile locomotion in embodied agents. However, many existing locomotion benchmarks primarily focus on simplified toy tasks, often failing to capture the complexity of real-world scenarios and steering research toward unrealistic domains. To advance research in IL for locomotion, we present a novel benchmark designed to facilitate rigorous evaluation and comparison of IL algorithms. This benchmark encompasses a diverse set of environments, including quadrupeds, bipeds, and musculoskeletal human models, each accompanied by comprehensive datasets, such as real noisy motion capture data, ground truth expert data, and ground truth sub-optimal data, enabling evaluation across a spectrum of difficulty levels. To increase the robustness of learned agents, we provide an easy interface for dynamics randomization and offer a wide range of partially observable tasks to train agents across different embodiments. Finally, we provide handcrafted metrics for each task and ship our benchmark with state-of-the-art baseline algorithms to ease evaluation and enable fast benchmarking.
    Global Convergence of Online Identification for Mixed Linear Regression. (arXiv:2311.18506v1 [stat.ML])
    Mixed linear regression (MLR) is a powerful model for characterizing nonlinear relationships by utilizing a mixture of linear regression sub-models. The identification of MLR is a fundamental problem, where most of the existing results focus on offline algorithms, rely on independent and identically distributed (i.i.d) data assumptions, and provide local convergence results only. This paper investigates the online identification and data clustering problems for two basic classes of MLRs, by introducing two corresponding new online identification algorithms based on the expectation-maximization (EM) principle. It is shown that both algorithms will converge globally without resorting to the traditional i.i.d data assumptions. The main challenge in our investigation lies in the fact that the gradient of the maximum likelihood function does not have a unique zero, and a key step in our analysis is to establish the stability of the corresponding differential equation in order to apply the celebrated Ljung's ODE method. It is also shown that the within-cluster error and the probability that the new data is categorized into the correct cluster are asymptotically the same as those in the case of known parameters. Finally, numerical simulations are provided to verify the effectiveness of our online algorithms.
    The Sliding Regret in Stochastic Bandits: Discriminating Index and Randomized Policies. (arXiv:2311.18437v1 [cs.LG])
    This paper studies the one-shot behavior of no-regret algorithms for stochastic bandits. Although many algorithms are known to be asymptotically optimal with respect to the expected regret, over a single run, their pseudo-regret seems to follow one of two tendencies: it is either smooth or bumpy. To measure this tendency, we introduce a new notion: the sliding regret, that measures the worst pseudo-regret over a time-window of fixed length sliding to infinity. We show that randomized methods (e.g. Thompson Sampling and MED) have optimal sliding regret, while index policies, although possibly asymptotically optimal for the expected regret, have the worst possible sliding regret under regularity conditions on their index (e.g. UCB, UCB-V, KL-UCB, MOSS, IMED etc.). We further analyze the average bumpiness of the pseudo-regret of index policies via the regret of exploration, that we show to be suboptimal as well.
    Toward the Tradeoffs between Privacy, Fairness and Utility in Federated Learning. (arXiv:2311.18190v1 [cs.LG])
    Federated Learning (FL) is a novel privacy-protection distributed machine learning paradigm that guarantees user privacy and prevents the risk of data leakage due to the advantage of the client's local training. Researchers have struggled to design fair FL systems that ensure fairness of results. However, the interplay between fairness and privacy has been less studied. Increasing the fairness of FL systems can have an impact on user privacy, while an increase in user privacy can affect fairness. In this work, on the client side, we use fairness metrics, such as Demographic Parity (DemP), Equalized Odds (EOs), and Disparate Impact (DI), to construct the local fair model. To protect the privacy of the client model, we propose a privacy-protection fairness FL method. The results show that the accuracy of the fair model with privacy increases because privacy breaks the constraints of the fairness metrics. In our experiments, we conclude the relationship between privacy, fairness and utility, and there is a tradeoff between these.
    Automatic Functional Differentiation in JAX. (arXiv:2311.18727v1 [cs.PL])
    We extend JAX with the capability to automatically differentiate higher-order functions (functionals and operators). By representing functions as a generalization of arrays, we seamlessly use JAX's existing primitive system to implement higher-order functions. We present a set of primitive operators that serve as foundational building blocks for constructing several key types of functionals. For every introduced primitive operator, we derive and implement both linearization and transposition rules, aligning with JAX's internal protocols for forward and reverse mode automatic differentiation. This enhancement allows for functional differentiation in the same syntax traditionally use for functions. The resulting functional gradients are themselves functions ready to be invoked in python. We showcase this tool's efficacy and simplicity through applications where functional derivatives are indispensable. The source code of this work is released at https://github.com/sail-sg/autofd .
    C3Net: Compound Conditioned ControlNet for Multimodal Content Generation. (arXiv:2311.17951v1 [cs.LG])
    We present Compound Conditioned ControlNet, C3Net, a novel generative neural architecture taking conditions from multiple modalities and synthesizing multimodal contents simultaneously (e.g., image, text, audio). C3Net adapts the ControlNet architecture to jointly train and make inferences on a production-ready diffusion model and its trainable copies. Specifically, C3Net first aligns the conditions from multi-modalities to the same semantic latent space using modality-specific encoders based on contrastive training. Then, it generates multimodal outputs based on the aligned latent space, whose semantic information is combined using a ControlNet-like architecture called Control C3-UNet. Correspondingly, with this system design, our model offers an improved solution for joint-modality generation through learning and explaining multimodal conditions instead of simply taking linear interpolations on the latent space. Meanwhile, as we align conditions to a unified latent space, C3Net only requires one trainable Control C3-UNet to work on multimodal semantic information. Furthermore, our model employs unimodal pretraining on the condition alignment stage, outperforming the non-pretrained alignment even on relatively scarce training data and thus demonstrating high-quality compound condition generation. We contribute the first high-quality tri-modal validation set to validate quantitatively that C3Net outperforms or is on par with first and contemporary state-of-the-art multimodal generation. Our codes and tri-modal dataset will be released.
    Gated Ensemble of Spatio-temporal Mixture of Experts for Multi-task Learning in Ride-hailing System. (arXiv:2012.15408v3 [cs.LG] UPDATED)
    Designing spatio-temporal forecasting models separately in a task-wise and city-wise manner poses a burden for the expanding transportation network companies. Therefore, a multi-task learning architecture is proposed in this study by developing gated ensemble of spatio-temporal mixture of experts network (GESME-Net) with convolutional recurrent neural network (CRNN), convolutional neural network (CNN), and recurrent neural network (RNN) for simultaneously forecasting spatio-temporal tasks in a city as well as across different cities. Furthermore, a task adaptation layer is integrated with the architecture for learning joint representation in multi-task learning and revealing the contribution of the input features utilized in prediction. The proposed architecture is tested with data from Didi Chuxing for: (i) simultaneously forecasting demand and supply-demand gap in Beijing, and (ii) simultaneously forecasting demand across Chengdu and Xian. In both scenarios, models from our proposed architecture outperformed the single-task and multi-task deep learning benchmarks and ensemble-based machine learning algorithms.
    Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization. (arXiv:2311.18703v1 [cs.LG])
    In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors, and are often pushed (through e.g. policy entropy regularization) to randomize their actions in favor of exploration. From a human perspective, this makes RL agents hard to interpret and predict, and from a safety perspective, even harder to formally verify. We propose a novel method to induce predictable behavior in RL agents, referred to as Predictability-Aware RL (PA-RL), which employs the state sequence entropy rate as a predictability measure. We show how the entropy rate can be formulated as an average reward objective, and since its entropy reward function is policy-dependent, we introduce an action-dependent surrogate entropy enabling the use of PG methods. We prove that deterministic policies minimizing the average surrogate reward exist and also minimize the actual entropy rate, and show how, given a learned dynamical model, we are able to approximate the value function associated to the true entropy rate. Finally, we demonstrate the effectiveness of the approach in RL tasks inspired by human-robot use-cases, and show how it produces agents with more predictable behavior while achieving near-optimal rewards.
    Detecting Anomalous Network Communication Patterns Using Graph Convolutional Networks. (arXiv:2311.18525v1 [cs.CR])
    To protect an organizations' endpoints from sophisticated cyberattacks, advanced detection methods are required. In this research, we present GCNetOmaly: a graph convolutional network (GCN)-based variational autoencoder (VAE) anomaly detector trained on data that include connection events among internal and external machines. As input, the proposed GCN-based VAE model receives two matrices: (i) the normalized adjacency matrix, which represents the connections among the machines, and (ii) the feature matrix, which includes various features (demographic, statistical, process-related, and Node2vec structural features) that are used to profile the individual nodes/machines. After training the model on data collected for a predefined time window, the model is applied on the same data; the reconstruction score obtained by the model for a given machine then serves as the machine's anomaly score. GCNetOmaly was evaluated on real, large-scale data logged by Carbon Black EDR from a large financial organization's automated teller machines (ATMs) as well as communication with Active Directory (AD) servers in two setups: unsupervised and supervised. The results of our evaluation demonstrate GCNetOmaly's effectiveness in detecting anomalous behavior of machines on unsupervised data.
    An Interventional Perspective on Identifiability in Gaussian LTI Systems with Independent Component Analysis. (arXiv:2311.18048v1 [cs.LG])
    We investigate the relationship between system identification and intervention design in dynamical systems. While previous research demonstrated how identifiable representation learning methods, such as Independent Component Analysis (ICA), can reveal cause-effect relationships, it relied on a passive perspective without considering how to collect data. Our work shows that in Gaussian Linear Time-Invariant (LTI) systems, the system parameters can be identified by introducing diverse intervention signals in a multi-environment setting. By harnessing appropriate diversity assumptions motivated by the ICA literature, our findings connect experiment design and representational identifiability in dynamical systems. We corroborate our findings on synthetic and (simulated) physical data. Additionally, we show that Hidden Markov Models, in general, and (Gaussian) LTI systems, in particular, fulfil a generalization of the Causal de Finetti theorem with continuous parameters.
    Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices. (arXiv:2311.18129v1 [cs.LG])
    While federated learning (FL) systems often utilize quantization to battle communication and computational bottlenecks, they have heretofore been limited to deploying fixed-precision quantization schemes. Meanwhile, the concept of mixed-precision quantization (MPQ), where different layers of a deep learning model are assigned varying bit-width, remains unexplored in the FL settings. We present a novel FL algorithm, FedMPQ, which introduces mixed-precision quantization to resource-heterogeneous FL systems. Specifically, local models, quantized so as to satisfy bit-width constraint, are trained by optimizing an objective function that includes a regularization term which promotes reduction of precision in some of the layers without significant performance degradation. The server collects local model updates, de-quantizes them into full-precision models, and then aggregates them into a global model. To initialize the next round of local training, the server relies on the information learned in the previous training round to customize bit-width assignments of the models delivered to different clients. In extensive benchmarking experiments on several model architectures and different datasets in both iid and non-iid settings, FedMPQ outperformed the baseline FL schemes that utilize fixed-precision quantization while incurring only a minor computational overhead on the participating devices.
    Learning Deep O($n$)-Equivariant Hyperspheres. (arXiv:2305.15613v3 [cs.LG] UPDATED)
    This paper presents an approach to learning (deep) $n$D features equivariant under orthogonal transformations, utilizing hyperspheres and regular $n$-simplexes. Our main contributions are theoretical and tackle major challenges in geometric deep learning such as equivariance and invariance under geometric transformations. Namely, we enrich the recently developed theory of steerable 3D spherical neurons -- SO(3)-equivariant filter banks based on neurons with spherical decision surfaces -- by extending said neurons to $n$D, which we call deep equivariant hyperspheres, and enabling their multi-layer construction. Using synthetic and real-world data in $n$D, we experimentally verify our theoretical contributions and find that our approach is superior to the competing methods for benchmark datasets in all but one case, additionally demonstrating a better speed/performance trade-off in all but one other case.
    Learning for Semantic Knowledge Base-Guided Online Feature Transmission in Dynamic Channels. (arXiv:2311.18316v1 [cs.IT])
    With the proliferation of edge computing, efficient AI inference on edge devices has become essential for intelligent applications such as autonomous vehicles and VR/AR. In this context, we address the problem of efficient remote object recognition by optimizing feature transmission between mobile devices and edge servers. We propose an online optimization framework to address the challenge of dynamic channel conditions and device mobility in an end-to-end communication system. Our approach builds upon existing methods by leveraging a semantic knowledge base to drive multi-level feature transmission, accounting for temporal factors and dynamic elements throughout the transmission process. To solve the online optimization problem, we design a novel soft actor-critic-based deep reinforcement learning system with a carefully designed reward function for real-time decision-making, overcoming the optimization difficulty of the NP-hard problem and achieving the minimization of semantic loss while respecting latency constraints. Numerical results showcase the superiority of our approach compared to traditional greedy methods under various system setups.
    Continuous 16-bit Training: Accelerating 32-bit Pre-Trained Neural Networks. (arXiv:2311.18587v1 [cs.LG])
    In the field of deep learning, the prevalence of models initially trained with 32-bit precision is a testament to its robustness and accuracy. However, the continuous evolution of these models often demands further training, which can be resource-intensive. This study introduces a novel approach where we continue the training of these pre-existing 32-bit models using 16-bit precision. This technique not only caters to the need for efficiency in computational resources but also significantly improves the speed of additional training phases. By adopting 16-bit precision for ongoing training, we are able to substantially decrease memory requirements and computational burden, thereby accelerating the training process in a resource-limited setting. Our experiments show that this method maintains the high standards of accuracy set by the original 32-bit training while providing a much-needed boost in training speed. This approach is especially pertinent in today's context, where most models are initially trained in 32-bit and require periodic updates and refinements. The findings from our research suggest that this strategy of 16-bit continuation training can be a key solution for sustainable and efficient deep learning, offering a practical way to enhance pre-trained models rapidly and in a resource-conscious manner.
    Multi-scale Iterative Refinement towards Robust and Versatile Molecular Docking. (arXiv:2311.18574v1 [q-bio.BM])
    Molecular docking is a key computational tool utilized to predict the binding conformations of small molecules to protein targets, which is fundamental in the design of novel drugs. Despite recent advancements in geometric deep learning-based approaches leading to improvements in blind docking efficiency, these methods have encountered notable challenges, such as limited generalization performance on unseen proteins, the inability to concurrently address the settings of blind docking and site-specific docking, and the frequent occurrence of physical implausibilities such as inter-molecular steric clash. In this study, we introduce DeltaDock, a robust and versatile framework designed for efficient molecular docking to overcome these challenges. DeltaDock operates in a two-step process: rapid initial complex structures sampling followed by multi-scale iterative refinement of the initial structures. In the initial stage, to sample accurate structures with high efficiency, we develop a ligand-dependent binding site prediction model founded on large protein models and graph neural networks. This model is then paired with GPU-accelerated sampling algorithms. The sampled structures are updated using a multi-scale iterative refinement module that captures both protein-ligand atom-atom interactions and residue-atom interactions in the following stage. Distinct from previous geometric deep learning methods that are conditioned on the blind docking setting, DeltaDock demonstrates superior performance in both blind docking and site-specific docking settings. Comprehensive experimental results reveal that DeltaDock consistently surpasses baseline methods in terms of docking accuracy. Furthermore, it displays remarkable generalization capabilities and proficiency for predicting physically valid structures, thereby attesting to its robustness and reliability in various scenarios.
    TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis. (arXiv:2311.18063v1 [cs.CL])
    Turkish is one of the most popular languages in the world. Wide us of this language on social media platforms such as Twitter, Instagram, or Tiktok and strategic position of the country in the world politics makes it appealing for the social network researchers and industry. To address this need, we introduce TurkishBERTweet, the first large scale pre-trained language model for Turkish social media built using almost 900 million tweets. The model shares the same architecture as base BERT model with smaller input length, making TurkishBERTweet lighter than BERTurk and can have significantly lower inference time. We trained our model using the same approach for RoBERTa model and evaluated on two text classification tasks: Sentiment Classification and Hate Speech Detection. We demonstrate that TurkishBERTweet outperforms the other available alternatives on generalizability and its lower inference time gives significant advantage to process large-scale datasets. We also compared our models with the commercial OpenAI solutions in terms of cost and performance to demonstrate TurkishBERTweet is scalable and cost-effective solution. As part of our research, we released TurkishBERTweet and fine-tuned LoRA adapters for the mentioned tasks under the MIT License to facilitate future research and applications on Turkish social media. Our TurkishBERTweet model is available at: https://github.com/ViralLab/TurkishBERTweet
    ADA-GP: Accelerating DNN Training By Adaptive Gradient Prediction. (arXiv:2305.13236v2 [cs.LG] UPDATED)
    Neural network training is inherently sequential where the layers finish the forward propagation in succession, followed by the calculation and back-propagation of gradients (based on a loss function) starting from the last layer. The sequential computations significantly slow down neural network training, especially the deeper ones. Prediction has been successfully used in many areas of computer architecture to speed up sequential processing. Therefore, we propose ADA-GP, which uses gradient prediction adaptively to speed up deep neural network (DNN) training while maintaining accuracy. ADA-GP works by incorporating a small neural network to predict gradients for different layers of a DNN model. ADA-GP uses a novel tensor reorganization method to make it feasible to predict a large number of gradients. ADA-GP alternates between DNN training using backpropagated gradients and DNN training using predicted gradients. ADA-GP adaptively adjusts when and for how long gradient prediction is used to strike a balance between accuracy and performance. Last but not least, we provide a detailed hardware extension in a typical DNN accelerator to realize the speed up potential from gradient prediction. Our extensive experiments with fifteen DNN models show that ADA-GP can achieve an average speed up of 1.47X with similar or even higher accuracy than the baseline models. Moreover, it consumes, on average, 34% less energy due to reduced off-chip memory accesses compared to the baseline accelerator.
    Categorical Traffic Transformer: Interpretable and Diverse Behavior Prediction with Tokenized Latent. (arXiv:2311.18307v1 [cs.LG])
    Adept traffic models are critical to both planning and closed-loop simulation for autonomous vehicles (AV), and key design objectives include accuracy, diverse multimodal behaviors, interpretability, and downstream compatibility. Recently, with the advent of large language models (LLMs), an additional desirable feature for traffic models is LLM compatibility. We present Categorical Traffic Transformer (CTT), a traffic model that outputs both continuous trajectory predictions and tokenized categorical predictions (lane modes, homotopies, etc.). The most outstanding feature of CTT is its fully interpretable latent space, which enables direct supervision of the latent variable from the ground truth during training and avoids mode collapse completely. As a result, CTT can generate diverse behaviors conditioned on different latent modes with semantic meanings while beating SOTA on prediction accuracy. In addition, CTT's ability to input and output tokens enables integration with LLMs for common-sense reasoning and zero-shot generalization.
    Match me if you can: Semantic Correspondence Learning with Unpaired Images. (arXiv:2311.18540v1 [cs.CV])
    Recent approaches for semantic correspondence have focused on obtaining high-quality correspondences using a complicated network, refining the ambiguous or noisy matching points. Despite their performance improvements, they remain constrained by the limited training pairs due to costly point-level annotations. This paper proposes a simple yet effective method that performs training with unlabeled pairs to complement both limited image pairs and sparse point pairs, requiring neither extra labeled keypoints nor trainable modules. We fundamentally extend the data quantity and variety by augmenting new unannotated pairs not primitively provided as training pairs in benchmarks. Using a simple teacher-student framework, we offer reliable pseudo correspondences to the student network via machine supervision. Finally, the performance of our network is steadily improved by the proposed iterative training, putting back the student as a teacher to generate refined labels and train a new student repeatedly. Our models outperform the milestone baselines, including state-of-the-art methods on semantic correspondence benchmarks.
    A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data. (arXiv:2311.18025v1 [cs.LG])
    Practitioners building classifiers often start with a smaller pilot dataset and plan to grow to larger data in the near future. Such projects need a toolkit for extrapolating how much classifier accuracy may improve from a 2x, 10x, or 50x increase in data size. While existing work has focused on finding a single "best-fit" curve using various functional forms like power laws, we argue that modeling and assessing the uncertainty of predictions is critical yet has seen less attention. In this paper, we propose a Gaussian process model to obtain probabilistic extrapolations of accuracy or similar performance metrics as dataset size increases. We evaluate our approach in terms of error, likelihood, and coverage across six datasets. Though we focus on medical tasks and image modalities, our open source approach generalizes to any kind of classifier.
    Dynamic Scheduling of a Multiclass Queue in the Halfin-Whitt Regime: A Computational Approach for High-Dimensional Problems. (arXiv:2311.18128v1 [eess.SY])
    We consider a multi-class queueing model of a telephone call center, in which a system manager dynamically allocates available servers to customer calls. Calls can terminate through either service completion or customer abandonment, and the manager strives to minimize the expected total of holding costs plus abandonment costs over a finite horizon. Focusing on the Halfin-Whitt heavy traffic regime, we derive an approximating diffusion control problem, and building on earlier work by Han et al. (2018), develop a simulation-based computational method for solution of such problems, one that relies heavily on deep neural network technology. Using this computational method, we propose a policy for the original (pre-limit) call center scheduling problem. Finally, the performance of this policy is assessed using test problems based on publicly available call center data. For the test problems considered so far, our policy does as well as the best benchmark we could find. Moreover, our method is computationally feasible at least up to dimension 100, that is, for call centers with 100 or more distinct customer classes.
    Transfer Learning across Different Chemical Domains: Virtual Screening of Organic Materials with Deep Learning Models Pretrained on Small Molecule and Chemical Reaction Data. (arXiv:2311.18377v1 [physics.chem-ph])
    Machine learning prediction of organic materials properties is an efficient virtual screening method ahead of more expensive screening methods. However, this approach has suffered from insufficient labeled data on organic materials to train state-of-the-art machine learning models. In this study, we demonstrate that drug-like small molecule and chemical reaction databases can be used to pretrain the BERT model for the virtual screening of organic materials. Among the BERT models fine-tuned by five virtual screening tasks on organic materials, the USPTO-SMILES pretrained BERT model had R2 > 0.90 for two tasks and R2 > 0.82 for one, which was generally superior to the same models pretrained by the small molecule or organic materials databases, as well as to the other three traditional machine learning models trained directly on the virtual screening task data. The superior performance of the USPTO-SMILES pretrained BERT model is due to the greater variety of organic building blocks in the USPTO database and the broader coverage of the chemical space. The even better performance of the BERT model pretrained externally from a chemical reaction database with additional sources of chemical reactions strengthens our proof of concept that transfer learning across different chemical domains is practical for the virtual screening of organic materials.
    A Minimal Approach for Natural Language Action Space in Text-based Games. (arXiv:2305.04082v2 [cs.LG] UPDATED)
    Text-based games (TGs) are language-based interactive environments for reinforcement learning. While language models (LMs) and knowledge graphs (KGs) are commonly used for handling large action space in TGs, it is unclear whether these techniques are necessary or overused. In this paper, we revisit the challenge of exploring the action space in TGs and propose $ \epsilon$-admissible exploration, a minimal approach of utilizing admissible actions, for training phase. Additionally, we present a text-based actor-critic (TAC) agent that produces textual commands for game, solely from game observations, without requiring any KG or LM. Our method, on average across 10 games from Jericho, outperforms strong baselines and state-of-the-art agents that use LM and KG. Our approach highlights that a much lighter model design, with a fresh perspective on utilizing the information within the environments, suffices for an effective exploration of exponentially large action spaces.
    Description Generation using Variational Auto-Encoders for precursor microRNA. (arXiv:2311.17970v1 [q-bio.QM])
    Micro RNAs (miRNA) are a type of non-coding RNA, which are involved in gene regulation and can be associated with diseases such as cancer, cardiovascular and neurological diseases. As such, identifying the entire genome of miRNA can be of great relevance. Since experimental methods for novel precursor miRNA (pre-miRNA) detection are complex and expensive, computational detection using ML could be useful. Existing ML methods are often complex black boxes, which do not create an interpretable structural description of pre-miRNA. In this paper, we propose a novel framework, which makes use of generative modeling through Variational Auto-Encoders to uncover the generative factors of pre-miRNA. After training the VAE, the pre-miRNA description is developed using a decision tree on the lower dimensional latent space. Applying the framework to miRNA classification, we obtain a high reconstruction and classification performance, while also developing an accurate miRNA description.
    Automatic Implementation of Neural Networks through Reaction Networks -- Part I: Circuit Design and Convergence Analysis. (arXiv:2311.18313v1 [math.DS])
    Information processing relying on biochemical interactions in the cellular environment is essential for biological organisms. The implementation of molecular computational systems holds significant interest and potential in the fields of synthetic biology and molecular computation. This two-part article aims to introduce a programmable biochemical reaction network (BCRN) system endowed with mass action kinetics that realizes the fully connected neural network (FCNN) and has the potential to act automatically in vivo. In part I, the feedforward propagation computation, the backpropagation component, and all bridging processes of FCNN are ingeniously designed as specific BCRN modules based on their dynamics. This approach addresses a design gap in the biochemical assignment module and judgment termination module and provides a novel precise and robust realization of bi-molecular reactions for the learning process. Through equilibrium approaching, we demonstrate that the designed BCRN system achieves FCNN functionality with exponential convergence to target computational results, thereby enhancing the theoretical support for such work. Finally, the performance of this construction is further evaluated on two typical logic classification problems.
    Cycle Invariant Positional Encoding for Graph Representation Learning. (arXiv:2311.14333v2 [cs.LG] UPDATED)
    Cycles are fundamental elements in graph-structured data and have demonstrated their effectiveness in enhancing graph learning models. To encode such information into a graph learning framework, prior works often extract a summary quantity, ranging from the number of cycles to the more sophisticated persistence diagram summaries. However, more detailed information, such as which edges are encoded in a cycle, has not yet been used in graph neural networks. In this paper, we make one step towards addressing this gap, and propose a structure encoding module, called CycleNet, that encodes cycle information via edge structure encoding in a permutation invariant manner. To efficiently encode the space of all cycles, we start with a cycle basis (i.e., a minimal set of cycles generating the cycle space) which we compute via the kernel of the 1-dimensional Hodge Laplacian of the input graph. To guarantee the encoding is invariant w.r.t. the choice of cycle basis, we encode the cycle information via the orthogonal projector of the cycle basis, which is inspired by BasisNet proposed by Lim et al. We also develop a more efficient variant which however requires that the input graph has a unique shortest cycle basis. To demonstrate the effectiveness of the proposed module, we provide some theoretical understandings of its expressive power. Moreover, we show via a range of experiments that networks enhanced by our CycleNet module perform better in various benchmarks compared to several existing SOTA models.
    Diffusion Models Without Attention. (arXiv:2311.18257v1 [cs.CV])
    In recent advancements in high-fidelity image generation, Denoising Diffusion Probabilistic Models (DDPMs) have emerged as a key player. However, their application at high resolutions presents significant computational challenges. Current methods, such as patchifying, expedite processes in UNet and Transformer architectures but at the expense of representational capacity. Addressing this, we introduce the Diffusion State Space Model (DiffuSSM), an architecture that supplants attention mechanisms with a more scalable state space model backbone. This approach effectively handles higher resolutions without resorting to global compression, thus preserving detailed image representation throughout the diffusion process. Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward. Comprehensive evaluations on both ImageNet and LSUN datasets at two resolutions demonstrate that DiffuSSMs are on par or even outperform existing diffusion models with attention modules in FID and Inception Score metrics while significantly reducing total FLOP usage.
    Can semi-supervised learning use all the data effectively? A lower bound perspective. (arXiv:2311.18557v1 [cs.LG])
    Prior works have shown that semi-supervised learning algorithms can leverage unlabeled data to improve over the labeled sample complexity of supervised learning (SL) algorithms. However, existing theoretical analyses focus on regimes where the unlabeled data is sufficient to learn a good decision boundary using unsupervised learning (UL) alone. This begs the question: Can SSL algorithms simultaneously improve upon both UL and SL? To this end, we derive a tight lower bound for 2-Gaussian mixture models that explicitly depends on the labeled and the unlabeled dataset size as well as the signal-to-noise ratio of the mixture distribution. Surprisingly, our result implies that no SSL algorithm can improve upon the minimax-optimal statistical error rates of SL or UL algorithms for these distributions. Nevertheless, we show empirically on real-world data that SSL algorithms can still outperform UL and SL methods. Therefore, our work suggests that, while proving performance gains for SSL algorithms is possible, it requires careful tracking of constants.
    Advancing Attack-Resilient Scheduling of Integrated Energy Systems with Demand Response via Deep Reinforcement Learning. (arXiv:2311.17941v1 [eess.SY])
    Optimally scheduling multi-energy flow is an effective method to utilize renewable energy sources (RES) and improve the stability and economy of integrated energy systems (IES). However, the stable demand-supply of IES faces challenges from uncertainties that arise from RES and loads, as well as the increasing impact of cyber-attacks with advanced information and communication technologies adoption. To address these challenges, this paper proposes an innovative model-free resilience scheduling method based on state-adversarial deep reinforcement learning (DRL) for integrated demand response (IDR)-enabled IES. The proposed method designs an IDR program to explore the interaction ability of electricity-gas-heat flexible loads. Additionally, a state-adversarial Markov decision process (SA-MDP) model characterizes the energy scheduling problem of IES under cyber-attack. The state-adversarial soft actor-critic (SA-SAC) algorithm is proposed to mitigate the impact of cyber-attacks on the scheduling strategy. Simulation results demonstrate that our method is capable of adequately addressing the uncertainties resulting from RES and loads, mitigating the impact of cyber-attacks on the scheduling strategy, and ensuring a stable demand supply for various energy sources. Moreover, the proposed method demonstrates resilience against cyber-attacks. Compared to the original soft actor-critic (SAC) algorithm, it achieves a 10\% improvement in economic performance under cyber-attack scenarios.
    Contextualized Policy Recovery: Modeling and Interpreting Medical Decisions with Adaptive Imitation Learning. (arXiv:2310.07918v2 [cs.LG] UPDATED)
    Interpretable policy learning seeks to estimate intelligible decision policies from observed actions; however, existing models fall short by forcing a tradeoff between accuracy and interpretability. This tradeoff limits data-driven interpretations of human decision-making process. e.g. to audit medical decisions for biases and suboptimal practices, we require models of decision processes which provide concise descriptions of complex behaviors. Fundamentally, existing approaches are burdened by this tradeoff because they represent the underlying decision process as a universal policy, when in fact human decisions are dynamic and can change drastically with contextual information. Thus, we propose Contextualized Policy Recovery (CPR), which re-frames the problem of modeling complex decision processes as a multi-task learning problem in which complex decision policies are comprised of context-specific policies. CPR models each context-specific policy as a linear observation-to-action mapping, and generates new decision models $\textit{on-demand}$ as contexts are updated with new observations. CPR is compatible with fully offline and partially observable decision environments, and can be tailored to incorporate any recurrent black-box model or interpretable decision model. We assess CPR through studies on simulated and real data, achieving state-of-the-art performance on the canonical tasks of predicting antibiotic prescription in intensive care units ($+22\%$ AUROC vs. previous SOTA) and predicting MRI prescription for Alzheimer's patients ($+7.7\%$ AUROC vs. previous SOTA). With this improvement in predictive performance, CPR closes the accuracy gap between interpretable and black-box methods for policy learning, allowing high-resolution exploration and analysis of context-specific decision models.
    Diffusion-TTA: Test-time Adaptation of Discriminative Models via Generative Feedback. (arXiv:2311.16102v2 [cs.CV] UPDATED)
    The advancements in generative modeling, particularly the advent of diffusion models, have sparked a fundamental question: how can these models be effectively used for discriminative tasks? In this work, we find that generative models can be great test-time adapters for discriminative models. Our method, Diffusion-TTA, adapts pre-trained discriminative models such as image classifiers, segmenters and depth predictors, to each unlabelled example in the test set using generative feedback from a diffusion model. We achieve this by modulating the conditioning of the diffusion model using the output of the discriminative model. We then maximize the image likelihood objective by backpropagating the gradients to discriminative model's parameters. We show Diffusion-TTA significantly enhances the accuracy of various large-scale pre-trained discriminative models, such as, ImageNet classifiers, CLIP models, image pixel labellers and image depth predictors. Diffusion-TTA outperforms existing test-time adaptation methods, including TTT-MAE and TENT, and particularly shines in online adaptation setups, where the discriminative model is continually adapted to each example in the test set. We provide access to code, results, and visualizations on our website: https://diffusion-tta.github.io/.
    Homogeneous Artificial Neural Network. (arXiv:2311.17973v1 [cs.LG])
    The paper proposes an artificial neural network (ANN) being a global approximator for a special class of functions, which are known as generalized homogeneous. The homogeneity means a symmetry of a function with respect to a group of transformations having topological characterization of a dilation. In this paper, a class of the so-called linear dilations is considered. A homogeneous universal approximation theorem is proven. Procedures for an upgrade of an existing ANN to a homogeneous one are developed. Theoretical results are supported by examples from the various domains (computer science, systems theory and automatic control).
    PatchBMI-Net: Lightweight Facial Patch-based Ensemble for BMI Prediction. (arXiv:2311.18102v1 [cs.CV])
    Due to an alarming trend related to obesity affecting 93.3 million adults in the United States alone, body mass index (BMI) and body weight have drawn significant interest in various health monitoring applications. Consequently, several studies have proposed self-diagnostic facial image-based BMI prediction methods for healthy weight monitoring. These methods have mostly used convolutional neural network (CNN) based regression baselines, such as VGG19, ResNet50, and Efficient-NetB0, for BMI prediction from facial images. However, the high computational requirement of these heavy-weight CNN models limits their deployment to resource-constrained mobile devices, thus deterring weight monitoring using smartphones. This paper aims to develop a lightweight facial patch-based ensemble (PatchBMI-Net) for BMI prediction to facilitate the deployment and weight monitoring using smartphones. Extensive experiments on BMI-annotated facial image datasets suggest that our proposed PatchBMI-Net model can obtain Mean Absolute Error (MAE) in the range [3.58, 6.51] with a size of about 3.3 million parameters. On cross-comparison with heavyweight models, such as ResNet-50 and Xception, trained for BMI prediction from facial images, our proposed PatchBMI-Net obtains equivalent MAE along with the model size reduction of about 5.4x and the average inference time reduction of about 3x when deployed on Apple-14 smartphone. Thus, demonstrating performance efficiency as well as low latency for on-device deployment and weight monitoring using smartphone applications.
    Heterogeneous Graph-based Trajectory Prediction using Local Map Context and Social Interactions. (arXiv:2311.18553v1 [cs.LG])
    Precisely predicting the future trajectories of surrounding traffic participants is a crucial but challenging problem in autonomous driving, due to complex interactions between traffic agents, map context and traffic rules. Vector-based approaches have recently shown to achieve among the best performances on trajectory prediction benchmarks. These methods model simple interactions between traffic agents but don't distinguish between relation-type and attributes like their distance along the road. Furthermore, they represent lanes only by sequences of vectors representing center lines and ignore context information like lane dividers and other road elements. We present a novel approach for vector-based trajectory prediction that addresses these shortcomings by leveraging three crucial sources of information: First, we model interactions between traffic agents by a semantic scene graph, that accounts for the nature and important features of their relation. Second, we extract agent-centric image-based map features to model the local map context. Finally, we generate anchor paths to enforce the policy in multi-modal prediction to permitted trajectories only. Each of these three enhancements shows advantages over the baseline model HoliGraph.
    Fantastic Weights and How to Find Them: Where to Prune in Dynamic Sparse Training. (arXiv:2306.12230v2 [cs.LG] UPDATED)
    Dynamic Sparse Training (DST) is a rapidly evolving area of research that seeks to optimize the sparse initialization of a neural network by adapting its topology during training. It has been shown that under specific conditions, DST is able to outperform dense models. The key components of this framework are the pruning and growing criteria, which are repeatedly applied during the training process to adjust the network's sparse connectivity. While the growing criterion's impact on DST performance is relatively well studied, the influence of the pruning criterion remains overlooked. To address this issue, we design and perform an extensive empirical analysis of various pruning criteria to better understand their impact on the dynamics of DST solutions. Surprisingly, we find that most of the studied methods yield similar results. The differences become more significant in the low-density regime, where the best performance is predominantly given by the simplest technique: magnitude-based pruning. The code is provided at https://github.com/alooow/fantastic_weights_paper
    SELF: Language-Driven Self-Evolution for Large Language Models. (arXiv:2310.00533v3 [cs.CL] UPDATED)
    Large Language Models (LLMs) have demonstrated remarkable versatility across various domains. To further advance LLMs, we propose 'SELF' (Self-Evolution with Language Feedback), a novel approach that enables LLMs to self-improve through self-reflection, akin to human learning processes. SELF initiates with a meta-skill learning process that equips the LLMs with capabilities for self-feedback and self-refinement. Subsequently, the model undergoes an iterative process of self-evolution. In each iteration, it utilizes an unlabeled dataset of instructions to generate initial responses. These responses are enhanced through self-feedback and self-refinement. The model is then fine-tuned using this enhanced data. The model undergoes progressive improvement through this iterative self-evolution process. Moreover, the SELF framework enables the model to apply self-refinement during inference, which further improves response quality. Our experiments in mathematics and general tasks demonstrate that SELF can enhance the capabilities of LLMs without human intervention. The SELF framework indicates a promising direction for the autonomous evolution of LLMs, transitioning them from passive information receivers to active participants in their development.
    On the Convergence of the ELBO to Entropy Sums. (arXiv:2209.03077v4 [stat.ML] UPDATED)
    The variational lower bound (a.k.a. ELBO or free energy) is the central objective for many established as well as many novel algorithms for unsupervised learning. Learning algorithms change model parameters such that the variational lower bound increases. Learning usually proceeds until parameters have converged to values close to a stationary point of the learning dynamics. In this purely theoretical contribution, we show that (for a very large class of generative models) the variational lower bound is at all stationary points of learning equal to a sum of entropies. For standard machine learning models with one set of latents and one set observed variables, the sum consists of three entropies: (A) the (average) entropy of the variational distributions, (B) the negative entropy of the model's prior distribution, and (C) the (expected) negative entropy of the observable distributions. The obtained result applies under realistic conditions including: finite numbers of data points, at any stationary points (including saddle points) and for any family of (well behaved) variational distributions. The class of generative models for which we show the equality to entropy sums contains many well-known generative models. As concrete examples we discuss Sigmoid Belief Networks, probabilistic PCA and (Gaussian and non-Gaussian) mixture models. The results also apply for standard (Gaussian) variational autoencoders, which has been shown in parallel (Damm et al., 2023). The prerequisites we use to show equality to entropy sums are relatively mild. Concretely, the distributions of a given generative model have to be of the exponential family (with constant base measure), and the model has to satisfy a parameterization criterion (which is usually fulfilled). Proving the equality of the ELBO to entropy sums at stationary points (under the stated conditions) is the main contribution of this work.
    Two-step reinforcement learning for model-free redesign of nonlinear optimal regulator. (arXiv:2103.03808v4 [eess.SY] UPDATED)
    In many practical control applications, the performance level of a closed-loop system degrades over time due to the change of plant characteristics. Thus, there is a strong need for redesigning a controller without going through the system modeling process, which is often difficult for closed-loop systems. Reinforcement learning (RL) is one of the promising approaches that enable model-free redesign of optimal controllers for nonlinear dynamical systems based only on the measurement of the closed-loop system. However, the learning process of RL usually requires a considerable number of trial-and-error experiments using the poorly controlled system that may accumulate wear on the plant. To overcome this limitation, we propose a model-free two-step design approach that improves the transient learning performance of RL in an optimal regulator redesign problem for unknown nonlinear systems. Specifically, we first design a linear control law that attains some degree of control performance in a model-free manner, and then, train the nonlinear optimal control law with online RL by using the designed linear control law in parallel. We introduce an offline RL algorithm for the design of the linear control law and theoretically guarantee its convergence to the LQR controller under mild assumptions. Numerical simulations show that the proposed approach improves the transient learning performance and efficiency in hyperparameter tuning of RL.
    Active learning for data streams: a survey. (arXiv:2302.08893v4 [stat.ML] UPDATED)
    Online active learning is a paradigm in machine learning that aims to select the most informative data points to label from a data stream. The problem of minimizing the cost associated with collecting labeled observations has gained a lot of attention in recent years, particularly in real-world applications where data is only available in an unlabeled form. Annotating each observation can be time-consuming and costly, making it difficult to obtain large amounts of labeled data. To overcome this issue, many active learning strategies have been proposed in the last decades, aiming to select the most informative observations for labeling in order to improve the performance of machine learning models. These approaches can be broadly divided into two categories: static pool-based and stream-based active learning. Pool-based active learning involves selecting a subset of observations from a closed pool of unlabeled data, and it has been the focus of many surveys and literature reviews. However, the growing availability of data streams has led to an increase in the number of approaches that focus on online active learning, which involves continuously selecting and labeling observations as they arrive in a stream. This work aims to provide an overview of the most recently proposed approaches for selecting the most informative observations from data streams in real time. We review the various techniques that have been proposed and discuss their strengths and limitations, as well as the challenges and opportunities that exist in this area of research.
    PERFOGRAPH: A Numerical Aware Program Graph Representation for Performance Optimization and Program Analysis. (arXiv:2306.00210v2 [cs.PL] UPDATED)
    The remarkable growth and significant success of machine learning have expanded its applications into programming languages and program analysis. However, a key challenge in adopting the latest machine learning methods is the representation of programming languages, which directly impacts the ability of machine learning methods to reason about programs. The absence of numerical awareness, aggregate data structure information, and improper way of presenting variables in previous representation works have limited their performances. To overcome the limitations and challenges of current program representations, we propose a graph-based program representation called PERFOGRAPH. PERFOGRAPH can capture numerical information and the aggregate data structure by introducing new nodes and edges. Furthermore, we propose an adapted embedding method to incorporate numerical awareness. These enhancements make PERFOGRAPH a highly flexible and scalable representation that effectively captures programs intricate dependencies and semantics. Consequently, it serves as a powerful tool for various applications such as program analysis, performance optimization, and parallelism discovery. Our experimental results demonstrate that PERFOGRAPH outperforms existing representations and sets new state-of-the-art results by reducing the error rate by 7.4% (AMD dataset) and 10% (NVIDIA dataset) in the well-known Device Mapping challenge. It also sets new state-of-the-art results in various performance optimization tasks like Parallelism Discovery and NUMA and Prefetchers Configuration prediction.
    Relating graph auto-encoders to linear models. (arXiv:2211.01858v2 [cs.LG] UPDATED)
    Graph auto-encoders are widely used to construct graph representations in Euclidean vector spaces. However, it has already been pointed out empirically that linear models on many tasks can outperform graph auto-encoders. In our work, we prove that the solution space induced by graph auto-encoders is a subset of the solution space of a linear map. This demonstrates that linear embedding models have at least the representational power of graph auto-encoders based on graph convolutional networks. So why are we still using nonlinear graph auto-encoders? One reason could be that actively restricting the linear solution space might introduce an inductive bias that helps improve learning and generalization. While many researchers believe that the nonlinearity of the encoder is the critical ingredient towards this end, we instead identify the node features of the graph as a more powerful inductive bias. We give theoretical insights by introducing a corresponding bias in a linear model and analyzing the change in the solution space. Our experiments are aligned with other empirical work on this question and show that the linear encoder can outperform the nonlinear encoder when using feature information.
    Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models. (arXiv:2306.08018v4 [q-bio.QM] UPDATED)
    Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a comprehensive instruction dataset designed for the biomolecular domain. Mol-Instructions encompasses three key components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions. Each component aims to improve the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on LLMs, we demonstrate the effectiveness of Mol-Instructions in enhancing large models' performance in the intricate realm of biomolecular studies, thus fostering progress in the biomolecular research community. Mol-Instructions is publicly available for ongoing research and will undergo regular updates to enhance its applicability.
    ClustML: A Measure of Cluster Pattern Complexity in Scatterplots Learnt from Human-labeled Groupings. (arXiv:2106.00599v2 [cs.HC] UPDATED)
    Visual quality measures (VQMs) are designed to support analysts by automatically detecting and quantifying patterns in visualizations. We propose a new VQM for visual grouping patterns in scatterplots, called ClustML, which is trained on previously collected human subject judgments. Our model encodes scatterplots in the parametric space of a Gaussian Mixture Model and uses a classifier trained on human judgment data to estimate the perceptual complexity of grouping patterns. The numbers of initial mixture components and final combined groups. It improves on existing VQMs, first, by better estimating human judgments on two-Gaussian cluster patterns and, second, by giving higher accuracy when ranking general cluster patterns in scatterplots. We use it to analyze kinship data for genome-wide association studies, in which experts rely on the visual analysis of large sets of scatterplots. We make the benchmark datasets and the new VQM available for practical use and further improvements.
    Meta Co-Training: Two Views are Better than One. (arXiv:2311.18083v1 [cs.CV])
    In many practical computer vision scenarios unlabeled data is plentiful, but labels are scarce and difficult to obtain. As a result, semi-supervised learning which leverages unlabeled data to boost the performance of supervised classifiers have received significant attention in recent literature. One major class of semi-supervised algorithms is co-training. In co-training two different models leverage different independent and sufficient "views" of the data to jointly make better predictions. During co-training each model creates pseudo labels on unlabeled points which are used to improve the other model. We show that in the common case when independent views are not available we can construct such views inexpensively using pre-trained models. Co-training on the constructed views yields a performance improvement over any of the individual views we construct and performance comparable with recent approaches in semi-supervised learning, but has some undesirable properties. To alleviate the issues present with co-training we present Meta Co-Training which is an extension of the successful Meta Pseudo Labels approach to multiple views. Our method achieves new state-of-the-art performance on ImageNet-10% with very few training resources, as well as outperforming prior semi-supervised work on several other fine-grained image classification datasets.
    Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction. (arXiv:2311.18695v1 [cs.CV])
    State-of-the-art single-view 360-degree room layout reconstruction methods formulate the problem as a high-level 1D (per-column) regression task. On the other hand, traditional low-level 2D layout segmentation is simpler to learn and can represent occluded regions, but it requires complex post-processing for the targeting layout polygon and sacrifices accuracy. We present Seg2Reg to render 1D layout depth regression from the 2D segmentation map in a differentiable and occlusion-aware way, marrying the merits of both sides. Specifically, our model predicts floor-plan density for the input equirectangular 360-degree image. Formulating the 2D layout representation as a density field enables us to employ `flattened' volume rendering to form 1D layout depth regression. In addition, we propose a novel 3D warping augmentation on layout to improve generalization. Finally, we re-implement recent room layout reconstruction methods into our codebase for benchmarking and explore modern backbones and training techniques to serve as the strong baseline. Our model significantly outperforms previous arts. The code will be made available upon publication.
    Meta-Prior: Meta learning for Adaptive Inverse Problem Solvers. (arXiv:2311.18710v1 [cs.CV])
    Deep neural networks have become a foundational tool for addressing imaging inverse problems. They are typically trained for a specific task, with a supervised loss to learn a mapping from the observations to the image to recover. However, real-world imaging challenges often lack ground truth data, rendering traditional supervised approaches ineffective. Moreover, for each new imaging task, a new model needs to be trained from scratch, wasting time and resources. To overcome these limitations, we introduce a novel approach based on meta-learning. Our method trains a meta-model on a diverse set of imaging tasks that allows the model to be efficiently fine-tuned for specific tasks with few fine-tuning steps. We show that the proposed method extends to the unsupervised setting, where no ground truth data is available. In its bilevel formulation, the outer level uses a supervised loss, that evaluates how well the fine-tuned model performs, while the inner loss can be either supervised or unsupervised, relying only on the measurement operator. This allows the meta-model to leverage a few ground truth samples for each task while being able to generalize to new imaging tasks. We show that in simple settings, this approach recovers the Bayes optimal estimator, illustrating the soundness of our approach. We also demonstrate our method's effectiveness on various tasks, including image processing and magnetic resonance imaging.
    TransOpt: Transformer-based Representation Learning for Optimization Problem Classification. (arXiv:2311.18035v1 [cs.LG])
    We propose a representation of optimization problem instances using a transformer-based neural network architecture trained for the task of problem classification of the 24 problem classes from the Black-box Optimization Benchmarking (BBOB) benchmark. We show that transformer-based methods can be trained to recognize problem classes with accuracies in the range of 70\%-80\% for different problem dimensions, suggesting the possible application of transformer architectures in acquiring representations for black-box optimization problems.
    Balancing Summarization and Change Detection in Graph Streams. (arXiv:2311.18694v1 [stat.ML])
    This study addresses the issue of balancing graph summarization and graph change detection. Graph summarization compresses large-scale graphs into a smaller scale. However, the question remains: To what extent should the original graph be compressed? This problem is solved from the perspective of graph change detection, aiming to detect statistically significant changes using a stream of summary graphs. If the compression rate is extremely high, important changes can be ignored, whereas if the compression rate is extremely low, false alarms may increase with more memory. This implies that there is a trade-off between compression rate in graph summarization and accuracy in change detection. We propose a novel quantitative methodology to balance this trade-off to simultaneously realize reliable graph summarization and change detection. We introduce a probabilistic structure of hierarchical latent variable model into a graph, thereby designing a parameterized summary graph on the basis of the minimum description length principle. The parameter specifying the summary graph is then optimized so that the accuracy of change detection is guaranteed to suppress Type I error probability (probability of raising false alarms) to be less than a given confidence level. First, we provide a theoretical framework for connecting graph summarization with change detection. Then, we empirically demonstrate its effectiveness on synthetic and real datasets.
    PAUNet: Precipitation Attention-based U-Net for rain prediction from satellite radiance data. (arXiv:2311.18306v1 [physics.ao-ph])
    This paper introduces Precipitation Attention-based U-Net (PAUNet), a deep learning architecture for predicting precipitation from satellite radiance data, addressing the challenges of the Weather4cast 2023 competition. PAUNet is a variant of U-Net and Res-Net, designed to effectively capture the large-scale contextual information of multi-band satellite images in visible, water vapor, and infrared bands through encoder convolutional layers with center cropping and attention mechanisms. We built upon the Focal Precipitation Loss including an exponential component (e-FPL), which further enhanced the importance across different precipitation categories, particularly medium and heavy rain. Trained on a substantial dataset from various European regions, PAUNet demonstrates notable accuracy with a higher Critical Success Index (CSI) score than the baseline model in predicting rainfall over multiple time slots. PAUNet's architecture and training methodology showcase improvements in precipitation forecasting, crucial for sectors like emergency services and retail and supply chain management.
    Beyond Prediction: On-street Parking Recommendation using Heterogeneous Graph-based List-wise Ranking. (arXiv:2305.00162v2 [cs.LG] UPDATED)
    To provide real-time parking information, existing studies focus on predicting parking availability, which seems an indirect approach to saving drivers' cruising time. In this paper, we first time propose an on-street parking recommendation (OPR) task to directly recommend a parking space for a driver. To this end, a learn-to-rank (LTR) based OPR model called OPR-LTR is built. Specifically, parking recommendation is closely related to the "turnover events" (state switching between occupied and vacant) of each parking space, and hence we design a highly efficient heterogeneous graph called ESGraph to represent historical and real-time meters' turnover events as well as geographical relations; afterward, a convolution-based event-then-graph network is used to aggregate and update representations of the heterogeneous graph. A ranking model is further utilized to learn a score function that helps recommend a list of ranked parking spots for a specific on-street parking query. The method is verified using the on-street parking meter data in Hong Kong and San Francisco. By comparing with the other two types of methods: prediction-only and prediction-then-recommendation, the proposed direct-recommendation method achieves satisfactory performance in different metrics. Extensive experiments also demonstrate that the proposed ESGraph and the recommendation model are more efficient in terms of computational efficiency as well as saving drivers' on-street parking time.
    A Comparison Between Invariant and Equivariant Classical and Quantum Graph Neural Networks. (arXiv:2311.18672v1 [quant-ph])
    Machine learning algorithms are heavily relied on to understand the vast amounts of data from high-energy particle collisions at the CERN Large Hadron Collider (LHC). The data from such collision events can naturally be represented with graph structures. Therefore, deep geometric methods, such as graph neural networks (GNNs), have been leveraged for various data analysis tasks in high-energy physics. One typical task is jet tagging, where jets are viewed as point clouds with distinct features and edge connections between their constituent particles. The increasing size and complexity of the LHC particle datasets, as well as the computational models used for their analysis, greatly motivate the development of alternative fast and efficient computational paradigms such as quantum computation. In addition, to enhance the validity and robustness of deep networks, one can leverage the fundamental symmetries present in the data through the use of invariant inputs and equivariant layers. In this paper, we perform a fair and comprehensive comparison between classical graph neural networks (GNNs) and equivariant graph neural networks (EGNNs) and their quantum counterparts: quantum graph neural networks (QGNNs) and equivariant quantum graph neural networks (EQGNN). The four architectures were benchmarked on a binary classification task to classify the parton-level particle initiating the jet. Based on their AUC scores, the quantum networks were shown to outperform the classical networks. However, seeing the computational advantage of the quantum networks in practice may have to wait for the further development of quantum technology and its associated APIs.
    Operationalizing Counterfactual Metrics: Incentives, Ranking, and Information Asymmetry. (arXiv:2305.14595v2 [cs.LG] UPDATED)
    From the social sciences to machine learning, it has been well documented that metrics to be optimized are not always aligned with social welfare. In healthcare, Dranove et al. (2003) showed that publishing surgery mortality metrics actually harmed the welfare of sicker patients by increasing provider selection behavior. We analyze the incentive misalignments that arise from such average treated outcome metrics, and show that the incentives driving treatment decisions would align with maximizing total patient welfare if the metrics (i) accounted for counterfactual untreated outcomes and (ii) considered total welfare instead of averaging over treated patients. Operationalizing this, we show how counterfactual metrics can be modified to behave reasonably in patient-facing ranking systems. Extending to realistic settings when providers observe more about patients than the regulatory agencies do, we bound the decay in performance by the degree of information asymmetry between principal and agent. In doing so, our model connects principal-agent information asymmetry with unobserved heterogeneity in causal inference.
    ZeST-NeRF: Using temporal aggregation for Zero-Shot Temporal NeRFs. (arXiv:2311.18491v1 [cs.CV])
    In the field of media production, video editing techniques play a pivotal role. Recent approaches have had great success at performing novel view image synthesis of static scenes. But adding temporal information adds an extra layer of complexity. Previous models have focused on implicitly representing static and dynamic scenes using NeRF. These models achieve impressive results but are costly at training and inference time. They overfit an MLP to describe the scene implicitly as a function of position. This paper proposes ZeST-NeRF, a new approach that can produce temporal NeRFs for new scenes without retraining. We can accurately reconstruct novel views using multi-view synthesis techniques and scene flow-field estimation, trained only with unrelated scenes. We demonstrate how existing state-of-the-art approaches from a range of fields cannot adequately solve this new task and demonstrate the efficacy of our solution. The resulting network improves quantitatively by 15% and produces significantly better visual results.
    Generalisable Agents for Neural Network Optimisation. (arXiv:2311.18598v1 [cs.LG])
    Optimising deep neural networks is a challenging task due to complex training dynamics, high computational requirements, and long training times. To address this difficulty, we propose the framework of Generalisable Agents for Neural Network Optimisation (GANNO) -- a multi-agent reinforcement learning (MARL) approach that learns to improve neural network optimisation by dynamically and responsively scheduling hyperparameters during training. GANNO utilises an agent per layer that observes localised network dynamics and accordingly takes actions to adjust these dynamics at a layerwise level to collectively improve global performance. In this paper, we use GANNO to control the layerwise learning rate and show that the framework can yield useful and responsive schedules that are competitive with handcrafted heuristics. Furthermore, GANNO is shown to perform robustly across a wide variety of unseen initial conditions, and can successfully generalise to harder problems than it was trained on. Our work presents an overview of the opportunities that this paradigm offers for training neural networks, along with key challenges that remain to be overcome.
    Learning Radio Environments by Differentiable Ray Tracing. (arXiv:2311.18558v1 [cs.IT])
    Ray tracing (RT) is instrumental in 6G research in order to generate spatially-consistent and environment-specific channel impulse responses (CIRs). While acquiring accurate scene geometries is now relatively straightforward, determining material characteristics requires precise calibration using channel measurements. We therefore introduce a novel gradient-based calibration method, complemented by differentiable parametrizations of material properties, scattering and antenna patterns. Our method seamlessly integrates with differentiable ray tracers that enable the computation of derivatives of CIRs with respect to these parameters. Essentially, we approach field computation as a large computational graph wherein parameters are trainable akin to weights of a neural network (NN). We have validated our method using both synthetic data and real-world indoor channel measurements, employing a distributed multiple-input multiple-output (MIMO) channel sounder.
    Communication-Efficient Heterogeneous Federated Learning with Generalized Heavy-Ball Momentum. (arXiv:2311.18578v1 [cs.LG])
    Federated Learning (FL) is the state-of-the-art approach for learning from decentralized data in privacy-constrained scenarios. As the current literature reports, the main problems associated with FL refer to system and statistical challenges: the former ones demand for efficient learning from edge devices, including lowering communication bandwidth and frequency, while the latter require algorithms robust to non-iidness. State-of-art approaches either guarantee convergence at increased communication cost or are not sufficiently robust to handle extreme heterogeneous local distributions. In this work we propose a novel generalization of the heavy-ball momentum, and present FedHBM to effectively address statistical heterogeneity in FL without introducing any communication overhead. We conduct extensive experimentation on common FL vision and NLP datasets, showing that our FedHBM algorithm empirically yields better model quality and higher convergence speed w.r.t. the state-of-art, especially in pathological non-iid scenarios. While being designed for cross-silo settings, we show how FedHBM is applicable in moderate-to-high cross-device scenarios, and how good model initializations (e.g. pre-training) can be exploited for prompt acceleration. Extended experimentation on large-scale real-world federated datasets further corroborates the effectiveness of our approach for real-world FL applications.
    Self-Driving Telescopes: Autonomous Scheduling of Astronomical Observation Campaigns with Offline Reinforcement Learning. (arXiv:2311.18094v1 [astro-ph.IM])
    Modern astronomical experiments are designed to achieve multiple scientific goals, from studies of galaxy evolution to cosmic acceleration. These goals require data of many different classes of night-sky objects, each of which has a particular set of observational needs. These observational needs are typically in strong competition with one another. This poses a challenging multi-objective optimization problem that remains unsolved. The effectiveness of Reinforcement Learning (RL) as a valuable paradigm for training autonomous systems has been well-demonstrated, and it may provide the basis for self-driving telescopes capable of optimizing the scheduling for astronomy campaigns. Simulated datasets containing examples of interactions between a telescope and a discrete set of sky locations on the celestial sphere can be used to train an RL model to sequentially gather data from these several locations to maximize a cumulative reward as a measure of the quality of the data gathered. We use simulated data to test and compare multiple implementations of a Deep Q-Network (DQN) for the task of optimizing the schedule of observations from the Stone Edge Observatory (SEO). We combine multiple improvements on the DQN and adjustments to the dataset, showing that DQNs can achieve an average reward of 87%+-6% of the maximum achievable reward in each state on the test set. This is the first comparison of offline RL algorithms for a particular astronomical challenge and the first open-source framework for performing such a comparison and assessment task.
    Learning Robust Precipitation Forecaster by Temporal Frame Interpolation. (arXiv:2311.18341v1 [cs.LG])
    Recent advancements in deep learning have propelled the field of weather prediction models to new heights. Despite their progress, these models often struggle with real-world application due to their sensitivity to spatial-temporal shifts, a vulnerability particularly pronounced in weather prediction tasks where overfitting to local and temporal variations is common. This paper presents an investigation into the development of a robust precipitation forecasting model that stands resilient to such shifts. We introduce Temporal Frame Interpolation (TFI), an innovative technique designed to fortify forecasting models against spatial-temporal discrepancies. TFI operates by generating synthetic samples through the interpolation of adjacent frames from satellite imagery and ground radar data, thereby enriching the training dataset and bolstering the model's defense against noise on frames. Additionally, we integrate a novel multi-level dice loss, which exploits the ordinal nature of rainfall intensities to further refine model performance. These methodologies have collectively advanced our model's forecasting precision, achieving \textit{1st place} on the transfer learning leaderboard in the \textit{Weather4Cast'23 competition}.It not only demonstrates the efficacy of our approaches but also sets a new benchmark for deep learning applications in meteorological forecasting. Our code and weights have been public on \url{https://github.com/Secilia-Cxy/UNetTFI}.
    Training a HyperDimensional Computing Classifier using a Threshold on its Confidence. (arXiv:2305.19007v2 [cs.LG] UPDATED)
    Hyperdimensional computing (HDC) has become popular for light-weight and energy-efficient machine learning, suitable for wearable Internet-of-Things (IoT) devices and near-sensor or on-device processing. HDC is computationally less complex than traditional deep learning algorithms and achieves moderate to good classification performance. This article proposes to extend the training procedure in HDC by taking into account not only wrongly classified samples, but also samples that are correctly classified by the HDC model but with low confidence. As such, a confidence threshold is introduced that can be tuned for each dataset to achieve the best classification accuracy. The proposed training procedure is tested on UCIHAR, CTG, ISOLET and HAND dataset for which the performance consistently improves compared to the baseline across a range of confidence threshold values. The extended training procedure also results in a shift towards higher confidence values of the correctly classified samples making the classifier not only more accurate but also more confident about its predictions.
    Self-Infilling Code Generation. (arXiv:2311.17972v1 [cs.PL])
    This work introduces a general code generation framework that incorporates infilling operations into auto-regressive decoding. Our approach capitalizes on the observation that recent code language models with infilling capabilities can perform \emph{self-infilling}: whereas infilling operations aim to fill in the middle based on a predefined prefix and suffix, self-infilling sequentially generates both such surrounding context and the infilled content. We utilize this feature to develop an infilling-augmented decoding process that facilitates non-monotonic generation. This approach allows for postponing the generation of uncertain code snippets until a definitive suffix is established, leading to improved control over the generation sequence. In addition, it facilitates a looping mechanism, which can iteratively update and synchronize each piece of generation in a cyclic manner. Extensive experiments are conducted to demonstrate that our proposed decoding process is effective in enhancing regularity and quality across several code generation benchmarks.
    Self-Supervised Learning for Large-Scale Preventive Security Constrained DC Optimal Power Flow. (arXiv:2311.18072v1 [cs.LG])
    Security-Constrained Optimal Power Flow (SCOPF) plays a crucial role in power grid stability but becomes increasingly complex as systems grow. This paper introduces PDL-SCOPF, a self-supervised end-to-end primal-dual learning framework for producing near-optimal solutions to large-scale SCOPF problems in milliseconds. Indeed, PDL-SCOPF remedies the limitations of supervised counterparts that rely on training instances with their optimal solutions, which becomes impractical for large-scale SCOPF problems. PDL-SCOPF mimics an Augmented Lagrangian Method (ALM) for training primal and dual networks that learn the primal solutions and the Lagrangian multipliers, respectively, to the unconstrained optimizations. In addition, PDL-SCOPF incorporates a repair layer to ensure the feasibility of the power balance in the nominal case, and a binary search layer to compute, using the Automatic Primary Response (APR), the generator dispatches in the contingencies. The resulting differentiable program can then be trained end-to-end using the objective function of the SCOPF and the power balance constraints of the contingencies. Experimental results demonstrate that the PDL-SCOPF delivers accurate feasible solutions with minimal optimality gaps. The framework underlying PDL-SCOPF aims at bridging the gap between traditional optimization methods and machine learning, highlighting the potential of self-supervised end-to-end primal-dual learning for large-scale optimization tasks.
    Handling Cost and Constraints with Off-Policy Deep Reinforcement Learning. (arXiv:2311.18684v1 [cs.LG])
    By reusing data throughout training, off-policy deep reinforcement learning algorithms offer improved sample efficiency relative to on-policy approaches. For continuous action spaces, the most popular methods for off-policy learning include policy improvement steps where a learned state-action ($Q$) value function is maximized over selected batches of data. These updates are often paired with regularization to combat associated overestimation of $Q$ values. With an eye toward safety, we revisit this strategy in environments with "mixed-sign" reward functions; that is, with reward functions that include independent positive (incentive) and negative (cost) terms. This setting is common in real-world applications, and may be addressed with or without constraints on the cost terms. We find the combination of function approximation and a term that maximizes $Q$ in the policy update to be problematic in such environments, because systematic errors in value estimation impact the contributions from the competing terms asymmetrically. This results in overemphasis of either incentives or costs and may severely limit learning. We explore two remedies to this issue. First, consistent with prior work, we find that periodic resetting of $Q$ and policy networks can be used to reduce value estimation error and improve learning in this setting. Second, we formulate novel off-policy actor-critic methods for both unconstrained and constrained learning that do not explicitly maximize $Q$ in the policy update. We find that this second approach, when applied to continuous action spaces with mixed-sign rewards, consistently and significantly outperforms state-of-the-art methods augmented by resetting. We further find that our approach produces agents that are both competitive with popular methods overall and more reliably competent on frequently-studied control problems that do not have mixed-sign rewards.
    TransNAS-TSAD: Harnessing Transformers for Multi-Objective Neural Architecture Search in Time Series Anomaly Detection. (arXiv:2311.18061v1 [cs.LG])
    The surge in real-time data collection across various industries has underscored the need for advanced anomaly detection in both univariate and multivariate time series data. Traditional methods, while comprehensive, often struggle to capture the complex interdependencies in such data. This paper introduces TransNAS-TSAD, a novel framework that synergizes transformer architecture with neural architecture search (NAS), enhanced through NSGA-II algorithm optimization. This innovative approach effectively tackles the complexities of both univariate and multivariate time series, balancing computational efficiency with detection accuracy. Our evaluation reveals that TransNAS-TSAD surpasses conventional anomaly detection models, demonstrating marked improvements in diverse data scenarios. We also propose the Efficiency-Accuracy-Complexity Score (EACS) as a new metric for assessing model performance, emphasizing the crucial balance between accuracy and computational resources. TransNAS-TSAD sets a new benchmark in time series anomaly detection, offering a versatile, efficient solution for complex real-world applications. This research paves the way for future developments in the field, highlighting its potential in a wide range of industry applications.
    Real-Time Vibration-Based Bearing Fault Diagnosis Under Time-Varying Speed Conditions. (arXiv:2311.18547v1 [cs.LG])
    Detection of rolling-element bearing faults is crucial for implementing proactive maintenance strategies and for minimizing the economic and operational consequences of unexpected failures. However, many existing techniques are developed and tested under strictly controlled conditions, limiting their adaptability to the diverse and dynamic settings encountered in practical applications. This paper presents an efficient real-time convolutional neural network (CNN) for diagnosing multiple bearing faults under various noise levels and time-varying rotational speeds. Additionally, we propose a novel Fisher-based spectral separability analysis (SSA) method to elucidate the effectiveness of the designed CNN model. We conducted experiments on both healthy bearings and bearings afflicted with inner race, outer race, and roller ball faults. The experimental results show the superiority of our model over the current state-of-the-art approach in three folds: it achieves substantial accuracy gains of up to 15.8%, it is robust to noise with high performance across various signal-to-noise ratios, and it runs in real-time with processing durations five times less than acquisition. Additionally, by using the proposed SSA technique, we offer insights into the model's performance and underscore its effectiveness in tackling real-world challenges.
    Data-efficient Deep Reinforcement Learning for Vehicle Trajectory Control. (arXiv:2311.18393v1 [cs.LG])
    Advanced vehicle control is a fundamental building block in the development of autonomous driving systems. Reinforcement learning (RL) promises to achieve control performance superior to classical approaches while keeping computational demands low during deployment. However, standard RL approaches like soft-actor critic (SAC) require extensive amounts of training data to be collected and are thus impractical for real-world application. To address this issue, we apply recently developed data-efficient deep RL methods to vehicle trajectory control. Our investigation focuses on three methods, so far unexplored for vehicle control: randomized ensemble double Q-learning (REDQ), probabilistic ensembles with trajectory sampling and model predictive path integral optimizer (PETS-MPPI), and model-based policy optimization (MBPO). We find that in the case of trajectory control, the standard model-based RL formulation used in approaches like PETS-MPPI and MBPO is not suitable. We, therefore, propose a new formulation that splits dynamics prediction and vehicle localization. Our benchmark study on the CARLA simulator reveals that the three identified data-efficient deep RL approaches learn control strategies on a par with or better than SAC, yet reduce the required number of environment interactions by more than one order of magnitude.
    RainAI -- Precipitation Nowcasting from Satellite Data. (arXiv:2311.18398v1 [cs.CV])
    This paper presents a solution to the Weather4Cast 2023 competition, where the goal is to forecast high-resolution precipitation with an 8-hour lead time using lower-resolution satellite radiance images. We propose a simple, yet effective method for spatiotemporal feature learning using a 2D U-Net model, that outperforms the official 3D U-Net baseline in both performance and efficiency. We place emphasis on refining the dataset, through importance sampling and dataset preparation, and show that such techniques have a significant impact on performance. We further study an alternative cross-entropy loss function that improves performance over the standard mean squared error loss, while also enabling models to produce probabilistic outputs. Additional techniques are explored regarding the generation of predictions at different lead times, specifically through Conditioning Lead Time. Lastly, to generate high-resolution forecasts, we evaluate standard and learned upsampling methods. The code and trained parameters are available at https://github.com/rafapablos/w4c23-rainai.
    Label-efficient Training of Small Task-specific Models by Leveraging Vision Foundation Models. (arXiv:2311.18237v1 [cs.CV])
    Large Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high memory and compute requirements, these models cannot be deployed in resource constrained settings. This raises an important question: How can we utilize the knowledge from a large VFM to train a small task-specific model for a new target task with limited labeled training data? In this work, we answer this question by proposing a simple and highly effective task-oriented knowledge transfer approach to leverage pretrained VFMs for effective training of small task-specific models. Our experimental results on four target tasks under limited labeled data settings show that the proposed knowledge transfer approach outperforms task-agnostic VFM distillation, web-scale CLIP pretraining and supervised ImageNet pretraining by 1-10.5%, 2-22% and 2-14%, respectively. We also show that the dataset used for transferring knowledge has a significant effect on the final target task performance, and propose an image retrieval-based approach for curating effective transfer sets.
    Packrat: Automatic Reconfiguration for Latency Minimization in CPU-based DNN Serving. (arXiv:2311.18174v1 [cs.DC])
    In this paper, we investigate how to push the performance limits of serving Deep Neural Network (DNN) models on CPU-based servers. Specifically, we observe that while intra-operator parallelism across multiple threads is an effective way to reduce inference latency, it provides diminishing returns. Our primary insight is that instead of running a single instance of a model with all available threads on a server, running multiple instances each with smaller batch sizes and fewer threads for intra-op parallelism can provide lower inference latency. However, the right configuration is hard to determine manually since it is workload- (DNN model and batch size used by the serving system) and deployment-dependent (number of CPU cores on server). We present Packrat, a new serving system for online inference that given a model and batch size ($B$) algorithmically picks the optimal number of instances ($i$), the number of threads each should be allocated ($t$), and the batch sizes each should operate on ($b$) that minimizes latency. Packrat is built as an extension to TorchServe and supports online reconfigurations to avoid serving downtime. Averaged across a range of batch sizes, Packrat improves inference latency by 1.43$\times$ to 1.83$\times$ on a range of commonly used DNNs.
    Calibration-free online test-time adaptation for electroencephalography motor imagery decoding. (arXiv:2311.18520v1 [cs.HC])
    Providing a promising pathway to link the human brain with external devices, Brain-Computer Interfaces (BCIs) have seen notable advancements in decoding capabilities, primarily driven by increasingly sophisticated techniques, especially deep learning. However, achieving high accuracy in real-world scenarios remains a challenge due to the distribution shift between sessions and subjects. In this paper we will explore the concept of online test-time adaptation (OTTA) to continuously adapt the model in an unsupervised fashion during inference time. Our approach guarantees the preservation of privacy by eliminating the requirement to access the source data during the adaptation process. Additionally, OTTA achieves calibration-free operation by not requiring any session- or subject-specific data. We will investigate the task of electroencephalography (EEG) motor imagery decoding using a lightweight architecture together with different OTTA techniques like alignment, adaptive batch normalization, and entropy minimization. We examine two datasets and three distinct data settings for a comprehensive analysis. Our adaptation methods produce state-of-the-art results, potentially instigating a shift in transfer learning for BCI decoding towards online adaptation.
    AutArch: An AI-assisted workflow for object detection and automated recording in archaeological catalogues. (arXiv:2311.17978v1 [cs.CV])
    Compiling large datasets from published resources, such as archaeological find catalogues presents fundamental challenges: identifying relevant content and manually recording it is a time-consuming, repetitive and error-prone task. For the data to be useful, it must be of comparable quality and adhere to the same recording standards, which is hardly ever the case in archaeology. Here, we present a new data collection method exploiting recent advances in Artificial Intelligence. Our software uses an object detection neural network combined with further classification networks to speed up, automate, and standardise data collection from legacy resources, such as archaeological drawings and photographs in large unsorted PDF files. The AI-assisted workflow detects common objects found in archaeological catalogues, such as graves, skeletons, ceramics, ornaments, stone tools and maps, and spatially relates and analyses these objects on the page to extract real-life attributes, such as the size and orientation of a grave based on the north arrow and the scale. A graphical interface allows for and assists with manual validation. We demonstrate the benefits of this approach by collecting a range of shapes and numerical attributes from richly-illustrated archaeological catalogues, and benchmark it in a real-world experiment with ten users. Moreover, we record geometric whole-outlines through contour detection, an alternative to landmark-based geometric morphometrics not achievable by hand.
    Discovering Galaxy Features via Dataset Distillation. (arXiv:2311.17967v1 [cs.CV])
    In many applications, Neural Nets (NNs) have classification performance on par or even exceeding human capacity. Moreover, it is likely that NNs leverage underlying features that might differ from those humans perceive to classify. Can we "reverse-engineer" pertinent features to enhance our scientific understanding? Here, we apply this idea to the notoriously difficult task of galaxy classification: NNs have reached high performance for this task, but what does a neural net (NN) "see" when it classifies galaxies? Are there morphological features that the human eye might overlook that could help with the task and provide new insights? Can we visualize tracers of early evolution, or additionally incorporated spectral data? We present a novel way to summarize and visualize galaxy morphology through the lens of neural networks, leveraging Dataset Distillation, a recent deep-learning methodology with the primary objective to distill knowledge from a large dataset and condense it into a compact synthetic dataset, such that a model trained on this synthetic dataset achieves performance comparable to a model trained on the full dataset. We curate a class-balanced, medium-size high-confidence version of the Galaxy Zoo 2 dataset, and proceed with dataset distillation from our accurate NN-classifier to create synthesized prototypical images of galaxy morphological features, demonstrating its effectiveness. Of independent interest, we introduce a self-adaptive version of the state-of-the-art Matching Trajectory algorithm to automate the distillation process, and show enhanced performance on computer vision benchmarks.
    Understanding Your Agent: Leveraging Large Language Models for Behavior Explanation. (arXiv:2311.18062v1 [cs.LG])
    Intelligent agents such as robots are increasingly deployed in real-world, safety-critical settings. It is vital that these agents are able to explain the reasoning behind their decisions to human counterparts; however, their behavior is often produced by uninterpretable models such as deep neural networks. We propose an approach to generate natural language explanations for an agent's behavior based only on observations of states and actions, thus making our method independent from the underlying model's representation. For such models, we first learn a behavior representation and subsequently use it to produce plausible explanations with minimal hallucination while affording user interaction with a pre-trained large language model. We evaluate our method in a multi-agent search-and-rescue environment and demonstrate the effectiveness of our explanations for agents executing various behaviors. Through user studies and empirical experiments, we show that our approach generates explanations as helpful as those produced by a human domain expert while enabling beneficial interactions such as clarification and counterfactual queries.
    Dataset Distillation via the Wasserstein Metric. (arXiv:2311.18531v1 [cs.CV])
    Dataset distillation (DD) offers a compelling approach in computer vision, with the goal of condensing extensive datasets into smaller synthetic versions without sacrificing much of the model performance. In this paper, we continue to study the methods for DD, by addressing its conceptually core objective: how to capture the essential representation of extensive datasets in smaller, synthetic forms. We propose a novel approach utilizing the Wasserstein distance, a metric rooted in optimal transport theory, to enhance distribution matching in DD. Our method leverages the Wasserstein barycenter, offering a geometrically meaningful way to quantify distribution differences and effectively capture the centroid of a set of distributions. Our approach retains the computational benefits of distribution matching-based methods while achieving new state-of-the-art performance on several benchmarks. To provide useful prior for learning the images, we embed the synthetic data into the feature space of pretrained classification models to conduct distribution matching. Extensive testing on various high-resolution datasets confirms the effectiveness and adaptability of our method, indicating the promising yet unexplored capabilities of Wasserstein metrics in dataset distillation.
    Skilful Precipitation Nowcasting Using NowcastNet. (arXiv:2311.17961v1 [physics.ao-ph])
    Designing early warning system for precipitation requires accurate short-term forecasting system. Climate change has led to an increase in frequency of extreme weather events, and hence such systems can prevent disasters and loss of life. Managing such events remain a challenge for both public and private institutions. Precipitation nowcasting can help relevant institutions to better prepare for such events as they impact agriculture, transport, public health and safety, etc. Physics-based numerical weather prediction (NWP) is unable to perform well for nowcasting because of large computational turn-around time. Deep-learning based models on the other hand are able to give predictions within seconds. We use recently proposed NowcastNet, a physics-conditioned deep generative network, to forecast precipitation for different regions of Europe using satellite images. Both spatial and temporal transfer learning is done by forecasting for the unseen regions and year. Model makes realistic predictions and is able to outperform baseline for such a prediction task.
    Reconstructing Historical Climate Fields With Deep Learning. (arXiv:2311.18348v1 [physics.geo-ph])
    Historical records of climate fields are often sparse due to missing measurements, especially before the introduction of large-scale satellite missions. Several statistical and model-based methods have been introduced to fill gaps and reconstruct historical records. Here, we employ a recently introduced deep-learning approach based on Fourier convolutions, trained on numerical climate model output, to reconstruct historical climate fields. Using this approach we are able to realistically reconstruct large and irregular areas of missing data, as well as reconstruct known historical events such as strong El Ni\~no and La Ni\~na with very little given information. Our method outperforms the widely used statistical kriging method as well as other recent machine learning approaches. The model generalizes to higher resolutions than the ones it was trained on and can be used on a variety of climate fields. Moreover, it allows inpainting of masks never seen before during the model training.
    Choosing the parameter of the Fermat distance: navigating geometry and noise. (arXiv:2311.18663v1 [stat.ML])
    The Fermat distance has been recently established as a useful tool for machine learning tasks when a natural distance is not directly available to the practitioner or to improve the results given by Euclidean distances by exploding the geometrical and statistical properties of the dataset. This distance depends on a parameter $\alpha$ that greatly impacts the performance of subsequent tasks. Ideally, the value of $\alpha$ should be large enough to navigate the geometric intricacies inherent to the problem. At the same, it should remain restrained enough to sidestep any deleterious ramifications stemming from noise during the process of distance estimation. We study both theoretically and through simulations how to select this parameter.
    Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation. (arXiv:2311.18207v1 [cs.LG])
    Off-Policy Evaluation (OPE) aims to assess the effectiveness of counterfactual policies using only offline logged data and is often used to identify the top-k promising policies for deployment in online A/B tests. Existing evaluation metrics for OPE estimators primarily focus on the "accuracy" of OPE or that of downstream policy selection, neglecting risk-return tradeoff in the subsequent online policy deployment. To address this issue, we draw inspiration from portfolio evaluation in finance and develop a new metric, called SharpeRatio@k, which measures the risk-return tradeoff of policy portfolios formed by an OPE estimator under varying online evaluation budgets (k). We validate our metric in two example scenarios, demonstrating its ability to effectively distinguish between low-risk and high-risk estimators and to accurately identify the most efficient estimator. This efficient estimator is characterized by its capability to form the most advantageous policy portfolios, maximizing returns while minimizing risks during online deployment, a nuance that existing metrics typically overlook. To facilitate a quick, accurate, and consistent evaluation of OPE via SharpeRatio@k, we have also integrated this metric into an open-source software, SCOPE-RL. Employing SharpeRatio@k and SCOPE-RL, we conduct comprehensive benchmarking experiments on various estimators and RL tasks, focusing on their risk-return tradeoff. These experiments offer several interesting directions and suggestions for future OPE research.
    CommunityAI: Towards Community-based Federated Learning. (arXiv:2311.17958v1 [cs.LG])
    Federated Learning (FL) has emerged as a promising paradigm to train machine learning models collaboratively while preserving data privacy. However, its widespread adoption faces several challenges, including scalability, heterogeneous data and devices, resource constraints, and security concerns. Despite its promise, FL has not been specifically adapted for community domains, primarily due to the wide-ranging differences in data types and context, devices and operational conditions, environmental factors, and stakeholders. In response to these challenges, we present a novel framework for Community-based Federated Learning called CommunityAI. CommunityAI enables participants to be organized into communities based on their shared interests, expertise, or data characteristics. Community participants collectively contribute to training and refining learning models while maintaining data and participant privacy within their respective groups. Within this paper, we discuss the conceptual architecture, system requirements, processes, and future challenges that must be solved. Finally, our goal within this paper is to present our vision regarding enabling a collaborative learning process within various communities.
    Unrolling Virtual Worlds for Immersive Experiences. (arXiv:2311.17924v1 [cs.GR])
    This research pioneers a method for generating immersive worlds, drawing inspiration from elements of vintage adventure games like Myst and employing modern text-to-image models. We explore the intricate conversion of 2D panoramas into 3D scenes using equirectangular projections, addressing the distortions in perception that occur as observers navigate within the encompassing sphere. Our approach employs a technique similar to "inpainting" to rectify distorted projections, enabling the smooth construction of locally coherent worlds. This provides extensive insight into the interrelation of technology, perception, and experiential reality within human-computer interaction.
    LayerCollapse: Adaptive compression of neural networks. (arXiv:2311.17943v1 [cs.LG])
    Handling the ever-increasing scale of contemporary deep learning and transformer-based models poses a significant challenge. Although great strides have been made in optimizing model compression techniques such as model architecture search and knowledge distillation, the availability of data and computational resources remains a considerable hurdle for these optimizations. This paper introduces LayerCollapse, a novel alternative adaptive model compression methodology. LayerCollapse works by eliminating non-linearities within the network and collapsing two consecutive fully connected layers into a single linear transformation. This approach simultaneously reduces both the number of layers and the parameter count, thereby enhancing model efficiency. We also introduce a compression aware regularizer, which compresses the model in alignment with the dataset quality and model expressiveness, consequently reducing overfitting across tasks. Our results demonstrate LayerCollapse's effective compression and regularization capabilities in multiple fine-grained classification benchmarks, achieving up to 74% post training compression with minimal accuracy loss. We compare this method with knowledge distillation on the same target network, showcasing a five-fold increase in computational efficiency and 8% improvement in overall accuracy on the ImageNet dataset.
    Analyzing Semantic Faithfulness of Language Models via Input Intervention on Question Answering. (arXiv:2212.10696v2 [cs.CL] UPDATED)
    Transformer-based language models have been shown to be highly effective for several NLP tasks. In this paper, we consider three transformer models, BERT, RoBERTa, and XLNet, in both small and large versions, and investigate how faithful their representations are with respect to the semantic content of texts. We formalize a notion of semantic faithfulness, in which the semantic content of a text should causally figure in a model's inferences in question answering. We then test this notion by observing a model's behavior on answering questions about a story after performing two novel semantic interventions: deletion intervention and negation intervention. While transformer models achieve high performance on standard question answering tasks, we show that they fail to be semantically faithful once we perform these interventions for a significant number of cases (~50% for deletion intervention, and ~20% drop in accuracy for negation intervention). We then propose an intervention-based training regime that can mitigate the undesirable effects for deletion intervention by a significant margin (from ~ 50% to ~6%). We analyze the inner-workings of the models to better understand the effectiveness of intervention-based training for deletion intervention. But we show that this training does not attenuate other aspects of semantic unfaithfulness such as the models' inability to deal with negation intervention or to capture the predicate-argument structure of texts. We also test InstructGPT, via prompting, for its ability to handle the two interventions and to capture predicate-argument structure. While InstructGPT models do achieve very high performance on predicate-argument structure task, they fail to respond adequately to our deletion and negation interventions.
    Generation of a Compendium of Transcription Factor Cascades and Identification of Potential Therapeutic Targets using Graph Machine Learning. (arXiv:2311.17969v1 [q-bio.MN])
    Transcription factors (TFs) play a vital role in the regulation of gene expression thereby making them critical to many cellular processes. In this study, we used graph machine learning methods to create a compendium of TF cascades using data extracted from the STRING database. A TF cascade is a sequence of TFs that regulate each other, forming a directed path in the TF network. We constructed a knowledge graph of 81,488 unique TF cascades, with the longest cascade consisting of 62 TFs. Our results highlight the complex and intricate nature of TF interactions, where multiple TFs work together to regulate gene expression. We also identified 10 TFs with the highest regulatory influence based on centrality measurements, providing valuable information for researchers interested in studying specific TFs. Furthermore, our pathway enrichment analysis revealed significant enrichment of various pathways and functional categories, including those involved in cancer and other diseases, as well as those involved in development, differentiation, and cell signaling. The enriched pathways identified in this study may have potential as targets for therapeutic intervention in diseases associated with dysregulation of transcription factors. We have released the dataset, knowledge graph, and graphML methods for the TF cascades, and created a website to display the results, which can be accessed by researchers interested in using this dataset. Our study provides a valuable resource for understanding the complex network of interactions between TFs and their regulatory roles in cellular processes.
    A Natural Gas Consumption Forecasting System for Continual Learning Scenarios based on Hoeffding Trees with Change Point Detection Mechanism. (arXiv:2309.03720v2 [cs.LG] UPDATED)
    Forecasting natural gas consumption, considering seasonality and trends, is crucial in planning its supply and consumption and optimizing the cost of obtaining it, mainly by industrial entities. However, in times of threats to its supply, it is also a critical element that guarantees the supply of this raw material to meet individual consumers' needs, ensuring society's energy security. This article introduces a novel multistep ahead forecasting of natural gas consumption with change point detection integration for model collection selection with continual learning capabilities using data stream processing. The performance of the forecasting models based on the proposed approach is evaluated in a complex real-world use case of natural gas consumption forecasting. We employed Hoeffding tree predictors as forecasting models and the Pruned Exact Linear Time (PELT) algorithm for the change point detection procedure. The change point detection integration enables selecting a different model collection for successive time frames. Thus, three model collection selection procedures (with and without an error feedback loop) are defined and evaluated for forecasting scenarios with various densities of detected change points. These models were compared with change point agnostic baseline approaches. Our experiments show that fewer change points result in a lower forecasting error regardless of the model collection selection procedure employed. Also, simpler model collection selection procedures omitting forecasting error feedback leads to more robust forecasting models suitable for continual learning tasks.
    19 Parameters Is All You Need: Tiny Neural Networks for Particle Physics. (arXiv:2310.16121v2 [hep-ph] UPDATED)
    As particle accelerators increase their collision rates, and deep learning solutions prove their viability, there is a growing need for lightweight and fast neural network architectures for low-latency tasks such as triggering. We examine the potential of one recent Lorentz- and permutation-symmetric architecture, PELICAN, and present its instances with as few as 19 trainable parameters that outperform generic architectures with tens of thousands of parameters when compared on the binary classification task of top quark jet tagging.
    Dynamical phase transition in quantum neural networks with large depth. (arXiv:2311.18144v1 [quant-ph])
    Understanding the training dynamics of quantum neural networks is a fundamental task in quantum information science with wide impact in physics, chemistry and machine learning. In this work, we show that the late-time training dynamics of quantum neural networks can be described by the generalized Lotka-Volterra equations, which lead to a dynamical phase transition. When the targeted value of cost function crosses the minimum achievable value from above to below, the dynamics evolve from a frozen-kernel phase to a frozen-error phase, showing a duality between the quantum neural tangent kernel and the total error. In both phases, the convergence towards the fixed point is exponential, while at the critical point becomes polynomial. Via mapping the Hessian of the training dynamics to a Hamiltonian in the imaginary time, we reveal the nature of the phase transition to be second-order with the exponent $\nu=1$, where scale invariance and closing gap are observed at critical point. We also provide a non-perturbative analytical theory to explain the phase transition via a restricted Haar ensemble at late time, when the output state approaches the steady state. The theory findings are verified experimentally on IBM quantum devices.
    LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models. (arXiv:2311.18232v1 [cs.CL])
    Large language models (LLMs) provide excellent text-generation capabilities, but standard prompting and generation methods generally do not lead to intentional or goal-directed agents and might necessitate considerable prompt tuning. This becomes particularly apparent in multi-turn conversations: even the best current LLMs rarely ask clarifying questions, engage in explicit information gathering, or take actions now that lead to better decisions after multiple turns. Reinforcement learning has the potential to leverage the powerful modeling capabilities of LLMs, as well as their internal representation of textual interactions, to create capable goal-directed language agents. This can enable intentional and temporally extended interactions, such as with humans, through coordinated persuasion and carefully crafted questions, or in goal-directed play through text games to bring about desired final outcomes. However, enabling this requires the community to develop stable and reliable reinforcement learning algorithms that can effectively train LLMs. Developing such algorithms requires tasks that can gauge progress on algorithm design, provide accessible and reproducible evaluations for multi-turn interactions, and cover a range of task properties and challenges in improving reinforcement learning algorithms. Our paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs, together with an open-source research framework containing a basic toolkit for getting started on multi-turn RL with offline value-based and policy-based RL methods. Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
    Aggregated f-average Neural Network for Interpretable Ensembling. (arXiv:2310.05566v2 [cs.LG] UPDATED)
    Ensemble learning leverages multiple models (i.e., weak learners) on a common machine learning task to enhance prediction performance. Basic ensembling approaches average the weak learners outputs, while more sophisticated ones stack a machine learning model in between the weak learners outputs and the final prediction. This work fuses both aforementioned frameworks. We introduce an aggregated f-average (AFA) shallow neural network which models and combines different types of averages to perform an optimal aggregation of the weak learners predictions. We emphasise its interpretable architecture and simple training strategy, and illustrate its good performance on the problem of few-shot class incremental learning.
    Filtered Semi-Markov CRF. (arXiv:2311.18028v1 [cs.CL])
    Semi-Markov CRF has been proposed as an alternative to the traditional Linear Chain CRF for text segmentation tasks such as Named Entity Recognition (NER). Unlike CRF, which treats text segmentation as token-level prediction, Semi-CRF considers segments as the basic unit, making it more expressive. However, Semi-CRF suffers from two major drawbacks: (1) quadratic complexity over sequence length, as it operates on every span of the input sequence, and (2) inferior performance compared to CRF for sequence labeling tasks like NER. In this paper, we introduce Filtered Semi-Markov CRF, a variant of Semi-CRF that addresses these issues by incorporating a filtering step to eliminate irrelevant segments, reducing complexity and search space. Our approach is evaluated on several NER benchmarks, where it outperforms both CRF and Semi-CRF while being significantly faster. The implementation of our method is available on \href{https://github.com/urchade/Filtered-Semi-Markov-CRF}{Github}.
    The Trifecta: Three simple techniques for training deeper Forward-Forward networks. (arXiv:2311.18130v1 [cs.LG])
    Modern machine learning models are able to outperform humans on a variety of non-trivial tasks. However, as the complexity of the models increases, they consume significant amounts of power and still struggle to generalize effectively to unseen data. Local learning, which focuses on updating subsets of a model's parameters at a time, has emerged as a promising technique to address these issues. Recently, a novel local learning algorithm, called Forward-Forward, has received widespread attention due to its innovative approach to learning. Unfortunately, its application has been limited to smaller datasets due to scalability issues. To this end, we propose The Trifecta, a collection of three simple techniques that synergize exceptionally well and drastically improve the Forward-Forward algorithm on deeper networks. Our experiments demonstrate that our models are on par with similarly structured, backpropagation-based models in both training speed and test accuracy on simple datasets. This is achieved by the ability to learn representations that are informative locally, on a layer-by-layer basis, and retain their informativeness when propagated to deeper layers in the architecture. This leads to around 84\% accuracy on CIFAR-10, a notable improvement (25\%) over the original FF algorithm. These results highlight the potential of Forward-Forward as a genuine competitor to backpropagation and as a promising research avenue.
    SMaRt: Improving GANs with Score Matching Regularity. (arXiv:2311.18208v1 [cs.LG])
    Generative adversarial networks (GANs) usually struggle in learning from highly diverse data, whose underlying manifold is complex. In this work, we revisit the mathematical foundations of GANs, and theoretically reveal that the native adversarial loss for GAN training is insufficient to fix the problem of subsets with positive Lebesgue measure of the generated data manifold lying out of the real data manifold. Instead, we find that score matching serves as a valid solution to this issue thanks to its capability of persistently pushing the generated data points towards the real data manifold. We thereby propose to improve the optimization of GANs with score matching regularity (SMaRt). Regarding the empirical evidences, we first design a toy example to show that training GANs by the aid of a ground-truth score function can help reproduce the real data distribution more accurately, and then confirm that our approach can consistently boost the synthesis performance of various state-of-the-art GANs on real-world datasets with pre-trained diffusion models acting as the approximate score function. For instance, when training Aurora on the ImageNet 64x64 dataset, we manage to improve FID from 8.87 to 7.11, on par with the performance of one-step consistency model. The source code will be made public.
    On the convergence of adaptive first order methods: proximal gradient and alternating minimization algorithms. (arXiv:2311.18431v1 [math.OC])
    Building upon recent works on linesearch-free adaptive proximal gradient methods, this paper proposes AdaPG$^{\pi,r}$, a framework that unifies and extends existing results by providing larger stepsize policies and improved lower bounds. Different choices of the parameters $\pi$ and $r$ are discussed and the efficacy of the resulting methods is demonstrated through numerical simulations. In an attempt to better understand the underlying theory, its convergence is established in a more general setting that allows for time-varying parameters. Finally, an adaptive alternating minimization algorithm is presented by exploring the dual setting. This algorithm not only incorporates additional adaptivity, but also expands its applicability beyond standard strongly convex settings.
    Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation. (arXiv:2311.18260v1 [eess.IV])
    Radiology reports are an instrumental part of modern medicine, informing key clinical decisions such as diagnosis and treatment. The worldwide shortage of radiologists, however, restricts access to expert care and imposes heavy workloads, contributing to avoidable errors and delays in report delivery. While recent progress in automated report generation with vision-language models offer clear potential in ameliorating the situation, the path to real-world adoption has been stymied by the challenge of evaluating the clinical quality of AI-generated reports. In this study, we build a state-of-the-art report generation system for chest radiographs, Flamingo-CXR, by fine-tuning a well-known vision-language foundation model on radiology data. To evaluate the quality of the AI-generated reports, a group of 16 certified radiologists provide detailed evaluations of AI-generated and human written reports for chest X-rays from an intensive care setting in the United States and an inpatient setting in India. At least one radiologist (out of two per case) preferred the AI report to the ground truth report in over 60$\%$ of cases for both datasets. Amongst the subset of AI-generated reports that contain errors, the most frequently cited reasons were related to the location and finding, whereas for human written reports, most mistakes were related to severity and finding. This disparity suggested potential complementarity between our AI system and human experts, prompting us to develop an assistive scenario in which Flamingo-CXR generates a first-draft report, which is subsequently revised by a clinician. This is the first demonstration of clinician-AI collaboration for report writing, and the resultant reports are assessed to be equivalent or preferred by at least one radiologist to reports written by experts alone in 80$\%$ of in-patient cases and 66$\%$ of intensive care cases.
    Focused Transformer: Contrastive Training for Context Scaling. (arXiv:2307.03170v2 [cs.CL] UPDATED)
    Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an external memory, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of $3B$ and $7B$ OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a $256 k$ context length for passkey retrieval.
    Understanding and Improving In-Context Learning on Vision-language Models. (arXiv:2311.18021v1 [cs.CV])
    Recently, in-context learning (ICL) on large language models (LLMs) has received great attention, and this technique can also be applied to vision-language models (VLMs) built upon LLMs. These VLMs can respond to queries by conditioning responses on a series of multimodal demonstrations, which comprise images, queries, and answers. Though ICL has been extensively studied on LLMs, its research on VLMs remains limited. The inclusion of additional visual information in the demonstrations motivates the following research questions: which of the two modalities in the demonstration is more significant? How can we select effective multimodal demonstrations to enhance ICL performance? This study investigates the significance of both visual and language information. Our findings indicate that ICL in VLMs is predominantly driven by the textual information in the demonstrations whereas the visual information in the demonstrations barely affects the ICL performance. Subsequently, we provide an understanding of the findings by analyzing the model information flow and comparing model inner states given different ICL settings. Motivated by our analysis, we propose a simple yet effective approach, termed Mixed Modality In-Context Example Selection (MMICES), which considers both visual and language modalities when selecting demonstrations and shows better ICL performance. Extensive experiments are conducted to support our findings, understanding, and improvement of the ICL performance of VLMs.
    Hubness Reduction Improves Sentence-BERT Semantic Spaces. (arXiv:2311.18364v1 [cs.CL])
    Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness using hubness reduction methods. We identify a combination of two methods as resulting in the best reduction. For example, on one of the tested pretrained models, this combined method can reduce hubness by about 75% and error rate by about 9%. Thus, we argue that mitigating hubness in the embedding space provides better semantic representations of text.
    Self-Chained Image-Language Model for Video Localization and Question Answering. (arXiv:2305.06988v2 [cs.CV] UPDATED)
    Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and QA on videos. SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2. We propose two ways of chaining these modules for cascaded inference and self-refinement. First, in the forward chain, the Localizer finds multiple language-aware keyframes in a video, which the Answerer uses to predict the answer. Second, in the reverse chain, the Answerer generates keyframe pseudo-labels to refine the Localizer, alleviating the need for expensive video moment localization annotations. Our SeViLA framework outperforms several strong baselines on 5 challenging video QA and event prediction benchmarks, and achieves the state-of-the-art in both fine-tuning (NExT-QA, STAR) and zero-shot (NExT-QA, STAR, How2QA, VLEP) settings. We also analyze the impact of Localizer, comparisons of Localizer with other temporal localization models, pre-training/self-refinement of Localizer, and varying the number of keyframes.
    A Nystr\"om method with missing distances. (arXiv:2311.18076v1 [cs.IT])
    We study the problem of determining the configuration of $n$ points, referred to as mobile nodes, by utilizing pairwise distances to $m$ fixed points known as anchor nodes. In the standard setting, we have information about the distances between anchors (anchor-anchor) and between anchors and mobile nodes (anchor-mobile), but the distances between mobile nodes (mobile-mobile) are not known. For this setup, the Nystr\"om method is a viable technique for estimating the positions of the mobile nodes. This study focuses on the setting where the anchor-mobile block of the distance matrix contains only partial distance information. First, we establish a relationship between the columns of the anchor-mobile block in the distance matrix and the columns of the corresponding block in the Gram matrix via a graph Laplacian. Exploiting this connection, we introduce a novel sampling model that frames the position estimation problem as low-rank recovery of an inner product matrix, given a subset of its expansion coefficients in a special non-orthogonal basis. This basis and its dual basis--the central elements of our model--are explicitly derived. Our analysis is grounded in a specific centering of the points that is unique to the Nystr\"om method. With this in mind, we extend previous work in Euclidean distance geometry by providing a general dual basis approach for points centered anywhere.
    Adaptive Early Exiting for Collaborative Inference over Noisy Wireless Channels. (arXiv:2311.18098v1 [cs.LG])
    Collaborative inference systems are one of the emerging solutions for deploying deep neural networks (DNNs) at the wireless network edge. Their main idea is to divide a DNN into two parts, where the first is shallow enough to be reliably executed at edge devices of limited computational power, while the second part is executed at an edge server with higher computational capabilities. The main advantage of such systems is that the input of the DNN gets compressed as the subsequent layers of the shallow part extract only the information necessary for the task. As a result, significant communication savings can be achieved compared to transmitting raw input samples. In this work, we study early exiting in the context of collaborative inference, which allows obtaining inference results at the edge device for certain samples, without the need to transmit the partially processed data to the edge server at all, leading to further communication savings. The central part of our system is the transmission-decision (TD) mechanism, which, given the information from the early exit, and the wireless channel conditions, decides whether to keep the early exit prediction or transmit the data to the edge server for further processing. In this paper, we evaluate various TD mechanisms and show experimentally, that for an image classification task over the wireless edge, proper utilization of early exits can provide both performance gains and significant communication savings.
    Transfer Learning in Robotics: An Upcoming Breakthrough? A Review of Promises and Challenges. (arXiv:2311.18044v1 [cs.RO])
    Transfer learning is a conceptually-enticing paradigm in pursuit of truly intelligent embodied agents. The core concept -- reusing prior knowledge to learn in and from novel situations -- is successfully leveraged by humans to handle novel situations. In recent years, transfer learning has received renewed interest from the community from different perspectives, including imitation learning, domain adaptation, and transfer of experience from simulation to the real world, among others. In this paper, we unify the concept of transfer learning in robotics and provide the first taxonomy of its kind considering the key concepts of robot, task, and environment. Through a review of the promises and challenges in the field, we identify the need of transferring at different abstraction levels, the need of quantifying the transfer gap and the quality of transfer, as well as the dangers of negative transfer. Via this position paper, we hope to channel the effort of the community towards the most significant roadblocks to realize the full potential of transfer learning in robotics.
    A Bag of Receptive Fields for Time Series Extrinsic Predictions. (arXiv:2311.18029v1 [cs.LG])
    High-dimensional time series data poses challenges due to its dynamic nature, varying lengths, and presence of missing values. This kind of data requires extensive preprocessing, limiting the applicability of existing Time Series Classification and Time Series Extrinsic Regression techniques. For this reason, we propose BORF, a Bag-Of-Receptive-Fields model, which incorporates notions from time series convolution and 1D-SAX to handle univariate and multivariate time series with varying lengths and missing values. We evaluate BORF on Time Series Classification and Time Series Extrinsic Regression tasks using the full UEA and UCR repositories, demonstrating its competitive performance against state-of-the-art methods. Finally, we outline how this representation can naturally provide saliency and feature-based explanations.
    The Forecastability of Underlying Building Electricity Demand from Time Series Data. (arXiv:2311.18078v1 [cs.LG])
    Forecasting building energy consumption has become a promising solution in Building Energy Management Systems for energy saving and optimization. Furthermore, it can play an important role in the efficient management of the operation of a smart grid. Different data-driven approaches to forecast the future energy demand of buildings at different scale, and over various time horizons, can be found in the scientific literature, including extensive Machine Learning and Deep Learning approaches. However, the identification of the most accurate forecaster model which can be utilized to predict the energy demand of such a building is still challenging.In this paper, the design and implementation of a data-driven approach to predict how forecastable the future energy demand of a building is, without first utilizing a data-driven forecasting model, is presented. The investigation utilizes a historical electricity consumption time series data set with a half-hour interval that has been collected from a group of residential buildings located in the City of London, United Kingdom
    A trainable manifold for accurate approximation with ReLU Networks. (arXiv:2311.18022v1 [cs.LG])
    We present a novel technique for exercising greater control of the weights of ReLU activated neural networks to produce more accurate function approximations. Many theoretical works encode complex operations into ReLU networks using smaller base components. In these works, a common base component is a constant width approximation to x^2, which has exponentially decaying error with respect to depth. We extend this block to represent a greater range of convex one-dimensional functions. We derive a manifold of weights such that the output of these new networks utilizes exponentially many piecewise-linear segments. This manifold guides their training process to overcome drawbacks associated with random initialization and unassisted gradient descent. We train these networks to approximate functions which do not necessarily lie on the manifold, showing a significant reduction of error values over conventional approaches.
    Class Distribution Shifts in Zero-Shot Learning: Learning Robust Representations. (arXiv:2311.18575v1 [cs.LG])
    Distribution shifts between training and deployment data often affect the performance of machine learning models. In this paper, we explore a setting where a hidden variable induces a shift in the distribution of classes. These distribution shifts are particularly challenging for zero-shot classifiers, as they rely on representations learned from training classes, but are deployed on new, unseen ones. We introduce an algorithm to learn data representations that are robust to such class distribution shifts in zero-shot verification tasks. We show that our approach, which combines hierarchical data sampling with out-of-distribution generalization techniques, improves generalization to diverse class distributions in both simulations and real-world datasets.
    Latent Alignment with Deep Set EEG Decoders. (arXiv:2311.17968v1 [eess.SP])
    The variability in EEG signals between different individuals poses a significant challenge when implementing brain-computer interfaces (BCI). Commonly proposed solutions to this problem include deep learning models, due to their increased capacity and generalization, as well as explicit domain adaptation techniques. Here, we introduce the Latent Alignment method that won the Benchmarks for EEG Transfer Learning (BEETL) competition and present its formulation as a deep set applied on the set of trials from a given subject. Its performance is compared to recent statistical domain adaptation techniques under various conditions. The experimental paradigms include motor imagery (MI), oddball event-related potentials (ERP) and sleep stage classification, where different well-established deep learning models are applied on each task. Our experimental results show that performing statistical distribution alignment at later stages in a deep learning model is beneficial to the classification accuracy, yielding the highest performance for our proposed method. We further investigate practical considerations that arise in the context of using deep learning and statistical alignment for EEG decoding. In this regard, we study class-discriminative artifacts that can spuriously improve results for deep learning models, as well as the impact of class-imbalance on alignment. We delineate a trade-off relationship between increased classification accuracy when alignment is performed at later modeling stages, and susceptibility to class-imbalance in the set of trials that the statistics are computed on.
    Adaptive proximal algorithms for convex optimization under local Lipschitz continuity of the gradient. (arXiv:2301.04431v3 [math.OC] CROSS LISTED)
    Backtracking linesearch is the de facto approach for minimizing continuously differentiable functions with locally Lipschitz gradient. In recent years, it has been shown that in the convex setting it is possible to avoid linesearch altogether, and to allow the stepsize to adapt based on a local smoothness estimate without any backtracks or evaluations of the function value. In this work we propose an adaptive proximal gradient method, adaPG, that uses novel estimates of the local smoothness modulus which leads to less conservative stepsize updates and that can additionally cope with nonsmooth terms. This idea is extended to the primal-dual setting where an adaptive three-term primal-dual algorithm, adaPD, is proposed which can be viewed as an extension of the PDHG method. Moreover, in this setting the ``essentially'' fully adaptive variant adaPD$^+$ is proposed that avoids evaluating the linear operator norm by invoking a backtracking procedure, that, remarkably, does not require extra gradient evaluations. Numerical simulations demonstrate the effectiveness of the proposed algorithms compared to the state of the art.
    Improving Adversarial Transferability via Model Alignment. (arXiv:2311.18495v1 [cs.LG])
    Neural networks are susceptible to adversarial perturbations that are transferable across different models. In this paper, we introduce a novel model alignment technique aimed at improving a given source model's ability in generating transferable adversarial perturbations. During the alignment process, the parameters of the source model are fine-tuned to minimize an alignment loss. This loss measures the divergence in the predictions between the source model and another, independently trained model, referred to as the witness model. To understand the effect of model alignment, we conduct a geometric anlaysis of the resulting changes in the loss landscape. Extensive experiments on the ImageNet dataset, using a variety of model architectures, demonstrate that perturbations generated from aligned source models exhibit significantly higher transferability than those from the original source model.
    OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials. (arXiv:2310.03121v2 [physics.chem-ph] UPDATED)
    Machine learning plays an important and growing role in molecular simulation. The newest version of the OpenMM molecular dynamics toolkit introduces new features to support the use of machine learning potentials. Arbitrary PyTorch models can be added to a simulation and used to compute forces and energy. A higher-level interface allows users to easily model their molecules of interest with general purpose, pretrained potential functions. A collection of optimized CUDA kernels and custom PyTorch operations greatly improves the speed of simulations. We demonstrate these features on simulations of cyclin-dependent kinase 8 (CDK8) and the green fluorescent protein (GFP) chromophore in water. Taken together, these features make it practical to use machine learning to improve the accuracy of simulations at only a modest increase in cost.
    Towards Comparable Active Learning. (arXiv:2311.18356v1 [cs.LG])
    Active Learning has received significant attention in the field of machine learning for its potential in selecting the most informative samples for labeling, thereby reducing data annotation costs. However, we show that the reported lifts in recent literature generalize poorly to other domains leading to an inconclusive landscape in Active Learning research. Furthermore, we highlight overlooked problems for reproducing AL experiments that can lead to unfair comparisons and increased variance in the results. This paper addresses these issues by providing an Active Learning framework for a fair comparison of algorithms across different tasks and domains, as well as a fast and performant oracle algorithm for evaluation. To the best of our knowledge, we propose the first AL benchmark that tests algorithms in 3 major domains: Tabular, Image, and Text. We report empirical results for 6 widely used algorithms on 7 real-world and 2 synthetic datasets and aggregate them into a domain-specific ranking of AL algorithms.
    Towards out-of-distribution generalization in large-scale astronomical surveys: robust networks learn similar representations. (arXiv:2311.18007v1 [astro-ph.IM])
    The generalization of machine learning (ML) models to out-of-distribution (OOD) examples remains a key challenge in extracting information from upcoming astronomical surveys. Interpretability approaches are a natural way to gain insights into the OOD generalization problem. We use Centered Kernel Alignment (CKA), a similarity measure metric of neural network representations, to examine the relationship between representation similarity and performance of pre-trained Convolutional Neural Networks (CNNs) on the CAMELS Multifield Dataset. We find that when models are robust to a distribution shift, they produce substantially different representations across their layers on OOD data. However, when they fail to generalize, these representations change less from layer to layer on OOD data. We discuss the potential application of similarity representation in guiding model design, training strategy, and mitigating the OOD problem by incorporating CKA as an inductive bias during training.
    TimeGNN: Temporal Dynamic Graph Learning for Time Series Forecasting. (arXiv:2307.14680v2 [cs.LG] UPDATED)
    Time series forecasting lies at the core of important real-world applications in many fields of science and engineering. The abundance of large time series datasets that consist of complex patterns and long-term dependencies has led to the development of various neural network architectures. Graph neural network approaches, which jointly learn a graph structure based on the correlation of raw values of multivariate time series while forecasting, have recently seen great success. However, such solutions are often costly to train and difficult to scale. In this paper, we propose TimeGNN, a method that learns dynamic temporal graph representations that can capture the evolution of inter-series patterns along with the correlations of multiple series. TimeGNN achieves inference times 4 to 80 times faster than other state-of-the-art graph-based methods while achieving comparable forecasting performance
    On Sparse Modern Hopfield Model. (arXiv:2309.12673v2 [cs.LG] UPDATED)
    We introduce the sparse modern Hopfield model as a sparse extension of the modern Hopfield model. Like its dense counterpart, the sparse modern Hopfield model equips a memory-retrieval dynamics whose one-step approximation corresponds to the sparse attention mechanism. Theoretically, our key contribution is a principled derivation of a closed-form sparse Hopfield energy using the convex conjugate of the sparse entropic regularizer. Building upon this, we derive the sparse memory retrieval dynamics from the sparse energy function and show its one-step approximation is equivalent to the sparse-structured attention. Importantly, we provide a sparsity-dependent memory retrieval error bound which is provably tighter than its dense analog. The conditions for the benefits of sparsity to arise are therefore identified and discussed. In addition, we show that the sparse modern Hopfield model maintains the robust theoretical properties of its dense counterpart, including rapid fixed point convergence and exponential memory capacity. Empirically, we use both synthetic and real-world datasets to demonstrate that the sparse Hopfield model outperforms its dense counterpart in many situations.
    Dodging DeepFake Detection via Implicit Spatial-Domain Notch Filtering. (arXiv:2009.09213v5 [cs.CV] UPDATED)
    The current high-fidelity generation and high-precision detection of DeepFake images are at an arms race. We believe that producing DeepFakes that are highly realistic and 'detection evasive' can serve the ultimate goal of improving future generation DeepFake detection capabilities. In this paper, we propose a simple yet powerful pipeline to reduce the artifact patterns of fake images without hurting image quality by performing implicit spatial-domain notch filtering. We first demonstrate that frequency-domain notch filtering, although famously shown to be effective in removing periodic noise in the spatial domain, is infeasible for our task at hand due to the manual designs required for the notch filters. We, therefore, resort to a learning-based approach to reproduce the notch filtering effects, but solely in the spatial domain. We adopt a combination of adding overwhelming spatial noise for breaking the periodic noise pattern and deep image filtering to reconstruct the noise-free fake images, and we name our method DeepNotch. Deep image filtering provides a specialized filter for each pixel in the noisy image, producing filtered images with high fidelity compared to their DeepFake counterparts. Moreover, we also use the semantic information of the image to generate an adversarial guidance map to add noise intelligently. Our large-scale evaluation on 3 representative state-of-the-art DeepFake detection methods (tested on 16 types of DeepFakes) has demonstrated that our technique significantly reduces the accuracy of these 3 fake image detection methods, 36.79% on average and up to 97.02% in the best case.
    Transformer Based Model for Predicting Rapid Impact Compaction Outcomes: A Case Study of Utapao International Airport. (arXiv:2311.17959v1 [cs.LG])
    This paper introduces a novel deep learning approach to predict the engineering properties of the ground improved by Rapid Impact Compaction (RIC), which is a ground improvement technique that uses a drop hammer to compact the soil and fill layers. The proposed approach uses transformer-based neural networks to capture the complex nonlinear relationships between the input features, such as the hammer energy, drop height, and number of blows, and the output variables, such as the cone resistance. The approach is applied to a real-world dataset from a trial test section for the new apron construction of the Utapao International Airport in Thailand. The results show that the proposed approach outperforms the existing methods in terms of prediction accuracy and efficiency and provides interpretable attention maps that reveal the importance of different features for RIC prediction. The paper also discusses the limitations and future directions of applying deep learning methods to RIC prediction.
    Reasoning with the Theory of Mind for Pragmatic Semantic Communication. (arXiv:2311.18224v1 [cs.IT])
    In this paper, a pragmatic semantic communication framework that enables effective goal-oriented information sharing between two-intelligent agents is proposed. In particular, semantics is defined as the causal state that encapsulates the fundamental causal relationships and dependencies among different features extracted from data. The proposed framework leverages the emerging concept in machine learning (ML) called theory of mind (ToM). It employs a dynamic two-level (wireless and semantic) feedback mechanism to continuously fine-tune neural network components at the transmitter. Thanks to the ToM, the transmitter mimics the actual mental state of the receiver's reasoning neural network operating semantic interpretation. Then, the estimated mental state at the receiver is dynamically updated thanks to the proposed dynamic two-level feedback mechanism. At the lower level, conventional channel quality metrics are used to optimize the channel encoding process based on the wireless communication channel's quality, ensuring an efficient mapping of semantic representations to a finite constellation. Additionally, a semantic feedback level is introduced, providing information on the receiver's perceived semantic effectiveness with minimal overhead. Numerical evaluations demonstrate the framework's ability to achieve efficient communication with a reduced amount of bits while maintaining the same semantics, outperforming conventional systems that do not exploit the ToM-based reasoning.
    Some Intriguing Aspects about Lipschitz Continuity of Neural Networks. (arXiv:2302.10886v3 [cs.LG] UPDATED)
    Lipschitz continuity is a crucial functional property of any predictive model, that naturally governs its robustness, generalisation, as well as adversarial vulnerability. Contrary to other works that focus on obtaining tighter bounds and developing different practical strategies to enforce certain Lipschitz properties, we aim to thoroughly examine and characterise the Lipschitz behaviour of Neural Networks. Thus, we carry out an empirical investigation in a range of different settings (namely, architectures, datasets, label noise, and more) by exhausting the limits of the simplest and the most general lower and upper bounds. As a highlight of this investigation, we showcase a remarkable fidelity of the lower Lipschitz bound, identify a striking Double Descent trend in both upper and lower bounds to the Lipschitz and explain the intriguing effects of label noise on function smoothness and generalisation.
    Image retrieval outperforms diffusion models on data augmentation. (arXiv:2304.10253v2 [cs.CV] UPDATED)
    Many approaches have been proposed to use diffusion models to augment training datasets for downstream tasks, such as classification. However, diffusion models are themselves trained on large datasets, often with noisy annotations, and it remains an open question to which extent these models contribute to downstream classification performance. In particular, it remains unclear if they generalize enough to improve over directly using the additional data of their pre-training process for augmentation. We systematically evaluate a range of existing methods to generate images from diffusion models and study new extensions to assess their benefit for data augmentation. Personalizing diffusion models towards the target data outperforms simpler prompting strategies. However, using the pre-training data of the diffusion model alone, via a simple nearest-neighbor retrieval procedure, leads to even stronger downstream performance. Our study explores the potential of diffusion models in generating new training data, and surprisingly finds that these sophisticated models are not yet able to beat a simple and strong image retrieval baseline on simple downstream vision tasks.
    Bayesian CART models for insurance claims frequency. (arXiv:2303.01923v2 [stat.ML] UPDATED)
    Accuracy and interpretability of a (non-life) insurance pricing model are essential qualities to ensure fair and transparent premiums for policy-holders, that reflect their risk. In recent years, the classification and regression trees (CARTs) and their ensembles have gained popularity in the actuarial literature, since they offer good prediction performance and are relatively easily interpretable. In this paper, we introduce Bayesian CART models for insurance pricing, with a particular focus on claims frequency modelling. Additionally to the common Poisson and negative binomial (NB) distributions used for claims frequency, we implement Bayesian CART for the zero-inflated Poisson (ZIP) distribution to address the difficulty arising from the imbalanced insurance claims data. To this end, we introduce a general MCMC algorithm using data augmentation methods for posterior tree exploration. We also introduce the deviance information criterion (DIC) for the tree model selection. The proposed models are able to identify trees which can better classify the policy-holders into risk groups. Some simulations and real insurance data will be discussed to illustrate the applicability of these models.
    Locally Differentially Private Document Generation Using Zero Shot Prompting. (arXiv:2310.16111v2 [cs.CL] UPDATED)
    Numerous studies have highlighted the privacy risks associated with pretrained large language models. In contrast, our research offers a unique perspective by demonstrating that pretrained large language models can effectively contribute to privacy preservation. We propose a locally differentially private mechanism called DP-Prompt, which leverages the power of pretrained large language models and zero-shot prompting to counter author de-anonymization attacks while minimizing the impact on downstream utility. When DP-Prompt is used with a powerful language model like ChatGPT (gpt-3.5), we observe a notable reduction in the success rate of de-anonymization attacks, showing that it surpasses existing approaches by a considerable margin despite its simpler design. For instance, in the case of the IMDB dataset, DP-Prompt (with ChatGPT) perfectly recovers the clean sentiment F1 score while achieving a 46\% reduction in author identification F1 score against static attackers and a 26\% reduction against adaptive attackers. We conduct extensive experiments across six open-source large language models, ranging up to 7 billion parameters, to analyze various effects of the privacy-utility tradeoff.
    Operator Learning with Neural Fields: Tackling PDEs on General Geometries. (arXiv:2306.07266v2 [cs.LG] UPDATED)
    Machine learning approaches for solving partial differential equations require learning mappings between function spaces. While convolutional or graph neural networks are constrained to discretized functions, neural operators present a promising milestone toward mapping functions directly. Despite impressive results they still face challenges with respect to the domain geometry and typically rely on some form of discretization. In order to alleviate such limitations, we present CORAL, a new method that leverages coordinate-based networks for solving PDEs on general geometries. CORAL is designed to remove constraints on the input mesh, making it applicable to any spatial sampling and geometry. Its ability extends to diverse problem domains, including PDE solving, spatio-temporal forecasting, and inverse problems like geometric design. CORAL demonstrates robust performance across multiple resolutions and performs well in both convex and non-convex domains, surpassing or performing on par with state-of-the-art models.
    Extending Explainable Boosting Machines to Scientific Image Data. (arXiv:2305.16526v2 [cs.CV] UPDATED)
    As the deployment of computer vision technology becomes increasingly common in science, the need for explanations of the system and its output has become a focus of great concern. Driven by the pressing need for interpretable models in science, we propose the use of Explainable Boosting Machines (EBMs) for scientific image data. Inspired by an important application underpinning the development of quantum technologies, we apply EBMs to cold-atom soliton image data tabularized using Gabor Wavelet Transform-based techniques that preserve the spatial structure of the data. In doing so, we demonstrate the use of EBMs for image data for the first time and show that our approach provides explanations that are consistent with human intuition about the data.
    Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing. (arXiv:2311.18608v1 [cs.CV])
    With the remarkable advent of text-to-image diffusion models, image editing methods have become more diverse and continue to evolve. A promising recent approach in this realm is Delta Denoising Score (DDS) - an image editing technique based on Score Distillation Sampling (SDS) framework that leverages the rich generative prior of text-to-image diffusion models. However, relying solely on the difference between scoring functions is insufficient for preserving specific structural elements from the original image, a crucial aspect of image editing. Inspired by the similarity and importance differences between DDS and the contrastive learning for unpaired image-to-image translation (CUT), here we present an embarrassingly simple yet very powerful modification of DDS, called Contrastive Denoising Score (CDS), for latent diffusion models (LDM). Specifically, to enforce structural correspondence between the input and output while maintaining the controllability of contents, we introduce a straightforward approach to regulate structural consistency using CUT loss within the DDS framework. To calculate this loss, instead of employing auxiliary networks, we utilize the intermediate features of LDM, in particular, those from the self-attention layers, which possesses rich spatial information. Our approach enables zero-shot image-to-image translation and neural radiance field (NeRF) editing, achieving a well-balanced interplay between maintaining the structural details and transforming content. Qualitative results and comparisons demonstrates the effectiveness of our proposed method. Project page with code is available at https://hyelinnam.github.io/CDS/.
    Enhancing Data-Assimilation in CFD using Graph Neural Networks. (arXiv:2311.18027v1 [physics.flu-dyn])
    We present a novel machine learning approach for data assimilation applied in fluid mechanics, based on adjoint-optimization augmented by Graph Neural Networks (GNNs) models. We consider as baseline the Reynolds-Averaged Navier-Stokes (RANS) equations, where the unknown is the meanflow and a closure model based on the Reynolds-stress tensor is required for correctly computing the solution. An end-to-end process is cast; first, we train a GNN model for the closure term. Second, the GNN model is introduced in the training process of data assimilation, where the RANS equations act as a physics constraint for a consistent prediction. We obtain our results using direct numerical simulations based on a Finite Element Method (FEM) solver; a two-fold interface between the GNN model and the solver allows the GNN's predictions to be incorporated into post-processing steps of the FEM analysis. The proposed scheme provides an excellent reconstruction of the meanflow without any features selection; preliminary results show promising generalization properties over unseen flow configurations.
    Knowledge Pursuit Prompting for Zero-Shot Multimodal Synthesis. (arXiv:2311.17898v2 [cs.CV] UPDATED)
    Hallucinations and unfaithful synthesis due to inaccurate prompts with insufficient semantic details are widely observed in multimodal generative models. A prevalent strategy to align multiple modalities is to fine-tune the generator with a large number of annotated text-image pairs. However, such a procedure is labor-consuming and resource-draining. The key question we ask is: can we enhance the quality and faithfulness of text-driven generative models beyond extensive text-image pair annotations? To address this question, we propose Knowledge Pursuit Prompting (KPP), a zero-shot framework that iteratively incorporates external knowledge to help generators produce reliable visual content. Instead of training generators to handle generic prompts, KPP employs a recursive knowledge query process to gather informative external facts from the knowledge base, instructs a language model to compress the acquired knowledge for prompt refinement, and utilizes text-driven generators for visual synthesis. The entire process is zero-shot, without accessing the architectures and parameters of generative models. We evaluate the framework across multiple text-driven generative tasks (image, 3D rendering, and video) on datasets of different domains. We further demonstrate the extensibility and adaptability of KPP through varying foundation model bases and instructions. Our results show that KPP is capable of generating faithful and semantically rich content across diverse visual domains, offering a promising solution to improve multimodal generative models.
    Microscopy Image Segmentation via Point and Shape Regularized Data Synthesis. (arXiv:2308.09835v2 [cs.CV] UPDATED)
    Current deep learning-based approaches for the segmentation of microscopy images heavily rely on large amount of training data with dense annotation, which is highly costly and laborious in practice. Compared to full annotation where the complete contour of objects is depicted, point annotations, specifically object centroids, are much easier to acquire and still provide crucial information about the objects for subsequent segmentation. In this paper, we assume access to point annotations only during training and develop a unified pipeline for microscopy image segmentation using synthetically generated training data. Our framework includes three stages: (1) it takes point annotations and samples a pseudo dense segmentation mask constrained with shape priors; (2) with an image generative model trained in an unpaired manner, it translates the mask to a realistic microscopy image regularized by object level consistency; (3) the pseudo masks along with the synthetic images then constitute a pairwise dataset for training an ad-hoc segmentation model. On the public MoNuSeg dataset, our synthesis pipeline produces more diverse and realistic images than baseline models while maintaining high coherence between input masks and generated images. When using the identical segmentation backbones, the models trained on our synthetic dataset significantly outperform those trained with pseudo-labels or baseline-generated images. Moreover, our framework achieves comparable results to models trained on authentic microscopy images with dense labels, demonstrating its potential as a reliable and highly efficient alternative to labor-intensive manual pixel-wise annotations in microscopy image segmentation. The code is available.
    Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework. (arXiv:2311.18460v1 [cs.LG])
    Fairness for machine learning predictions is widely required in practice for legal, ethical, and societal reasons. Existing work typically focuses on settings without unobserved confounding, even though unobserved confounding can lead to severe violations of causal fairness and, thus, unfair predictions. In this work, we analyze the sensitivity of causal fairness to unobserved confounding. Our contributions are three-fold. First, we derive bounds for causal fairness metrics under different sources of unobserved confounding. This enables practitioners to examine the sensitivity of their machine learning models to unobserved confounding in fairness-critical applications. Second, we propose a novel neural framework for learning fair predictions, which allows us to offer worst-case guarantees of the extent to which causal fairness can be violated due to unobserved confounding. Third, we demonstrate the effectiveness of our framework in a series of experiments, including a real-world case study about predicting prison sentences. To the best of our knowledge, ours is the first work to study causal fairness under unobserved confounding. To this end, our work is of direct practical value as a refutation strategy to ensure the fairness of predictions in high-stakes applications.
    Leveraging cache to enable SLU on tiny devices. (arXiv:2311.18188v1 [eess.AS])
    This paper addresses spoken language understanding (SLU) on microcontroller-like embedded devices, integrating on-device execution with cloud offloading in a novel fashion. We exploit temporal locality in a device's speech inputs and accordingly reuse recent SLU inferences. Our idea is simple: let the device match new inputs against cached results, and only offload unmatched inputs to the cloud for full inference. Realization of this idea, however, is non-trivial: the device needs to compare acoustic features in a robust, low-cost way. To this end, we present XYZ, a speech cache for tiny devices. It matches speech inputs at two levels of representations: first by clustered sequences of raw sound units, then as sequences of phonemes. Working in tandem, the two representations offer complementary cost/accuracy tradeoffs. To further boost accuracy, our cache is learning: with the mismatched and then offloaded inputs, it continuously finetunes the device's feature extractors (with the assistance of the cloud). We implement XYZ on an off-the-shelf STM32 microcontroller. The resultant implementation has a small memory footprint of 2MB. Evaluated on challenging speech benchmarks, our system resolves 45%--90% of inputs on device, reducing the average latency by up to 80% compared to offloading to popular cloud speech services. Our benefit is pronounced even in adversarial settings -- noisy environments, cold cache, or one device shared by a number of users.
    Navigating Privacy and Copyright Challenges Across the Data Lifecycle of Generative AI. (arXiv:2311.18252v1 [cs.SE])
    The advent of Generative AI has marked a significant milestone in artificial intelligence, demonstrating remarkable capabilities in generating realistic images, texts, and data patterns. However, these advancements come with heightened concerns over data privacy and copyright infringement, primarily due to the reliance on vast datasets for model training. Traditional approaches like differential privacy, machine unlearning, and data poisoning only offer fragmented solutions to these complex issues. Our paper delves into the multifaceted challenges of privacy and copyright protection within the data lifecycle. We advocate for integrated approaches that combines technical innovation with ethical foresight, holistically addressing these concerns by investigating and devising solutions that are informed by the lifecycle perspective. This work aims to catalyze a broader discussion and inspire concerted efforts towards data privacy and copyright integrity in Generative AI.
    How Much Is Hidden in the NAS Benchmarks? Few-Shot Adaptation of a NAS Predictor. (arXiv:2311.18451v1 [cs.LG])
    Neural architecture search has proven to be a powerful approach to designing and refining neural networks, often boosting their performance and efficiency over manually-designed variations, but comes with computational overhead. While there has been a considerable amount of research focused on lowering the cost of NAS for mainstream tasks, such as image classification, a lot of those improvements stem from the fact that those tasks are well-studied in the broader context. Consequently, applicability of NAS to emerging and under-represented domains is still associated with a relatively high cost and/or uncertainty about the achievable gains. To address this issue, we turn our focus towards the recent growth of publicly available NAS benchmarks in an attempt to extract general NAS knowledge, transferable across different tasks and search spaces. We borrow from the rich field of meta-learning for few-shot adaptation and carefully study applicability of those methods to NAS, with a special focus on the relationship between task-level correlation (domain shift) and predictor transferability; which we deem critical for improving NAS on diverse tasks. In our experiments, we use 6 NAS benchmarks in conjunction, spanning in total 16 NAS settings -- our meta-learning approach not only shows superior (or matching) performance in the cross-validation experiments but also successful extrapolation to a new search space and tasks.
    HOT: Higher-Order Dynamic Graph Representation Learning with Efficient Transformers. (arXiv:2311.18526v1 [cs.LG])
    Many graph representation learning (GRL) problems are dynamic, with millions of edges added or removed per second. A fundamental workload in this setting is dynamic link prediction: using a history of graph updates to predict whether a given pair of vertices will become connected. Recent schemes for link prediction in such dynamic settings employ Transformers, modeling individual graph updates as single tokens. In this work, we propose HOT: a model that enhances this line of works by harnessing higher-order (HO) graph structures; specifically, k-hop neighbors and more general subgraphs containing a given pair of vertices. Harnessing such HO structures by encoding them into the attention matrix of the underlying Transformer results in higher accuracy of link prediction outcomes, but at the expense of increased memory pressure. To alleviate this, we resort to a recent class of schemes that impose hierarchy on the attention matrix, significantly reducing memory footprint. The final design offers a sweetspot between high accuracy and low memory utilization. HOT outperforms other dynamic GRL schemes, for example achieving 9%, 7%, and 15% higher accuracy than - respectively - DyGFormer, TGN, and GraphMixer, for the MOOC dataset. Our design can be seamlessly extended towards other dynamic GRL workloads.
    Data-Agnostic Model Poisoning against Federated Learning: A Graph Autoencoder Approach. (arXiv:2311.18498v1 [cs.LG])
    This paper proposes a novel, data-agnostic, model poisoning attack on Federated Learning (FL), by designing a new adversarial graph autoencoder (GAE)-based framework. The attack requires no knowledge of FL training data and achieves both effectiveness and undetectability. By listening to the benign local models and the global model, the attacker extracts the graph structural correlations among the benign local models and the training data features substantiating the models. The attacker then adversarially regenerates the graph structural correlations while maximizing the FL training loss, and subsequently generates malicious local models using the adversarial graph structure and the training data features of the benign ones. A new algorithm is designed to iteratively train the malicious local models using GAE and sub-gradient descent. The convergence of FL under attack is rigorously proved, with a considerably large optimality gap. Experiments show that the FL accuracy drops gradually under the proposed attack and existing defense mechanisms fail to detect it. The attack can give rise to an infection across all benign devices, making it a serious threat to FL.
    Positional Information Matters for Invariant In-Context Learning: A Case Study of Simple Function Classes. (arXiv:2311.18194v1 [cs.LG])
    In-context learning (ICL) refers to the ability of a model to condition on a few in-context demonstrations (input-output examples of the underlying task) to generate the answer for a new query input, without updating parameters. Despite the impressive ICL ability of LLMs, it has also been found that ICL in LLMs is sensitive to input demonstrations and limited to short context lengths. To understand the limitations and principles for successful ICL, we conduct an investigation with ICL linear regression of transformers. We characterize several Out-of-Distribution (OOD) cases for ICL inspired by realistic LLM ICL failures and compare transformers with DeepSet, a simple yet powerful architecture for ICL. Surprisingly, DeepSet outperforms transformers across a variety of distribution shifts, implying that preserving permutation invariance symmetry to input demonstrations is crucial for OOD ICL. The phenomenon specifies a fundamental requirement by ICL, which we termed as ICL invariance. Nevertheless, the positional encodings in LLMs will break ICL invariance. To this end, we further evaluate transformers with identical positional encodings and find preserving ICL invariance in transformers achieves state-of-the-art performance across various ICL distribution shifts
    QuadraNet: Improving High-Order Neural Interaction Efficiency with Hardware-Aware Quadratic Neural Networks. (arXiv:2311.17956v1 [cs.LG])
    Recent progress in computer vision-oriented neural network designs is mostly driven by capturing high-order neural interactions among inputs and features. And there emerged a variety of approaches to accomplish this, such as Transformers and its variants. However, these interactions generate a large amount of intermediate state and/or strong data dependency, leading to considerable memory consumption and computing cost, and therefore compromising the overall runtime performance. To address this challenge, we rethink the high-order interactive neural network design with a quadratic computing approach. Specifically, we propose QuadraNet -- a comprehensive model design methodology from neuron reconstruction to structural block and eventually to the overall neural network implementation. Leveraging quadratic neurons' intrinsic high-order advantages and dedicated computation optimization schemes, QuadraNet could effectively achieve optimal cognition and computation performance. Incorporating state-of-the-art hardware-aware neural architecture search and system integration techniques, QuadraNet could also be well generalized in different hardware constraint settings and deployment scenarios. The experiment shows thatQuadraNet achieves up to 1.5$\times$ throughput, 30% less memory footprint, and similar cognition performance, compared with the state-of-the-art high-order approaches.
    MarkovGen: Structured Prediction for Efficient Text-to-Image Generation. (arXiv:2308.10997v2 [cs.CV] UPDATED)
    Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However, this quality comes at significant computational cost: nearly all of these models are iterative and require running sampling multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt, but also compatible with each other. In this work, we propose a light-weight approach to achieving this compatibility between different regions of an image, using a Markov Random Field (MRF) model. We demonstrate the effectiveness of this method on top of the latent token-based Muse text-to-image model. The MRF richly encodes the compatibility among image tokens at different spatial locations to improve quality and significantly reduce the required number of Muse sampling steps. Inference with the MRF is significantly cheaper, and its parameters can be quickly learned through back-propagation by modeling MRF inference as a differentiable neural-network layer. Our full model, MarkovGen, uses this proposed MRF model to both speed up Muse by 1.5X and produce higher quality images by decreasing undesirable image artifacts.
    Convergence Analysis of Fractional Gradient Descent. (arXiv:2311.18426v1 [math.OC])
    Fractional derivatives are a well-studied generalization of integer order derivatives. Naturally, for optimization, it is of interest to understand the convergence properties of gradient descent using fractional derivatives. Convergence analysis of fractional gradient descent is currently limited both in the methods analyzed and the settings analyzed. This paper aims to fill in these gaps by analyzing variations of fractional gradient descent in smooth and convex, smooth and strongly convex, and smooth and non-convex settings. First, novel bounds will be established bridging fractional and integer derivatives. Then, these bounds will be applied to the aforementioned settings to prove $O(1/T)$ convergence for smooth and convex functions and linear convergence for smooth and strongly convex functions. Additionally, we prove $O(1/T)$ convergence for smooth and non-convex functions using an extended notion of smoothness that is more natural for fractional derivatives. Finally, empirical results will be presented on the potential speed up of fractional gradient descent over standard gradient descent as well as the challenges of predicting which will be faster in general.
    New Online Communities: Graph Deep Learning on Anonymous Voting Networks to Identify Sybils in Polycentric Governance. (arXiv:2311.17929v1 [cs.LG])
    This research examines the polycentric governance of digital assets in Decentralized Autonomous Organizations (DAOs). It offers a theoretical framework and addresses a critical challenge facing decentralized governance by developing a method to identify sybils, or spurious identities. The method uses graph deep learning techniques to identify sybil activity in a DAO governance dataset (snapshot.org). Specifically, a Graph Convolutional Neural Network (GCNN) learned voting behaviours and a fast k-means vector clustering algorithm (FAISS) used the high dimensional embeddings to identify similar nodes in a graph. The results reveal that deep learning can effectively identify sybils, reducing the voting graph by 2-5%. This research underscores the importance of sybil resistance in DAOs and offers a novel perspective on decentralized governance, informing future policy, regulation, and governance practices.
    Combined Scheduling, Memory Allocation and Tensor Replacement for Minimizing Off-Chip Data Accesses of DNN Accelerators. (arXiv:2311.18246v1 [cs.LG])
    Specialized hardware accelerators have been extensively used for Deep Neural Networks (DNNs) to provide power/performance benefits. These accelerators contain specialized hardware that supports DNN operators, and scratchpad memory for storing the tensor operands. Often, the size of the scratchpad is insufficient to store all the tensors needed for the computation, and additional data accesses are needed to move tensors back and forth from host memory during the computation with significant power/performance overhead. The volume of these additional data accesses depends on the operator schedule, and memory allocation (specific locations selected for the tensors in the scratchpad). We propose an optimization framework, named COSMA, for mapping DNNs to an accelerator that finds the optimal operator schedule, memory allocation and tensor replacement that minimizes the additional data accesses. COSMA provides an Integer Linear Programming (ILP) formulation to generate the optimal solution for mapping a DNN to the accelerator for a given scratchpad size. We demonstrate that, using an off-the-shelf ILP solver, COSMA obtains the optimal solution in seconds for a wide-range of state-of-the-art DNNs for different applications. Further, it out-performs existing methods by reducing on average 84% of the non-compulsory data accesses. We further propose a divide-and-conquer heuristic to scale up to certain complex DNNs generated by Neural Architecture Search, and this heuristic solution reduces on average 85% data accesses compared with other works.
    Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications. (arXiv:2311.18168v1 [cs.CV])
    We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D facial motions that accompany speech in the real world. Importantly, the relationship between speech and facial motion is one-to-many, containing both inter-speaker and intra-speaker variations and necessitating a probabilistic approach. In this paper, we identify and address key challenges that have so far limited the development of probabilistic models: lack of datasets and metrics that are suitable for training and evaluating them, as well as the difficulty of designing a model that generates diverse results while remaining faithful to a strong conditioning signal as speech. We first propose large-scale benchmark datasets and metrics suitable for probabilistic modeling. Then, we demonstrate a probabilistic model that achieves both diversity and fidelity to speech, outperforming other methods across the proposed benchmarks. Finally, we showcase useful applications of probabilistic models trained on these large-scale datasets: we can generate diverse speech-driven 3D facial motion that matches unseen speaker styles extracted from reference clips; and our synthetic meshes can be used to improve the performance of downstream audio-visual models.
    Exploring the Carbon Footprint of Hugging Face's ML Models: A Repository Mining Study. (arXiv:2305.11164v3 [cs.LG] UPDATED)
    The rise of machine learning (ML) systems has exacerbated their carbon footprint due to increased capabilities and model sizes. However, there is scarce knowledge on how the carbon footprint of ML models is actually measured, reported, and evaluated. In light of this, the paper aims to analyze the measurement of the carbon footprint of 1,417 ML models and associated datasets on Hugging Face, which is the most popular repository for pretrained ML models. The goal is to provide insights and recommendations on how to report and optimize the carbon efficiency of ML models. The study includes the first repository mining study on the Hugging Face Hub API on carbon emissions. This study seeks to answer two research questions: (1) how do ML model creators measure and report carbon emissions on Hugging Face Hub?, and (2) what aspects impact the carbon emissions of training ML models? The study yielded several key findings. These include a stalled proportion of carbon emissions-reporting models, a slight decrease in reported carbon footprint on Hugging Face over the past 2 years, and a continued dominance of NLP as the main application domain. Furthermore, the study uncovers correlations between carbon emissions and various attributes such as model size, dataset size, and ML application domains. These results highlight the need for software measurements to improve energy reporting practices and promote carbon-efficient model development within the Hugging Face community. In response to this issue, two classifications are proposed: one for categorizing models based on their carbon emission reporting practices and another for their carbon efficiency. The aim of these classification proposals is to foster transparency and sustainable model development within the ML community.
    Targeted Reduction of Causal Models. (arXiv:2311.18639v1 [stat.ML])
    Why does a phenomenon occur? Addressing this question is central to most scientific inquiries based on empirical observations, and often heavily relies on simulations of scientific models. As models become more intricate, deciphering the causes behind these phenomena in high-dimensional spaces of interconnected variables becomes increasingly challenging. Causal machine learning may assist scientists in the discovery of relevant and interpretable patterns of causation in simulations. We introduce Targeted Causal Reduction (TCR), a method for turning complex models into a concise set of causal factors that explain a specific target phenomenon. We derive an information theoretic objective to learn TCR from interventional data or simulations and propose algorithms to optimize this objective efficiently. TCR's ability to generate interpretable high-level explanations from complex models is demonstrated on toy and mechanical systems, illustrating its potential to assist scientists in the study of complex phenomena in a broad range of disciplines.
    Steering Deep Feature Learning with Backward Aligned Feature Updates. (arXiv:2311.18718v1 [cs.LG])
    Deep learning succeeds by doing hierarchical feature learning, yet tuning Hyper-Parameters (HP) such as initialization scales, learning rates etc., only give indirect control over this behavior. In this paper, we propose the alignment between the feature updates and the backward pass as a key notion to predict, measure and control feature learning. On the one hand, we show that when alignment holds, the magnitude of feature updates after one SGD step is related to the magnitude of the forward and backward passes by a simple and general formula. This leads to techniques to automatically adjust HPs (initialization scales and learning rates) at initialization and throughout training to attain a desired feature learning behavior. On the other hand, we show that, at random initialization, this alignment is determined by the spectrum of a certain kernel, and that well-conditioned layer-to-layer Jacobians (aka dynamical isometry) implies alignment. Finally, we investigate ReLU MLPs and ResNets in the large width-then-depth limit. Combining hints from random matrix theory and numerical experiments, we show that (i) in MLP with iid initializations, alignment degenerates with depth, making it impossible to start training, and that (ii) in ResNets, the branch scale $1/\sqrt{\text{depth}}$ is the only one maintaining non-trivial alignment at infinite depth.
    Generating Molecular Conformer Fields. (arXiv:2311.17932v1 [physics.chem-ph])
    In this paper we tackle the problem of generating conformers of a molecule in 3D space given its molecular graph. We parameterize these conformers as continuous functions that map elements from the molecular graph to points in 3D space. We then formulate the problem of learning to generate conformers as learning a distribution over these functions using a diffusion generative model, called Molecular Conformer Fields (MCF). Our approach is simple and scalable, and achieves state-of-the-art performance on challenging molecular conformer generation benchmarks while making no assumptions about the explicit structure of molecules (e.g. modeling torsional angles). MCF represents an advance in extending diffusion models to handle complex scientific problems in a conceptually simple, scalable and effective manner.
    Less is More -- Towards parsimonious multi-task models using structured sparsity. (arXiv:2308.12114v3 [cs.CV] UPDATED)
    Model sparsification in deep learning promotes simpler, more interpretable models with fewer parameters. This not only reduces the model's memory footprint and computational needs but also shortens inference time. This work focuses on creating sparse models optimized for multiple tasks with fewer parameters. These parsimonious models also possess the potential to match or outperform dense models in terms of performance. In this work, we introduce channel-wise l1/l2 group sparsity in the shared convolutional layers parameters (or weights) of the multi-task learning model. This approach facilitates the removal of extraneous groups i.e., channels (due to l1 regularization) and also imposes a penalty on the weights, further enhancing the learning efficiency for all tasks (due to l2 regularization). We analyzed the results of group sparsity in both single-task and multi-task settings on two widely-used Multi-Task Learning (MTL) datasets: NYU-v2 and CelebAMask-HQ. On both datasets, which consist of three different computer vision tasks each, multi-task models with approximately 70% sparsity outperform their dense equivalents. We also investigate how changing the degree of sparsification influences the model's performance, the overall sparsity percentage, the patterns of sparsity, and the inference time.
    A Survey of Learning Criteria Going Beyond the Usual Risk. (arXiv:2110.04996v3 [stat.ML] UPDATED)
    Virtually all machine learning tasks are characterized using some form of loss function, and "good performance" is typically stated in terms of a sufficiently small average loss, taken over the random draw of test data. While optimizing for performance on average is intuitive, convenient to analyze in theory, and easy to implement in practice, such a choice brings about trade-offs. In this work, we survey and introduce a wide variety of non-traditional criteria used to design and evaluate machine learning algorithms, place the classical paradigm within the proper historical context, and propose a view of learning problems which emphasizes the question of "what makes for a desirable loss distribution?" in place of tacit use of the expected loss.
    Optimizing ZX-Diagrams with Deep Reinforcement Learning. (arXiv:2311.18588v1 [quant-ph])
    ZX-diagrams are a powerful graphical language for the description of quantum processes with applications in fundamental quantum mechanics, quantum circuit optimization, tensor network simulation, and many more. The utility of ZX-diagrams relies on a set of local transformation rules that can be applied to them without changing the underlying quantum process they describe. These rules can be exploited to optimize the structure of ZX-diagrams for a range of applications. However, finding an optimal sequence of transformation rules is generally an open problem. In this work, we bring together ZX-diagrams with reinforcement learning, a machine learning technique designed to discover an optimal sequence of actions in a decision-making problem and show that a trained reinforcement learning agent can significantly outperform other optimization techniques like a greedy strategy or simulated annealing. The use of graph neural networks to encode the policy of the agent enables generalization to diagrams much bigger than seen during the training phase.
    SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation. (arXiv:2311.18206v1 [cs.LG])
    This paper introduces SCOPE-RL, a comprehensive open-source Python software designed for offline reinforcement learning (offline RL), off-policy evaluation (OPE), and selection (OPS). Unlike most existing libraries that focus solely on either policy learning or evaluation, SCOPE-RL seamlessly integrates these two key aspects, facilitating flexible and complete implementations of both offline RL and OPE processes. SCOPE-RL put particular emphasis on its OPE modules, offering a range of OPE estimators and robust evaluation-of-OPE protocols. This approach enables more in-depth and reliable OPE compared to other packages. For instance, SCOPE-RL enhances OPE by estimating the entire reward distribution under a policy rather than its mere point-wise expected value. Additionally, SCOPE-RL provides a more thorough evaluation-of-OPE by presenting the risk-return tradeoff in OPE results, extending beyond mere accuracy evaluations in existing OPE literature. SCOPE-RL is designed with user accessibility in mind. Its user-friendly APIs, comprehensive documentation, and a variety of easy-to-follow examples assist researchers and practitioners in efficiently implementing and experimenting with various offline RL methods and OPE estimators, tailored to their specific problem contexts. The documentation of SCOPE-RL is available at https://scope-rl.readthedocs.io/en/latest/.
    Hessian-Aware Bayesian Optimization for Decision Making Systems. (arXiv:2308.00629v3 [cs.LG] UPDATED)
    Many approaches for optimizing decision making systems rely on gradient based methods requiring informative feedback from the environment. However, in the case where such feedback is sparse or uninformative, such approaches may result in poor performance. Derivative-free approaches such as Bayesian Optimization mitigate the dependency on the quality of gradient feedback, but are known to scale poorly in the high-dimension setting of complex decision making systems. This problem is exacerbated if the system requires interactions between several actors cooperating to accomplish a shared goal. To address the dimensionality challenge, we propose a compact multi-layered architecture modeling the dynamics of actor interactions through the concept of role. Additionally, we introduce Hessian-aware Bayesian Optimization to efficiently optimize the multi-layered architecture parameterized by a large number of parameters. Experimental results demonstrate that our method (HA-GP-UCB) works effectively on several benchmarks under resource constraints and malformed feedback settings.
    On Double Descent in Reinforcement Learning with LSTD and Random Features. (arXiv:2310.05518v3 [cs.LG] UPDATED)
    Temporal Difference (TD) algorithms are widely used in Deep Reinforcement Learning (RL). Their performance is heavily influenced by the size of the neural network. While in supervised learning, the regime of over-parameterization and its benefits are well understood, the situation in RL is much less clear. In this paper, we present a theoretical analysis of the influence of network size and $l_2$-regularization on performance. We identify the ratio between the number of parameters and the number of visited states as a crucial factor and define over-parameterization as the regime when it is larger than one. Furthermore, we observe a double descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one. Leveraging random features and the lazy training regime, we study the regularized Least-Square Temporal Difference (LSTD) algorithm in an asymptotic regime, as both the number of parameters and states go to infinity, maintaining a constant ratio. We derive deterministic limits of both the empirical and the true Mean-Square Bellman Error (MSBE) that feature correction terms responsible for the double-descent. Correction terms vanish when the $l_2$-regularization is increased or the number of unvisited states goes to zero. Numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.
    Defining Reference Sequences for Nocardia Species by Similarity and Clustering Analyses of 16S rRNA Gene Sequence Data. (arXiv:2311.17965v1 [q-bio.GN])
    The intra- and inter-species genetic diversity of bacteria and the absence of 'reference', or the most representative, sequences of individual species present a significant challenge for sequence-based identification. The aims of this study were to determine the utility, and compare the performance of several clustering and classification algorithms to identify the species of 364 sequences of 16S rRNA gene with a defined species in GenBank, and 110 sequences of 16S rRNA gene with no defined species, all within the genus Nocardia. A total of 364 16S rRNA gene sequences of Nocardia species were studied. In addition, 110 16S rRNA gene sequences assigned only to the Nocardia genus level at the time of submission to GenBank were used for machine learning classification experiments. Different clustering algorithms were compared with a novel algorithm or the linear mapping (LM) of the distance matrix. Principal Components Analysis was used for the dimensionality reduction and visualization. Results: The LM algorithm achieved the highest performance and classified the set of 364 16S rRNA sequences into 80 clusters, the majority of which (83.52%) corresponded with the original species. The most representative 16S rRNA sequences for individual Nocardia species have been identified as 'centroids' in respective clusters from which the distances to all other sequences were minimized; 110 16S rRNA gene sequences with identifications recorded only at the genus level were classified using machine learning methods. Simple kNN machine learning demonstrated the highest performance and classified Nocardia species sequences with an accuracy of 92.7% and a mean frequency of 0.578.
    ANPL: Towards Natural Programming with Interactive Decomposition. (arXiv:2305.18498v2 [cs.PL] UPDATED)
    Though LLMs are capable of generating plausible programs, it's challenging to interact with the LLMs further to revise the program, especially if the user's specific requirements are different from the initial proposal. In this paper, we introduce ANPL, an interactive programming system that ensures users can always refine the generated code towards their specific programmatic intents via structured decompositions. Borrowing the paradigm of sketching from program synthesis, an ANPL program consists of a set of input-outputs that it must satisfy, a ``sketch'' -- control/data flow expressed in precise code (e.g. Python), and ``holes'' -- sub-modules to be implemented by the LLM specified with natural language. The user revises an ANPL program by either modifying the sketch, changing the language used to describe the holes, or providing additional input-outputs to a particular hole, turning it into a sub-ANPL program that can be solved recursively. This workflow allows the users to offload programming burdens to the LLM as much as possible while retaining the ability to pinpoint and resolve bugs locally, without exposing the rest of the program to the LLM. We deploy ANPL on the Abstraction and Reasoning Corpus (ARC), a set of unique tasks that are challenging for state-of-the-art AI systems, showing it outperforms baseline programming systems that (a) without the ability to decompose tasks interactively and (b) without the guarantee that the modules can be correctly composed together. Additional evaluations on APPS, HumanEval, and real-world programming tasks have validated that the ANPL framework is applicable to multiple programming domains. We release the ANPL solutions to the ARC tasks as a dataset, providing insights into how humans decompose novel tasks programmatically. See our code at https://iprc-dip.github.io/ANPL/.
    DKiS: Decay weight invertible image steganography with private key. (arXiv:2311.18243v1 [cs.MM])
    Image steganography, the practice of concealing information within another image, traditionally faces security challenges when its methods become publicly known. To counteract this, we introduce a novel private key-based image steganography technique. This approach ensures the security of hidden information, requiring a corresponding private key for access, irrespective of the public knowledge of the steganography method. We present experimental evidence demonstrating our method's effectiveness, showcasing its real-world applicability. Additionally, we identified a critical challenge in the invertible image steganography process: the transfer of non-essential, or `garbage', information from the secret to the host pipeline. To address this, we introduced the decay weight to control the information transfer, filtering out irrelevant data and enhancing the performance of image steganography. Our code is publicly accessible at https://github.com/yanghangAI/DKiS, and a practical demonstration is available at this http URL
  • Open

    Fair Community Detection and Structure Learning in Heterogeneous Graphical Models. (arXiv:2112.05128v2 [stat.ML] UPDATED)
    Inference of community structure in probabilistic graphical models may not be consistent with fairness constraints when nodes have demographic attributes. Certain demographics may be over-represented in some detected communities and under-represented in others. This paper defines a novel $\ell_1$-regularized pseudo-likelihood approach for fair graphical model selection. In particular, we assume there is some community or clustering structure in the true underlying graph, and we seek to learn a sparse undirected graph and its communities from the data such that demographic groups are fairly represented within the communities. In the case when the graph is known a priori, we provide a convex semidefinite programming approach for fair community detection. We establish the statistical consistency of the proposed method for both a Gaussian graphical model and an Ising model for, respectively, continuous and binary data, proving that our method can recover the graphs and their fair communities with high probability.
    Targeted Reduction of Causal Models. (arXiv:2311.18639v1 [stat.ML])
    Why does a phenomenon occur? Addressing this question is central to most scientific inquiries based on empirical observations, and often heavily relies on simulations of scientific models. As models become more intricate, deciphering the causes behind these phenomena in high-dimensional spaces of interconnected variables becomes increasingly challenging. Causal machine learning may assist scientists in the discovery of relevant and interpretable patterns of causation in simulations. We introduce Targeted Causal Reduction (TCR), a method for turning complex models into a concise set of causal factors that explain a specific target phenomenon. We derive an information theoretic objective to learn TCR from interventional data or simulations and propose algorithms to optimize this objective efficiently. TCR's ability to generate interpretable high-level explanations from complex models is demonstrated on toy and mechanical systems, illustrating its potential to assist scientists in the study of complex phenomena in a broad range of disciplines.
    On Double Descent in Reinforcement Learning with LSTD and Random Features. (arXiv:2310.05518v3 [cs.LG] UPDATED)
    Temporal Difference (TD) algorithms are widely used in Deep Reinforcement Learning (RL). Their performance is heavily influenced by the size of the neural network. While in supervised learning, the regime of over-parameterization and its benefits are well understood, the situation in RL is much less clear. In this paper, we present a theoretical analysis of the influence of network size and $l_2$-regularization on performance. We identify the ratio between the number of parameters and the number of visited states as a crucial factor and define over-parameterization as the regime when it is larger than one. Furthermore, we observe a double descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one. Leveraging random features and the lazy training regime, we study the regularized Least-Square Temporal Difference (LSTD) algorithm in an asymptotic regime, as both the number of parameters and states go to infinity, maintaining a constant ratio. We derive deterministic limits of both the empirical and the true Mean-Square Bellman Error (MSBE) that feature correction terms responsible for the double-descent. Correction terms vanish when the $l_2$-regularization is increased or the number of unvisited states goes to zero. Numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.
    Exploring the Carbon Footprint of Hugging Face's ML Models: A Repository Mining Study. (arXiv:2305.11164v3 [cs.LG] UPDATED)
    The rise of machine learning (ML) systems has exacerbated their carbon footprint due to increased capabilities and model sizes. However, there is scarce knowledge on how the carbon footprint of ML models is actually measured, reported, and evaluated. In light of this, the paper aims to analyze the measurement of the carbon footprint of 1,417 ML models and associated datasets on Hugging Face, which is the most popular repository for pretrained ML models. The goal is to provide insights and recommendations on how to report and optimize the carbon efficiency of ML models. The study includes the first repository mining study on the Hugging Face Hub API on carbon emissions. This study seeks to answer two research questions: (1) how do ML model creators measure and report carbon emissions on Hugging Face Hub?, and (2) what aspects impact the carbon emissions of training ML models? The study yielded several key findings. These include a stalled proportion of carbon emissions-reporting models, a slight decrease in reported carbon footprint on Hugging Face over the past 2 years, and a continued dominance of NLP as the main application domain. Furthermore, the study uncovers correlations between carbon emissions and various attributes such as model size, dataset size, and ML application domains. These results highlight the need for software measurements to improve energy reporting practices and promote carbon-efficient model development within the Hugging Face community. In response to this issue, two classifications are proposed: one for categorizing models based on their carbon emission reporting practices and another for their carbon efficiency. The aim of these classification proposals is to foster transparency and sustainable model development within the ML community.
    Geometry-Aware Normalizing Wasserstein Flows for Optimal Causal Inference. (arXiv:2311.18826v1 [cs.LG])
    This manuscript enriches the framework of continuous normalizing flows (CNFs) within causal inference, primarily to augment the geometric properties of parametric submodels used in targeted maximum likelihood estimation (TMLE). By introducing an innovative application of CNFs, we construct a refined series of parametric submodels that enable a directed interpolation between the prior distribution $p_0$ and the empirical distribution $p_1$. This proposed methodology serves to optimize the semiparametric efficiency bound in causal inference by orchestrating CNFs to align with Wasserstein gradient flows. Our approach not only endeavors to minimize the mean squared error in the estimation but also imbues the estimators with geometric sophistication, thereby enhancing robustness against misspecification. This robustness is crucial, as it alleviates the dependence on the standard $n^{\frac{1}{4}}$ rate for a doubly-robust perturbation direction in TMLE. By incorporating robust optimization principles and differential geometry into the estimators, the developed geometry-aware CNFs represent a significant advancement in the pursuit of doubly robust causal inference.
    Bayesian CART models for insurance claims frequency. (arXiv:2303.01923v2 [stat.ML] UPDATED)
    Accuracy and interpretability of a (non-life) insurance pricing model are essential qualities to ensure fair and transparent premiums for policy-holders, that reflect their risk. In recent years, the classification and regression trees (CARTs) and their ensembles have gained popularity in the actuarial literature, since they offer good prediction performance and are relatively easily interpretable. In this paper, we introduce Bayesian CART models for insurance pricing, with a particular focus on claims frequency modelling. Additionally to the common Poisson and negative binomial (NB) distributions used for claims frequency, we implement Bayesian CART for the zero-inflated Poisson (ZIP) distribution to address the difficulty arising from the imbalanced insurance claims data. To this end, we introduce a general MCMC algorithm using data augmentation methods for posterior tree exploration. We also introduce the deviance information criterion (DIC) for the tree model selection. The proposed models are able to identify trees which can better classify the policy-holders into risk groups. Some simulations and real insurance data will be discussed to illustrate the applicability of these models.
    Towards Comparable Active Learning. (arXiv:2311.18356v1 [cs.LG])
    Active Learning has received significant attention in the field of machine learning for its potential in selecting the most informative samples for labeling, thereby reducing data annotation costs. However, we show that the reported lifts in recent literature generalize poorly to other domains leading to an inconclusive landscape in Active Learning research. Furthermore, we highlight overlooked problems for reproducing AL experiments that can lead to unfair comparisons and increased variance in the results. This paper addresses these issues by providing an Active Learning framework for a fair comparison of algorithms across different tasks and domains, as well as a fast and performant oracle algorithm for evaluation. To the best of our knowledge, we propose the first AL benchmark that tests algorithms in 3 major domains: Tabular, Image, and Text. We report empirical results for 6 widely used algorithms on 7 real-world and 2 synthetic datasets and aggregate them into a domain-specific ranking of AL algorithms.  ( 2 min )
    Online Change Points Detection for Linear Dynamical Systems with Finite Sample Guarantees. (arXiv:2311.18769v1 [cs.LG])
    The problem of online change point detection is to detect abrupt changes in properties of time series, ideally as soon as possible after those changes occur. Existing work on online change point detection either assumes i.i.d data, focuses on asymptotic analysis, does not present theoretical guarantees on the trade-off between detection accuracy and detection delay, or is only suitable for detecting single change points. In this work, we study the online change point detection problem for linear dynamical systems with unknown dynamics, where the data exhibits temporal correlations and the system could have multiple change points. We develop a data-dependent threshold that can be used in our test that allows one to achieve a pre-specified upper bound on the probability of making a false alarm. We further provide a finite-sample-based bound for the probability of detecting a change point. Our bound demonstrates how parameters used in our algorithm affect the detection probability and delay, and provides guidance on the minimum required time between changes to guarantee detection.  ( 2 min )
    Balancing Summarization and Change Detection in Graph Streams. (arXiv:2311.18694v1 [stat.ML])
    This study addresses the issue of balancing graph summarization and graph change detection. Graph summarization compresses large-scale graphs into a smaller scale. However, the question remains: To what extent should the original graph be compressed? This problem is solved from the perspective of graph change detection, aiming to detect statistically significant changes using a stream of summary graphs. If the compression rate is extremely high, important changes can be ignored, whereas if the compression rate is extremely low, false alarms may increase with more memory. This implies that there is a trade-off between compression rate in graph summarization and accuracy in change detection. We propose a novel quantitative methodology to balance this trade-off to simultaneously realize reliable graph summarization and change detection. We introduce a probabilistic structure of hierarchical latent variable model into a graph, thereby designing a parameterized summary graph on the basis of the minimum description length principle. The parameter specifying the summary graph is then optimized so that the accuracy of change detection is guaranteed to suppress Type I error probability (probability of raising false alarms) to be less than a given confidence level. First, we provide a theoretical framework for connecting graph summarization with change detection. Then, we empirically demonstrate its effectiveness on synthetic and real datasets.  ( 2 min )
    Fantastic Weights and How to Find Them: Where to Prune in Dynamic Sparse Training. (arXiv:2306.12230v2 [cs.LG] UPDATED)
    Dynamic Sparse Training (DST) is a rapidly evolving area of research that seeks to optimize the sparse initialization of a neural network by adapting its topology during training. It has been shown that under specific conditions, DST is able to outperform dense models. The key components of this framework are the pruning and growing criteria, which are repeatedly applied during the training process to adjust the network's sparse connectivity. While the growing criterion's impact on DST performance is relatively well studied, the influence of the pruning criterion remains overlooked. To address this issue, we design and perform an extensive empirical analysis of various pruning criteria to better understand their impact on the dynamics of DST solutions. Surprisingly, we find that most of the studied methods yield similar results. The differences become more significant in the low-density regime, where the best performance is predominantly given by the simplest technique: magnitude-based pruning. The code is provided at https://github.com/alooow/fantastic_weights_paper  ( 2 min )
    A Comparison Between Invariant and Equivariant Classical and Quantum Graph Neural Networks. (arXiv:2311.18672v1 [quant-ph])
    Machine learning algorithms are heavily relied on to understand the vast amounts of data from high-energy particle collisions at the CERN Large Hadron Collider (LHC). The data from such collision events can naturally be represented with graph structures. Therefore, deep geometric methods, such as graph neural networks (GNNs), have been leveraged for various data analysis tasks in high-energy physics. One typical task is jet tagging, where jets are viewed as point clouds with distinct features and edge connections between their constituent particles. The increasing size and complexity of the LHC particle datasets, as well as the computational models used for their analysis, greatly motivate the development of alternative fast and efficient computational paradigms such as quantum computation. In addition, to enhance the validity and robustness of deep networks, one can leverage the fundamental symmetries present in the data through the use of invariant inputs and equivariant layers. In this paper, we perform a fair and comprehensive comparison between classical graph neural networks (GNNs) and equivariant graph neural networks (EGNNs) and their quantum counterparts: quantum graph neural networks (QGNNs) and equivariant quantum graph neural networks (EQGNN). The four architectures were benchmarked on a binary classification task to classify the parton-level particle initiating the jet. Based on their AUC scores, the quantum networks were shown to outperform the classical networks. However, seeing the computational advantage of the quantum networks in practice may have to wait for the further development of quantum technology and its associated APIs.  ( 3 min )
    Discovering Communication Pattern Shifts in Large-Scale Labeled Networks using Encoder Embedding and Vertex Dynamics. (arXiv:2305.02381v2 [cs.SI] UPDATED)
    Analyzing large-scale time-series network data, such as social media and email communications, poses a significant challenge in understanding social dynamics, detecting anomalies, and predicting trends. In particular, the scalability of graph analysis is a critical hurdle impeding progress in large-scale downstream inference. To address this challenge, we introduce a temporal encoder embedding method. This approach leverages ground-truth or estimated vertex labels, enabling an efficient embedding of large-scale graph data and the processing of billions of edges within minutes. Furthermore, this embedding unveils a temporal dynamic statistic capable of detecting communication pattern shifts across all levels, ranging from individual vertices to vertex communities and the overall graph structure. We provide theoretical support to confirm its soundness under random graph models, and demonstrate its numerical advantages in capturing evolving communities and identifying outliers. Finally, we showcase the practical application of our approach by analyzing an anonymized time-series communication network from a large organization spanning 2019-2020, enabling us to assess the impact of Covid-19 on workplace communication patterns.  ( 3 min )
    On Sparse Modern Hopfield Model. (arXiv:2309.12673v2 [cs.LG] UPDATED)
    We introduce the sparse modern Hopfield model as a sparse extension of the modern Hopfield model. Like its dense counterpart, the sparse modern Hopfield model equips a memory-retrieval dynamics whose one-step approximation corresponds to the sparse attention mechanism. Theoretically, our key contribution is a principled derivation of a closed-form sparse Hopfield energy using the convex conjugate of the sparse entropic regularizer. Building upon this, we derive the sparse memory retrieval dynamics from the sparse energy function and show its one-step approximation is equivalent to the sparse-structured attention. Importantly, we provide a sparsity-dependent memory retrieval error bound which is provably tighter than its dense analog. The conditions for the benefits of sparsity to arise are therefore identified and discussed. In addition, we show that the sparse modern Hopfield model maintains the robust theoretical properties of its dense counterpart, including rapid fixed point convergence and exponential memory capacity. Empirically, we use both synthetic and real-world datasets to demonstrate that the sparse Hopfield model outperforms its dense counterpart in many situations.  ( 2 min )
    AI in Pharma for Personalized Sequential Decision-Making: Methods, Applications and Opportunities. (arXiv:2311.18725v1 [stat.ME])
    In the pharmaceutical industry, the use of artificial intelligence (AI) has seen consistent growth over the past decade. This rise is attributed to major advancements in statistical machine learning methodologies, computational capabilities and the increased availability of large datasets. AI techniques are applied throughout different stages of drug development, ranging from drug discovery to post-marketing benefit-risk assessment. Kolluri et al. provided a review of several case studies that span these stages, featuring key applications such as protein structure prediction, success probability estimation, subgroup identification, and AI-assisted clinical trial monitoring. From a regulatory standpoint, there was a notable uptick in submissions incorporating AI components in 2021. The most prevalent therapeutic areas leveraging AI were oncology (27%), psychiatry (15%), gastroenterology (12%), and neurology (11%). The paradigm of personalized or precision medicine has gained significant traction in recent research, partly due to advancements in AI techniques \cite{hamburg2010path}. This shift has had a transformative impact on the pharmaceutical industry. Departing from the traditional "one-size-fits-all" model, personalized medicine incorporates various individual factors, such as environmental conditions, lifestyle choices, and health histories, to formulate customized treatment plans. By utilizing sophisticated machine learning algorithms, clinicians and researchers are better equipped to make informed decisions in areas such as disease prevention, diagnosis, and treatment selection, thereby optimizing health outcomes for each individual.  ( 2 min )
    The Sliding Regret in Stochastic Bandits: Discriminating Index and Randomized Policies. (arXiv:2311.18437v1 [cs.LG])
    This paper studies the one-shot behavior of no-regret algorithms for stochastic bandits. Although many algorithms are known to be asymptotically optimal with respect to the expected regret, over a single run, their pseudo-regret seems to follow one of two tendencies: it is either smooth or bumpy. To measure this tendency, we introduce a new notion: the sliding regret, that measures the worst pseudo-regret over a time-window of fixed length sliding to infinity. We show that randomized methods (e.g. Thompson Sampling and MED) have optimal sliding regret, while index policies, although possibly asymptotically optimal for the expected regret, have the worst possible sliding regret under regularity conditions on their index (e.g. UCB, UCB-V, KL-UCB, MOSS, IMED etc.). We further analyze the average bumpiness of the pseudo-regret of index policies via the regret of exploration, that we show to be suboptimal as well.  ( 2 min )
    Active learning for data streams: a survey. (arXiv:2302.08893v4 [stat.ML] UPDATED)
    Online active learning is a paradigm in machine learning that aims to select the most informative data points to label from a data stream. The problem of minimizing the cost associated with collecting labeled observations has gained a lot of attention in recent years, particularly in real-world applications where data is only available in an unlabeled form. Annotating each observation can be time-consuming and costly, making it difficult to obtain large amounts of labeled data. To overcome this issue, many active learning strategies have been proposed in the last decades, aiming to select the most informative observations for labeling in order to improve the performance of machine learning models. These approaches can be broadly divided into two categories: static pool-based and stream-based active learning. Pool-based active learning involves selecting a subset of observations from a closed pool of unlabeled data, and it has been the focus of many surveys and literature reviews. However, the growing availability of data streams has led to an increase in the number of approaches that focus on online active learning, which involves continuously selecting and labeling observations as they arrive in a stream. This work aims to provide an overview of the most recently proposed approaches for selecting the most informative observations from data streams in real time. We review the various techniques that have been proposed and discuss their strengths and limitations, as well as the challenges and opportunities that exist in this area of research.  ( 3 min )
    Perturbation-based Analysis of Compositional Data. (arXiv:2311.18501v1 [stat.ME])
    Existing statistical methods for compositional data analysis are inadequate for many modern applications for two reasons. First, modern compositional datasets, for example in microbiome research, display traits such as high-dimensionality and sparsity that are poorly modelled with traditional approaches. Second, assessing -- in an unbiased way -- how summary statistics of a composition (e.g., racial diversity) affect a response variable is not straightforward. In this work, we propose a framework based on hypothetical data perturbations that addresses both issues. Unlike existing methods for compositional data, we do not transform the data and instead use perturbations to define interpretable statistical functionals on the compositions themselves, which we call average perturbation effects. These average perturbation effects, which can be employed in many applications, naturally account for confounding that biases frequently used marginal dependence analyses. We show how average perturbation effects can be estimated efficiently by deriving a perturbation-dependent reparametrization and applying semiparametric estimation techniques. We analyze the proposed estimators empirically on simulated data and demonstrate advantages over existing techniques on US census and microbiome data. For all proposed estimators, we provide confidence intervals with uniform asymptotic coverage guarantees.  ( 2 min )
    Global Convergence of Online Identification for Mixed Linear Regression. (arXiv:2311.18506v1 [stat.ML])
    Mixed linear regression (MLR) is a powerful model for characterizing nonlinear relationships by utilizing a mixture of linear regression sub-models. The identification of MLR is a fundamental problem, where most of the existing results focus on offline algorithms, rely on independent and identically distributed (i.i.d) data assumptions, and provide local convergence results only. This paper investigates the online identification and data clustering problems for two basic classes of MLRs, by introducing two corresponding new online identification algorithms based on the expectation-maximization (EM) principle. It is shown that both algorithms will converge globally without resorting to the traditional i.i.d data assumptions. The main challenge in our investigation lies in the fact that the gradient of the maximum likelihood function does not have a unique zero, and a key step in our analysis is to establish the stability of the corresponding differential equation in order to apply the celebrated Ljung's ODE method. It is also shown that the within-cluster error and the probability that the new data is categorized into the correct cluster are asymptotically the same as those in the case of known parameters. Finally, numerical simulations are provided to verify the effectiveness of our online algorithms.  ( 2 min )
    Can semi-supervised learning use all the data effectively? A lower bound perspective. (arXiv:2311.18557v1 [cs.LG])
    Prior works have shown that semi-supervised learning algorithms can leverage unlabeled data to improve over the labeled sample complexity of supervised learning (SL) algorithms. However, existing theoretical analyses focus on regimes where the unlabeled data is sufficient to learn a good decision boundary using unsupervised learning (UL) alone. This begs the question: Can SSL algorithms simultaneously improve upon both UL and SL? To this end, we derive a tight lower bound for 2-Gaussian mixture models that explicitly depends on the labeled and the unlabeled dataset size as well as the signal-to-noise ratio of the mixture distribution. Surprisingly, our result implies that no SSL algorithm can improve upon the minimax-optimal statistical error rates of SL or UL algorithms for these distributions. Nevertheless, we show empirically on real-world data that SSL algorithms can still outperform UL and SL methods. Therefore, our work suggests that, while proving performance gains for SSL algorithms is possible, it requires careful tracking of constants.  ( 2 min )
    Contextualized Policy Recovery: Modeling and Interpreting Medical Decisions with Adaptive Imitation Learning. (arXiv:2310.07918v2 [cs.LG] UPDATED)
    Interpretable policy learning seeks to estimate intelligible decision policies from observed actions; however, existing models fall short by forcing a tradeoff between accuracy and interpretability. This tradeoff limits data-driven interpretations of human decision-making process. e.g. to audit medical decisions for biases and suboptimal practices, we require models of decision processes which provide concise descriptions of complex behaviors. Fundamentally, existing approaches are burdened by this tradeoff because they represent the underlying decision process as a universal policy, when in fact human decisions are dynamic and can change drastically with contextual information. Thus, we propose Contextualized Policy Recovery (CPR), which re-frames the problem of modeling complex decision processes as a multi-task learning problem in which complex decision policies are comprised of context-specific policies. CPR models each context-specific policy as a linear observation-to-action mapping, and generates new decision models $\textit{on-demand}$ as contexts are updated with new observations. CPR is compatible with fully offline and partially observable decision environments, and can be tailored to incorporate any recurrent black-box model or interpretable decision model. We assess CPR through studies on simulated and real data, achieving state-of-the-art performance on the canonical tasks of predicting antibiotic prescription in intensive care units ($+22\%$ AUROC vs. previous SOTA) and predicting MRI prescription for Alzheimer's patients ($+7.7\%$ AUROC vs. previous SOTA). With this improvement in predictive performance, CPR closes the accuracy gap between interpretable and black-box methods for policy learning, allowing high-resolution exploration and analysis of context-specific decision models.  ( 3 min )
    Some Intriguing Aspects about Lipschitz Continuity of Neural Networks. (arXiv:2302.10886v3 [cs.LG] UPDATED)
    Lipschitz continuity is a crucial functional property of any predictive model, that naturally governs its robustness, generalisation, as well as adversarial vulnerability. Contrary to other works that focus on obtaining tighter bounds and developing different practical strategies to enforce certain Lipschitz properties, we aim to thoroughly examine and characterise the Lipschitz behaviour of Neural Networks. Thus, we carry out an empirical investigation in a range of different settings (namely, architectures, datasets, label noise, and more) by exhausting the limits of the simplest and the most general lower and upper bounds. As a highlight of this investigation, we showcase a remarkable fidelity of the lower Lipschitz bound, identify a striking Double Descent trend in both upper and lower bounds to the Lipschitz and explain the intriguing effects of label noise on function smoothness and generalisation.  ( 2 min )
    Choosing the parameter of the Fermat distance: navigating geometry and noise. (arXiv:2311.18663v1 [stat.ML])
    The Fermat distance has been recently established as a useful tool for machine learning tasks when a natural distance is not directly available to the practitioner or to improve the results given by Euclidean distances by exploding the geometrical and statistical properties of the dataset. This distance depends on a parameter $\alpha$ that greatly impacts the performance of subsequent tasks. Ideally, the value of $\alpha$ should be large enough to navigate the geometric intricacies inherent to the problem. At the same, it should remain restrained enough to sidestep any deleterious ramifications stemming from noise during the process of distance estimation. We study both theoretically and through simulations how to select this parameter.  ( 2 min )
    A Survey of Learning Criteria Going Beyond the Usual Risk. (arXiv:2110.04996v3 [stat.ML] UPDATED)
    Virtually all machine learning tasks are characterized using some form of loss function, and "good performance" is typically stated in terms of a sufficiently small average loss, taken over the random draw of test data. While optimizing for performance on average is intuitive, convenient to analyze in theory, and easy to implement in practice, such a choice brings about trade-offs. In this work, we survey and introduce a wide variety of non-traditional criteria used to design and evaluate machine learning algorithms, place the classical paradigm within the proper historical context, and propose a view of learning problems which emphasizes the question of "what makes for a desirable loss distribution?" in place of tacit use of the expected loss.  ( 2 min )
    Generative Models for Anomaly Detection and Design-Space Dimensionality Reduction in Shape Optimization. (arXiv:2308.04051v2 [stat.ML] UPDATED)
    Our work presents a novel approach to shape optimization, with the twofold objective to improve the efficiency of global optimization algorithms while promoting the generation of high-quality designs during the optimization process free of geometrical anomalies. This is accomplished by reducing the number of the original design variables defining a new reduced subspace where the geometrical variance is maximized and modeling the underlying generative process of the data via probabilistic linear latent variable models such as factor analysis and probabilistic principal component analysis. We show that the data follows approximately a Gaussian distribution when the shape modification method is linear and the design variables are sampled uniformly at random, due to the direct application of the central limit theorem. The degree of anomalousness is measured in terms of Mahalanobis distance, and the paper demonstrates that abnormal designs tend to exhibit a high value of this metric. This enables the definition of a new optimization model where anomalous geometries are penalized and consequently avoided during the optimization loop. The procedure is demonstrated for hull shape optimization of the DTMB 5415 model, extensively used as an international benchmark for shape optimization problems. The global optimization routine is carried out using Bayesian optimization and the DIRECT algorithm. From the numerical results, the new framework improves the convergence of global optimization algorithms, while only designs with high-quality geometrical features are generated through the optimization routine thereby avoiding the wastage of precious computationally expensive simulations.  ( 3 min )
    $\mathbb{Z}_2\times \mathbb{Z}_2$ Equivariant Quantum Neural Networks: Benchmarking against Classical Neural Networks. (arXiv:2311.18744v1 [quant-ph])
    This paper presents a comprehensive comparative analysis of the performance of Equivariant Quantum Neural Networks (EQNN) and Quantum Neural Networks (QNN), juxtaposed against their classical counterparts: Equivariant Neural Networks (ENN) and Deep Neural Networks (DNN). We evaluate the performance of each network with two toy examples for a binary classification task, focusing on model complexity (measured by the number of parameters) and the size of the training data set. Our results show that the $\mathbb{Z}_2\times \mathbb{Z}_2$ EQNN and the QNN provide superior performance for smaller parameter sets and modest training data samples.  ( 2 min )
    On the Convergence of the ELBO to Entropy Sums. (arXiv:2209.03077v4 [stat.ML] UPDATED)
    The variational lower bound (a.k.a. ELBO or free energy) is the central objective for many established as well as many novel algorithms for unsupervised learning. Learning algorithms change model parameters such that the variational lower bound increases. Learning usually proceeds until parameters have converged to values close to a stationary point of the learning dynamics. In this purely theoretical contribution, we show that (for a very large class of generative models) the variational lower bound is at all stationary points of learning equal to a sum of entropies. For standard machine learning models with one set of latents and one set observed variables, the sum consists of three entropies: (A) the (average) entropy of the variational distributions, (B) the negative entropy of the model's prior distribution, and (C) the (expected) negative entropy of the observable distributions. The obtained result applies under realistic conditions including: finite numbers of data points, at any stationary points (including saddle points) and for any family of (well behaved) variational distributions. The class of generative models for which we show the equality to entropy sums contains many well-known generative models. As concrete examples we discuss Sigmoid Belief Networks, probabilistic PCA and (Gaussian and non-Gaussian) mixture models. The results also apply for standard (Gaussian) variational autoencoders, which has been shown in parallel (Damm et al., 2023). The prerequisites we use to show equality to entropy sums are relatively mild. Concretely, the distributions of a given generative model have to be of the exponential family (with constant base measure), and the model has to satisfy a parameterization criterion (which is usually fulfilled). Proving the equality of the ELBO to entropy sums at stationary points (under the stated conditions) is the main contribution of this work.  ( 3 min )
    Semiparametric Efficient Inference in Adaptive Experiments. (arXiv:2311.18274v1 [stat.ML])
    We consider the problem of efficient inference of the Average Treatment Effect in a sequential experiment where the policy governing the assignment of subjects to treatment or control can change over time. We first provide a central limit theorem for the Adaptive Augmented Inverse-Probability Weighted estimator, which is semiparametric efficient, under weaker assumptions than those previously made in the literature. This central limit theorem enables efficient inference at fixed sample sizes. We then consider a sequential inference setting, deriving both asymptotic and nonasymptotic confidence sequences that are considerably tighter than previous methods. These anytime-valid methods enable inference under data-dependent stopping times (sample sizes). Additionally, we use propensity score truncation techniques from the recent off-policy estimation literature to reduce the finite sample variance of our estimator without affecting the asymptotic variance. Empirical results demonstrate that our methods yield narrower confidence sequences than those previously developed in the literature while maintaining time-uniform error control.  ( 2 min )
    Efficient Model-Based Concave Utility Reinforcement Learning through Greedy Mirror Descent. (arXiv:2311.18346v1 [math.OC])
    Many machine learning tasks can be solved by minimizing a convex function of an occupancy measure over the policies that generate them. These include reinforcement learning, imitation learning, among others. This more general paradigm is called the Concave Utility Reinforcement Learning problem (CURL). Since CURL invalidates classical Bellman equations, it requires new algorithms. We introduce MD-CURL, a new algorithm for CURL in a finite horizon Markov decision process. MD-CURL is inspired by mirror descent and uses a non-standard regularization to achieve convergence guarantees and a simple closed-form solution, eliminating the need for computationally expensive projection steps typically found in mirror descent approaches. We then extend CURL to an online learning scenario and present Greedy MD-CURL, a new method adapting MD-CURL to an online, episode-based setting with partially unknown dynamics. Like MD-CURL, the online version Greedy MD-CURL benefits from low computational complexity, while guaranteeing sub-linear or even logarithmic regret, depending on the level of information available on the underlying dynamics.  ( 2 min )

  • Open

    [D]eep Dive into the Vision Transformer (ViT) paper by the Google Brain team
    We have a reading club every Friday called Arxiv Dives where we go over the fundamentals of a lot of the state of the art techniques used in Machine Learning today. Last week we dove into the "Vision Transformers" Paper from 2021 where the Google Brain team benchmarked training large scale transformers against ResNets. Though it is not groundbreaking research as of this week, I think with the pace of AI it is important to dive deep into past work and what others have tried! It's nice to take a step back and review the fundamentals as well as keeping up with the latest and greatest. Posted the notes and recap here if anyone finds it helpful: https://blog.oxen.ai/arxiv-dives-vision-transformers-vit/ Also would love to have anyone join us live on Fridays! We've got a pretty consistent and fun group of 300+ engineers and researchers showing up. submitted by /u/FallMindless3563 [link] [comments]  ( 9 min )
    [D] M2 Max 96GB or M2 Ultra 64GB
    Mac Studio Apple M2 Ultra Chip with 24‑Core CPU and 60‑Core GPU, 64 GB RAM or Mac Studio Apple M2 Max Chip with 12‑Core CPU and 38‑Core GPU, 96 GB RAM For working with LLM. What would you do? submitted by /u/breadandtacos [link] [comments]  ( 9 min )
    [R] When Meta-Learning Meets Online and Continual Learning: A Survey
    Paper: https://arxiv.org/abs/2311.05241 Abstract: Over the past decade, deep neural networks have demonstrated significant success using the training scheme that involves mini-batch stochastic gradient descent on extensive datasets. Expanding upon this accomplishment, there has been a surge in research exploring the application of neural networks in other learning scenarios. One notable framework that has garnered significant attention is meta-learning. Often described as "learning to learn," meta-learning is a data-driven approach to optimize the learning algorithm. Other branches of interest are continual learning and online learning, both of which involve incrementally updating a model with streaming data. While these frameworks were initially developed independently, recent works have started investigating their combinations, proposing novel problem settings and learning algorithms. However, due to the elevated complexity and lack of unified terminology, discerning differences between the learning frameworks can be challenging even for experienced researchers. To facilitate a clear understanding, this paper provides a comprehensive survey that organizes various problem settings using consistent terminology and formal descriptions. By offering an overview of these learning paradigms, our work aims to foster further advancements in this promising area of research. https://preview.redd.it/pp2j7tz2dr3c1.png?width=1249&format=png&auto=webp&s=983e081c4b4feabddb3457ba74d94202495be4a5 submitted by /u/APaperADay [link] [comments]  ( 10 min )
    [D]
    Hi everyone ! I am going to purchase a laptop for ML and DL as mentioned above. Currently I am at beginner phase and most of the models I made are trained online on jupiter and google collab. For me battery life matters a lot because currently I am an engineering student and the place where I study has shortage of electric ports ( Library ). But I can compromise it if goes against my productivity and work flow. My ambition is to go for MS ( research in ML and DL ) and the machine should last around 6-8 years MAX. THOSE TWO LAPTOPS ARE:- Apple Macbook Air ( M2, 16GB unified memory, 8 core CPU & 10 core GPU, 512GB SSD ) ROG Strix G15 (2022) G513 G513RW-HQ148WS These both are around the same price range where the Mac is 20K INR costlier than ROG. So yes my maximum budget is 1,50,000 INR. [link] [comments]  ( 9 min )
    [D] Advice on PC upgrades.
    Sorry in advance for the wall of text. TL:DR Should I upgrade my CPU from a 3900x to a 5950x? And should I get two 3090s or a single 4090? Background: At the beginning of the year my company started offering a 6 month data science course for employees which I was lucky to participate in. I knew little about machine learning prior to the course but have since been obsessed to the point where I want to repurpose my gaming PC into a machine learning PC. I completed the course, but I'm still very much a noob so I figured I can use some advice. I want to upgrade some pieces and have a bonus coming in next month so I can spare about $3k on upgrades. Use case: All of my machine learning projects have been run on work servers and my work MBP, so I have no frame of reference on how training/infe…  ( 11 min )
    [P] Early in 2023 I put in a lot of work on a new machine learning project. Now I'm not sure what to do with it.
    First I want to make it clear this is not a self promotion post. I hope many machine learning people come at me with questions or comments about this project. A little background about myself. I did work on the 4 bits quantization of LLaMA using GPTQ. (https://github.com/qwopqwop200/GPTQ-for-LLaMa). I've been studying AI in-depth for many years now. Back in early 2023 I created several unique systems that can output a video like this (https://www.youtube.com/watch?v=uoEd8WykzkY) . Not that this is the best video ever, its mostly a tech demo of what it can do. I did some promotion on the system, but was mostly only able to reach bots from what I can tell. I'm not sure what to do with it. I have some creative Ideas but would like to hear if anyone is interested in the system. I put a lot of work into it and would like to see it used for something entertaining. The website for the project is basic and plain as can be right now, just a place holder and contact page really. I didn't get much of a response at all after posting a bunch of videos made by it. I think some people may confuse it with spam or whatever. A few points are: - Can be fully automated with almost no human intervention - Full HD video - A unique 3D perspective video system that produces 3D video from 2D images - Needs no internet connection - No special hardware required, can be run with a single 3090 or better. submitted by /u/mentosorangemint [link] [comments]  ( 10 min )
    [D] PCA, loadings and finding most significant discriminant features/variables.
    The data I am working has 137 samples with each having 34,765 variables (each pixel is a spectrum itself). I want to find which variable is the most significant one (IMO the most intensity/amplitude one gotta be the most significant). Once I ran PCA after normalization (dividing by the sum of a reference spectrum), the first two components explained variance ratio is 93%. Now what do I learn from the loadings of the PCA matrix which is (34765 x 10) for 10 principal components. The max absolute values of loadings should be investigated? what shall I do? submitted by /u/AnikArnab [link] [comments]  ( 9 min )
    [D] The Bitter Lesson for Robotics
    For the two people in this subreddit that haven't read the bitter lesson yet, http://www.incompleteideas.net/IncIdeas/BitterLesson.html However, as someone who is interested in robot perception, planning, and learning, does this necessarily apply? I'm not too sure, especially in the context of robots in human (unstructured) environments whose policies cover much wider in scope than say factory or warehouse robots. Robots that have to deal with the stochastic and wildly differing nature of the real world, whose policies are robust to change in its environment. I can think of a few ideas where the bitter lesson may or may not be applicable. Hardware limitations. Although offloading compute to remote servers is definitely an option, how much can robots rely on this when interacting with the environment in real time? No feasible robot would be able to store billions of parameters even just for inference. Moore's law has been slowing down for some time now. Data. This is a big one in my opinion. What even constitutes good training data for training robot policies? Do we have enough of it? Of course we have good models and enough data for CV and language, and even Toyota's paper on using diffusion models for grasp poses looks promising, but a robot policy must put all of this together to accomplish multi-modal tasks. There isn't a huge corpus for multimodal task planning, such as how to break a task such as pouring a cup of water, into an HTN with multiple sub-tasks (grasp cup, pick, pour, etc.) The reason why I make this post is because I am unsure. I could see how an LLM could be the basis for a generalized robot policy that can accomplish multiple tasks, or more efficient architectures can allow for more compute to be available to robots. What are your thoughts? submitted by /u/n0ided_ [link] [comments]  ( 10 min )
    [D] Should We Contact the AC if Reviewers Go Silent During Rebuttal?
    After submitting our paper to ICLR2024 and receiving initial reviews, we provided a detailed rebuttal addressing all the points raised by the reviewers. However, since our rebuttal, there has been complete silence from the reviewers' side. No further questions, comments, or engagement of any kind, dispite the initial comments were quite detailed and feedbacks are positive. In a situation like this, is it advisable to reach out to the AC to request their intervention in encouraging reviewer engagement? Has anyone been in a similar situation? What did you do? Would be greatly appreciated to any suggestion. submitted by /u/jzhoubu [link] [comments]  ( 9 min )
    [Discussion] Language models that could run reasonably on a Jetson Nano
    I recently tried running some low Q numbered Mistral 7B models on a NVIDA Jetson Nano board that has 4gb of memory with ollama. They ran but reallllly slowly. Is there any language models with reasonable performance that could run on something like a Jetson Nano for coding / analytics / data processing? submitted by /u/head_robotics [link] [comments]  ( 9 min )
    [R] RETVec: Resilient and Efficient Text Vectorizer
    Happy Friday, Really happy to share that the code and model for RETVec our new SOTA robust text tokenizer for classification is available on Github here and the NeurIPS paper here. We also provide native support for TFLite and for the web via a TFJS. Hope you will find it useful for your research. If you would like to give it a try we have a get started notebook. Let us know if you have any questions. ​ submitted by /u/ebursztein [link] [comments]  ( 9 min )
    [P] 80% faster, 50% less memory, 0% loss in accuracy Llama finetuning
    Hey r/MachineLearning! I manually derived backpropagation steps, did some chained matrix multiplication optims, wrote all kernels in OpenAI's Triton language and did more maths and coding trickery to make QLoRA finetuning for Llama 5x faster on Unsloth: https://github.com/unslothai/unsloth! Some highlights: 5x faster (5 hours to 1 hour) Use 50% less memory With 0% loss in accuracy All locally on NVIDIA GPUs (Tesla T4, RTX 20/30/40, Ampere, Hopper) for free! QLoRA / LoRA is now 80% faster to train. On Slim Orca 518K examples on 2 Tesla T4 GPUs via DDP, Unsloth trains 4bit QLoRA on all layers in 260 hours VS Huggingface's original implementation of 1301 hours. Slim Orca 1301 hours to 260 hours You might (most likely not) remember me from Hyperlearn (https://github.com/danielhanchen/hyperlearn) which I launched a few years back to make ML algos 2000x faster via maths and coding tricks. I wrote up a blog post about all the manual hand derived backprop via https://unsloth.ai/introducing. I wrote a Google Colab for T4 for Alpaca: https://colab.research.google.com/drive/1oW55fBmwzCOrBVX66RcpptL3a99qWBxb?usp=sharing which finetunes Alpaca 2x faster on a single GPU. On Kaggle via 2 Tesla T4s on DDP: https://www.kaggle.com/danielhanchen/unsloth-laion-chip2-kaggle, finetune LAION's OIG 5x faster and Slim Orca 5x faster. You can install Unsloth all locally via: pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git" pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git" Currently we only support Pytorch 2.1 and Linux distros - more installation instructions via https://github.com/unslothai/unsloth/blob/main/README.md I hope to: Support other LLMs other than Llama style models (Mistral etc) Add sqrt gradient checkpointing to shave another 25% of memory usage. And other tricks! Thanks a bunch!! submitted by /u/danielhanchen [link] [comments]  ( 10 min )
    [R] Meta's new speech models (Seamless)
    Meta Research just released new speech models called Seamless: https://ai.meta.com/research/seamless-communication/ It supports a TON of languages for both speech and text input AND output. In some sense it's a generalized model for a bunch of related speech tasks. Very interesting! submitted by /u/semicausal [link] [comments]  ( 9 min )
    [R] Research Survey: Bugs in LLMs Generated Code
    Hello, Our research group at Polytechnique Montréal (under the supervision of Prof. Foutse Khomh) is conducting a survey on “Bugs in LLMs Generated Code”. We have prepared an online survey that takes around 10 minutes of your valuable time to complete asking about the characteristics of such bugs. Snippets provided will be in Python but they require basic background in Python to be understood. Moreover, you could kindly share this survey with colleagues/students who you consider eligible to participate. At no point in the survey will we ask you for your name, and we will not be logging your IP address to allow anonymity. The results of this survey will be publicly accessible in an anonymous form. We greatly appreciate your help in doing so. If you would like to know more about this study or have any questions, feel free to contact us. The link to the survey: https://forms.gle/8kZaxsNqtb3vbFaD6 Thank you for your help. Florian Tambon ([florian-2.tambon@polymtl.ca](mailto:florian-2.tambon@polymtl.ca)), Arghavan Moradidakhel ([arghavan.moradi-dakhel@polymtl.ca](mailto:arghavan.moradi-dakhel@polymtl.ca)), Amin Nikanjam ([amin.nikanjam@polymtl.ca](mailto:amin.nikanjam@polymtl.ca)), Foutse Khomh ([foutse.khomh@polymtl.ca](mailto:foutse.khomh@polymtl.ca)) SWAT Lab., Polytechnique Montréal, http://swat.polymtl.ca/ submitted by /u/aminnikanjam [link] [comments]  ( 10 min )
    [D] Best project(s) to try for improving scanned photos
    I'm sitting on approximately 1000 scanned photos. Most are about 20-30yrs old so in pretty good condition. A couple are much older. I'd like to batch/automate as much as I can, then manually review orig vs processed to pick out the ones I need to work on manually. These are the issues that I would like to fix in order of prevalence in the set: Photo paper texture (ie. matte) makes it look like it is a scan vs a photo Adjustments for brightness/contrast/color etc Photos are blurry or out of focus Photo is damaged (torn, wrinkled, etc) Photo is really old (stained) I'd like to batch process as much as possible. Not exactly sure how to do #1, #2 can be done by alot of software with their "auto adjust" settings. I'm thinking maybe Irfanview? or Photoshop? Not sure what other tools there are. There are some machine learning or neural networks models out there that can fix #3-5 and maybe #1-2. However many are focus on the face and some go way too far and basically replace the face so it looks super fake. Just looking through gitlab there are alot of option. The one I've tested the mots is https://github.com/microsoft/Bringing-Old-Photos-Back-to-Life Then there are all the random "AI" websites. Ideally, I'd like super old photos to be restored to look like they did when they were first taken. Less old photos (ie. 90's) I'd like to restore to look like they were taken today. At least that's the goal. Is there a comparison somewhere? Also, in this same vein, I'd be curious if there are any tools that are good to use on general photos taken on a phone to "optimize" them. For example, I use to run them all through a noise filter to make them look clearer. ​ To summarize, I know there are alot of projects out there. I'm curious if there is a comparison out there so I can learn about the differences or which are currently best without having to try to setup each one to compare. I'd appreciate any recommendations. submitted by /u/eng33 [link] [comments]  ( 10 min )
    [R] Do some authors conscientiously add up more mathematics than needed to make the paper "look" more groundbreaking?
    I've noticed a trend recently of authors adding more formalism than needed in some instances (e.g. a diagram/ image would have done the job fine). Is this such a thing as adding more mathematics than needed to make the paper look better or perhaps it's just constrained by the publisher (whatever format the paper must stick to in order to get published)? submitted by /u/Inquation [link] [comments]  ( 9 min )
    [D] Choosing number of components for Nonnegative Matrix Factorization
    I have several estimates of undirected weighted networks, i.e. weighted adjacency matrices. I use NMF to identify subnetworks / components. I understand that there are several ways to come up with the number of components to use, from simple means such as a knee-plot to more sophisticated approaches such as cross-validation with randomly removed datapoints. I have applied several methods to my set of networks and they suggest to use 5-8 components. I have an a priori idea of what two subnetworks should look like. Using 5-8 components seems to create subnetworks of the expected two subnetworks. Or two of the components represent the expected subnetworks while the remaining components look very similar to those components. If I reduce the number of components to two, then I obtain the two expected subnetworks. Is it legit to simply use two components because they match my expectation? Are there methods to merge components (e.g. merging the sub-subnetworks to the expected subnetworks) and is it legit to merge components? E.g. finding the additive combination of subnetworks that maximize correlation with an expected network? Are there methods that enforce "larger" networks? I appreciate your ideas on this! submitted by /u/dizzledk [link] [comments]  ( 10 min )
    [D] Data science book recommendations
    I've been a professional coder for over 10 years, backed by a degree in computer science and a touch of math and theoretical statistics. I haven't really used that stats knowledge much, so I'm just skimming the surface. Now, I'm looking to dive into machine learning. I'm not looking for an easy path, My main purpose is not to quickly land a job, I'm doing this solely for personal growth. Before diving deep into machine learning, I want to strengthen my foundations in data science, perhaps even focusing on data mining. Any book/books recommendations for this? I’m not in need of programming books, as I already know some programming languages, including Python. I came across "Introduction to Data Mining" by Pang-Ning Tan and others. Is it a good starting point? Another book I found is "An Introduction to Statistical Learning" by Gareth James. submitted by /u/lp_kalubec [link] [comments]  ( 9 min )
    [D] freeze particular parameters in a layer of a resnet18 in pytorch?
    Hey friends! Has anyone written any code before where they freeze particular parameters in a layer of a NN? ​ I don't mean freezing particular layers of a NN, which I can successfully do (e.g.: for param in resnet18.fc.parameters(): param.requires_grad = True ​ What I mean is freezing particular parameters within a layer - by applying some sort of mask or otherwise. Edit: Also, what I am trying to do is treat some parameters of the NN as scalars -- so that gradients won't pass through them anymore. If I simply mask the gradient post hoc, e.g. loss.backward() for p, p_mask in zip(parameters, masks): p.grad *= p_mask optimizer.step() this won't accomplish this, -- isn't this right? or am I mistaken. Thanks so much! :) submitted by /u/Cultural-Average3959 [link] [comments]  ( 9 min )
  • Open

    Mutli Agent Reinforcement Learning Baselines
    I am looking for an equivalent of Stable baselines 3 but for MARL as the name suggests so that I can benchmark algorithms on my custom MARL environment. Most of the approaches mentioned utilize parameter sharing with different libraries like Tianshou, CleanRL and Stable Baselines 3 itself. Is there any other commonly used standardized baseline for differenet MARL algorithms? submitted by /u/blitzkreig3 [link] [comments]  ( 1 min )
    When Meta-Learning Meets Online and Continual Learning: A Survey
    Paper: https://arxiv.org/abs/2311.05241 Abstract: Over the past decade, deep neural networks have demonstrated significant success using the training scheme that involves mini-batch stochastic gradient descent on extensive datasets. Expanding upon this accomplishment, there has been a surge in research exploring the application of neural networks in other learning scenarios. One notable framework that has garnered significant attention is meta-learning. Often described as "learning to learn," meta-learning is a data-driven approach to optimize the learning algorithm. Other branches of interest are continual learning and online learning, both of which involve incrementally updating a model with streaming data. While these frameworks were initially developed independently, recent works have started investigating their combinations, proposing novel problem settings and learning algorithms. However, due to the elevated complexity and lack of unified terminology, discerning differences between the learning frameworks can be challenging even for experienced researchers. To facilitate a clear understanding, this paper provides a comprehensive survey that organizes various problem settings using consistent terminology and formal descriptions. By offering an overview of these learning paradigms, our work aims to foster further advancements in this promising area of research. https://preview.redd.it/ag7xaviaer3c1.png?width=1249&format=png&auto=webp&s=c4c0aa093ac20ae7f6f567c9427f9a9e76e3a7ee submitted by /u/APaperADay [link] [comments]  ( 1 min )
    Estimating state value for environments with fixed number of steps
    I'm training an RL agent in a custom environment. My environment terminates after a certain number of actions have been taken. My instinct is that when it comes to estimating state value through a value-network, the number of steps remaining should be part of the input to the value network. (Currently I do not have this as part of input.) Is this correct? Would love to hear your thoughts on this and it would be great if you could point me to any existing implementations or papers where this is done? Thank you! submitted by /u/morakorvai [link] [comments]  ( 1 min )
    How does RL work with multiple motions
    For example, is this created using Unity, or its own gym? https://www.youtube.com/watch?v=SsJ_AusntiU&pp=ygUncmVpbmZvcmNlbWVudCBsZWFybmluZyB0ZWFjaCBob3cgdG8gYm94 Its hard to tell from the video. What's confusing is how they are chained together, and whether or not this is possible with ML agents. I understand how to train a model to stand, but how to stand, get up, move around, and box at the same time? Is it all done from the same observations and same rewards within billions of repetitions? Or is there 1 model that learned how to stand, 1model that learned how to get up. 1 simplified model that learned how to move around. etc. And then these models are bound to each other using ragdoll physics. Essentially joints on the model that is learning are bound to the model that knows each individual motion. reason I'm asking is I'm finding unity ML agents to be a bit limited. like the motion is way too random, i get almost no growth in mean reward. are the actions supposed to improve over the course of learning or are they just randomized the whole time? even simple things like moving to target i find the agent constantly jitters and moves bath and forth randomly submitted by /u/Sharp-Cat2319 [link] [comments]  ( 1 min )
    Any physics engine gyms outside of Unity that allow importing of models and etc.
    I find the ML Agents for unity is a bit dated and limiting and more geared toward game development than robotics, and I would like to create my own gyms and etc. For simple stuff like 2D games its pretty straightforward, but I would like to do some 3D robotic models and use stable baselines to actuate them, for this I would need a 3D engine that has collision and physics, as well as model/texture importing, some shading, and of course rendering. Preferably something that can render to a webgl environment too. I would assume Stable Baselines can work with any 3D Open Source engine. Is there something pretty rudimentary that someone can use with OpenAI gym to import models, add rigid body and collision parameters, etc. Or does OpenAI gym already have all of this? ​ ​ submitted by /u/Sharp-Cat2319 [link] [comments]  ( 1 min )
    🚀 DIAMBRA x OROBIX: Releasing SheepRL Reinforcement Learning Library Integration! 🤖🎮
    We are thrilled to announce our dynamic collaboration with OROBIX, creators of the advanced RL library SheepRL! 🤝💡 🌟 Why SheepRL? Seamlessly integrated into DIAMBRA, SheepRL offers a clear path to modern RL algorithms with a focus on distributed training for scalable applications. 📚 Getting Started: Explore the power of SheepRL in our official DIAMBRA documentation. Examples in source code await in the DIAMBRA Agents repository. 🚀 Swift Launchpad to RL Exploration: With just a few lines of Python, launch RL trainings on all our games—making AI exploration a breeze! 👉 Discover More with the links in the comments below! Join us in embracing the future of RL research. 🚀🤖✨ DIAMBRA x SheepRL submitted by /u/DIAMBRA_AIArena [link] [comments]  ( 1 min )
    Learning RL research
    Hi everyone, I'm studying reinforcement learning now, and I'm looking for ways to speed up my learning. My plan is to be able to do good research in the field. Right now I'm looking at offline RL. Right now what helps is we have a weekly session with my fellow PhD classmates. We present new research papers we have read. I share about RL papers when it's my turn. But one issue is our topics are so diverse. One works on computer vision, one on graph neural nets, one on time series forecasting. I'm the only one doing RL. It would be great if I get to talk to more people with the same interests. For added context, I'm from the Philippines and it's very hard to get support here or groups with expertise to mentor me. It's hard to stay motivated when I feel that I'm almost alone in my learning process. Though I have a professor in the Netherlands who's helping me a lot. Do you have any suggestions on say online groups etc, that can help me learn it much faster, with potential mentorship, sharing of resources, etc? Thank you! submitted by /u/111user222 [link] [comments]  ( 1 min )
  • Open

    How can I practice?
    Hi, I'm a freshman and I just finished the first course of Andrew NG's 3 course specialization. In 2-3 months I'm looking forward to start working with one of my professors. I told him even though I'm not zero programming wise but I have no idea about ML, reinforcement learning, etc. and asked him to tell me what am I supposed to know before I can start work with him. He told me to check this course and learn things like numpy, pandas, scikit-learn, etc. I'm going through the specialization but I have no idea where can I apply the knowledge I gain or how can I practise? Is there a website or something like hackerrank where I can practise data science, ML algorithms and such? submitted by /u/Trevorego [link] [comments]  ( 1 min )
  • Open

    Boosting developer productivity: How Deloitte uses Amazon SageMaker Canvas for no-code/low-code machine learning
    The ability to quickly build and deploy machine learning (ML) models is becoming increasingly important in today’s data-driven world. However, building ML models requires significant time, effort, and specialized expertise. From data collection and cleaning to feature engineering, model building, tuning, and deployment, ML projects often take months for developers to complete. And experienced data […]  ( 10 min )
    Experience the new and improved Amazon SageMaker Studio
    Launched in 2019, Amazon SageMaker Studio provides one place for all end-to-end machine learning (ML) workflows, from data preparation, building and experimentation, training, hosting, and monitoring. As we continue to innovate to increase data science productivity, we’re excited to announce the improved SageMaker Studio experience, which allows users to select the managed Integrated Development Environment (IDE) […]  ( 6 min )
    Amazon SageMaker simplifies setting up SageMaker domain for enterprises to onboard their users to SageMaker
    As organizations scale the adoption of machine learning (ML), they are looking for efficient and reliable ways to deploy new infrastructure and onboard teams to ML environments. One of the challenges is setting up authentication and fine-grained permissions for users based on their roles and activities. For example, MLOps engineers typically perform model deployment activities, […]  ( 8 min )
  • Open

    how much smarter is chatting with ai making you?
    View Poll submitted by /u/Georgeo57 [link] [comments]  ( 1 min )
    AI — weekly megathread!
    News provided by aibrews.com Meta AI introduced a suite of AI language translation models that preserve expression and improve streaming [Details | GitHub]: SeamlessExpressive enables the transfer of tones, emotional expression and vocal styles in speech translation. You can try a demo of SeamlessExpressive using your own voice as an input here. SeamlessStreaming, a new model that enables streaming speech-to-speech and speech-to-text translations with <2 seconds of latency and nearly the same accuracy as an offline model. In contrast to conventional systems which translate when the speaker has finished their sentence, SeamlessStreaming translates while the speaker is still talking. t intelligently decides when it has enough context to output the next translated segment. SeamlessM4T …  ( 1 min )
    Out of the loop – which image generator to use to "make person X look like Y"?
    When AI blew up (has it been a year already?) I was all over it, jumping from one image generator to the next to see what they could do. I never really went beyond the basics, but now I am looking for an image generator that will let me upload an image and turn the person on it into something specific – for instance "make it look like this boy is conducting a philharmonic orchestra". Description of the generator I'm looking for: It should have the ability to create images with a high DPI NSFW is not a concern, I will only be generating SFW content Free (as in beer) is a plus, but not essential. If there is payment, I prefer a monthly/yearly payment (free as in speech would make me happy on a personal level) Please no Discord UI, it drives me up the wall Edit: Expanded the definition of "free" submitted by /u/1337_n00b [link] [comments]
    How does the temperature LLM parameter mathematically work?
    I know what is does, but inside, what exactly is it? submitted by /u/yukiarimo [link] [comments]
    "DeepMind's AI Unleashes Potential: Paving the Way for Thousands of New Materials"
    submitted by /u/Fit-Code-5141 [link] [comments]  ( 1 min )
    Essential Python Libraries for LLMs and Application Development in 2024
    submitted by /u/FCFAN44 [link] [comments]
    What are your predictions for 2023?
    It's been a crazy year and the amount and pace of announcements has been unprecedented. What are your expectations for 2024? Here's a few that I expect to see next year. A huge race in ai personal assistants like Siri and Alexa A personalities to become a much bigger thing - conversations to partly replace doom scrolling Voice / audio being utilized much more Models getting better with less parameters Gpt's to expand and enable building ui's to build full apps conversationally. The first few AI agents that can autonomously complete goal oriented multi step tasks Easy Integration into all the major apps. More scientific breakthroughs like the DeepMind’s Materials discovery. Grok will beat gpt 4 is some ways. Rise of digital companions. Let's hear yours. *Edit. Typo in title, meant 2024 submitted by /u/zascar [link] [comments]  ( 1 min )
    Watsonx training reaources
    Has anyone seen good Watsonx material online like YouTube videos or elsewhere to help learn more about the platform, it's main pillars and various capabilities? I'm also curious if there are particular certifications for aspects of the platform. submitted by /u/tealdric [link] [comments]
    Anyone remember that *one morning* when ChatGPT was *actually telling the truth*? I got to ask it a few important questions before they “fixed” it…
    submitted by /u/Eric-Dubay-4419 [link] [comments]  ( 1 min )
    One year later, ChatGPT is still alive and kicking. OpenAI's AI language model, ChatGPT, has over 100 million active users every week, making it the fastest-growing consumer product ever.
    submitted by /u/Upbeat-Interaction13 [link] [comments]  ( 1 min )
    The last selfie
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 1 min )
    Is Gemini different than Bard or just better version of Bard?
    Both Bard and Gemini are from Google. Why isn’t Gemini Bard 2? Is Gemini using different base technology than Bard? Sorry I don’t know the terms submitted by /u/Acceptable_Box7598 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 11/30/2023
    Researchers at Google DeepMind have used artificial intelligence to predict the structures of more than 2 million new materials, in a breakthrough that could have wide-reaching benefits in sectors such as renewable energy and computing.[1] Bitmagic Launches Public Test for AI-based Game Creation Platform.[2] Meta Unveils Audiobox, A new foundation model for Custom Audio Generation.[3] Amazon announces Q, an AI chatbot for businesses.[4] Sources: [1] https://time.com/6340681/deepmind-gnome-ai-materials/ [2] https://www.prnewswire.com/news-releases/bitmagic-launches-public-test-for-ai-based-game-creation-platform-302002138.html [3] https://www.maginative.com/article/meta-unveils-audiobox-a-new-foundation-model-for-custom-audio-generation/ [4] https://www.cnbc.com/2023/11/28/amazon-announces-q-an-ai-chatbot-for-businesses.html submitted by /u/Excellent-Target-847 [link] [comments]
    Microsoft Releases Convincing Case Study Showing Chain of Thought (CoT) with GPT 4 Versus Fine Tuned Models via Medprompt and CoT Prompting Strategies
    https://arxiv.org/pdf/2311.16452 A great read. I'll pull out the important parts. November 2023 ​ https://preview.redd.it/cyf6y5fubl3c1.png?width=1059&format=png&auto=webp&s=2a1b559ebfdd0900ab7dc84d3dc7088470b3bb2a Figure 1: (a) Comparison of performance on MedQA. (b) GPT-4 with Medprompt achieves SoTA on a wide range of medical challenge questions. A core metric for characterizing the performance of foundation models is the accuracy of next word prediction. Accuracy with next word prediction is found to increase with scale in training data, model parameters, and compute, in accordance with empirically derived “neural model scaling laws” [3, 12]). However, beyond predictions of scaling laws on basic measures such as next word prediction, foundation models show the sudden emergence of…  ( 1 min )
    I used Adobe's AI generative fill to turn this 16:9 photo into 4:5 aspect ratio for Instagram. Quite a challenge with the frescos and endless Rococo art - can you tell where the real photo ends and the AI begins?
    submitted by /u/WeAlStartAsStrangers [link] [comments]  ( 1 min )
    I used Adobe's AI generative fill to turn this 16:9 photo into 4:5 aspect ratio for Instagram. Quite a challenge with the frescos and Rococo art - can you tell where the real photo ends and the AI begins?
    submitted by /u/WeAlStartAsStrangers [link] [comments]  ( 1 min )
    Screenshot to Code GPT
    submitted by /u/Senior_tasteey [link] [comments]  ( 1 min )
  • Open

    Unsupervised speech-to-speech translation from monolingual data
    Posted by Eliya Nachmani, Research Scientist, and Michelle Tadmor Ramanovich, Software Engineer, Google Research Speech-to-speech translation (S2ST) is a type of machine translation that converts spoken language from one language to another. This technology has the potential to break down language barriers and facilitate communication between people from different cultures and backgrounds. Previously, we introduced Translatotron 1 and Translatotron 2, the first ever models that were able to directly translate speech between two languages. However they were trained in supervised settings with parallel speech data. The scarcity of parallel speech data is a major challenge in this field, so much that most public datasets are semi- or fully-synthesized from text. This adds additiona…  ( 93 min )
  • Open

    PwR: Using representations for AI-powered software development
    PwR uses domain-specific languages to bridge communication between developers and AI tools. Learn how it can help simplify code creation and enhance software reliability and customization, no matter your coding expertise. The post PwR: Using representations for AI-powered software development appeared first on Microsoft Research.  ( 10 min )
  • Open

    Convergent subsequence
    I was reading a theorem giving conditions for a divergent series to have a convergent subseries and had a sort of flashback. I studied nonlinear PDEs in grad school, which amounted to applied functional analysis. We were constantly proving or using theorems about sequences having convergent subsequences, often subsequences that converged in a very weak […] Convergent subsequence first appeared on John D. Cook.  ( 5 min )
  • Open

    The Effects of Overparameterization on Sharpness-aware Minimization: An Empirical and Theoretical Analysis. (arXiv:2311.17539v1 [cs.LG])
    Training an overparameterized neural network can yield minimizers of the same level of training loss and yet different generalization capabilities. With evidence that indicates a correlation between sharpness of minima and their generalization errors, increasing efforts have been made to develop an optimization method to explicitly find flat minima as more generalizable solutions. This sharpness-aware minimization (SAM) strategy, however, has not been studied much yet as to how overparameterization can actually affect its behavior. In this work, we analyze SAM under varying degrees of overparameterization and present both empirical and theoretical results that suggest a critical influence of overparameterization on SAM. Specifically, we first use standard techniques in optimization to prove that SAM can achieve a linear convergence rate under overparameterization in a stochastic setting. We also show that the linearly stable minima found by SAM are indeed flatter and have more uniformly distributed Hessian moments compared to those of SGD. These results are corroborated with our experiments that reveal a consistent trend that the generalization improvement made by SAM continues to increase as the model becomes more overparameterized. We further present that sparsity can open up an avenue for effective overparameterization in practice.  ( 2 min )
    Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers. (arXiv:2311.17717v1 [cs.CV])
    Concept erasure in text-to-image diffusion models aims to disable pre-trained diffusion models from generating images related to a target concept. To perform reliable concept erasure, the properties of robustness and locality are desirable. The former refrains the model from producing images associated with the target concept for any paraphrased or learned prompts, while the latter preserves the model ability in generating images for non-target concepts. In this paper, we propose Reliable Concept Erasing via Lightweight Erasers (Receler), which learns a lightweight Eraser to perform concept erasing and enhances locality and robustness with the proposed concept-localized regularization and adversarial prompt learning, respectively. Comprehensive quantitative and qualitative experiments with various concept prompts verify the superiority of Receler over the previous erasing methods on the above two desirable properties.  ( 2 min )
    Modular Neural Networks for Time Series Forecasting: Interpretability and Feature Selection using Attention. (arXiv:2311.16834v2 [cs.LG] UPDATED)
    Multivariate time series have many applications, from healthcare and meteorology to life science. Although deep learning models have shown excellent predictive performance for time series, they have been criticised for being "black-boxes" or non-interpretable. This paper proposes a novel modular neural network model for multivariate time series prediction that is interpretable by construction. A recurrent neural network learns the temporal dependencies in the data while an attention-based feature selection component selects the most relevant features and suppresses redundant features used in the learning of the temporal dependencies. A modular deep network is trained from the selected features independently to show the users how features influence outcomes, making the model interpretable. Experimental results show that this approach can outperform state-of-the-art interpretable Neural Additive Models (NAM) and variations thereof in both regression and classification of time series tasks, achieving a predictive performance that is comparable to the top non-interpretable methods for time series, LSTM and XGBoost.  ( 2 min )
    Grounding Foundation Models through Federated Transfer Learning: A General Framework. (arXiv:2311.17431v1 [cs.LG])
    Foundation Models (FMs) such as GPT-4 encoded with vast knowledge and powerful emergent abilities have achieved remarkable success in various natural language processing and computer vision tasks. Grounding FMs by adapting them to domain-specific tasks or augmenting them with domain-specific knowledge enables us to exploit the full potential of FMs. However, grounding FMs faces several challenges, stemming primarily from constrained computing resources, data privacy, model heterogeneity, and model ownership. Federated Transfer Learning (FTL), the combination of federated learning and transfer learning, provides promising solutions to address these challenges. In recent years, the need for grounding FMs leveraging FTL, coined FTL-FM, has arisen strongly in both academia and industry. Motivated by the strong growth in FTL-FM research and the potential impact of FTL-FM on industrial applications, we propose an FTL-FM framework that formulates problems of grounding FMs in the federated learning setting, construct a detailed taxonomy based on the FTL-FM framework to categorize state-of-the-art FTL-FM works, and comprehensively overview FTL-FM works based on the proposed taxonomy. We also establish correspondences between FTL-FM and conventional phases of adapting FM so that FM practitioners can align their research works with FTL-FM. In addition, we overview advanced efficiency-improving and privacy-preserving techniques because efficiency and privacy are critical concerns in FTL-FM. Last, we discuss opportunities and future research directions of FTL-FM.  ( 2 min )
    Learning to Simulate: Generative Metamodeling via Quantile Regression. (arXiv:2311.17797v1 [cs.LG])
    Stochastic simulation models, while effective in capturing the dynamics of complex systems, are often too slow to run for real-time decision-making. Metamodeling techniques are widely used to learn the relationship between a summary statistic of the outputs (e.g., the mean or quantile) and the inputs of the simulator, so that it can be used in real time. However, this methodology requires the knowledge of an appropriate summary statistic in advance, making it inflexible for many practical situations. In this paper, we propose a new metamodeling concept, called generative metamodeling, which aims to construct a "fast simulator of the simulator". This technique can generate random outputs substantially faster than the original simulation model, while retaining an approximately equal conditional distribution given the same inputs. Once constructed, a generative metamodel can instantaneously generate a large amount of random outputs as soon as the inputs are specified, thereby facilitating the immediate computation of any summary statistic for real-time decision-making. Furthermore, we propose a new algorithm -- quantile-regression-based generative metamodeling (QRGMM) -- and study its convergence and rate of convergence. Extensive numerical experiments are conducted to investigate the empirical performance of QRGMM, compare it with other state-of-the-art generative algorithms, and demonstrate its usefulness in practical real-time decision-making.  ( 2 min )
    A knowledge-driven AutoML architecture. (arXiv:2311.17124v1 [cs.LG])
    This paper proposes a knowledge-driven AutoML architecture for pipeline and deep feature synthesis. The main goal is to render the AutoML process explainable and to leverage domain knowledge in the synthesis of pipelines and features. The architecture explores several novel ideas: first, the construction of pipelines and deep features is approached in an unified way. Next, synthesis is driven by a shared knowledge system, interactively queried as to what pipeline operations to use or features to compute. Lastly, the synthesis processes takes decisions at runtime using partial solutions and results of their application on data. Two experiments are conducted to demonstrate the functionality of a na\"{\i}ve implementation of the proposed architecture and to discuss its advantages, trade-offs as well as future potential for AutoML.  ( 2 min )
    Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning. (arXiv:2311.17842v1 [cs.RO])
    In this study, we are interested in imbuing robots with the capability of physically-grounded task planning. Recent advancements have shown that large language models (LLMs) possess extensive knowledge useful in robotic tasks, especially in reasoning and planning. However, LLMs are constrained by their lack of world grounding and dependence on external affordance models to perceive environmental information, which cannot jointly reason with LLMs. We argue that a task planner should be an inherently grounded, unified multimodal system. To this end, we introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning that leverages vision-language models (VLMs) to generate a sequence of actionable steps. ViLa directly integrates perceptual data into its reasoning and planning process, enabling a profound understanding of commonsense knowledge in the visual world, including spatial layouts and object attributes. It also supports flexible multimodal goal specification and naturally incorporates visual feedback. Our extensive evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners, highlighting its effectiveness in a wide array of open-world manipulation tasks.  ( 2 min )
    BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling. (arXiv:2311.17218v1 [cs.CV])
    Like masked language modeling (MLM) in natural language processing, masked image modeling (MIM) aims to extract valuable insights from image patches to enhance the feature extraction capabilities of the underlying deep neural network (DNN). Contrasted with other training paradigms like supervised learning and unsupervised contrastive learning, masked image modeling (MIM) pretraining typically demands significant computational resources in order to manage large training data batches (e.g., 4096). The significant memory and computation requirements pose a considerable challenge to its broad adoption. To mitigate this, we introduce a novel learning framework, termed~\textit{Block-Wise Masked Image Modeling} (BIM). This framework involves decomposing the MIM tasks into several sub-tasks with independent computation patterns, resulting in block-wise back-propagation operations instead of the traditional end-to-end approach. Our proposed BIM maintains superior performance compared to conventional MIM while greatly reducing peak memory consumption. Moreover, BIM naturally enables the concurrent training of numerous DNN backbones of varying depths. This leads to the creation of multiple trained DNN backbones, each tailored to different hardware platforms with distinct computing capabilities. This approach significantly reduces computational costs in comparison with training each DNN backbone individually. Our framework offers a promising solution for resource constrained training of MIM.  ( 2 min )
    Unified Binary and Multiclass Margin-Based Classification. (arXiv:2311.17778v1 [stat.ML])
    The notion of margin loss has been central to the development and analysis of algorithms for binary classification. To date, however, there remains no consensus as to the analogue of the margin loss for multiclass classification. In this work, we show that a broad range of multiclass loss functions, including many popular ones, can be expressed in the relative margin form, a generalization of the margin form of binary losses. The relative margin form is broadly useful for understanding and analyzing multiclass losses as shown by our prior work (Wang and Scott, 2020, 2021). To further demonstrate the utility of this way of expressing multiclass losses, we use it to extend the seminal result of Bartlett et al. (2006) on classification-calibration of binary margin losses to multiclass. We then analyze the class of Fenchel-Young losses, and expand the set of these losses that are known to be classification-calibrated.  ( 2 min )
    Adversarial Robust Memory-Based Continual Learner. (arXiv:2311.17608v1 [cs.CV])
    Despite the remarkable advances that have been made in continual learning, the adversarial vulnerability of such methods has not been fully discussed. We delve into the adversarial robustness of memory-based continual learning algorithms and observe limited robustness improvement by directly applying adversarial training techniques. Preliminary studies reveal the twin challenges for building adversarial robust continual learners: accelerated forgetting in continual learning and gradient obfuscation in adversarial robustness. In this study, we put forward a novel adversarial robust memory-based continual learner that adjusts data logits to mitigate the forgetting of pasts caused by adversarial samples. Furthermore, we devise a gradient-based data selection mechanism to overcome the gradient obfuscation caused by limited stored data. The proposed approach can widely integrate with existing memory-based continual learning as well as adversarial training algorithms in a plug-and-play way. Extensive experiments on Split-CIFAR10/100 and Split-Tiny-ImageNet demonstrate the effectiveness of our approach, achieving up to 8.13% higher accuracy for adversarial data.  ( 2 min )
    Efficient Stitchable Task Adaptation. (arXiv:2311.17352v1 [cs.LG])
    The paradigm of pre-training and fine-tuning has laid the foundation for deploying deep learning models. However, most fine-tuning methods are designed to meet a specific resource budget. Recently, considering diverse deployment scenarios with various resource budgets, stitchable neural network (SN-Net) is introduced to quickly obtain numerous new networks (stitches) from the pre-trained models (anchors) in a model family via model stitching. Although promising, SN-Net confronts new challenges when adapting it to new target domains, including huge memory and storage requirements and a long and sub-optimal multistage adaptation process. In this work, we present a novel framework, Efficient Stitchable Task Adaptation (ESTA), to efficiently produce a palette of fine-tuned models that adhere to diverse resource constraints. Specifically, we first tailor parameter-efficient fine-tuning to share low-rank updates among the stitches while maintaining independent bias terms. In this way, we largely reduce fine-tuning memory burdens and mitigate the interference among stitches that arises in task adaptation. Furthermore, we streamline a simple yet effective one-stage deployment pipeline, which estimates the important stitches to deploy with training-time gradient statistics. By assigning higher sampling probabilities to important stitches, we also get a boosted Pareto frontier. Extensive experiments on 25 downstream visual recognition tasks demonstrate that our ESTA is capable of generating stitches with smooth accuracy-efficiency trade-offs and surpasses the direct SN-Net adaptation by remarkable margins with significantly lower training time and fewer trainable parameters. Furthermore, we demonstrate the flexibility and scalability of our ESTA framework by stitching LLMs from LLaMA family, obtaining chatbot stitches of assorted sizes.  ( 3 min )
    Physics-informed neural networks for transformed geometries and manifolds. (arXiv:2311.15940v2 [cs.LG] UPDATED)
    Physics-informed neural networks (PINNs) effectively embed physical principles into machine learning, but often struggle with complex or alternating geometries. We propose a novel method for integrating geometric transformations within PINNs to robustly accommodate geometric variations. Our method incorporates a diffeomorphism as a mapping of a reference domain and adapts the derivative computation of the physics-informed loss function. This generalizes the applicability of PINNs not only to smoothly deformed domains, but also to lower-dimensional manifolds and allows for direct shape optimization while training the network. We demonstrate the effectivity of our approach on several problems: (i) Eikonal equation on Archimedean spiral, (ii) Poisson problem on surface manifold, (iii) Incompressible Stokes flow in deformed tube, and (iv) Shape optimization with Laplace operator. Through these examples, we demonstrate the enhanced flexibility over traditional PINNs, especially under geometric variations. The proposed framework presents an outlook for training deep neural operators over parametrized geometries, paving the way for advanced modeling with PDEs on complex geometries in science and engineering.  ( 2 min )
    Single-cell Multi-view Clustering via Community Detection with Unknown Number of Clusters. (arXiv:2311.17103v1 [q-bio.GN])
    Single-cell multi-view clustering enables the exploration of cellular heterogeneity within the same cell from different views. Despite the development of several multi-view clustering methods, two primary challenges persist. Firstly, most existing methods treat the information from both single-cell RNA (scRNA) and single-cell Assay of Transposase Accessible Chromatin (scATAC) views as equally significant, overlooking the substantial disparity in data richness between the two views. This oversight frequently leads to a degradation in overall performance. Additionally, the majority of clustering methods necessitate manual specification of the number of clusters by users. However, for biologists dealing with cell data, precisely determining the number of distinct cell types poses a formidable challenge. To this end, we introduce scUNC, an innovative multi-view clustering approach tailored for single-cell data, which seamlessly integrates information from different views without the need for a predefined number of clusters. The scUNC method comprises several steps: initially, it employs a cross-view fusion network to create an effective embedding, which is then utilized to generate initial clusters via community detection. Subsequently, the clusters are automatically merged and optimized until no further clusters can be merged. We conducted a comprehensive evaluation of scUNC using three distinct single-cell datasets. The results underscored that scUNC outperforms the other baseline methods.  ( 2 min )
    Utilizing Model Residuals to Identify Rental Properties of Interest: The Price Anomaly Score (PAS) and Its Application to Real-time Data in Manhattan. (arXiv:2311.17287v1 [cs.LG])
    Understanding whether a property is priced fairly hinders buyers and sellers since they usually do not have an objective viewpoint of the price distribution for the overall market of their interest. Drawing from data collected of all possible available properties for rent in Manhattan as of September 2023, this paper aims to strengthen our understanding of model residuals; specifically on machine learning models which generalize for a majority of the distribution of a well-proportioned dataset. Most models generally perceive deviations from predicted values as mere inaccuracies, however this paper proposes a different vantage point: when generalizing to at least 75\% of the data-set, the remaining deviations reveal significant insights. To harness these insights, we introduce the Price Anomaly Score (PAS), a metric capable of capturing boundaries between irregularly predicted prices. By combining relative pricing discrepancies with statistical significance, the Price Anomaly Score (PAS) offers a multifaceted view of rental valuations. This metric allows experts to identify overpriced or underpriced properties within a dataset by aggregating PAS values, then fine-tuning upper and lower boundaries to any threshold to set indicators of choice.  ( 3 min )
    In Search of a Data Transformation That Accelerates Neural Field Training. (arXiv:2311.17094v1 [cs.LG])
    Neural field is an emerging paradigm in data representation that trains a neural network to approximate the given signal. A key obstacle that prevents its widespread adoption is the encoding speed-generating neural fields requires an overfitting of a neural network, which can take a significant number of SGD steps to reach the desired fidelity level. In this paper, we delve into the impacts of data transformations on the speed of neural field training, specifically focusing on how permuting pixel locations affect the convergence speed of SGD. Counterintuitively, we find that randomly permuting the pixel locations can considerably accelerate the training. To explain this phenomenon, we examine the neural field training through the lens of PSNR curves, loss landscapes, and error patterns. Our analyses suggest that the random pixel permutations remove the easy-to-fit patterns, which facilitate easy optimization in the early stage but hinder capturing fine details of the signal.
    Interpreting Differentiable Latent States for Healthcare Time-series Data. (arXiv:2311.17560v1 [cs.LG])
    Machine learning enables extracting clinical insights from large temporal datasets. The applications of such machine learning models include identifying disease patterns and predicting patient outcomes. However, limited interpretability poses challenges for deploying advanced machine learning in digital healthcare. Understanding the meaning of latent states is crucial for interpreting machine learning models, assuming they capture underlying patterns. In this paper, we present a concise algorithm that allows for i) interpreting latent states using highly related input features; ii) interpreting predictions using subsets of input features via latent states; and iii) interpreting changes in latent states over time. The proposed algorithm is feasible for any model that is differentiable. We demonstrate that this approach enables the identification of a daytime behavioral pattern for predicting nocturnal behavior in a real-world healthcare dataset.
    Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods. (arXiv:2308.06703v2 [cs.LG] UPDATED)
    Stochastic gradient descent (SGD) and adaptive gradient methods, such as Adam and RMSProp, have been widely used in training deep neural networks. We empirically show that while the difference between the standard generalization performance of models trained using these methods is small, those trained using SGD exhibit far greater robustness under input perturbations. Notably, our investigation demonstrates the presence of irrelevant frequencies in natural datasets, where alterations do not affect models' generalization performance. However, models trained with adaptive methods show sensitivity to these changes, suggesting that their use of irrelevant frequencies can lead to solutions sensitive to perturbations. To better understand this difference, we study the learning dynamics of gradient descent (GD) and sign gradient descent (signGD) on a synthetic dataset that mirrors natural signals. With a three-dimensional input space, the models optimized with GD and signGD have standard risks close to zero but vary in their adversarial risks. Our result shows that linear models' robustness to $\ell_2$-norm bounded changes is inversely proportional to the model parameters' weight norm: a smaller weight norm implies better robustness. In the context of deep learning, our experiments show that SGD-trained neural networks have smaller Lipschitz constants, explaining the better robustness to input perturbations than those trained with adaptive gradient methods.  ( 3 min )
    GC-MVSNet: Multi-View, Multi-Scale, Geometrically-Consistent Multi-View Stereo. (arXiv:2310.19583v2 [cs.CV] UPDATED)
    Traditional multi-view stereo (MVS) methods rely heavily on photometric and geometric consistency constraints, but newer machine learning-based MVS methods check geometric consistency across multiple source views only as a post-processing step. In this paper, we present a novel approach that explicitly encourages geometric consistency of reference view depth maps across multiple source views at different scales during learning (see Fig. 1). We find that adding this geometric consistency loss significantly accelerates learning by explicitly penalizing geometrically inconsistent pixels, reducing the training iteration requirements to nearly half that of other MVS methods. Our extensive experiments show that our approach achieves a new state-of-the-art on the DTU and BlendedMVS datasets, and competitive results on the Tanks and Temples benchmark. To the best of our knowledge, GC-MVSNet is the first attempt to enforce multi-view, multi-scale geometric consistency during learning.  ( 2 min )
    A Primer on Deep Learning for Causal Inference. (arXiv:2110.04442v2 [cs.LG] UPDATED)
    This review systematizes the emerging literature for causal inference using deep neural networks under the potential outcomes framework. It provides an intuitive introduction on how deep learning can be used to estimate/predict heterogeneous treatment effects and extend causal inference to settings where confounding is non-linear, time varying, or encoded in text, networks, and images. To maximize accessibility, we also introduce prerequisite concepts from causal inference and deep learning. The survey differs from other treatments of deep learning and causal inference in its sharp focus on observational causal estimation, its extended exposition of key algorithms, and its detailed tutorials for implementing, training, and selecting among deep estimators in Tensorflow 2 available at github.com/kochbj/Deep-Learning-for-Causal-Inference.  ( 2 min )
    Learning to Estimate Without Bias. (arXiv:2110.12403v3 [cs.LG] UPDATED)
    The Gauss Markov theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models. In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints. The classical approach to designing non-linear MVUEs is through maximum likelihood estimation (MLE) which often involves computationally challenging optimizations. On the other hand, deep learning methods allow for non-linear estimators with fixed computational complexity. Learning based estimators perform optimally on average with respect to their training set but may suffer from significant bias in other parameters. To avoid this, we propose to add a simple bias constraint to the loss function, resulting in an estimator we refer to as Bias Constrained Estimator (BCE). We prove that this yields asymptotic MVUEs that behave similarly to the classical MLEs and asymptotically attain the Cramer Rao bound. We demonstrate the advantages of our approach in the context of signal to noise ratio estimation as well as covariance estimation. A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance. Examples include distributed sensor networks and data augmentation in test-time. In such applications, we show that BCE leads to asymptotically consistent estimators.
    Towards Efficient Hyperdimensional Computing Using Photonics. (arXiv:2311.17801v1 [cs.ET])
    Over the past few years, silicon photonics-based computing has emerged as a promising alternative to CMOS-based computing for Deep Neural Networks (DNN). Unfortunately, the non-linear operations and the high-precision requirements of DNNs make it extremely challenging to design efficient silicon photonics-based systems for DNN inference and training. Hyperdimensional Computing (HDC) is an emerging, brain-inspired machine learning technique that enjoys several advantages over existing DNNs, including being lightweight, requiring low-precision operands, and being robust to noise introduced by the nonidealities in the hardware. For HDC, computing in-memory (CiM) approaches have been widely used, as CiM reduces the data transfer cost if the operands can fit into the memory. However, inefficient multi-bit operations, high write latency, and low endurance make CiM ill-suited for HDC. On the other hand, the existing electro-photonic DNN accelerators are inefficient for HDC because they are specifically optimized for matrix multiplication in DNNs and consume a lot of power with high-precision data converters. In this paper, we argue that photonic computing and HDC complement each other better than photonic computing and DNNs, or CiM and HDC. We propose PhotoHDC, the first-ever electro-photonic accelerator for HDC training and inference, supporting the basic, record-based, and graph encoding schemes. Evaluating with popular datasets, we show that our accelerator can achieve two to five orders of magnitude lower EDP than the state-of-the-art electro-photonic DNN accelerators for implementing HDC training and inference. PhotoHDC also achieves four orders of magnitude lower energy-delay product than CiM-based accelerators for both HDC training and inference.
    Rigorous dynamical mean field theory for stochastic gradient descent methods. (arXiv:2210.06591v3 [math-ph] UPDATED)
    We prove closed-form equations for the exact high-dimensional asymptotics of a family of first order gradient-based methods, learning an estimator (e.g. M-estimator, shallow neural network, ...) from observations on Gaussian data with empirical risk minimization. This includes widely used algorithms such as stochastic gradient descent (SGD) or Nesterov acceleration. The obtained equations match those resulting from the discretization of dynamical mean-field theory (DMFT) equations from statistical physics when applied to gradient flow. Our proof method allows us to give an explicit description of how memory kernels build up in the effective dynamics, and to include non-separable update functions, allowing datasets with non-identity covariance matrices. Finally, we provide numerical implementations of the equations for SGD with generic extensive batch-size and with constant learning rates.
    Navigating Cultural Chasms: Exploring and Unlocking the Cultural POV of Text-To-Image Models. (arXiv:2310.01929v2 [cs.CL] UPDATED)
    Text-To-Image (TTI) models, such as DALL-E and StableDiffusion, have demonstrated remarkable prompt-based image generation capabilities. Multilingual encoders may have a substantial impact on the cultural agency of these models, as language is a conduit of culture. In this study, we explore the cultural perception embedded in TTI models by characterizing culture across three hierarchical tiers: cultural dimensions, cultural domains, and cultural concepts. Based on this ontology, we derive prompt templates to unlock the cultural knowledge in TTI models, and propose a comprehensive suite of evaluation techniques, including intrinsic evaluations using the CLIP space, extrinsic evaluations with a Visual-Question-Answer (VQA) model and human assessments, to evaluate the cultural content of TTI-generated images. To bolster our research, we introduce the CulText2I dataset, derived from four diverse TTI models and spanning ten languages. Our experiments provide insights regarding Do, What, Which and How research questions about the nature of cultural encoding in TTI models, paving the way for cross-cultural applications of these models.  ( 2 min )
    Fair Text-to-Image Diffusion via Fair Mapping. (arXiv:2311.17695v1 [cs.CV])
    In this paper, we address the limitations of existing text-to-image diffusion models in generating demographically fair results when given human-related descriptions. These models often struggle to disentangle the target language context from sociocultural biases, resulting in biased image generation. To overcome this challenge, we propose Fair Mapping, a general, model-agnostic, and lightweight approach that modifies a pre-trained text-to-image model by controlling the prompt to achieve fair image generation. One key advantage of our approach is its high efficiency. The training process only requires updating a small number of parameters in an additional linear mapping network. This not only reduces the computational cost but also accelerates the optimization process. We first demonstrate the issue of bias in generated results caused by language biases in text-guided diffusion models. By developing a mapping network that projects language embeddings into an unbiased space, we enable the generation of relatively balanced demographic results based on a keyword specified in the prompt. With comprehensive experiments on face image generation, we show that our method significantly improves image generation performance when prompted with descriptions related to human faces. By effectively addressing the issue of bias, we produce more fair and diverse image outputs. This work contributes to the field of text-to-image generation by enhancing the ability to generate images that accurately reflect the intended demographic characteristics specified in the text.
    Model Performance Prediction for Hyperparameter Optimization of Deep Learning Models Using High Performance Computing and Quantum Annealing. (arXiv:2311.17508v1 [cs.LG])
    Hyperparameter Optimization (HPO) of Deep Learning-based models tends to be a compute resource intensive process as it usually requires to train the target model with many different hyperparameter configurations. We show that integrating model performance prediction with early stopping methods holds great potential to speed up the HPO process of deep learning models. Moreover, we propose a novel algorithm called Swift-Hyperband that can use either classical or quantum support vector regression for performance prediction and benefit from distributed High Performance Computing environments. This algorithm is tested not only for the Machine-Learned Particle Flow model used in High Energy Physics, but also for a wider range of target models from domains such as computer vision and natural language processing. Swift-Hyperband is shown to find comparable (or better) hyperparameters as well as using less computational resources in all test cases.  ( 2 min )
    Hybrid Search for Efficient Planning with Completeness Guarantees. (arXiv:2310.12819v2 [cs.AI] UPDATED)
    Solving complex planning problems has been a long-standing challenge in computer science. Learning-based subgoal search methods have shown promise in tackling these problems, but they often suffer from a lack of completeness guarantees, meaning that they may fail to find a solution even if one exists. In this paper, we propose an efficient approach to augment a subgoal search method to achieve completeness in discrete action spaces. Specifically, we augment the high-level search with low-level actions to execute a multi-level (hybrid) search, which we call complete subgoal search. This solution achieves the best of both worlds: the practical efficiency of high-level search and the completeness of low-level search. We apply the proposed search method to a recently proposed subgoal search algorithm and evaluate the algorithm trained on offline data on complex planning problems. We demonstrate that our complete subgoal search not only guarantees completeness but can even improve performance in terms of search expansions for instances that the high-level could solve without low-level augmentations. Our approach makes it possible to apply subgoal-level planning for systems where completeness is a critical requirement.  ( 2 min )
    SVDinsTN: A Tensor Network Paradigm for Efficient Structure Search from Regularized Modeling Perspective. (arXiv:2305.14912v3 [cs.LG] UPDATED)
    Tensor network (TN) representation is a powerful technique for computer vision and machine learning. TN structure search (TN-SS) aims to search for a customized structure to achieve a compact representation, which is a challenging NP-hard problem. Recent "sampling-evaluation-based" methods require sampling an extensive collection of structures and evaluating them one by one, resulting in prohibitively high computational costs. To address this issue, we propose a novel TN paradigm, named SVD-inspired TN decomposition (SVDinsTN), which allows us to efficiently solve the TN-SS problem from a regularized modeling perspective, eliminating the repeated structure evaluations. To be specific, by inserting a diagonal factor for each edge of the fully-connected TN, SVDinsTN allows us to calculate TN cores and diagonal factors simultaneously, with the factor sparsity revealing a compact TN structure. In theory, we prove a convergence guarantee for the proposed method. Experimental results demonstrate that the proposed method achieves approximately 100 to 1000 times acceleration compared to the state-of-the-art TN-SS methods while maintaining a comparable representation ability.  ( 2 min )
    Discovering Predictable Latent Factors for Time Series Forecasting. (arXiv:2303.10426v2 [cs.LG] UPDATED)
    Modern time series forecasting methods, such as Transformer and its variants, have shown strong ability in sequential data modeling. To achieve high performance, they usually rely on redundant or unexplainable structures to model complex relations between variables and tune the parameters with large-scale data. Many real-world data mining tasks, however, lack sufficient variables for relation reasoning, and therefore these methods may not properly handle such forecasting problems. With insufficient data, time series appear to be affected by many exogenous variables, and thus, the modeling becomes unstable and unpredictable. To tackle this critical issue, in this paper, we develop a novel algorithmic framework for inferring the intrinsic latent factors implied by the observable time series. The inferred factors are used to form multiple independent and predictable signal components that enable not only sparse relation reasoning for long-term efficiency but also reconstructing the future temporal data for accurate prediction. To achieve this, we introduce three characteristics, i.e., predictability, sufficiency, and identifiability, and model these characteristics via the powerful deep latent dynamics models to infer the predictable signal components. Empirical results on multiple real datasets show the efficiency of our method for different kinds of time series forecasting. The statistical analysis validates the predictability of the learned latent factors.  ( 2 min )
    SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery. (arXiv:2311.17179v1 [cs.CV])
    Geographic location is essential for modeling tasks in fields ranging from ecology to epidemiology to the Earth system sciences. However, extracting relevant and meaningful characteristics of a location can be challenging, often entailing expensive data fusion or data distillation from global imagery datasets. To address this challenge, we introduce Satellite Contrastive Location-Image Pretraining (SatCLIP), a global, general-purpose geographic location encoder that learns an implicit representation of locations from openly available satellite imagery. Trained location encoders provide vector embeddings summarizing the characteristics of any given location for convenient usage in diverse downstream tasks. We show that SatCLIP embeddings, pretrained on globally sampled multi-spectral Sentinel-2 satellite data, can be used in various predictive tasks that depend on location information but not necessarily satellite imagery, including temperature prediction, animal recognition in imagery, and population density estimation. Across tasks, SatCLIP embeddings consistently outperform embeddings from existing pretrained location encoders, ranging from models trained on natural images to models trained on semantic context. SatCLIP embeddings also help to improve geographic generalization. This demonstrates the potential of general-purpose location encoders and opens the door to learning meaningful representations of our planet from the vast, varied, and largely untapped modalities of geospatial data.
    An Attribution Method for Siamese Encoders. (arXiv:2310.05703v3 [cs.CL] UPDATED)
    Despite the success of Siamese encoder models such as sentence transformers (ST), little is known about the aspects of inputs they pay attention to. A barrier is that their predictions cannot be attributed to individual features, as they compare two inputs rather than processing a single one. This paper derives a local attribution method for Siamese encoders by generalizing the principle of integrated gradients to models with multiple inputs. The solution takes the form of feature-pair attributions, and can be reduced to a token-token matrix for STs. Our method involves the introduction of integrated Jacobians and inherits the advantageous formal properties of integrated gradients: it accounts for the model's full computation graph and is guaranteed to converge to the actual prediction. A pilot study shows that in an ST few token-pairs can often explain large fractions of predictions, and it focuses on nouns and verbs. For accurate predictions, it however needs to attend to the majority of tokens and parts of speech.
    Single-Cell Clustering via Dual-Graph Alignment. (arXiv:2311.17104v1 [cs.LG])
    In recent years, the field of single-cell RNA sequencing has seen a surge in the development of clustering methods. These methods enable the identification of cell subpopulations, thereby facilitating the understanding of tumor microenvironments. Despite their utility, most existing clustering algorithms primarily focus on the attribute information provided by the cell matrix or the network structure between cells, often neglecting the network between genes. This oversight could lead to loss of information and clustering results that lack clinical significance. To address this limitation, we develop an advanced single-cell clustering model incorporating dual-graph alignment, which integrates gene network information into the clustering process based on self-supervised and unsupervised optimization. Specifically, we designed a graph-based autoencoder enhanced by an attention mechanism to effectively capture relationships between cells. Moreover, we performed the node2vec method on Protein-Protein Interaction (PPI) networks to derive the gene network structure and maintained this structure throughout the clustering process. Our proposed method has been demonstrated to be effective through experimental results, showcasing its ability to optimize clustering outcomes while preserving the original associations between cells and genes. This research contributes to obtaining accurate cell subpopulations and generates clustering results that more closely resemble real-world biological scenarios. It provides better insights into the characteristics and distribution of diseased cells, ultimately building a foundation for early disease diagnosis and treatment.
    Direction-oriented Multi-objective Learning: Simple and Provable Stochastic Algorithms. (arXiv:2305.18409v3 [cs.LG] UPDATED)
    Multi-objective optimization (MOO) has become an influential framework in many machine learning problems with multiple objectives such as learning with multiple criteria and multi-task learning (MTL). In this paper, we propose a new direction-oriented multi-objective problem by regularizing the common descent direction within a neighborhood of a direction that optimizes a linear combination of objectives such as the average loss in MTL. This formulation includes GD and MGDA as special cases, enjoys the direction-oriented benefit as in CAGrad, and facilitates the design of stochastic algorithms. To solve this problem, we propose Stochastic Direction-oriented Multi-objective Gradient descent (SDMGrad) with simple SGD type of updates, and its variant SDMGrad-OS with an efficient objective sampling in the setting where the number of objectives is large. For a constant-level regularization parameter $\lambda$, we show that SDMGrad and SDMGrad-OS provably converge to a Pareto stationary point with improved complexities and milder assumptions. For an increasing $\lambda$, this convergent point reduces to a stationary point of the linear combination of objectives. We demonstrate the superior performance of the proposed methods in a series of tasks on multi-task supervised learning and reinforcement learning. Code is provided at https://github.com/ml-opt-lab/sdmgrad.
    Modern Bayesian Experimental Design. (arXiv:2302.14545v2 [stat.ML] UPDATED)
    Bayesian experimental design (BED) provides a powerful and general framework for optimizing the design of experiments. However, its deployment often poses substantial computational challenges that can undermine its practical use. In this review, we outline how recent advances have transformed our ability to overcome these challenges and thus utilize BED effectively, before discussing some key areas for future development in the field.
    A personalized Uncertainty Quantification framework for patient survival models: estimating individual uncertainty of patients with metastatic brain tumors in the absence of ground truth. (arXiv:2311.17173v1 [cs.LG])
    TodevelopanovelUncertaintyQuantification (UQ) framework to estimate the uncertainty of patient survival models in the absence of ground truth, we developed and evaluated our approach based on a dataset of 1383 patients treated with stereotactic radiosurgery (SRS) for brain metastases between January 2015 and December 2020. Our motivating hypothesis is that a time-to-event prediction of a test patient on inference is more certain given a higher feature-space-similarity to patients in the training set. Therefore, the uncertainty for a particular patient-of-interest is represented by the concordance index between a patient similarity rank and a prediction similarity rank. Model uncertainty was defined as the increased percentage of the max uncertainty-constrained-AUC compared to the model AUC. We evaluated our method on multiple clinically-relevant endpoints, including time to intracranial progression (ICP), progression-free survival (PFS) after SRS, overall survival (OS), and time to ICP and/or death (ICPD), on a variety of both statistical and non-statistical models, including CoxPH, conditional survival forest (CSF), and neural multi-task linear regression (NMTLR). Our results show that all models had the lowest uncertainty on ICP (2.21%) and the highest uncertainty (17.28%) on ICPD. OS models demonstrated high variation in uncertainty performance, where NMTLR had the lowest uncertainty(1.96%)and CSF had the highest uncertainty (14.29%). In conclusion, our method can estimate the uncertainty of individual patient survival modeling results. As expected, our data empirically demonstrate that as model uncertainty measured via our technique increases, the similarity between a feature-space and its predicted outcome decreases.
    Compressing the Backward Pass of Large-Scale Neural Architectures by Structured Activation Pruning. (arXiv:2311.16883v2 [cs.LG] UPDATED)
    The rise of Deep Neural Networks (DNNs) has led to an increase in model size and complexity, straining the memory capacity of GPUs. Sparsity in DNNs, characterized as structural or ephemeral, has gained attention as a solution. This work focuses on ephemeral sparsity, aiming to reduce memory consumption during training. It emphasizes the significance of activations, an often overlooked component, and their role in memory usage. This work employs structured pruning in Block Sparse Compressed Row (BSR) format in combination with a magnitude-based criterion to efficiently prune activations. We furthermore introduce efficient block-sparse operators for GPUs and showcase their effectiveness, as well as the superior compression offered by block sparsity. We report the effectiveness of activation pruning by evaluating training speed, accuracy, and memory usage of large-scale neural architectures on the example of ResMLP on image classification tasks. As a result, we observe a memory reduction of up to 32% while maintaining accuracy. Ultimately, our approach aims to democratize large-scale model training, reduce GPU requirements, and address ecological concerns.
    Marginal Laplacian Score. (arXiv:2311.17795v1 [cs.LG])
    High-dimensional imbalanced data poses a machine learning challenge. In the absence of sufficient or high-quality labels, unsupervised feature selection methods are crucial for the success of subsequent algorithms. Therefore, there is a growing need for unsupervised feature selection algorithms focused on imbalanced data. Thus, we propose a Marginal Laplacian Score (MLS) a modification of the well-known Laplacian Score (LS) to be better suited for imbalance data. We introduce an assumption that the minority class or anomalous appear more frequently in the margin of the features. Consequently, MLS aims to preserve the local structure of the data set's margin. As MLS is better suited for handling imbalanced data, we propose its integration into modern feature selection methods that utilize the Laplacian score. We integrate the MLS algorithm into the Differentiable Unsupervised Feature Selection (DUFS), resulting in DUFS-MLS. The proposed methods demonstrate robust and improved performance on synthetic and public data sets.
    Zero-Shot Self-Supervised Learning for MRI Reconstruction. (arXiv:2102.07737v4 [eess.IV] UPDATED)
    Deep learning (DL) has emerged as a powerful tool for accelerated MRI reconstruction, but often necessitates a database of fully-sampled measurements for training. Recent self-supervised and unsupervised learning approaches enable training without fully-sampled data. However, a database of undersampled measurements may not be available in many scenarios, especially for scans involving contrast or translational acquisitions in development. Moreover, recent studies show that database-trained models may not generalize well when the unseen measurements differ in terms of sampling pattern, acceleration rate, SNR, image contrast, and anatomy. Such challenges necessitate a new methodology to enable subject-specific DL MRI reconstruction without external training datasets, since it is clinically imperative to provide high-quality reconstructions that can be used to identify lesions/disease for \emph{every individual}. In this work, we propose a zero-shot self-supervised learning approach to perform subject-specific accelerated DL MRI reconstruction to tackle these issues. The proposed approach partitions the available measurements from a single scan into three disjoint sets. Two of these sets are used to enforce data consistency and define loss during training for self-supervision, while the last set serves to self-validate, establishing an early stopping criterion. In the presence of models pre-trained on a database with different image characteristics, we show that the proposed approach can be combined with transfer learning for faster convergence time and reduced computational complexity. The code is available at \url{https://github.com/byaman14/ZS-SSL}.
    A Counterfactual Safety Margin Perspective on the Scoring of Autonomous Vehicles' Riskiness. (arXiv:2308.01050v4 [cs.RO] UPDATED)
    Autonomous Vehicles (AVs) promise a range of societal advantages, including broader access to mobility, reduced road accidents, and enhanced transportation efficiency. However, evaluating the risks linked to AVs is complex due to limited historical data and the swift progression of technology. This paper presents a data-driven framework for assessing the risk of different AVs' behaviors in various operational design domains (ODDs), based on counterfactual simulations of "misbehaving" road users. We propose the notion of counterfactual safety margin, which represents the minimum deviation from nominal behavior that could cause a collision. This methodology not only pinpoints the most critical scenarios but also quantifies the (relative) risk's frequency and severity concerning AVs. Importantly, we show that our approach is applicable even when the AV's behavioral policy remains undisclosed, through worst- and best-case analyses, benefiting external entities like regulators and risk evaluators. Our experimental outcomes demonstrate the correlation between the safety margin, the quality of the driving policy, and the ODD, shedding light on the relative risks of different AV providers. Overall, this work contributes to the safety assessment of AVs and addresses legislative and insurance concerns surrounding this burgeoning technology.
    A quasi-polynomial time algorithm for Multi-Dimensional Scaling via LP hierarchies. (arXiv:2311.17840v1 [cs.DS])
    Multi-dimensional Scaling (MDS) is a family of methods for embedding pair-wise dissimilarities between $n$ objects into low-dimensional space. MDS is widely used as a data visualization tool in the social and biological sciences, statistics, and machine learning. We study the Kamada-Kawai formulation of MDS: given a set of non-negative dissimilarities $\{d_{i,j}\}_{i , j \in [n]}$ over $n$ points, the goal is to find an embedding $\{x_1,\dots,x_n\} \subset \mathbb{R}^k$ that minimizes \[ \text{OPT} = \min_{x} \mathbb{E}_{i,j \in [n]} \left[ \left(1-\frac{\|x_i - x_j\|}{d_{i,j}}\right)^2 \right] \] Despite its popularity, our theoretical understanding of MDS is extremely limited. Recently, Demaine, Hesterberg, Koehler, Lynch, and Urschel (arXiv:2109.11505) gave the first approximation algorithm with provable guarantees for Kamada-Kawai, which achieves an embedding with cost $\text{OPT} +\epsilon$ in $n^2 \cdot 2^{\tilde{\mathcal{O}}(k \Delta^4 / \epsilon^2)}$ time, where $\Delta$ is the aspect ratio of the input dissimilarities. In this work, we give the first approximation algorithm for MDS with quasi-polynomial dependency on $\Delta$: for target dimension $k$, we achieve a solution with cost $\mathcal{O}(\text{OPT}^{ \hspace{0.04in}1/k } \cdot \log(\Delta/\epsilon) )+ \epsilon$ in time $n^{ \mathcal{O}(1)} \cdot 2^{\tilde{\mathcal{O}}( k^2 (\log(\Delta)/\epsilon)^{k/2 + 1} ) }$. Our approach is based on a novel analysis of a conditioning-based rounding scheme for the Sherali-Adams LP Hierarchy. Crucially, our analysis exploits the geometry of low-dimensional Euclidean space, allowing us to avoid an exponential dependence on the aspect ratio $\Delta$. We believe our geometry-aware treatment of the Sherali-Adams Hierarchy is an important step towards developing general-purpose techniques for efficient metric optimization algorithms.
    LegendreTron: Uprising Proper Multiclass Loss Learning. (arXiv:2301.11695v3 [stat.ML] UPDATED)
    Loss functions serve as the foundation of supervised learning and are often chosen prior to model development. To avoid potentially ad hoc choices of losses, statistical decision theory describes a desirable property for losses known as \emph{properness}, which asserts that Bayes' rule is optimal. Recent works have sought to \emph{learn losses} and models jointly. Existing methods do this by fitting an inverse canonical link function which monotonically maps $\mathbb{R}$ to $[0,1]$ to estimate probabilities for binary problems. In this paper, we extend monotonicity to maps between $\mathbb{R}^{C-1}$ and the projected probability simplex $\tilde{\Delta}^{C-1}$ by using monotonicity of gradients of convex functions. We present {\sc LegendreTron} as a novel and practical method that jointly learns \emph{proper canonical losses} and probabilities for multiclass problems. Tested on a benchmark of domains with up to 1,000 classes, our experimental results show that our method consistently outperforms the natural multiclass baseline under a $t$-test at 99% significance on all datasets with greater than 10 classes.
    Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation. (arXiv:2309.02685v3 [cs.RO] UPDATED)
    Diffusion generative modeling has become a promising approach for learning robotic manipulation tasks from stochastic human demonstrations. In this paper, we present Diffusion-EDFs, a novel SE(3)-equivariant diffusion-based approach for visual robotic manipulation tasks. We show that our proposed method achieves remarkable data efficiency, requiring only 5 to 10 human demonstrations for effective end-to-end training in less than an hour. Furthermore, our benchmark experiments demonstrate that our approach has superior generalizability and robustness compared to state-of-the-art methods. Lastly, we validate our methods with real hardware experiments. Project Website: https://sites.google.com/view/diffusion-edfs/home
    Addressing Membership Inference Attack in Federated Learning with Model Compression. (arXiv:2311.17750v1 [cs.LG])
    Federated Learning (FL) has been proposed as a privacy-preserving solution for machine learning. However, recent works have shown that Federated Learning can leak private client data through membership attacks. In this paper, we show that the effectiveness of these attacks on the clients negatively correlates with the size of the client datasets and model complexity. Based on this finding, we propose model-agnostic Federated Learning as a privacy-enhancing solution because it enables the use of models of varying complexity in the clients. To this end, we present $\texttt{MaPP-FL}$, a novel privacy-aware FL approach that leverages model compression on the clients while keeping a full model on the server. We compare the performance of $\texttt{MaPP-FL}$ against state-of-the-art model-agnostic FL methods on the CIFAR-10, CIFAR-100, and FEMNIST vision datasets. Our experiments show the effectiveness of $\texttt{MaPP-FL}$ in preserving the clients' and the server's privacy while achieving competitive classification accuracies.
    Quantifying the redundancy between prosody and text. (arXiv:2311.17233v1 [cs.CL])
    Prosody -- the suprasegmental component of speech, including pitch, loudness, and tempo -- carries critical aspects of meaning. However, the relationship between the information conveyed by prosody vs. by the words themselves remains poorly understood. We use large language models (LLMs) to estimate how much information is redundant between prosody and the words themselves. Using a large spoken corpus of English audiobooks, we extract prosodic features aligned to individual words and test how well they can be predicted from LLM embeddings, compared to non-contextual word embeddings. We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features, including intensity, duration, pauses, and pitch contours. Furthermore, a word's prosodic information is redundant with both the word itself and the context preceding as well as following it. Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words. Along with this paper, we release a general-purpose data processing pipeline for quantifying the relationship between linguistic information and extra-linguistic features.
    SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. (arXiv:2310.03684v3 [cs.LG] UPDATED)
    Despite efforts to align large language models (LLMs) with human values, widely-used LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on LLMs. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. SmoothLLM reduces the attack success rate on numerous popular LLMs to below one percentage point, avoids unnecessary conservatism, and admits provable guarantees on attack mitigation. Moreover, our defense uses exponentially fewer queries than existing attacks and is compatible with any LLM. Our code is publicly available at the following link: https://github.com/arobey1/smooth-llm.
    Explainability for Large Language Models: A Survey. (arXiv:2309.01029v3 [cs.CL] UPDATED)
    Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. However, their internal mechanisms are still unclear and this lack of transparency poses unwanted risks for downstream applications. Therefore, understanding and explaining these models is crucial for elucidating their behaviors, limitations, and social impacts. In this paper, we introduce a taxonomy of explainability techniques and provide a structured overview of methods for explaining Transformer-based language models. We categorize techniques based on the training paradigms of LLMs: traditional fine-tuning-based paradigm and prompting-based paradigm. For each paradigm, we summarize the goals and dominant approaches for generating local explanations of individual predictions and global explanations of overall model knowledge. We also discuss metrics for evaluating generated explanations, and discuss how explanations can be leveraged to debug models and improve performance. Lastly, we examine key challenges and emerging opportunities for explanation techniques in the era of LLMs in comparison to conventional machine learning models.
    Reduced-order modeling for parameterized PDEs via implicit neural representations. (arXiv:2311.16410v1 [math.NA] CROSS LISTED)
    We present a new data-driven reduced-order modeling approach to efficiently solve parametrized partial differential equations (PDEs) for many-query problems. This work is inspired by the concept of implicit neural representation (INR), which models physics signals in a continuous manner and independent of spatial/temporal discretization. The proposed framework encodes PDE and utilizes a parametrized neural ODE (PNODE) to learn latent dynamics characterized by multiple PDE parameters. PNODE can be inferred by a hypernetwork to reduce the potential difficulties in learning PNODE due to a complex multilayer perceptron (MLP). The framework uses an INR to decode the latent dynamics and reconstruct accurate PDE solutions. Further, a physics-informed loss is also introduced to correct the prediction of unseen parameter instances. Incorporating the physics-informed loss also enables the model to be fine-tuned in an unsupervised manner on unseen PDE parameters. A numerical experiment is performed on a two-dimensional Burgers equation with a large variation of PDE parameters. We evaluate the proposed method at a large Reynolds number and obtain up to speedup of O(10^3) and ~1% relative error to the ground truth values.
    On the Adversarial Robustness of Graph Contrastive Learning Methods. (arXiv:2311.17853v1 [cs.LG])
    Contrastive learning (CL) has emerged as a powerful framework for learning representations of images and text in a self-supervised manner while enhancing model robustness against adversarial attacks. More recently, researchers have extended the principles of contrastive learning to graph-structured data, giving birth to the field of graph contrastive learning (GCL). However, whether GCL methods can deliver the same advantages in adversarial robustness as their counterparts in the image and text domains remains an open question. In this paper, we introduce a comprehensive robustness evaluation protocol tailored to assess the robustness of GCL models. We subject these models to adaptive adversarial attacks targeting the graph structure, specifically in the evasion scenario. We evaluate node and graph classification tasks using diverse real-world datasets and attack strategies. With our work, we aim to offer insights into the robustness of GCL methods and hope to open avenues for potential future research directions.
    GROOT: Learning to Follow Instructions by Watching Gameplay Videos. (arXiv:2310.08235v2 [cs.AI] UPDATED)
    We study the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space. We implement our agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers. We evaluate GROOT against open-world counterparts and human players on a proposed Minecraft SkillForge benchmark. The Elo ratings clearly show that GROOT is closing the human-machine gap as well as exhibiting a 70% winning rate over the best generalist agent baseline. Qualitative analysis of the induced goal space further demonstrates some interesting emergent properties, including the goal composition and complex gameplay behavior synthesis. The project page is available at https://craftjarvis-groot.github.io.
    GNNFlow: A Distributed Framework for Continuous Temporal GNN Learning on Dynamic Graphs. (arXiv:2311.17410v1 [cs.DC])
    Graph Neural Networks (GNNs) play a crucial role in various fields. However, most existing deep graph learning frameworks assume pre-stored static graphs and do not support training on graph streams. In contrast, many real-world graphs are dynamic and contain time domain information. We introduce GNNFlow, a distributed framework that enables efficient continuous temporal graph representation learning on dynamic graphs on multi-GPU machines. GNNFlow introduces an adaptive time-indexed block-based data structure that effectively balances memory usage with graph update and sampling operation efficiency. It features a hybrid GPU-CPU graph data placement for rapid GPU-based temporal neighborhood sampling and kernel optimizations for enhanced sampling processes. A dynamic GPU cache for node and edge features is developed to maximize cache hit rates through reuse and restoration strategies. GNNFlow supports distributed training across multiple machines with static scheduling to ensure load balance. We implement GNNFlow based on DGL and PyTorch. Our experimental results show that GNNFlow provides up to 21.1x faster continuous learning than existing systems.
    Q-learning Based Optimal False Data Injection Attack on Probabilistic Boolean Control Networks. (arXiv:2311.17631v1 [eess.SY])
    In this paper, we present a reinforcement learning (RL) method for solving optimal false data injection attack problems in probabilistic Boolean control networks (PBCNs) where the attacker lacks knowledge of the system model. Specifically, we employ a Q-learning (QL) algorithm to address this problem. We then propose an improved QL algorithm that not only enhances learning efficiency but also obtains optimal attack strategies for large-scale PBCNs that the standard QL algorithm cannot handle. Finally, we verify the effectiveness of our proposed approach by considering two attacked PBCNs, including a 10-node network and a 28-node network.
    Federated Fine-Tuning of Foundation Models via Probabilistic Masking. (arXiv:2311.17299v1 [cs.LG])
    Foundation Models (FMs) have revolutionized machine learning with their adaptability and high performance across tasks; yet, their integration into Federated Learning (FL) is challenging due to substantial communication overhead from their extensive parameterization. Current communication-efficient FL strategies, such as gradient compression, reduce bitrates to around $1$ bit-per-parameter (bpp). However, these approaches fail to harness the characteristics of FMs, with their large number of parameters still posing a challenge to communication efficiency, even at these bitrate regimes. In this work, we present DeltaMask, a novel method that efficiently fine-tunes FMs in FL at an ultra-low bitrate, well below 1 bpp. DeltaMask employs stochastic masking to detect highly effective subnetworks within FMs and leverage stochasticity and sparsity in client masks to compress updates into a compact grayscale image using probabilistic filters, deviating from traditional weight training approaches. Our comprehensive evaluations across various datasets and architectures demonstrate DeltaMask efficiently achieves bitrates as low as 0.09 bpp, enhancing communication efficiency while maintaining FMs performance, as measured on 8 datasets and 5 pre-trained models of various network architectures.
    Slot-Mixup with Subsampling: A Simple Regularization for WSI Classification. (arXiv:2311.17466v1 [cs.CV])
    Whole slide image (WSI) classification requires repetitive zoom-in and out for pathologists, as only small portions of the slide may be relevant to detecting cancer. Due to the lack of patch-level labels, multiple instance learning (MIL) is a common practice for training a WSI classifier. One of the challenges in MIL for WSIs is the weak supervision coming only from the slide-level labels, often resulting in severe overfitting. In response, researchers have considered adopting patch-level augmentation or applying mixup augmentation, but their applicability remains unverified. Our approach augments the training dataset by sampling a subset of patches in the WSI without significantly altering the underlying semantics of the original slides. Additionally, we introduce an efficient model (Slot-MIL) that organizes patches into a fixed number of slots, the abstract representation of patches, using an attention mechanism. We empirically demonstrate that the subsampling augmentation helps to make more informative slots by restricting the over-concentration of attention and to improve interpretability. Finally, we illustrate that combining our attention-based aggregation model with subsampling and mixup, which has shown limited compatibility in existing MIL methods, can enhance both generalization and calibration. Our proposed methods achieve the state-of-the-art performance across various benchmark datasets including class imbalance and distribution shifts.
    A Comprehensive Survey on Distributed Training of Graph Neural Networks. (arXiv:2211.05368v3 [cs.DC] UPDATED)
    Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields for their effectiveness in learning over graphs. To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training which distributes the workload of training across multiple computing nodes. At present, the volume of related research on distributed GNN training is exceptionally vast, accompanied by an extraordinarily rapid pace of publication. Moreover, the approaches reported in these studies exhibit significant divergence. This situation poses a considerable challenge for newcomers, hindering their ability to grasp a comprehensive understanding of the workflows, computational patterns, communication strategies, and optimization techniques employed in distributed GNN training. As a result, there is a pressing need for a survey to provide correct recognition, analysis, and comparisons in this field. In this paper, we provide a comprehensive survey of distributed GNN training by investigating various optimization techniques used in distributed GNN training. First, distributed GNN training is classified into several categories according to their workflows. In addition, their computational patterns and communication patterns, as well as the optimization techniques proposed by recent work are introduced. Second, the software frameworks and hardware platforms of distributed GNN training are also introduced for a deeper understanding. Third, distributed GNN training is compared with distributed training of deep neural networks, emphasizing the uniqueness of distributed GNN training. Finally, interesting issues and opportunities in this field are discussed.
    Generative quantum machine learning via denoising diffusion probabilistic models. (arXiv:2310.05866v2 [quant-ph] UPDATED)
    Deep generative models are key-enabling technology to computer vision, text generation and large language models. Denoising diffusion probabilistic models (DDPMs) have recently gained much attention due to their ability to generate diverse and high-quality samples in many computer vision tasks, as well as to incorporate flexible model architectures and relatively simple training scheme. Quantum generative models, empowered by entanglement and superposition, have brought new insight to learning classical and quantum data. Inspired by the classical counterpart, we propose the quantum denoising diffusion probabilistic models (QuDDPM) to enable efficiently trainable generative learning of quantum data. QuDDPM adopts sufficient layers of circuits to guarantee expressivity, while introduces multiple intermediate training tasks as interpolation between the target distribution and noise to avoid barren plateau and guarantee efficient training. We provide bounds on the learning error and demonstrate QuDDPM's capability in learning correlated quantum noise model, quantum many-body phases and topological structure of quantum data. The results provide a paradigm for versatile and efficient quantum generative learning.
    (Ir)rationality in AI: State of the Art, Research Challenges and Open Questions. (arXiv:2311.17165v1 [cs.AI])
    The concept of rationality is central to the field of artificial intelligence. Whether we are seeking to simulate human reasoning, or the goal is to achieve bounded optimality, we generally seek to make artificial agents as rational as possible. Despite the centrality of the concept within AI, there is no unified definition of what constitutes a rational agent. This article provides a survey of rationality and irrationality in artificial intelligence, and sets out the open questions in this area. The understanding of rationality in other fields has influenced its conception within artificial intelligence, in particular work in economics, philosophy and psychology. Focusing on the behaviour of artificial agents, we consider irrational behaviours that can prove to be optimal in certain scenarios. Some methods have been developed to deal with irrational agents, both in terms of identification and interaction, however work in this area remains limited. Methods that have up to now been developed for other purposes, namely adversarial scenarios, may be adapted to suit interactions with artificial agents. We further discuss the interplay between human and artificial agents, and the role that rationality plays within this interaction; many questions remain in this area, relating to potentially irrational behaviour of both humans and artificial agents.
    Corruption-Robust Lipschitz Contextual Search. (arXiv:2307.13903v3 [cs.LG] UPDATED)
    I study the problem of learning a Lipschitz function with corrupted binary signals. The learner tries to learn a $L$-Lipschitz function $f: [0,1]^d \rightarrow [0, L]$ that the adversary chooses. There is a total of $T$ rounds. In each round $t$, the adversary selects a context vector $x_t$ in the input space, and the learner makes a guess to the true function value $f(x_t)$ and receives a binary signal indicating whether the guess is high or low. In a total of $C$ rounds, the signal may be corrupted, though the value of $C$ is \emph{unknown} to the learner. The learner's goal is to incur a small cumulative loss. This work introduces the new algorithmic technique \emph{agnostic checking} as well as new analysis techniques. I design algorithms which: for the symmetric loss, the learner achieves regret $L\cdot O(C\log T)$ with $d = 1$ and $L\cdot O_d(C\log T + T^{(d-1)/d})$ with $d > 1$; for the pricing loss, the learner achieves regret $L\cdot \widetilde{O} (T^{d/(d+1)} + C\cdot T^{1/(d+1)})$.
    IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers. (arXiv:2311.17072v1 [cs.CV])
    Generative training has been demonstrated to be powerful for building visual-language models. However, on zero-shot discriminative benchmarks, there is still a performance gap between models trained with generative and discriminative objectives. In this paper, we aim to narrow this gap by improving the efficacy of generative training on classification tasks, without any finetuning processes or additional modules. Specifically, we focus on narrowing the gap between the generative captioner and the CLIP classifier. We begin by analysing the predictions made by the captioner and classifier and observe that the caption generation inherits the distribution bias from the language model trained with pure text modality, making it less grounded on the visual signal. To tackle this problem, we redesign the scoring objective for the captioner to alleviate the distributional bias and focus on measuring the gain of information brought by the visual inputs. We further design a generative training objective to match the evaluation objective. We name our model trained and evaluated from the novel procedures as Information Gain (IG) captioner. We pretrain the models on the public Laion-5B dataset and perform a series of discriminative evaluations. For the zero-shot classification on ImageNet, IG captioner achieves $> 18\%$ improvements over the standard captioner, achieving comparable performances with the CLIP classifier. IG captioner also demonstrated strong performance on zero-shot image-text retrieval tasks on MSCOCO and Flickr30K. We hope this paper inspires further research towards unifying generative and discriminative training procedures for visual-language models.
    LanGWM: Language Grounded World Model. (arXiv:2311.17593v1 [cs.LG])
    Recent advances in deep reinforcement learning have showcased its potential in tackling complex tasks. However, experiments on visual control tasks have revealed that state-of-the-art reinforcement learning models struggle with out-of-distribution generalization. Conversely, expressing higher-level concepts and global contexts is relatively easy using language. Building upon recent success of the large language models, our main objective is to improve the state abstraction technique in reinforcement learning by leveraging language for robust action selection. Specifically, we focus on learning language-grounded visual features to enhance the world model learning, a model-based reinforcement learning technique. To enforce our hypothesis explicitly, we mask out the bounding boxes of a few objects in the image observation and provide the text prompt as descriptions for these masked objects. Subsequently, we predict the masked objects along with the surrounding regions as pixel reconstruction, similar to the transformer-based masked autoencoder approach. Our proposed LanGWM: Language Grounded World Model achieves state-of-the-art performance in out-of-distribution test at the 100K interaction steps benchmarks of iGibson point navigation tasks. Furthermore, our proposed technique of explicit language-grounded visual representation learning has the potential to improve models for human-robot interaction because our extracted visual features are language grounded.
    Continuous optimization by quantum adaptive distribution search. (arXiv:2311.17353v1 [quant-ph])
    In this paper, we introduce the quantum adaptive distribution search (QuADS), a quantum continuous optimization algorithm that integrates Grover adaptive search (GAS) with the covariance matrix adaptation - evolution strategy (CMA-ES), a classical technique for continuous optimization. QuADS utilizes the quantum-based search capabilities of GAS and enhances them with the principles of CMA-ES for more efficient optimization. It employs a multivariate normal distribution for the initial state of the quantum search and repeatedly updates it throughout the optimization process. Our numerical experiments show that QuADS outperforms both GAS and CMA-ES. This is achieved through adaptive refinement of the initial state distribution rather than consistently using a uniform state, resulting in fewer oracle calls. This study presents an important step toward exploiting the potential of quantum computing for continuous optimization.
    DTW+S: Shape-based Comparison of Time-series with Ordered Local Trend. (arXiv:2309.03579v2 [cs.LG] UPDATED)
    Measuring distance or similarity between time-series data is a fundamental aspect of many applications including classification, clustering, and ensembling/alignment. Existing measures may fail to capture similarities among local trends (shapes) and may even produce misleading results. Our goal is to develop a measure that looks for similar trends occurring around similar times and is easily interpretable for researchers in applied domains. This is particularly useful for applications where time-series have a sequence of meaningful local trends that are ordered, such as in epidemics (a surge to an increase to a peak to a decrease). We propose a novel measure, DTW+S, which creates an interpretable "closeness-preserving" matrix representation of the time-series, where each column represents local trends, and then it applies Dynamic Time Warping to compute distances between these matrices. We present a theoretical analysis that supports the choice of this representation. We demonstrate the utility of DTW+S in several tasks. For the clustering of epidemic curves, we show that DTW+S is the only measure able to produce good clustering compared to the baselines. For ensemble building, we propose a combination of DTW+S and barycenter averaging that results in the best preservation of characteristics of the underlying trajectories. We also demonstrate that our approach results in better classification compared to Dynamic Time Warping for a class of datasets, particularly when local trends rather than scale play a decisive role.
    Practical Layout-Aware Analog/Mixed-Signal Design Automation with Bayesian Neural Networks. (arXiv:2311.17073v1 [cs.LG])
    The high simulation cost has been a bottleneck of practical analog/mixed-signal design automation. Many learning-based algorithms require thousands of simulated data points, which is impractical for expensive to simulate circuits. We propose a learning-based algorithm that can be trained using a small amount of data and, therefore, scalable to tasks with expensive simulations. Our efficient algorithm solves the post-layout performance optimization problem where simulations are known to be expensive. Our comprehensive study also solves the schematic-level sizing problem. For efficient optimization, we utilize Bayesian Neural Networks as a regression model to approximate circuit performance. For layout-aware optimization, we handle the problem as a multi-fidelity optimization problem and improve efficiency by exploiting the correlations from cheaper evaluations. We present three test cases to demonstrate the efficiency of our algorithms. Our tests prove that the proposed approach is more efficient than conventional baselines and state-of-the-art algorithms.
    Mostly Beneficial Clustering: Aggregating Data for Operational Decision Making. (arXiv:2311.17326v1 [cs.LG])
    With increasingly volatile market conditions and rapid product innovations, operational decision-making for large-scale systems entails solving thousands of problems with limited data. Data aggregation is proposed to combine the data across problems to improve the decisions obtained by solving those problems individually. We propose a novel cluster-based shrunken-SAA approach that can exploit the cluster structure among problems when implementing the data aggregation approaches. We prove that, as the number of problems grows, leveraging the known cluster structure among problems yields additional benefits over the data aggregation approaches that neglect such structure. When the cluster structure is unknown, we show that unveiling the cluster structure, even at the cost of a few data points, can be beneficial, especially when the distance between clusters of problems is substantial. Our proposed approach can be extended to general cost functions under mild conditions. When the number of problems gets large, the optimality gap of our proposed approach decreases exponentially in the distance between the clusters. We explore the performance of the proposed approach through the application of managing newsvendor systems via numerical experiments. We investigate the impacts of distance metrics between problem instances on the performance of the cluster-based Shrunken-SAA approach with synthetic data. We further validate our proposed approach with real data and highlight the advantages of cluster-based data aggregation, especially in the small-data large-scale regime, compared to the existing approaches.
    LoCoMotif: Discovering time-warped motifs in time series. (arXiv:2311.17582v1 [cs.LG])
    Time Series Motif Discovery (TSMD) refers to the task of identifying patterns that occur multiple times (possibly with minor variations) in a time series. All existing methods for TSMD have one or more of the following limitations: they only look for the two most similar occurrences of a pattern; they only look for patterns of a pre-specified, fixed length; they cannot handle variability along the time axis; and they only handle univariate time series. In this paper, we present a new method, LoCoMotif, that has none of these limitations. The method is motivated by a concrete use case from physiotherapy. We demonstrate the value of the proposed method on this use case. We also introduce a new quantitative evaluation metric for motif discovery, and benchmark data for comparing TSMD methods. LoCoMotif substantially outperforms the existing methods, on top of being more broadly applicable.
    Meta-Learning with a Geometry-Adaptive Preconditioner. (arXiv:2304.01552v2 [cs.CV] UPDATED)
    Model-agnostic meta-learning (MAML) is one of the most successful meta-learning algorithms. It has a bi-level optimization structure where the outer-loop process learns a shared initialization and the inner-loop process optimizes task-specific weights. Although MAML relies on the standard gradient descent in the inner-loop, recent studies have shown that controlling the inner-loop's gradient descent with a meta-learned preconditioner can be beneficial. Existing preconditioners, however, cannot simultaneously adapt in a task-specific and path-dependent way. Additionally, they do not satisfy the Riemannian metric condition, which can enable the steepest descent learning with preconditioned gradient. In this study, we propose Geometry-Adaptive Preconditioned gradient descent (GAP) that can overcome the limitations in MAML; GAP can efficiently meta-learn a preconditioner that is dependent on task-specific parameters, and its preconditioner can be shown to be a Riemannian metric. Thanks to the two properties, the geometry-adaptive preconditioner is effective for improving the inner-loop optimization. Experiment results show that GAP outperforms the state-of-the-art MAML family and preconditioned gradient descent-MAML (PGD-MAML) family in a variety of few-shot learning tasks. Code is available at: https://github.com/Suhyun777/CVPR23-GAP.
    SenTest: Evaluating Robustness of Sentence Encoders. (arXiv:2311.17722v1 [cs.CL])
    Contrastive learning has proven to be an effective method for pre-training models using weakly labeled data in the vision domain. Sentence transformers are the NLP counterparts to this architecture, and have been growing in popularity due to their rich and effective sentence representations. Having effective sentence representations is paramount in multiple tasks, such as information retrieval, retrieval augmented generation (RAG), and sentence comparison. Keeping in mind the deployability factor of transformers, evaluating the robustness of sentence transformers is of utmost importance. This work focuses on evaluating the robustness of the sentence encoders. We employ several adversarial attacks to evaluate its robustness. This system uses character-level attacks in the form of random character substitution, word-level attacks in the form of synonym replacement, and sentence-level attacks in the form of intra-sentence word order shuffling. The results of the experiments strongly undermine the robustness of sentence encoders. The models produce significantly different predictions as well as embeddings on perturbed datasets. The accuracy of the models can fall up to 15 percent on perturbed datasets as compared to unperturbed datasets. Furthermore, the experiments demonstrate that these embeddings does capture the semantic and syntactic structure (sentence order) of sentences. However, existing supervised classification strategies fail to leverage this information, and merely function as n-gram detectors.
    CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra. (arXiv:2309.03060v2 [cs.LG] UPDATED)
    Many areas of machine learning and science involve large linear algebra problems, such as eigendecompositions, solving linear systems, computing matrix exponentials, and trace estimation. The matrices involved often have Kronecker, convolutional, block diagonal, sum, or product structure. In this paper, we propose a simple but general framework for large-scale linear algebra problems in machine learning, named CoLA (Compositional Linear Algebra). By combining a linear operator abstraction with compositional dispatch rules, CoLA automatically constructs memory and runtime efficient numerical algorithms. Moreover, CoLA provides memory efficient automatic differentiation, low precision computation, and GPU acceleration in both JAX and PyTorch, while also accommodating new objects, operations, and rules in downstream packages via multiple dispatch. CoLA can accelerate many algebraic operations, while making it easy to prototype matrix structures and algorithms, providing an appealing drop-in tool for virtually any computational effort that requires linear algebra. We showcase its efficacy across a broad range of applications, including partial differential equations, Gaussian processes, equivariant model construction, and unsupervised learning.
    Wireless Network Digital Twin for 6G: Generative AI as A Key Enabler. (arXiv:2311.17451v1 [cs.NI])
    Digital twin, which enables emulation, evaluation, and optimization of physical entities through synchronized digital replicas, has gained increasingly attention as a promising technology for intricate wireless networks. For 6G, numerous innovative wireless technologies and network architectures have posed new challenges in establishing wireless network digital twins. To tackle these challenges, artificial intelligence (AI), particularly the flourishing generative AI, emerges as a potential solution. In this article, we discuss emerging prerequisites for wireless network digital twins considering the complicated network architecture, tremendous network scale, extensive coverage, and diversified application scenarios in the 6G era. We further explore the applications of generative AI, such as transformer and diffusion model, to empower the 6G digital twin from multiple perspectives including implementation, physical-digital synchronization, and slicing capability. Subsequently, we propose a hierarchical generative AI-enabled wireless network digital twin at both the message-level and policy-level, and provide a typical use case with numerical results to validate the effectiveness and efficiency. Finally, open research issues for wireless network digital twins in the 6G era are discussed.
    Collision Cross-entropy for Soft Class Labels and Deep Clustering. (arXiv:2303.07321v3 [cs.LG] UPDATED)
    We propose "collision cross-entropy" as a robust alternative to Shannon's cross-entropy (CE) loss when class labels are represented by soft categorical distributions y. In general, soft labels can naturally represent ambiguous targets in classification. They are particularly relevant for self-labeled clustering methods, where latent pseudo-labels are jointly estimated with the model parameters and uncertainty is prevalent. In case of soft labels, Shannon's CE teaches the model predictions to reproduce the uncertainty in each training example, which inhibits the model's ability to learn and generalize from these examples. As an alternative loss, we propose the negative log of "collision probability" that maximizes the chance of equality between two random variables, predicted class and unknown true class. We show that it has the properties of a generalized CE. The proposed collision CE agrees with Shannon's CE for one-hot labels, but the training from soft labels differs. For example, unlike Shannon's CE, data points where y is a uniform distribution have zero contribution to the training. Collision CE significantly improves classification supervised by soft uncertain targets. Unlike Shannon's, collision CE is symmetric for y and network predictions, which is particularly relevant when both distributions are estimated in the context of self-labeled clustering. Focusing on discriminative deep clustering where self-labeling and entropy-based losses are dominant, we show that the use of collision CE improves the state-of-the-art. We also derive an efficient EM algorithm that significantly speeds up the pseudo-label estimation with collision CE.
    Dynamic Neighborhood Construction for Structured Large Discrete Action Spaces. (arXiv:2305.19891v3 [cs.LG] UPDATED)
    Large discrete action spaces (LDAS) remain a central challenge in reinforcement learning. Existing solution approaches can handle unstructured LDAS with up to a few million actions. However, many real-world applications in logistics, production, and transportation systems have combinatorial action spaces, whose size grows well beyond millions of actions, even on small instances. Fortunately, such action spaces exhibit structure, e.g., equally spaced discrete resource units. With this work, we focus on handling structured LDAS (SLDAS) with sizes that cannot be handled by current benchmarks: we propose Dynamic Neighborhood Construction (DNC), a novel exploitation paradigm for SLDAS. We present a scalable neighborhood exploration heuristic that utilizes this paradigm and efficiently explores the discrete neighborhood around the continuous proxy action in structured action spaces with up to $10^{73}$ actions. We demonstrate the performance of our method by benchmarking it against three state-of-the-art approaches designed for large discrete action spaces across two distinct environments. Our results show that DNC matches or outperforms state-of-the-art approaches while being computationally more efficient. Furthermore, our method scales to action spaces that so far remained computationally intractable for existing methodologies.
    Propagate & Distill: Towards Effective Graph Learners Using Propagation-Embracing MLPs. (arXiv:2311.17781v1 [cs.LG])
    Recent studies attempted to utilize multilayer perceptrons (MLPs) to solve semisupervised node classification on graphs, by training a student MLP by knowledge distillation from a teacher graph neural network (GNN). While previous studies have focused mostly on training the student MLP by matching the output probability distributions between the teacher and student models during distillation, it has not been systematically studied how to inject the structural information in an explicit and interpretable manner. Inspired by GNNs that separate feature transformation $T$ and propagation $\Pi$, we re-frame the distillation process as making the student MLP learn both $T$ and $\Pi$. Although this can be achieved by applying the inverse propagation $\Pi^{-1}$ before distillation from the teacher, it still comes with a high computational cost from large matrix multiplications during training. To solve this problem, we propose Propagate & Distill (P&D), which propagates the output of the teacher before distillation, which can be interpreted as an approximate process of the inverse propagation. We demonstrate that P&D can readily improve the performance of the student MLP.
    Algorithmic Assistance with Recommendation-Dependent Preferences. (arXiv:2208.07626v2 [cs.LG] UPDATED)
    When we use algorithms to produce risk assessments, we typically think of these predictions as providing helpful input to human decisions, such as when risk scores are presented to judges or doctors. But when a decision-maker obtains algorithmic assistance, they may not only react to the information. The decision-maker may view the input of the algorithm as recommending a default action, making it costly for them to deviate, such as when a judge is reluctant to overrule a high-risk assessment of a defendant or a doctor fears the consequences of deviating from recommended procedures. In this article, we propose a principal-agent model of joint human-machine decision-making. Within this model, we consider the effect and design of algorithmic recommendations when they affect choices not just by shifting beliefs, but also by altering preferences. We motivate this assumption from institutional factors, such as a desire to avoid audits, as well as from well-established models in behavioral science that predict loss aversion relative to a reference point, which here is set by the algorithm. We show that recommendation-dependent preferences create inefficiencies where the decision-maker is overly responsive to the recommendation. As a potential remedy, we discuss algorithms that strategically withhold recommendations, and show how they can improve the quality of final decisions.
    Effective Learning with Node Perturbation in Deep Neural Networks. (arXiv:2310.00965v2 [cs.LG] UPDATED)
    Backpropagation (BP) is the dominant and most successful method for training parameters of deep neural network models. However, BP relies on two computationally distinct phases, does not provide a satisfactory explanation of biological learning, and can be challenging to apply for training of networks with discontinuities or noisy node dynamics. By comparison, node perturbation (NP) proposes learning by the injection of noise into the network activations, and subsequent measurement of the induced loss change. NP relies on two forward (inference) passes, does not make use of network derivatives, and has been proposed as a model for learning in biological systems. However, standard NP is highly data inefficient and unstable due to its unguided noise-based search process. In this work, we investigate different formulations of NP and relate it to the concept of directional derivatives as well as combining it with a decorrelating mechanism for layer-wise inputs. We find that a closer alignment with directional derivatives together with input decorrelation at every layer significantly enhances performance of NP learning, making its performance on the train set competitive with BP and allowing its application to noisy systems in which the noise process itself is inaccessible.
    Analyzing and Explaining Image Classifiers via Diffusion Guidance. (arXiv:2311.17833v1 [cs.CV])
    While deep learning has led to huge progress in complex image classification tasks like ImageNet, unexpected failure modes, e.g. via spurious features, call into question how reliably these classifiers work in the wild. Furthermore, for safety-critical tasks the black-box nature of their decisions is problematic, and explanations or at least methods which make decisions plausible are needed urgently. In this paper, we address these problems by generating images that optimize a classifier-derived objective using a framework for guided image generation. We analyze the behavior and decisions of image classifiers by visual counterfactual explanations (VCEs), detection of systematic mistakes by analyzing images where classifiers maximally disagree, and visualization of neurons to verify potential spurious features. In this way, we validate existing observations, e.g. the shape bias of adversarially robust models, as well as novel failure modes, e.g. systematic errors of zero-shot CLIP classifiers, or identify harmful spurious features. Moreover, our VCEs outperform previous work while being more versatile.
    Human Choice Prediction in Language-based Non-Cooperative Games: Simulation-based Off-Policy Evaluation. (arXiv:2305.10361v3 [cs.LG] UPDATED)
    Persuasion games have been fundamental in economics and AI research, and have significant practical applications. Recent works in this area have started to incorporate natural language, moving beyond the traditional stylized message setting. However, previous research has focused on on-policy prediction, where the train and test data have the same distribution, which is not representative of real-life scenarios. In this paper, we tackle the challenging problem of off-policy evaluation (OPE) in language-based persuasion games. To address the inherent difficulty of human data collection in this setup, we propose a novel approach which combines real and simulated human-bot interaction data. Our simulated data is created by an exogenous model assuming decision makers (DMs) start with a mixture of random and decision-theoretic based behaviors and improve over time. We present a deep learning training algorithm that effectively integrates real interaction and simulated data, substantially improving over models that train only with interaction data. Our results demonstrate the potential of real interaction and simulation mixtures as a cost-effective and scalable solution for OPE in language-based persuasion games. Our code and the large dataset we collected and generated are submitted as supplementary material and publicly available in our GitHub repository: https://github.com/eilamshapira/HumanChoicePrediction
    DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models. (arXiv:2310.01870v2 [cs.LG] UPDATED)
    As large language models (LLMs) become more capable, there is an urgent need for interpretable and transparent tools. Current methods are difficult to implement, and accessible tools to analyze model internals are lacking. To bridge this gap, we present DeepDecipher - an API and interface for probing neurons in transformer models' MLP layers. DeepDecipher makes the outputs of advanced interpretability techniques for LLMs readily available. The easy-to-use interface also makes inspecting these complex models more intuitive. This paper outlines DeepDecipher's design and capabilities. We demonstrate how to analyze neurons, compare models, and gain insights into model behavior. For example, we contrast DeepDecipher's functionality with similar tools like Neuroscope and OpenAI's Neuron Explainer. DeepDecipher enables efficient, scalable analysis of LLMs. By granting access to state-of-the-art interpretability methods, DeepDecipher makes LLMs more transparent, trustworthy, and safe. Researchers, engineers, and developers can quickly diagnose issues, audit systems, and advance the field.
    D-CIPHER: Discovery of Closed-form Partial Differential Equations. (arXiv:2206.10586v3 [cs.LG] UPDATED)
    Closed-form differential equations, including partial differential equations and higher-order ordinary differential equations, are one of the most important tools used by scientists to model and better understand natural phenomena. Discovering these equations directly from data is challenging because it requires modeling relationships between various derivatives that are not observed in the data (equation-data mismatch) and it involves searching across a huge space of possible equations. Current approaches make strong assumptions about the form of the equation and thus fail to discover many well-known systems. Moreover, many of them resolve the equation-data mismatch by estimating the derivatives, which makes them inadequate for noisy and infrequently sampled systems. To this end, we propose D-CIPHER, which is robust to measurement artifacts and can uncover a new and very general class of differential equations. We further design a novel optimization procedure, CoLLie, to help D-CIPHER search through this class efficiently. Finally, we demonstrate empirically that it can discover many well-known equations that are beyond the capabilities of current methods.
    An Efficient High-Dimensional Gene Selection Approach based on Binary Horse Herd Optimization Algorithm for Biological Data Classification. (arXiv:2308.09791v2 [cs.LG] UPDATED)
    The Horse Herd Optimization Algorithm (HOA) is a new meta-heuristic algorithm based on the behaviors of horses at different ages. The HOA was introduced recently to solve complex and high-dimensional problems. This paper proposes a binary version of the Horse Herd Optimization Algorithm (BHOA) in order to solve discrete problems and select prominent feature subsets. Moreover, this study provides a novel hybrid feature selection framework based on the BHOA and a minimum Redundancy Maximum Relevance (MRMR) filter method. This hybrid feature selection, which is more computationally efficient, produces a beneficial subset of relevant and informative features. Since feature selection is a binary problem, we have applied a new Transfer Function (TF), called X-shape TF, which transforms continuous problems into binary search spaces. Furthermore, the Support Vector Machine (SVM) is utilized to examine the efficiency of the proposed method on ten microarray datasets, namely Lymphoma, Prostate, Brain-1, DLBCL, SRBCT, Leukemia, Ovarian, Colon, Lung, and MLL. In comparison to other state-of-the-art, such as the Gray Wolf (GW), Particle Swarm Optimization (PSO), and Genetic Algorithm (GA), the proposed hybrid method (MRMR-BHOA) demonstrates superior performance in terms of accuracy and minimum selected features. Also, experimental results prove that the X-Shaped BHOA approach outperforms others methods.
    FASER: Binary Code Similarity Search through the use of Intermediate Representations. (arXiv:2310.03605v3 [cs.CR] UPDATED)
    Being able to identify functions of interest in cross-architecture software is useful whether you are analysing for malware, securing the software supply chain or conducting vulnerability research. Cross-Architecture Binary Code Similarity Search has been explored in numerous studies and has used a wide range of different data sources to achieve its goals. The data sources typically used draw on common structures derived from binaries such as function control flow graphs or binary level call graphs, the output of the disassembly process or the outputs of a dynamic analysis approach. One data source which has received less attention is binary intermediate representations. Binary Intermediate representations possess two interesting properties: they are cross architecture by their very nature and encode the semantics of a function explicitly to support downstream usage. Within this paper we propose Function as a String Encoded Representation (FASER) which combines long document transformers with the use of intermediate representations to create a model capable of cross architecture function search without the need for manual feature engineering, pre-training or a dynamic analysis step. We compare our approach against a series of baseline approaches for two tasks; A general function search task and a targeted vulnerability search task. Our approach demonstrates strong performance across both tasks, performing better than all baseline approaches.
    Uncertainty-aware Traffic Prediction under Missing Data. (arXiv:2309.06800v5 [cs.LG] UPDATED)
    Traffic prediction is a crucial topic because of its broad scope of applications in the transportation domain. Recently, various studies have achieved promising results. However, most studies assume the prediction locations have complete or at least partial historical records and cannot be extended to non-historical recorded locations. In real-life scenarios, the deployment of sensors could be limited due to budget limitations and installation availability, which makes most current models not applicable. Though few pieces of literature tried to impute traffic states at the missing locations, these methods need the data simultaneously observed at the locations with sensors, making them not applicable to prediction tasks. Another drawback is the lack of measurement of uncertainty in prediction, making prior works unsuitable for risk-sensitive tasks or involving decision-making. To fill the gap, inspired by the previous inductive graph neural network, this work proposed an uncertainty-aware framework with the ability to 1) extend prediction to missing locations with no historical records and significantly extend spatial coverage of prediction locations while reducing deployment of sensors and 2) generate probabilistic prediction with uncertainty quantification to help the management of risk and decision making in the down-stream tasks. Through extensive experiments on real-life datasets, the result shows our method achieved promising results on prediction tasks, and the uncertainty quantification gives consistent results which highly correlated with the locations with and without historical data. We also show that our model could help support sensor deployment tasks in the transportation field to achieve higher accuracy with a limited sensor deployment budget.
    Intellectual Property Protection of Diffusion Models via the Watermark Diffusion Process. (arXiv:2306.03436v2 [cs.CR] UPDATED)
    Diffusion models have rapidly become a vital part of deep generative architectures, given today's increasing demands. Obtaining large, high-performance diffusion models demands significant resources, highlighting their importance as intellectual property worth protecting. However, existing watermarking techniques for ownership verification are insufficient when applied to diffusion models. Very recent research in watermarking diffusion models either exposes watermarks during task generation, which harms the imperceptibility, or is developed for conditional diffusion models that require prompts to trigger the watermark. This paper introduces WDM, a novel watermarking solution for diffusion models without imprinting the watermark during task generation. It involves training a model to concurrently learn a Watermark Diffusion Process (WDP) for embedding watermarks alongside the standard diffusion process for task generation. We provide a detailed theoretical analysis of WDP training and sampling, relating it to a shifted Gaussian diffusion process via the same reverse noise. Extensive experiments are conducted to validate the effectiveness and robustness of our approach in various trigger and watermark data configurations.
    Hausdorff Distance Matching with Adaptive Query Denoising for Rotated Detection Transformer. (arXiv:2305.07598v4 [cs.CV] UPDATED)
    The Detection Transformer (DETR) has emerged as a pivotal role in object detection tasks, setting new performance benchmarks due to its end-to-end design and scalability. Despite its advancements, the application of DETR in detecting rotated objects has demonstrated suboptimal performance relative to established oriented object detectors. Our analysis identifies a key limitation: the L1 cost used in Hungarian Matching leads to duplicate predictions due to the square-like problem in oriented object detection, thereby obstructing the training process of the detector. We introduce a Hausdorff distance-based cost for Hungarian matching, which more accurately quantifies the discrepancy between predictions and ground truths. Moreover, we note that a static denoising approach hampers the training of rotated DETR, particularly when the detector's predictions surpass the quality of noised ground truths. We propose an adaptive query denoising technique, employing Hungarian matching to selectively filter out superfluous noised queries that no longer contribute to model improvement. Our proposed modifications to DETR have resulted in superior performance, surpassing previous rotated DETR models and other alternatives. This is evidenced by our model's state-of-the-art achievements in benchmarks such as DOTA-v1.0/v1.5/v2.0, and DIOR-R.
    Leveraging Graph Diffusion Models for Network Refinement Tasks. (arXiv:2311.17856v1 [cs.LG])
    Most real-world networks are noisy and incomplete samples from an unknown target distribution. Refining them by correcting corruptions or inferring unobserved regions typically improves downstream performance. Inspired by the impressive generative capabilities that have been used to correct corruptions in images, and the similarities between "in-painting" and filling in missing nodes and edges conditioned on the observed graph, we propose a novel graph generative framework, SGDM, which is based on subgraph diffusion. Our framework not only improves the scalability and fidelity of graph diffusion models, but also leverages the reverse process to perform novel, conditional generation tasks. In particular, through extensive empirical analysis and a set of novel metrics, we demonstrate that our proposed model effectively supports the following refinement tasks for partially observable networks: T1: denoising extraneous subgraphs, T2: expanding existing subgraphs and T3: performing "style" transfer by regenerating a particular subgraph to match the characteristics of a different node or subgraph.
    Modular Quantization-Aware Training: Increasing Accuracy by Decreasing Precision in 6D Object Pose Estimation. (arXiv:2303.06753v2 [cs.CV] UPDATED)
    Edge applications, such as collaborative robotics and spacecraft rendezvous, demand efficient 6D object pose estimation on resource-constrained embedded platforms. Existing 6D pose estimation networks are often too large for such deployments, necessitating compression while maintaining reliable performance. To address this challenge, we introduce Modular Quantization-Aware Training (MQAT), an adaptive and mixed-precision quantization-aware training strategy that exploits the modular structure of modern 6D pose estimation architectures. MQAT guides a systematic gradated modular quantization sequence and determines module-specific bit precisions, leading to quantized models that outperform those produced by state-of-the-art uniform and mixed-precision quantization techniques. Our experiments showcase the generality of MQAT across datasets, architectures, and quantization algorithms. Remarkably, MQAT-trained quantized models achieve a significant accuracy boost (>7%) over the baseline full-precision network while reducing model size by a factor of 4x or more.
    A sparse coding approach to inverse problems with application to microwave tomography. (arXiv:2308.03818v2 [eess.IV] UPDATED)
    Inverse imaging problems that are ill-posed can be encountered across multiple domains of science and technology, ranging from medical diagnosis to astronomical studies. To reconstruct images from incomplete and distorted data, it is necessary to create algorithms that can take into account both, the physical mechanisms responsible for generating these measurements and the intrinsic characteristics of the images being analyzed. In this work, the sparse representation of images is reviewed, which is a realistic, compact and effective generative model for natural images inspired by the visual system of mammals. It enables us to address ill-posed linear inverse problems by training the model on a vast collection of images. Moreover, we extend the application of sparse coding to solve the non-linear and ill-posed problem in microwave tomography imaging, which could lead to a significant improvement of the state-of-the-arts algorithms.
    Conditional Gradients for the Approximate Vanishing Ideal. (arXiv:2202.03349v15 [cs.LG] UPDATED)
    The vanishing ideal of a set of points $X\subseteq \mathbb{R}^n$ is the set of polynomials that evaluate to $0$ over all points $\mathbf{x} \in X$ and admits an efficient representation by a finite set of polynomials called generators. To accommodate the noise in the data set, we introduce the pairwise conditional gradients approximate vanishing ideal algorithm (PCGAVI) that constructs a set of generators of the approximate vanishing ideal. The constructed generators capture polynomial structures in data and give rise to a feature map that can, for example, be used in combination with a linear classifier for supervised learning. In PCGAVI, we construct the set of generators by solving constrained convex optimization problems with the pairwise conditional gradients algorithm. Thus, PCGAVI not only constructs few but also sparse generators, making the corresponding feature transformation robust and compact. Furthermore, we derive several learning guarantees for PCGAVI that make the algorithm theoretically better motivated than related generator-constructing methods.
    Anonymous Jamming Detection in 5G with Bayesian Network Model Based Inference Analysis. (arXiv:2311.17097v1 [cs.LG])
    Jamming and intrusion detection are critical in 5G research, aiming to maintain reliability, prevent user experience degradation, and avoid infrastructure failure. This paper introduces an anonymous jamming detection model for 5G based on signal parameters from the protocol stacks. The system uses supervised and unsupervised learning for real-time, high-accuracy detection of jamming, including unknown types. Supervised models reach an AUC of 0.964 to 1, compared to LSTM models with an AUC of 0.923 to 1. However, the need for data annotation limits the supervised approach. To address this, an unsupervised auto-encoder-based anomaly detection is presented with an AUC of 0.987. The approach is resistant to adversarial training samples. For transparency and domain knowledge injection, a Bayesian network-based causation analysis is introduced.
    Equivariant Parameter Sharing for Porous Crystalline Materials. (arXiv:2304.01628v3 [cs.LG] UPDATED)
    Efficiently predicting properties of porous crystalline materials has great potential to accelerate the high throughput screening process for developing new materials, as simulations carried out using first principles model are often computationally expensive. To effectively make use of Deep Learning methods to model these materials, we need to utilize the symmetries present in the crystals, which are defined by their space group. Existing methods for crystal property prediction either have symmetry constraints that are too restrictive or only incorporate symmetries between unit cells. In addition, these models do not explicitly model the porous structure of the crystal. In this paper, we develop a model which incorporates the symmetries of the unit cell of a crystal in its architecture and explicitly models the porous structure. We evaluate our model by predicting the heat of adsorption of CO$_2$ for different configurations of the mordenite zeolite. Our results confirm that our method performs better than existing methods for crystal property prediction and that the inclusion of pores results in a more efficient model.
    CD-GAN: a robust fusion-based generative adversarial network for unsupervised remote sensing change detection with heterogeneous sensors. (arXiv:2203.00948v4 [eess.IV] UPDATED)
    In the context of Earth observation, change detection boils down to comparing images acquired at different times by sensors of possibly different spatial and/or spectral resolutions or different modalities (e.g., optical or radar). Even when considering only optical images, this task has proven to be challenging as soon as the sensors differ by their spatial and/or spectral resolutions. This paper proposes a novel unsupervised change detection method dedicated to images acquired by such so-called heterogeneous optical sensors. It capitalizes on recent advances which formulate the change detection task into a robust fusion framework. Adopting this formulation, the work reported in this paper shows that any off-the-shelf network trained beforehand to fuse optical images of different spatial and/or spectral resolutions can be easily complemented with a network of the same architecture and embedded into an adversarial framework to perform change detection. A comparison with state-of-the-art change detection methods demonstrates the versatility and the effectiveness of the proposed approach.
    Inference of CO2 flow patterns -- a feasibility study. (arXiv:2311.00290v2 [cs.CE] UPDATED)
    As the global deployment of carbon capture and sequestration (CCS) technology intensifies in the fight against climate change, it becomes increasingly imperative to establish robust monitoring and detection mechanisms for potential underground CO2 leakage, particularly through pre-existing or induced faults in the storage reservoir's seals. While techniques such as history matching and time-lapse seismic monitoring of CO2 storage have been used successfully in tracking the evolution of CO2 plumes in the subsurface, these methods lack principled approaches to characterize uncertainties related to the CO2 plumes' behavior. Inclusion of systematic assessment of uncertainties is essential for risk mitigation for the following reasons: (i) CO2 plume-induced changes are small and seismic data is noisy; (ii) changes between regular and irregular (e.g., caused by leakage) flow patterns are small; and (iii) the reservoir properties that control the flow are strongly heterogeneous and typically only available as distributions. To arrive at a formulation capable of inferring flow patterns for regular and irregular flow from well and seismic data, the performance of conditional normalizing flow will be analyzed on a series of carefully designed numerical experiments. While the inferences presented are preliminary in the context of an early CO2 leakage detection system, the results do indicate that inferences with conditional normalizing flows can produce high-fidelity estimates for CO2 plumes with or without leakage. We are also confident that the inferred uncertainty is reasonable because it correlates well with the observed errors. This uncertainty stems from noise in the seismic data and from the lack of precise knowledge of the reservoir's fluid flow properties.
    Invariance assumptions for class distribution estimation. (arXiv:2311.17225v1 [cs.LG])
    We study the problem of class distribution estimation under dataset shift. On the training dataset, both features and class labels are observed while on the test dataset only the features can be observed. The task then is the estimation of the distribution of the class labels, i.e. the estimation of the class prior probabilities, in the test dataset. Assumptions of invariance between the training joint distribution of features and labels and the test distribution can considerably facilitate this task. We discuss the assumptions of covariate shift, factorizable joint shift, and sparse joint shift and their implications for class distribution estimation.
    Uncertainty in Additive Feature Attribution methods. (arXiv:2311.17446v1 [cs.LG])
    In this work, we explore various topics that fall under the umbrella of Uncertainty in post-hoc Explainable AI (XAI) methods. We in particular focus on the class of additive feature attribution explanation methods. We first describe our specifications of uncertainty and compare various statistical and recent methods to quantify the same. Next, for a particular instance, we study the relationship between a feature's attribution and its uncertainty and observe little correlation. As a result, we propose a modification in the distribution from which perturbations are sampled in LIME-based algorithms such that the important features have minimal uncertainty without an increase in computational cost. Next, while studying how the uncertainty in explanations varies across the feature space of a classifier, we observe that a fraction of instances show near-zero uncertainty. We coin the term "stable instances" for such instances and diagnose factors that make an instance stable. Next, we study how an XAI algorithm's uncertainty varies with the size and complexity of the underlying model. We observe that the more complex the model, the more inherent uncertainty is exhibited by it. As a result, we propose a measure to quantify the relative complexity of a blackbox classifier. This could be incorporated, for example, in LIME-based algorithms' sampling densities, to help different explanation algorithms achieve tighter confidence levels. Together, the above measures would have a strong impact on making XAI models relatively trustworthy for the end-user as well as aiding scientific discovery.
    A novel feature selection method based on quantum support vector machine. (arXiv:2311.17646v1 [quant-ph])
    Feature selection is critical in machine learning to reduce dimensionality and improve model accuracy and efficiency. The exponential growth in feature space dimensionality for modern datasets directly results in ambiguous samples and redundant features, which can severely degrade classification accuracy. Quantum machine learning offers potential advantages for addressing this challenge. In this paper, we propose a novel method, quantum support vector machine feature selection (QSVMF), integrating quantum support vector machines with multi-objective genetic algorithm. QSVMF optimizes multiple simultaneous objectives: maximizing classification accuracy, minimizing selected features and quantum circuit costs, and reducing feature covariance. We apply QSVMF for feature selection on a breast cancer dataset, comparing the performance of QSVMF against classical approaches with the selected features. Experimental results show that QSVMF achieves superior performance. Furthermore, The Pareto front solutions of QSVMF enable analysis of accuracy versus feature set size trade-offs, identifying extremely sparse yet accurate feature subsets. We contextualize the biological relevance of the selected features in terms of known breast cancer biomarkers. This work highlights the potential of quantum-based feature selection to enhance machine learning efficiency and performance on complex real-world data.
    AnyLens: A Generative Diffusion Model with Any Rendering Lens. (arXiv:2311.17609v1 [cs.CV])
    State-of-the-art diffusion models can generate highly realistic images based on various conditioning like text, segmentation, and depth. However, an essential aspect often overlooked is the specific camera geometry used during image capture. The influence of different optical systems on the final scene appearance is frequently overlooked. This study introduces a framework that intimately integrates a text-to-image diffusion model with the particular lens geometry used in image rendering. Our method is based on a per-pixel coordinate conditioning method, enabling the control over the rendering geometry. Notably, we demonstrate the manipulation of curvature properties, achieving diverse visual effects, such as fish-eye, panoramic views, and spherical texturing using a single diffusion model.
    A transductive few-shot learning approach for classification of digital histopathological slides from liver cancer. (arXiv:2311.17740v1 [eess.IV])
    This paper presents a new approach for classifying 2D histopathology patches using few-shot learning. The method is designed to tackle a significant challenge in histopathology, which is the limited availability of labeled data. By applying a sliding window technique to histopathology slides, we illustrate the practical benefits of transductive learning (i.e., making joint predictions on patches) to achieve consistent and accurate classification. Our approach involves an optimization-based strategy that actively penalizes the prediction of a large number of distinct classes within each window. We conducted experiments on histopathological data to classify tissue classes in digital slides of liver cancer, specifically hepatocellular carcinoma. The initial results show the effectiveness of our method and its potential to enhance the process of automated cancer diagnosis and treatment, all while reducing the time and effort required for expert annotation.
    Universalizing Weak Supervision. (arXiv:2112.03865v3 [cs.LG] UPDATED)
    Weak supervision (WS) frameworks are a popular way to bypass hand-labeling large datasets for training data-hungry models. These approaches synthesize multiple noisy but cheaply-acquired estimates of labels into a set of high-quality pseudolabels for downstream training. However, the synthesis technique is specific to a particular kind of label, such as binary labels or sequences, and each new label type requires manually designing a new synthesis algorithm. Instead, we propose a universal technique that enables weak supervision over any label type while still offering desirable properties, including practical flexibility, computational efficiency, and theoretical guarantees. We apply this technique to important problems previously not tackled by WS frameworks including learning to rank, regression, and learning in hyperbolic space. Theoretically, our synthesis approach produces a consistent estimators for learning some challenging but important generalizations of the exponential family model. Experimentally, we validate our framework and show improvement over baselines in diverse settings including real-world learning-to-rank and regression problems along with learning on hyperbolic manifolds.
    Easing Color Shifts in Score-Based Diffusion Models. (arXiv:2306.15832v2 [cs.LG] UPDATED)
    Generated images of score-based models can suffer from errors in their spatial means, an effect, referred to as a color shift, which grows for larger images. This paper investigates a previously-introduced approach to mitigate color shifts in score-based diffusion models. We quantify the performance of a nonlinear bypass connection in the score network, designed to process the spatial mean of the input and to predict the mean of the score function. We show that this network architecture substantially improves the resulting quality of the generated images, and that this improvement is approximately independent of the size of the generated images. As a result, this modified architecture offers a simple solution for the color shift problem across image sizes. We additionally discuss the origin of color shifts in an idealized setting in order to motivate the approach.
    Estimates on Learning Rates for Multi-Penalty Distribution Regression. (arXiv:2006.09017v2 [cs.LG] UPDATED)
    This paper is concerned with functional learning by utilizing two-stage sampled distribution regression. We study a multi-penalty regularization algorithm for distribution regression under the framework of learning theory. The algorithm aims at regressing to real valued outputs from probability measures. The theoretical analysis on distribution regression is far from maturity and quite challenging, since only second stage samples are observable in practical setting. In the algorithm, to transform information from samples, we embed the distributions to a reproducing kernel Hilbert space $\mathcal{H}_K$ associated with Mercer kernel $K$ via mean embedding technique. The main contribution of the paper is to present a novel multi-penalty regularization algorithm to capture more features of distribution regression and derive optimal learning rates for the algorithm. The work also derives learning rates for distribution regression in the nonstandard setting $f_{\rho}\notin\mathcal{H}_K$, which is not explored in existing literature. Moreover, we propose a distribution regression-based distributed learning algorithm to face large-scale data or information challenge. The optimal learning rates are derived for the distributed learning algorithm. By providing new algorithms and showing their learning rates, we improve the existing work in different aspects in the literature.
    Guarantees for Self-Play in Multiplayer Games via Polymatrix Decomposability. (arXiv:2310.11518v3 [cs.GT] UPDATED)
    Self-play is a technique for machine learning in multi-agent systems where a learning algorithm learns by interacting with copies of itself. Self-play is useful for generating large quantities of data for learning, but has the drawback that the agents the learner will face post-training may have dramatically different behavior than the learner came to expect by interacting with itself. For the special case of two-player constant-sum games, self-play that reaches Nash equilibrium is guaranteed to produce strategies that perform well against any post-training opponent; however, no such guarantee exists for multiplayer games. We show that in games that approximately decompose into a set of two-player constant-sum games (called constant-sum polymatrix games) where global $\epsilon$-Nash equilibria are boundedly far from Nash equilibria in each subgame (called subgame stability), any no-external-regret algorithm that learns by self-play will produce a strategy with bounded vulnerability. For the first time, our results identify a structural property of multiplayer games that enable performance guarantees for the strategies produced by a broad class of self-play algorithms. We demonstrate our findings through experiments on Leduc poker.
    Federated Online and Bandit Convex Optimization. (arXiv:2311.17586v1 [cs.LG])
    We study the problems of distributed online and bandit convex optimization against an adaptive adversary. We aim to minimize the average regret on $M$ machines working in parallel over $T$ rounds with $R$ intermittent communications. Assuming the underlying cost functions are convex and can be generated adaptively, our results show that collaboration is not beneficial when the machines have access to the first-order gradient information at the queried points. This is in contrast to the case for stochastic functions, where each machine samples the cost functions from a fixed distribution. Furthermore, we delve into the more challenging setting of federated online optimization with bandit (zeroth-order) feedback, where the machines can only access values of the cost functions at the queried points. The key finding here is identifying the high-dimensional regime where collaboration is beneficial and may even lead to a linear speedup in the number of machines. We further illustrate our findings through federated adversarial linear bandits by developing novel distributed single and two-point feedback algorithms. Our work is the first attempt towards a systematic understanding of federated online optimization with limited feedback, and it attains tight regret bounds in the intermittent communication setting for both first and zeroth-order feedback. Our results thus bridge the gap between stochastic and adaptive settings in federated online optimization.
    Enhancing the Performance of Neural Networks Through Causal Discovery and Integration of Domain Knowledge. (arXiv:2311.17303v1 [cs.LG])
    In this paper, we develop a generic methodology to encode hierarchical causality structure among observed variables into a neural network in order to improve its predictive performance. The proposed methodology, called causality-informed neural network (CINN), leverages three coherent steps to systematically map the structural causal knowledge into the layer-to-layer design of neural network while strictly preserving the orientation of every causal relationship. In the first step, CINN discovers causal relationships from observational data via directed acyclic graph (DAG) learning, where causal discovery is recast as a continuous optimization problem to avoid the combinatorial nature. In the second step, the discovered hierarchical causality structure among observed variables is systematically encoded into neural network through a dedicated architecture and customized loss function. By categorizing variables in the causal DAG as root, intermediate, and leaf nodes, the hierarchical causal DAG is translated into CINN with a one-to-one correspondence between nodes in the causal DAG and units in the CINN while maintaining the relative order among these nodes. Regarding the loss function, both intermediate and leaf nodes in the DAG graph are treated as target outputs during CINN training so as to drive co-learning of causal relationships among different types of nodes. As multiple loss components emerge in CINN, we leverage the projection of conflicting gradients to mitigate gradient interference among the multiple learning tasks. Computational experiments across a broad spectrum of UCI data sets demonstrate substantial advantages of CINN in predictive performance over other state-of-the-art methods. In addition, an ablation study underscores the value of integrating structural and quantitative causal knowledge in enhancing the neural network's predictive performance incrementally.
    NeuroBack: Improving CDCL SAT Solving using Graph Neural Networks. (arXiv:2110.14053v6 [cs.AI] UPDATED)
    Propositional satisfiability (SAT) is an NP-complete problem that impacts many research fields, such as planning, verification, and security. Mainstream modern SAT solvers are based on the Conflict-Driven Clause Learning (CDCL) algorithm. Recent work aimed to enhance CDCL SAT solvers using Graph Neural Networks (GNNs). However, so far this approach either has not made solving more effective, or required substantial GPU resources for frequent online model inferences. Aiming to make GNN improvements practical, this paper proposes an approach called NeuroBack, which builds on two insights: (1) predicting phases (i.e., values) of variables appearing in the majority (or even all) of the satisfying assignments are essential for CDCL SAT solving, and (2) it is sufficient to query the neural model only once for the predictions before the SAT solving starts. Once trained, the offline model inference allows NeuroBack to execute exclusively on the CPU, removing its reliance on GPU resources. To train NeuroBack, a new dataset called DataBack containing 120,286 data samples is created. Finally, NeuroBack is implemented as an enhancement to a state-of-the-art SAT solver called Kissat. As a result, it allowed Kissat to solve 5.2% more problems on the recent SAT competition problem set, SATCOMP-2022. NeuroBack therefore shows how machine learning can be harnessed to improve SAT solving in an effective and practical manner.
    An Efficient Illumination Invariant Tiger Detection Framework for Wildlife Surveillance. (arXiv:2311.17552v1 [cs.CV])
    Tiger conservation necessitates the strategic deployment of multifaceted initiatives encompassing the preservation of ecological habitats, anti-poaching measures, and community involvement for sustainable growth in the tiger population. With the advent of artificial intelligence, tiger surveillance can be automated using object detection. In this paper, an accurate illumination invariant framework is proposed based on EnlightenGAN and YOLOv8 for tiger detection. The fine-tuned YOLOv8 model achieves a mAP score of 61% without illumination enhancement. The illumination enhancement improves the mAP by 0.7%. The approaches elevate the state-of-the-art performance on the ATRW dataset by approximately 6% to 7%.
    FedAgg: Adaptive Federated Learning with Aggregated Gradients. (arXiv:2303.15799v3 [cs.LG] UPDATED)
    Federated Learning (FL) has become an emerging norm for distributed model training, which enables multiple devices cooperatively to train a shared model utilizing their own datasets scheduled by a central server while keeping private data localized. However, during the training process, the non-independent-and-identically-distributed (Non-IID) data generated on heterogeneous clients and frequent communication across participants may significantly influence the training performance, slow down the convergent rate, and increase communication consumption. In this paper, we ameliorate the standard stochastic gradient descent approach by introducing the aggregated gradients at each local update epoch and propose an adaptive learning rate iterative algorithm that further takes the deviation between the local parameter and global parameter into account. The aforementioned adaptive learning rate design mechanism requires local information of all clients, which is challenging as there is no communication during the local update epochs. To obtain a decentralized adaptive learning rate for each client, we introduce the mean-field approach by utilizing two mean-field terms to estimate the average local parameters and gradients respectively without exchanging clients' local information with each other over time. Through theoretical analysis, we prove that our method can provide the convergence guarantee for model training and derive a convergent upper bound for the client drifting term. Extensive numerical results show that our proposed framework is superior to the state-of-the-art FL schemes in both model accuracy and convergent rate on real-world datasets with IID and Non-IID data distribution.
    Fourier Neural Differential Equations for learning Quantum Field Theories. (arXiv:2311.17250v1 [cs.LG])
    A Quantum Field Theory is defined by its interaction Hamiltonian, and linked to experimental data by the scattering matrix. The scattering matrix is calculated as a perturbative series, and represented succinctly as a first order differential equation in time. Neural Differential Equations (NDEs) learn the time derivative of a residual network's hidden state, and have proven efficacy in learning differential equations with physical constraints. Hence using an NDE to learn particle scattering matrices presents a possible experiment-theory phenomenological connection. In this paper, NDE models are used to learn $\phi^4$ theory, Scalar-Yukawa theory and Scalar Quantum Electrodynamics. A new NDE architecture is also introduced, the Fourier Neural Differential Equation (FNDE), which combines NDE integration and Fourier network convolution. The FNDE model demonstrates better generalisability than the non-integrated equivalent FNO model. It is also shown that by training on scattering data, the interaction Hamiltonian of a theory can be extracted from network parameters.
    On Separate Normalization in Self-supervised Transformers. (arXiv:2309.12931v2 [cs.CL] UPDATED)
    Self-supervised training methods for transformers have demonstrated remarkable performance across various domains. Previous transformer-based models, such as masked autoencoders (MAE), typically utilize a single normalization layer for both the [CLS] symbol and the tokens. We propose in this paper a simple modification that employs separate normalization layers for the tokens and the [CLS] symbol to better capture their distinct characteristics and enhance downstream task performance. Our method aims to alleviate the potential negative effects of using the same normalization statistics for both token types, which may not be optimally aligned with their individual roles. We empirically show that by utilizing a separate normalization layer, the [CLS] embeddings can better encode the global contextual information and are distributed more uniformly in its anisotropic space. When replacing the conventional normalization layer with the two separate layers, we observe an average 2.7% performance improvement over the image, natural language, and graph domains.
    The Devil is in the Data: Learning Fair Graph Neural Networks via Partial Knowledge Distillation. (arXiv:2311.17373v1 [cs.LG])
    Graph neural networks (GNNs) are being increasingly used in many high-stakes tasks, and as a result, there is growing attention on their fairness recently. GNNs have been shown to be unfair as they tend to make discriminatory decisions toward certain demographic groups, divided by sensitive attributes such as gender and race. While recent works have been devoted to improving their fairness performance, they often require accessible demographic information. This greatly limits their applicability in real-world scenarios due to legal restrictions. To address this problem, we present a demographic-agnostic method to learn fair GNNs via knowledge distillation, namely FairGKD. Our work is motivated by the empirical observation that training GNNs on partial data (i.e., only node attributes or topology data) can improve their fairness, albeit at the cost of utility. To make a balanced trade-off between fairness and utility performance, we employ a set of fairness experts (i.e., GNNs trained on different partial data) to construct the synthetic teacher, which distills fairer and informative knowledge to guide the learning of the GNN student. Experiments on several benchmark datasets demonstrate that FairGKD, which does not require access to demographic information, significantly improves the fairness of GNNs by a large margin while maintaining their utility.
    Gene-MOE: A Sparsely-gated Framework for Pan-Cancer Genomic Analysis. (arXiv:2311.17401v1 [cs.LG])
    Analyzing the genomic information from the Pan-Cancer database can help us understand cancer-related factors and contribute to the cancer diagnosis and prognosis. However, existing computational methods and deep learning methods can not effectively find the deep correlations between tens of thousands of genes, which leads to precision loss. In this paper, we proposed a novel pretrained model called Gene-MOE to learn the general feature representations of the Pan-Cancer dataset and transfer the pretrained weights to the downstream tasks. The Gene-MOE fully exploits the mixture of expert (MOE) layers to learn rich feature representations of high-dimensional genes. At the same time, we build a mixture of attention expert (MOAE) model to learn the deep semantic relationships within genetic features. Finally, we proposed a new self-supervised pretraining strategy including loss function design, data enhancement, and optimization strategy to train the Gene-MOE and further improve the performance for the downstream analysis. We carried out cancer classification and survival analysis experiments based on the Gene-MOE. According to the survival analysis results on 14 cancer types, using Gene-MOE outperformed state-of-the-art models on 12 cancer types. According to the classification results, the total accuracy of the classification model for 33 cancer classifications reached 95.2\%. Through detailed feature analysis, we found the Gene-MOE model can learn rich feature representations of high-dimensional genes.
    Backdiff: a diffusion model for generalized transferable protein backmapping. (arXiv:2310.01768v2 [q-bio.QM] UPDATED)
    Coarse-grained (CG) models play a crucial role in the study of protein structures, protein thermodynamic properties, and protein conformation dynamics. Due to the information loss in the coarse-graining process, backmapping from CG to all-atom configurations is essential in many protein design and drug discovery applications when detailed atomic representations are needed for in-depth studies. Despite recent progress in data-driven backmapping approaches, devising a backmapping method that can be universally applied across various CG models and proteins remains unresolved. In this work, we propose BackDiff, a new generative model designed to achieve generalization and reliability in the protein backmapping problem. BackDiff leverages the conditional score-based diffusion model with geometric representations. Since different CG models can contain different coarse-grained sites which include selected atoms (CG atoms) and simple CG auxiliary functions of atomistic coordinates (CG auxiliary variables), we design a self-supervised training framework to adapt to different CG atoms, and constrain the diffusion sampling paths with arbitrary CG auxiliary variables as conditions. Our method facilitates end-to-end training and allows efficient sampling across different proteins and diverse CG models without the need for retraining. Comprehensive experiments over multiple popular CG models demonstrate BackDiff's superior performance to existing state-of-the-art approaches, and generalization and flexibility that these approaches cannot achieve. A pretrained BackDiff model can offer a convenient yet reliable plug-and-play solution for protein researchers, enabling them to investigate further from their own CG models.
    Learning End-to-End Channel Coding with Diffusion Models. (arXiv:2302.01714v2 [cs.IT] UPDATED)
    It is a known problem that deep-learning-based end-to-end (E2E) channel coding systems depend on a known and differentiable channel model, due to the learning process and based on the gradient-descent optimization methods. This places the challenge to approximate or generate the channel or its derivative from samples generated by pilot signaling in real-world scenarios. Currently, there are two prevalent methods to solve this problem. One is to generate the channel via a generative adversarial network (GAN), and the other is to, in essence, approximate the gradient via reinforcement learning methods. Other methods include using score-based methods, variational autoencoders, or mutual-information-based methods. In this paper, we focus on generative models and, in particular, on a new promising method called diffusion models, which have shown a higher quality of generation in image-based tasks. We will show that diffusion models can be used in wireless E2E scenarios and that they work as good as Wasserstein GANs while having a more stable training procedure and a better generalization ability in testing.
    LibSignal: An Open Library for Traffic Signal Control. (arXiv:2211.10649v2 [cs.LG] UPDATED)
    This paper introduces a library for cross-simulator comparison of reinforcement learning models in traffic signal control tasks. This library is developed to implement recent state-of-the-art reinforcement learning models with extensible interfaces and unified cross-simulator evaluation metrics. It supports commonly-used simulators in traffic signal control tasks, including Simulation of Urban MObility(SUMO) and CityFlow, and multiple benchmark datasets for fair comparisons. We conducted experiments to validate our implementation of the models and to calibrate the simulators so that the experiments from one simulator could be referential to the other. Based on the validated models and calibrated environments, this paper compares and reports the performance of current state-of-the-art RL algorithms across different datasets and simulators. This is the first time that these methods have been compared fairly under the same datasets with different simulators.
    Knockoffs-SPR: Clean Sample Selection in Learning with Noisy Labels. (arXiv:2301.00545v4 [cs.LG] UPDATED)
    A noisy training set usually leads to the degradation of the generalization and robustness of neural networks. In this paper, we propose a novel theoretically guaranteed clean sample selection framework for learning with noisy labels. Specifically, we first present a Scalable Penalized Regression (SPR) method, to model the linear relation between network features and one-hot labels. In SPR, the clean data are identified by the zero mean-shift parameters solved in the regression model. We theoretically show that SPR can recover clean data under some conditions. Under general scenarios, the conditions may be no longer satisfied; and some noisy data are falsely selected as clean data. To solve this problem, we propose a data-adaptive method for Scalable Penalized Regression with Knockoff filters (Knockoffs-SPR), which is provable to control the False-Selection-Rate (FSR) in the selected clean data. To improve the efficiency, we further present a split algorithm that divides the whole training set into small pieces that can be solved in parallel to make the framework scalable to large datasets. While Knockoffs-SPR can be regarded as a sample selection module for a standard supervised training pipeline, we further combine it with a semi-supervised algorithm to exploit the support of noisy data as unlabeled data. Experimental results on several benchmark datasets and real-world noisy datasets show the effectiveness of our framework and validate the theoretical results of Knockoffs-SPR. Our code and pre-trained models are available at https://github.com/Yikai-Wang/Knockoffs-SPR.
    Bias Resilient Multi-Step Off-Policy Goal-Conditioned Reinforcement Learning. (arXiv:2311.17565v1 [cs.LG])
    In goal-conditioned reinforcement learning (GCRL), sparse rewards present significant challenges, often obstructing efficient learning. Although multi-step GCRL can boost this efficiency, it can also lead to off-policy biases in target values. This paper dives deep into these biases, categorizing them into two distinct categories: "shooting" and "shifting". Recognizing that certain behavior policies can hasten policy refinement, we present solutions designed to capitalize on the positive aspects of these biases while minimizing their drawbacks, enabling the use of larger step sizes to speed up GCRL. An empirical study demonstrates that our approach ensures a resilient and robust improvement, even in ten-step learning scenarios, leading to superior learning efficiency and performance that generally surpass the baseline and several state-of-the-art multi-step GCRL benchmarks.
    Beyond Invariance: Test-Time Label-Shift Adaptation for Distributions with "Spurious" Correlations. (arXiv:2211.15646v4 [stat.ML] UPDATED)
    Changes in the data distribution at test time can have deleterious effects on the performance of predictive models $p(y|x)$. We consider situations where there are additional meta-data labels (such as group labels), denoted by $z$, that can account for such changes in the distribution. In particular, we assume that the prior distribution $p(y, z)$, which models the dependence between the class label $y$ and the "nuisance" factors $z$, may change across domains, either due to a change in the correlation between these terms, or a change in one of their marginals. However, we assume that the generative model for features $p(x|y,z)$ is invariant across domains. We note that this corresponds to an expanded version of the widely used "label shift" assumption, where the labels now also include the nuisance factors $z$. Based on this observation, we propose a test-time label shift correction that adapts to changes in the joint distribution $p(y, z)$ using EM applied to unlabeled samples from the target domain distribution, $p_t(x)$. Importantly, we are able to avoid fitting a generative model $p(x|y, z)$, and merely need to reweight the outputs of a discriminative model $p_s(y, z|x)$ trained on the source distribution. We evaluate our method, which we call "Test-Time Label-Shift Adaptation" (TTLSA), on several standard image and text datasets, as well as the CheXpert chest X-ray dataset, and show that it improves performance over methods that target invariance to changes in the distribution, as well as baseline empirical risk minimization methods. Code for reproducing experiments is available at https://github.com/nalzok/test-time-label-shift .
    Towards Learning Monocular 3D Object Localization From 2D Labels using the Physical Laws of Motion. (arXiv:2310.17462v2 [cs.CV] UPDATED)
    We present a novel method for precise 3D object localization in single images from a single calibrated camera using only 2D labels. No expensive 3D labels are needed. Thus, instead of using 3D labels, our model is trained with easy-to-annotate 2D labels along with the physical knowledge of the object's motion. Given this information, the model can infer the latent third dimension, even though it has never seen this information during training. Our method is evaluated on both synthetic and real-world datasets, and we are able to achieve a mean distance error of just 6 cm in our experiments on real data. The results indicate the method's potential as a step towards learning 3D object location estimation, where collecting 3D data for training is not feasible.
    The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding. (arXiv:2311.17518v1 [cs.CV])
    Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenarios, where object classes are defined in free-text formats during inference. In this paper, we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand fine-grained properties of objects and their parts. To this end, we introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects in the presence of hard-negative classes. We contribute with a benchmark suite of increasing difficulty and probing different properties like color, pattern, and material. We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol and find that most existing solutions, which shine in standard open-vocabulary benchmarks, struggle to accurately capture and distinguish finer object details. We conclude the paper by highlighting the limitations of current methodologies and exploring promising research directions to overcome the discovered drawbacks. Data and code are available at https://github.com/lorebianchi98/FG-OVD.
    Quantum-probabilistic Hamiltonian learning for generative modelling & anomaly detection. (arXiv:2211.03803v3 [quant-ph] UPDATED)
    The Hamiltonian of an isolated quantum mechanical system determines its dynamics and physical behaviour. This study investigates the possibility of learning and utilising a system's Hamiltonian and its variational thermal state estimation for data analysis techniques. For this purpose, we employ the method of Quantum Hamiltonian-based models for the generative modelling of simulated Large Hadron Collider data and demonstrate the representability of such data as a mixed state. In a further step, we use the learned Hamiltonian for anomaly detection, showing that different sample types can form distinct dynamical behaviours once treated as a quantum many-body system. We exploit these characteristics to quantify the difference between sample types. Our findings show that the methodologies designed for field theory computations can be utilised in machine learning applications to employ theoretical approaches in data analysis techniques.
    BertRLFuzzer: A BERT and Reinforcement Learning based Fuzzer. (arXiv:2305.12534v3 [cs.SE] UPDATED)
    We present a novel tool BertRLFuzzer, a BERT and Reinforcement Learning (RL) based fuzzer aimed at finding security vulnerabilities for Web applications. BertRLFuzzer works as follows: given a set of seed inputs, the fuzzer performs grammar-adhering and attack-provoking mutation operations on them to generate candidate attack vectors. The key insight of BertRLFuzzer is the use of RL with a BERT model as an agent to guide the fuzzer to efficiently learn grammar-adhering and attack-provoking mutation operators. In order to establish the efficacy of BertRLFuzzer we compare it against a total of 13 black box and white box fuzzers over a benchmark of 9 victim websites with over 16K LOC. We observed a significant improvement relative to the nearest competing tool in terms of time to first attack (54% less), new vulnerabilities found (17 new vulnerabilities), and attack rate (4.4% more attack vectors generated).
    Continual Learning with Low Rank Adaptation. (arXiv:2311.17601v1 [cs.LG])
    Recent work using pretrained transformers has shown impressive performance when fine-tuned with data from the downstream problem of interest. However, they struggle to retain that performance when the data characteristics changes. In this paper, we focus on continual learning, where a pre-trained transformer is updated to perform well on new data, while retaining its performance on data it was previously trained on. Earlier works have tackled this primarily through methods inspired from prompt tuning. We question this choice, and investigate the applicability of Low Rank Adaptation (LoRA) to continual learning. On a range of domain-incremental learning benchmarks, our LoRA-based solution, CoLoR, yields state-of-the-art performance, while still being as parameter efficient as the prompt tuning based methods.
    Using Ornstein-Uhlenbeck Process to understand Denoising Diffusion Probabilistic Model and its Noise Schedules. (arXiv:2311.17673v1 [stat.ML])
    The aim of this short note is to show that Denoising Diffusion Probabilistic Model DDPM, a non-homogeneous discrete-time Markov process, can be represented by a time-homogeneous continuous-time Markov process observed at non-uniformly sampled discrete times. Surprisingly, this continuous-time Markov process is the well-known and well-studied Ornstein-Ohlenbeck (OU) process, which was developed in 1930's for studying Brownian particles in Harmonic potentials. We establish the formal equivalence between DDPM and the OU process using its analytical solution. We further demonstrate that the design problem of the noise scheduler for non-homogeneous DDPM is equivalent to designing observation times for the OU process. We present several heuristic designs for observation times based on principled quantities such as auto-variance and Fisher Information and connect them to ad hoc noise schedules for DDPM. Interestingly, we show that the Fisher-Information-motivated schedule corresponds exactly the cosine schedule, which was developed without any theoretical foundation but is the current state-of-the-art noise schedule.
    Toward a Surgeon-in-the-Loop Ophthalmic Robotic Apprentice using Reinforcement and Imitation Learning. (arXiv:2311.17693v1 [cs.RO])
    Robotic-assisted surgical systems have demonstrated significant potential in enhancing surgical precision and minimizing human errors. However, existing systems lack the ability to accommodate the unique preferences and requirements of individual surgeons. Additionally, they primarily focus on general surgeries (e.g., laparoscopy) and are not suitable for highly precise microsurgeries, such as ophthalmic procedures. Thus, we propose a simulation-based image-guided approach for surgeon-centered autonomous agents that can adapt to the individual surgeon's skill level and preferred surgical techniques during ophthalmic cataract surgery. Our approach utilizes a simulated environment to train reinforcement and imitation learning agents guided by image data to perform all tasks of the incision phase of cataract surgery. By integrating the surgeon's actions and preferences into the training process with the surgeon-in-the-loop, our approach enables the robot to implicitly learn and adapt to the individual surgeon's unique approach through demonstrations. This results in a more intuitive and personalized surgical experience for the surgeon. Simultaneously, it ensures consistent performance for the autonomous robotic apprentice. We define and evaluate the effectiveness of our approach using our proposed metrics; and highlight the trade-off between a generic agent and a surgeon-centered adapted agent. Moreover, our approach has the potential to extend to other ophthalmic surgical procedures, opening the door to a new generation of surgeon-in-the-loop autonomous surgical robots. We provide an open-source simulation framework for future development and reproducibility.
    Dynamic DAG Discovery for Interpretable Imitation Learning. (arXiv:2310.00489v3 [cs.LG] UPDATED)
    Imitation learning, which learns agent policy by mimicking expert demonstration, has shown promising results in many applications such as medical treatment regimes and self-driving vehicles. However, it remains a difficult task to interpret control policies learned by the agent. Difficulties mainly come from two aspects: 1) agents in imitation learning are usually implemented as deep neural networks, which are black-box models and lack interpretability; 2) the latent causal mechanism behind agents' decisions may vary along the trajectory, rather than staying static throughout time steps. To increase transparency and offer better interpretability of the neural agent, we propose to expose its captured knowledge in the form of a directed acyclic causal graph, with nodes being action and state variables and edges denoting the causal relations behind predictions. Furthermore, we design this causal discovery process to be state-dependent, enabling it to model the dynamics in latent causal graphs. Concretely, we conduct causal discovery from the perspective of Granger causality and propose a self-explainable imitation learning framework, {\method}. The proposed framework is composed of three parts: a dynamic causal discovery module, a causality encoding module, and a prediction module, and is trained in an end-to-end manner. After the model is learned, we can obtain causal relations among states and action variables behind its decisions, exposing policies learned by it. Experimental results on both synthetic and real-world datasets demonstrate the effectiveness of the proposed {\method} in learning the dynamic causal graphs for understanding the decision-making of imitation learning meanwhile maintaining high prediction accuracy.
    Learning sources of variability from high-dimensional observational studies. (arXiv:2307.13868v2 [stat.ME] UPDATED)
    Causal inference studies whether the presence of a variable influences an observed outcome. As measured by quantities such as the "average treatment effect," this paradigm is employed across numerous biological fields, from vaccine and drug development to policy interventions. Unfortunately, the majority of these methods are often limited to univariate outcomes. Our work generalizes causal estimands to outcomes with any number of dimensions or any measurable space, and formulates traditional causal estimands for nominal variables as causal discrepancy tests. We propose a simple technique for adjusting universally consistent conditional independence tests and prove that these tests are universally consistent causal discrepancy tests. Numerical experiments illustrate that our method, Causal CDcorr, leads to improvements in both finite sample validity and power when compared to existing strategies. Our methods are all open source and available at github.com/ebridge2/cdcorr.
    Topology-Preserving Adversarial Training. (arXiv:2311.17607v1 [cs.CV])
    Despite the effectiveness in improving the robustness of neural networks, adversarial training has suffered from the natural accuracy degradation problem, i.e., accuracy on natural samples has reduced significantly. In this study, we reveal that natural accuracy degradation is highly related to the disruption of the natural sample topology in the representation space by quantitative and qualitative experiments. Based on this observation, we propose Topology-pReserving Adversarial traINing (TRAIN) to alleviate the problem by preserving the topology structure of natural samples from a standard model trained only on natural samples during adversarial training. As an additional regularization, our method can easily be combined with various popular adversarial training algorithms in a plug-and-play manner, taking advantage of both sides. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet show that our proposed method achieves consistent and significant improvements over various strong baselines in most cases. Specifically, without additional data, our proposed method achieves up to 8.78% improvement in natural accuracy and 4.50% improvement in robust accuracy.
    Improving Self-supervised Molecular Representation Learning using Persistent Homology. (arXiv:2311.17327v1 [cs.LG])
    Self-supervised learning (SSL) has great potential for molecular representation learning given the complexity of molecular graphs, the large amounts of unlabelled data available, the considerable cost of obtaining labels experimentally, and the hence often only small training datasets. The importance of the topic is reflected in the variety of paradigms and architectures that have been investigated recently. Yet the differences in performance seem often minor and are barely understood to date. In this paper, we study SSL based on persistent homology (PH), a mathematical tool for modeling topological features of data that persist across multiple scales. It has several unique features which particularly suit SSL, naturally offering: different views of the data, stability in terms of distance preservation, and the opportunity to flexibly incorporate domain knowledge. We (1) investigate an autoencoder, which shows the general representational power of PH, and (2) propose a contrastive loss that complements existing approaches. We rigorously evaluate our approach for molecular property prediction and demonstrate its particular features in improving the embedding space: after SSL, the representations are better and offer considerably more predictive power than the baselines over different probing tasks; our loss increases baseline performance, sometimes largely; and we often obtain substantial improvements over very small datasets, a common scenario in practice.
    Maximum Entropy Model Correction in Reinforcement Learning. (arXiv:2311.17855v1 [cs.LG])
    We propose and theoretically analyze an approach for planning with an approximate model in reinforcement learning that can reduce the adverse impact of model error. If the model is accurate enough, it accelerates the convergence to the true value function too. One of its key components is the MaxEnt Model Correction (MoCo) procedure that corrects the model's next-state distributions based on a Maximum Entropy density estimation formulation. Based on MoCo, we introduce the Model Correcting Value Iteration (MoCoVI) algorithm, and its sampled-based variant MoCoDyna. We show that MoCoVI and MoCoDyna's convergence can be much faster than the conventional model-free algorithms. Unlike traditional model-based algorithms, MoCoVI and MoCoDyna effectively utilize an approximate model and still converge to the correct value function.
    Improved Prototypical Semi-Supervised Learning with Foundation Models: Prototype Selection, Parametric vMF-SNE Pretraining and Multi-view Pseudolabelling. (arXiv:2311.17093v1 [cs.CV])
    In this paper we present an improved approach to prototypical semi-supervised learning for computer vision, in the context of leveraging a frozen foundation model as the backbone of our neural network. As a general tool, we propose parametric von-Mises Fisher Stochastic Neighbour Embedding (vMF-SNE) to create mappings with neural networks between high-dimensional latent spaces that preserve local structure. This enables us to pretrain the projection head of our network using the high-quality embeddings of the foundation model with vMF-SNE. We also propose soft multi-view pseudolabels, where predictions across multiple views are combined to provide a more reliable supervision signal compared to a consistency or swapped assignment approach. We demonstrate that these ideas improve upon P}redicting View-Assignments with Support Samples (PAWS), a current state-of-the-art semi-supervised learning method, as well as Robust PAWS (RoPAWS), over a range of benchmarking datasets. We also introduce simple $k$-means prototype selection, a technique that provides superior performance to other unsupervised label selection approaches in this context. These changes improve upon PAWS by an average of +2.9% for CIFAR-10 and +5.7% for CIFAR-100 with four labels per class, and by +15.2% for DeepWeeds, a particularly challenging dataset for semi-supervised learning. We also achieve new state-of-the-art results in semi-supervised learning in this small label regime for CIFAR-10 - 95.8% (+0.7%) and CIFAR-100 - 76.6% (+12.0%).
    Improving embedding of graphs with missing data by soft manifolds. (arXiv:2311.17598v1 [cs.LG])
    Embedding graphs in continous spaces is a key factor in designing and developing algorithms for automatic information extraction to be applied in diverse tasks (e.g., learning, inferring, predicting). The reliability of graph embeddings directly depends on how much the geometry of the continuous space matches the graph structure. Manifolds are mathematical structure that can enable to incorporate in their topological spaces the graph characteristics, and in particular nodes distances. State-of-the-art of manifold-based graph embedding algorithms take advantage of the assumption that the projection on a tangential space of each point in the manifold (corresponding to a node in the graph) would locally resemble a Euclidean space. Although this condition helps in achieving efficient analytical solutions to the embedding problem, it does not represent an adequate set-up to work with modern real life graphs, that are characterized by weighted connections across nodes often computed over sparse datasets with missing records. In this work, we introduce a new class of manifold, named soft manifold, that can solve this situation. In particular, soft manifolds are mathematical structures with spherical symmetry where the tangent spaces to each point are hypocycloids whose shape is defined according to the velocity of information propagation across the data points. Using soft manifolds for graph embedding, we can provide continuous spaces to pursue any task in data analysis over complex datasets. Experimental results on reconstruction tasks on synthetic and real datasets show how the proposed approach enable more accurate and reliable characterization of graphs in continuous spaces with respect to the state-of-the-art.
    LiveTune: Dynamic Parameter Tuning for Training Deep Neural Networks. (arXiv:2311.17279v1 [cs.LG])
    Traditional machine learning training is a static process that lacks real-time adaptability of hyperparameters. Popular tuning solutions during runtime involve checkpoints and schedulers. Adjusting hyper-parameters usually require the program to be restarted, wasting utilization and time, while placing unnecessary strain on memory and processors. We present LiveTune, a new framework allowing real-time parameter tuning during training through LiveVariables. Live Variables allow for a continuous training session by storing parameters on designated ports on the system, allowing them to be dynamically adjusted. Extensive evaluations of our framework show saving up to 60 seconds and 5.4 Kilojoules of energy per hyperparameter change.
    Introduction to Transformers: an NLP Perspective. (arXiv:2311.17633v1 [cs.CL])
    Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applications. Given that Transformers and related deep learning techniques might be evolving in ways we have never seen, we cannot dive into all the model details or cover all the technical areas. Instead, we focus on just those concepts that are helpful for gaining a good understanding of Transformers and their variants. We also summarize the key ideas that impact this field, thereby yielding some insights into the strengths and limitations of these models.
    Graph-based Molecular Representation Learning. (arXiv:2207.04869v3 [q-bio.QM] UPDATED)
    Molecular representation learning (MRL) is a key step to build the connection between machine learning and chemical science. In particular, it encodes molecules as numerical vectors preserving the molecular structures and features, on top of which the downstream tasks (e.g., property prediction) can be performed. Recently, MRL has achieved considerable progress, especially in methods based on deep molecular graph learning. In this survey, we systematically review these graph-based molecular representation techniques, especially the methods incorporating chemical domain knowledge. Specifically, we first introduce the features of 2D and 3D molecular graphs. Then we summarize and categorize MRL methods into three groups based on their input. Furthermore, we discuss some typical chemical applications supported by MRL. To facilitate studies in this fast-developing area, we also list the benchmarks and commonly used datasets in the paper. Finally, we share our thoughts on future research directions.
    Minimax Exploiter: A Data Efficient Approach for Competitive Self-Play. (arXiv:2311.17190v1 [cs.LG])
    Recent advances in Competitive Self-Play (CSP) have achieved, or even surpassed, human level performance in complex game environments such as Dota 2 and StarCraft II using Distributed Multi-Agent Reinforcement Learning (MARL). One core component of these methods relies on creating a pool of learning agents -- consisting of the Main Agent, past versions of this agent, and Exploiter Agents -- where Exploiter Agents learn counter-strategies to the Main Agents. A key drawback of these approaches is the large computational cost and physical time that is required to train the system, making them impractical to deploy in highly iterative real-life settings such as video game productions. In this paper, we propose the Minimax Exploiter, a game theoretic approach to exploiting Main Agents that leverages knowledge of its opponents, leading to significant increases in data efficiency. We validate our approach in a diversity of settings, including simple turn based games, the arcade learning environment, and For Honor, a modern video game. The Minimax Exploiter consistently outperforms strong baselines, demonstrating improved stability and data efficiency, leading to a robust CSP-MARL method that is both flexible and easy to deploy.  ( 2 min )
    Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation. (arXiv:2311.17121v1 [cs.CV])
    Recent advances in generative models, such as diffusion models, have made generating high-quality synthetic images widely accessible. Prior works have shown that training on synthetic images improves many perception tasks, such as image classification, object detection, and semantic segmentation. We are the first to explore generative data augmentations for scribble-supervised semantic segmentation. We propose a generative data augmentation method that leverages a ControlNet diffusion model conditioned on semantic scribbles to produce high-quality training data. However, naive implementations of generative data augmentations may inadvertently harm the performance of the downstream segmentor rather than improve it. We leverage classifier-free diffusion guidance to enforce class consistency and introduce encode ratios to trade off data diversity for data realism. Using the guidance scale and encode ratio, we are able to generate a spectrum of high-quality training images. We propose multiple augmentation schemes and find that these schemes significantly impact model performance, especially in the low-data regime. Our framework further reduces the gap between the performance of scribble-supervised segmentation and that of fully-supervised segmentation. We also show that our framework significantly improves segmentation performance on small datasets, even surpassing fully-supervised segmentation.  ( 2 min )
    Pragmatic Radiology Report Generation. (arXiv:2311.17154v1 [cs.CL])
    When pneumonia is not found on a chest X-ray, should the report describe this negative observation or omit it? We argue that this question cannot be answered from the X-ray alone and requires a pragmatic perspective, which captures the communicative goal that radiology reports serve between radiologists and patients. However, the standard image-to-text formulation for radiology report generation fails to incorporate such pragmatic intents. Following this pragmatic perspective, we demonstrate that the indication, which describes why a patient comes for an X-ray, drives the mentions of negative observations and introduce indications as additional input to report generation. With respect to the output, we develop a framework to identify uninferable information from the image as a source of model hallucinations, and limit them by cleaning groundtruth reports. Finally, we use indications and cleaned groundtruth reports to develop pragmatic models, and show that they outperform existing methods not only in new pragmatics-inspired metrics (+4.3 Negative F1) but also in standard metrics (+6.3 Positive F1 and +11.0 BLEU-2).  ( 2 min )
    Language Models: A Guide for the Perplexed. (arXiv:2311.17301v1 [cs.CL])
    Given the growing importance of AI literacy, we decided to write this tutorial to help narrow the gap between the discourse among those who study language models -- the core technology underlying ChatGPT and similar products -- and those who are intrigued and want to learn more about them. In short, we believe the perspective of researchers and educators can add some clarity to the public's understanding of the technologies beyond what's currently available, which tends to be either extremely technical or promotional material generated about products by their purveyors. Our approach teases apart the concept of a language model from products built on them, from the behaviors attributed to or desired from those products, and from claims about similarity to human cognition. As a starting point, we (1) offer a scientific viewpoint that focuses on questions amenable to study through experimentation; (2) situate language models as they are today in the context of the research that led to their development; and (3) describe the boundaries of what is known about the models at this writing.  ( 2 min )
    Group-wise Sparse and Explainable Adversarial Attacks. (arXiv:2311.17434v1 [cs.CV])
    Sparse adversarial attacks fool deep neural networks (DNNs) through minimal pixel perturbations, typically regularized by the $\ell_0$ norm. Recent efforts have replaced this norm with a structural sparsity regularizer, such as the nuclear group norm, to craft group-wise sparse adversarial attacks. The resulting perturbations are thus explainable and hold significant practical relevance, shedding light on an even greater vulnerability of DNNs than previously anticipated. However, crafting such attacks poses an optimization challenge, as it involves computing norms for groups of pixels within a non-convex objective. In this paper, we tackle this challenge by presenting an algorithm that simultaneously generates group-wise sparse attacks within semantically meaningful areas of an image. In each iteration, the core operation of our algorithm involves the optimization of a quasinorm adversarial loss. This optimization is achieved by employing the $1/2$-quasinorm proximal operator for some iterations, a method tailored for nonconvex programming. Subsequently, the algorithm transitions to a projected Nesterov's accelerated gradient descent with $2$-norm regularization applied to perturbation magnitudes. We rigorously evaluate the efficacy of our novel attack in both targeted and non-targeted attack scenarios, on CIFAR-10 and ImageNet datasets. When compared to state-of-the-art methods, our attack consistently results in a remarkable increase in group-wise sparsity, e.g., an increase of $48.12\%$ on CIFAR-10 and $40.78\%$ on ImageNet (average case, targeted attack), all while maintaining lower perturbation magnitudes. Notably, this performance is complemented by a significantly faster computation time and a $100\%$ attack success rate.  ( 2 min )
    Improving the Robustness of Transformer-based Large Language Models with Dynamic Attention. (arXiv:2311.17400v1 [cs.CL])
    Transformer-based models, such as BERT and GPT, have been widely adopted in natural language processing (NLP) due to their exceptional performance. However, recent studies show their vulnerability to textual adversarial attacks where the model's output can be misled by intentionally manipulating the text inputs. Despite various methods that have been proposed to enhance the model's robustness and mitigate this vulnerability, many require heavy consumption resources (e.g., adversarial training) or only provide limited protection (e.g., defensive dropout). In this paper, we propose a novel method called dynamic attention, tailored for the transformer architecture, to enhance the inherent robustness of the model itself against various adversarial attacks. Our method requires no downstream task knowledge and does not incur additional costs. The proposed dynamic attention consists of two modules: (I) attention rectification, which masks or weakens the attention value of the chosen tokens, and (ii) dynamic modeling, which dynamically builds the set of candidate tokens. Extensive experiments demonstrate that dynamic attention significantly mitigates the impact of adversarial attacks, improving up to 33\% better performance than previous methods against widely-used adversarial attacks. The model-level design of dynamic attention enables it to be easily combined with other defense methods (e.g., adversarial training) to further enhance the model's robustness. Furthermore, we demonstrate that dynamic attention preserves the state-of-the-art robustness space of the original model compared to other dynamic modeling methods.  ( 3 min )
    Compositional Chain-of-Thought Prompting for Large Multimodal Models. (arXiv:2311.17076v1 [cs.CV])
    The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. One solution is to utilize scene graphs (SGs)--a formalization of objects and their relations and attributes that has been extensively used as a bridge between the visual and textual domains. Yet, scene graph data requires scene graph annotations, which are expensive to collect and thus not easily scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this, inspired by chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. Specifically, we first generate an SG using the LMM, and then use that SG in the prompt to produce a response. Through extensive experiments, we find that the proposed CCoT approach not only improves LMM performance on several vision and language VL compositional benchmarks but also improves the performance of several popular LMMs on general multimodal benchmarks, without the need for fine-tuning or annotated ground-truth SGs.  ( 2 min )
    SoUnD Framework: Analyzing (So)cial Representation in (Un)structured (D)ata. (arXiv:2311.17259v1 [cs.LG])
    The unstructured nature of data used in foundation model development is a challenge to systematic analyses for making data use and documentation decisions. From a Responsible AI perspective, these decisions often rely upon understanding how people are represented in data. We propose a framework designed to guide analysis of human representation in unstructured data and identify downstream risks. We apply the framework in two toy examples using the Common Crawl web text corpus (C4) and LAION-400M. We also propose a set of hypothetical action steps in service of dataset use, development, and documentation.  ( 2 min )
    An Online Optimization-Based Decision Support Tool for Small Farmers in India: Learning in Non-stationary Environments. (arXiv:2311.17277v1 [cs.LG])
    Crop management decision support systems are specialized tools for farmers that reduce the riskiness of revenue streams, especially valuable for use under the current climate changes that impact agricultural productivity. Unfortunately, small farmers in India, who could greatly benefit from these tools, do not have access to them. In this paper, we model an individual greenhouse as a Markov Decision Process (MDP) and adapt Li and Li (2019)'s Follow the Weighted Leader (FWL) online learning algorithm to offer crop planning advice. We successfully produce utility-preserving cropping pattern suggestions in simulations. When we compare against an offline planning algorithm, we achieve the same cumulative revenue with greatly reduced runtime.  ( 2 min )
    Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now. (arXiv:2311.17138v1 [cs.CV])
    Generative models can produce impressively realistic images. This paper demonstrates that generated images have geometric features different from those of real images. We build a set of collections of generated images, prequalified to fool simple, signal-based classifiers into believing they are real. We then show that prequalified generated images can be identified reliably by classifiers that only look at geometric properties. We use three such classifiers. All three classifiers are denied access to image pixels, and look only at derived geometric features. The first classifier looks at the perspective field of the image, the second looks at lines detected in the image, and the third looks at relations between detected objects and shadows. Our procedure detects generated images more reliably than SOTA local signal based detectors, for images from a number of distinct generators. Saliency maps suggest that the classifiers can identify geometric problems reliably. We conclude that current generators cannot reliably reproduce geometric properties of real images.  ( 2 min )
    Fast Particle-based Anomaly Detection Algorithm with Variational Autoencoder. (arXiv:2311.17162v1 [hep-ex])
    Model-agnostic anomaly detection is one of the promising approaches in the search for new beyond the standard model physics. In this paper, we present Set-VAE, a particle-based variational autoencoder (VAE) anomaly detection algorithm. We demonstrate a 2x signal efficiency gain compared with traditional subjettiness-based jet selection. Furthermore, with an eye to the future deployment to trigger systems, we propose the CLIP-VAE, which reduces the inference-time cost of anomaly detection by using the KL-divergence loss as the anomaly score, resulting in a 2x acceleration in latency and reducing the caching requirement.  ( 2 min )
    Optimal EEG Electrode Set for Emotion Recognition From Brain Signals: An Empirical Quest. (arXiv:2311.17204v1 [cs.LG])
    The human brain is a complex organ, still completely undiscovered, that controls almost all the parts of the body. Apart from survival, the human brain stimulates emotions. Recent research indicates that brain signals can be very effective for emotion recognition. However, which parts of the brain exhibit most of the emotions is still under-explored. In this study, we empirically analyze the contribution of each part of the brain in exhibiting emotions. We use the DEAP dataset to find the most optimal electrode set which eventually leads to the effective brain part associated with emotions. We use Fast Fourier Transformation for effective feature extraction and a 1D-CNN with residual connection for classification. Though 32 electrodes from the DEAP dataset got an accuracy of 97.34%, only 12 electrodes (F7, P8, O1, F8, C4, T7, PO3, Fp1, Fp2, O2, P3, and Fz) achieve 95.81% accuracy. This study also shows that adding more than 10 electrodes does not improve performance significantly. Moreover, the frontal lobe is the most important for recognizing emotion.  ( 2 min )
    Deployment of a Robust and Explainable Mortality Prediction Model: The COVID-19 Pandemic and Beyond. (arXiv:2311.17133v1 [cs.LG])
    This study investigated the performance, explainability, and robustness of deployed artificial intelligence (AI) models in predicting mortality during the COVID-19 pandemic and beyond. The first study of its kind, we found that Bayesian Neural Networks (BNNs) and intelligent training techniques allowed our models to maintain performance amidst significant data shifts. Our results emphasize the importance of developing robust AI models capable of matching or surpassing clinician predictions, even under challenging conditions. Our exploration of model explainability revealed that stochastic models generate more diverse and personalized explanations thereby highlighting the need for AI models that provide detailed and individualized insights in real-world clinical settings. Furthermore, we underscored the importance of quantifying uncertainty in AI models which enables clinicians to make better-informed decisions based on reliable predictions. Our study advocates for prioritizing implementation science in AI research for healthcare and ensuring that AI solutions are practical, beneficial, and sustainable in real-world clinical environments. By addressing unique challenges and complexities in healthcare settings, researchers can develop AI models that effectively improve clinical practice and patient outcomes.  ( 3 min )
    ClimateX: Do LLMs Accurately Assess Human Expert Confidence in Climate Statements?. (arXiv:2311.17107v1 [cs.LG])
    Evaluating the accuracy of outputs generated by Large Language Models (LLMs) is especially important in the climate science and policy domain. We introduce the Expert Confidence in Climate Statements (ClimateX) dataset, a novel, curated, expert-labeled dataset consisting of 8094 climate statements collected from the latest Intergovernmental Panel on Climate Change (IPCC) reports, labeled with their associated confidence levels. Using this dataset, we show that recent LLMs can classify human expert confidence in climate-related statements, especially in a few-shot learning setting, but with limited (up to 47%) accuracy. Overall, models exhibit consistent and significant over-confidence on low and medium confidence statements. We highlight implications of our results for climate communication, LLMs evaluation strategies, and the use of LLMs in information retrieval systems.  ( 2 min )
    XAI for time-series classification leveraging image highlight methods. (arXiv:2311.17110v1 [cs.LG])
    Although much work has been done on explainability in the computer vision and natural language processing (NLP) fields, there is still much work to be done to explain methods applied to time series as time series by nature can not be understood at first sight. In this paper, we present a Deep Neural Network (DNN) in a teacher-student architecture (distillation model) that offers interpretability in time-series classification tasks. The explainability of our approach is based on transforming the time series to 2D plots and applying image highlight methods (such as LIME and GradCam), making the predictions interpretable. At the same time, the proposed approach offers increased accuracy competing with the baseline model with the trade-off of increasing the training time.  ( 2 min )
    Feedback RoI Features Improve Aerial Object Detection. (arXiv:2311.17129v1 [cs.CV])
    Neuroscience studies have shown that the human visual system utilizes high-level feedback information to guide lower-level perception, enabling adaptation to signals of different characteristics. In light of this, we propose Feedback multi-Level feature Extractor (Flex) to incorporate a similar mechanism for object detection. Flex refines feature selection based on image-wise and instance-level feedback information in response to image quality variation and classification uncertainty. Experimental results show that Flex offers consistent improvement to a range of existing SOTA methods on the challenging aerial object detection datasets including DOTA-v1.0, DOTA-v1.5, and HRSC2016. Although the design originates in aerial image detection, further experiments on MS COCO also reveal our module's efficacy in general detection models. Quantitative and qualitative analyses indicate that the improvements are closely related to image qualities, which match our motivation.  ( 2 min )
    Efficient Deep Speech Understanding at the Edge. (arXiv:2311.17065v1 [eess.AS])
    Contemporary Speech Understanding (SU) involves a sophisticated pipeline: capturing real-time voice input, the pipeline encompasses a deep neural network with an encoder-decoder architecture enhanced by beam search. This network periodically assesses attention and Connectionist Temporal Classification (CTC) scores in its autoregressive output. This paper aims to enhance SU performance on edge devices with limited resources. It pursues two intertwined goals: accelerating on-device execution and efficiently handling inputs that surpass the on-device model's capacity. While these objectives are well-established, we introduce innovative solutions that specifically address SU's distinctive challenges: 1. Late contextualization: Enables the parallel execution of a model's attentive encoder during input ingestion. 2. Pilot decoding: Alleviates temporal load imbalances. 3. Autoregression offramps: Facilitate offloading decisions based on partial output sequences. Our techniques seamlessly integrate with existing SU models, pipelines, and frameworks, allowing for independent or combined application. Together, they constitute a hybrid solution for edge SU, exemplified by our prototype, XYZ. Evaluated on platforms equipped with 6-8 Arm cores, our system achieves State-of-the-Art (SOTA) accuracy, reducing end-to-end latency by 2x and halving offloading requirements.  ( 2 min )
    Accelerating DNN Training With Photonics: A Residue Number System-Based Design. (arXiv:2311.17323v1 [cs.AR])
    Photonic computing is a compelling avenue for performing highly efficient matrix multiplication, a crucial operation in Deep Neural Networks (DNNs). While this method has shown great success in DNN inference, meeting the high precision demands of DNN training proves challenging due to the precision limitations imposed by costly data converters and the analog noise inherent in photonic hardware. This paper proposes Mirage, a photonic DNN training accelerator that overcomes the precision challenges in photonic hardware using the Residue Number System (RNS). RNS is a numeral system based on modular arithmetic$\unicode{x2014}$allowing us to perform high-precision operations via multiple low-precision modular operations. In this work, we present a novel micro-architecture and dataflow for an RNS-based photonic tensor core performing modular arithmetic in the analog domain. By combining RNS and photonics, Mirage provides high energy efficiency without compromising precision and can successfully train state-of-the-art DNNs achieving accuracy comparable to FP32 training. Our study shows that on average across several DNNs when compared to systolic arrays, Mirage achieves more than $23.8\times$ faster training and $32.1\times$ lower EDP in an iso-energy scenario and consumes $42.8\times$ lower power with comparable or better EDP in an iso-area scenario.  ( 2 min )
    \texttt{GlycoNMR}: Dataset and benchmarks for NMR chemical shift prediction of carbohydrates with graph neural networks. (arXiv:2311.17134v1 [cs.LG])
    Molecular representation learning (MRL) is a powerful tool for bridging the gap between machine learning and chemical sciences, as it converts molecules into numerical representations while preserving their chemical features. These encoded representations serve as a foundation for various downstream biochemical studies, including property prediction and drug design. MRL has had great success with proteins and general biomolecule datasets. Yet, in the growing sub-field of glycoscience (the study of carbohydrates, where longer carbohydrates are also called glycans), MRL methods have been barely explored. This under-exploration can be primarily attributed to the limited availability of comprehensive and well-curated carbohydrate-specific datasets and a lack of Machine learning (ML) pipelines specifically tailored to meet the unique problems presented by carbohydrate data. Since interpreting and annotating carbohydrate-specific data is generally more complicated than protein data, domain experts are usually required to get involved. The existing MRL methods, predominately optimized for proteins and small biomolecules, also cannot be directly used in carbohydrate applications without special modifications. To address this challenge, accelerate progress in glycoscience, and enrich the data resources of the MRL community, we introduce GlycoNMR. GlycoNMR contains two laboriously curated datasets with 2,609 carbohydrate structures and 211,543 annotated nuclear magnetic resonance (NMR) chemical shifts for precise atomic-level prediction. We tailored carbohydrate-specific features and adapted existing MRL models to tackle this problem effectively. For illustration, we benchmark four modified MRL models on our new datasets.  ( 3 min )
    Generative Models: What do they know? Do they know things? Let's find out!. (arXiv:2311.17137v1 [cs.CV])
    Generative models have been shown to be capable of synthesizing highly detailed and realistic images. It is natural to suspect that they implicitly learn to model some image intrinsics such as surface normals, depth, or shadows. In this paper, we present compelling evidence that generative models indeed internally produce high-quality scene intrinsic maps. We introduce Intrinsic LoRA (I LoRA), a universal, plug-and-play approach that transforms any generative model into a scene intrinsic predictor, capable of extracting intrinsic scene maps directly from the original generator network without needing additional decoders or fully fine-tuning the original network. Our method employs a Low-Rank Adaptation (LoRA) of key feature maps, with newly learned parameters that make up less than 0.6% of the total parameters in the generative model. Optimized with a small set of labeled images, our model-agnostic approach adapts to various generative architectures, including Diffusion models, GANs, and Autoregressive models. We show that the scene intrinsic maps produced by our method compare well with, and in some cases surpass those generated by leading supervised techniques.  ( 2 min )
    Deep convolutional encoder-decoder hierarchical neural networks for conjugate heat transfer surrogate modeling. (arXiv:2311.17068v1 [cs.CE])
    Conjugate heat transfer (CHT) models are vital for the design of many engineering systems. However, high-fidelity CHT models are computationally intensive, which limits their use in applications such as design optimization, where hundreds to thousands of model evaluations are required. In this work, we develop a modular deep convolutional encoder-decoder hierarchical (DeepEDH) neural network, a novel deep-learning-based surrogate modeling methodology for computationally intensive CHT models. Leveraging convective temperature dependencies, we propose a two-stage temperature prediction architecture that couples velocity and temperature models. The proposed DeepEDH methodology is demonstrated by modeling the pressure, velocity, and temperature fields for a liquid-cooled cold-plate-based battery thermal management system with variable channel geometry. A computational model of the cold plate is developed and solved using the finite element method (FEM), generating a dataset of 1,500 simulations. The FEM results are transformed and scaled from unstructured to structured, image-like meshes to create training and test datasets. The DeepEDH methodology's performance is examined in relation to data scaling, training dataset size, and network depth. Our performance analysis covers the impact of the novel architecture, separate field models, output geometry masks, multi-stage temperature models, and optimizations of the hyperparameters and architecture. Furthermore, we quantify the influence of the CHT thermal boundary condition on surrogate model performance, highlighting improved temperature model performance with higher heat fluxes. Compared to other deep learning neural network surrogate models, such as U-Net and DenseED, the proposed DeepEDH methodology for CHT models exhibits up to a 65% enhancement in the coefficient of determination ($R^{2}$).  ( 3 min )
    Predicting the Age of Astronomical Transients from Real-Time Multivariate Time Series. (arXiv:2311.17143v1 [astro-ph.IM])
    Astronomical transients, such as supernovae and other rare stellar explosions, have been instrumental in some of the most significant discoveries in astronomy. New astronomical sky surveys will soon record unprecedented numbers of transients as sparsely and irregularly sampled multivariate time series. To improve our understanding of the physical mechanisms of transients and their progenitor systems, early-time measurements are necessary. Prioritizing the follow-up of transients based on their age along with their class is crucial for new surveys. To meet this demand, we present the first method of predicting the age of transients in real-time from multi-wavelength time-series observations. We build a Bayesian probabilistic recurrent neural network. Our method can accurately predict the age of a transient with robust uncertainties as soon as it is initially triggered by a survey telescope. This work will be essential for the advancement of our understanding of the numerous young transients being detected by ongoing and upcoming astronomical surveys.  ( 2 min )
    A point cloud approach to generative modeling for galaxy surveys at the field level. (arXiv:2311.17141v1 [astro-ph.CO])
    We introduce a diffusion-based generative model to describe the distribution of galaxies in our Universe directly as a collection of points in 3-D space (coordinates) optionally with associated attributes (e.g., velocities and masses), without resorting to binning or voxelization. The custom diffusion model can be used both for emulation, reproducing essential summary statistics of the galaxy distribution, as well as inference, by computing the conditional likelihood of a galaxy field. We demonstrate a first application to massive dark matter haloes in the Quijote simulation suite. This approach can be extended to enable a comprehensive analysis of cosmological data, circumventing limitations inherent to summary statistic -- as well as neural simulation-based inference methods.  ( 2 min )
  • Open

    LegendreTron: Uprising Proper Multiclass Loss Learning. (arXiv:2301.11695v3 [stat.ML] UPDATED)
    Loss functions serve as the foundation of supervised learning and are often chosen prior to model development. To avoid potentially ad hoc choices of losses, statistical decision theory describes a desirable property for losses known as \emph{properness}, which asserts that Bayes' rule is optimal. Recent works have sought to \emph{learn losses} and models jointly. Existing methods do this by fitting an inverse canonical link function which monotonically maps $\mathbb{R}$ to $[0,1]$ to estimate probabilities for binary problems. In this paper, we extend monotonicity to maps between $\mathbb{R}^{C-1}$ and the projected probability simplex $\tilde{\Delta}^{C-1}$ by using monotonicity of gradients of convex functions. We present {\sc LegendreTron} as a novel and practical method that jointly learns \emph{proper canonical losses} and probabilities for multiclass problems. Tested on a benchmark of domains with up to 1,000 classes, our experimental results show that our method consistently outperforms the natural multiclass baseline under a $t$-test at 99% significance on all datasets with greater than 10 classes.  ( 2 min )
    A quasi-polynomial time algorithm for Multi-Dimensional Scaling via LP hierarchies. (arXiv:2311.17840v1 [cs.DS])
    Multi-dimensional Scaling (MDS) is a family of methods for embedding pair-wise dissimilarities between $n$ objects into low-dimensional space. MDS is widely used as a data visualization tool in the social and biological sciences, statistics, and machine learning. We study the Kamada-Kawai formulation of MDS: given a set of non-negative dissimilarities $\{d_{i,j}\}_{i , j \in [n]}$ over $n$ points, the goal is to find an embedding $\{x_1,\dots,x_n\} \subset \mathbb{R}^k$ that minimizes \[ \text{OPT} = \min_{x} \mathbb{E}_{i,j \in [n]} \left[ \left(1-\frac{\|x_i - x_j\|}{d_{i,j}}\right)^2 \right] \] Despite its popularity, our theoretical understanding of MDS is extremely limited. Recently, Demaine, Hesterberg, Koehler, Lynch, and Urschel (arXiv:2109.11505) gave the first approximation algorithm with provable guarantees for Kamada-Kawai, which achieves an embedding with cost $\text{OPT} +\epsilon$ in $n^2 \cdot 2^{\tilde{\mathcal{O}}(k \Delta^4 / \epsilon^2)}$ time, where $\Delta$ is the aspect ratio of the input dissimilarities. In this work, we give the first approximation algorithm for MDS with quasi-polynomial dependency on $\Delta$: for target dimension $k$, we achieve a solution with cost $\mathcal{O}(\text{OPT}^{ \hspace{0.04in}1/k } \cdot \log(\Delta/\epsilon) )+ \epsilon$ in time $n^{ \mathcal{O}(1)} \cdot 2^{\tilde{\mathcal{O}}( k^2 (\log(\Delta)/\epsilon)^{k/2 + 1} ) }$. Our approach is based on a novel analysis of a conditioning-based rounding scheme for the Sherali-Adams LP Hierarchy. Crucially, our analysis exploits the geometry of low-dimensional Euclidean space, allowing us to avoid an exponential dependence on the aspect ratio $\Delta$. We believe our geometry-aware treatment of the Sherali-Adams Hierarchy is an important step towards developing general-purpose techniques for efficient metric optimization algorithms.
    Corruption-Robust Lipschitz Contextual Search. (arXiv:2307.13903v3 [cs.LG] UPDATED)
    I study the problem of learning a Lipschitz function with corrupted binary signals. The learner tries to learn a $L$-Lipschitz function $f: [0,1]^d \rightarrow [0, L]$ that the adversary chooses. There is a total of $T$ rounds. In each round $t$, the adversary selects a context vector $x_t$ in the input space, and the learner makes a guess to the true function value $f(x_t)$ and receives a binary signal indicating whether the guess is high or low. In a total of $C$ rounds, the signal may be corrupted, though the value of $C$ is \emph{unknown} to the learner. The learner's goal is to incur a small cumulative loss. This work introduces the new algorithmic technique \emph{agnostic checking} as well as new analysis techniques. I design algorithms which: for the symmetric loss, the learner achieves regret $L\cdot O(C\log T)$ with $d = 1$ and $L\cdot O_d(C\log T + T^{(d-1)/d})$ with $d > 1$; for the pricing loss, the learner achieves regret $L\cdot \widetilde{O} (T^{d/(d+1)} + C\cdot T^{1/(d+1)})$.
    Are ensembles getting better all the time?. (arXiv:2311.17885v1 [stat.ML])
    Ensemble methods combine the predictions of several base models. We study whether or not including more models in an ensemble always improve its average performance. Such a question depends on the kind of ensemble considered, as well as the predictive metric chosen. We focus on situations where all members of the ensemble are a priori expected to perform as well, which is the case of several popular methods like random forests or deep ensembles. In this setting, we essentially show that ensembles are getting better all the time if, and only if, the considered loss function is convex. More precisely, in that case, the average loss of the ensemble is a decreasing function of the number of models. When the loss function is nonconvex, we show a series of results that can be summarised by the insight that ensembles of good models keep getting better, and ensembles of bad models keep getting worse. To this end, we prove a new result on the monotonicity of tail probabilities that may be of independent interest. We illustrate our results on a simple machine learning problem (diagnosing melanomas using neural nets).
    Rigorous dynamical mean field theory for stochastic gradient descent methods. (arXiv:2210.06591v3 [math-ph] UPDATED)
    We prove closed-form equations for the exact high-dimensional asymptotics of a family of first order gradient-based methods, learning an estimator (e.g. M-estimator, shallow neural network, ...) from observations on Gaussian data with empirical risk minimization. This includes widely used algorithms such as stochastic gradient descent (SGD) or Nesterov acceleration. The obtained equations match those resulting from the discretization of dynamical mean-field theory (DMFT) equations from statistical physics when applied to gradient flow. Our proof method allows us to give an explicit description of how memory kernels build up in the effective dynamics, and to include non-separable update functions, allowing datasets with non-identity covariance matrices. Finally, we provide numerical implementations of the equations for SGD with generic extensive batch-size and with constant learning rates.
    The Effects of Overparameterization on Sharpness-aware Minimization: An Empirical and Theoretical Analysis. (arXiv:2311.17539v1 [cs.LG])
    Training an overparameterized neural network can yield minimizers of the same level of training loss and yet different generalization capabilities. With evidence that indicates a correlation between sharpness of minima and their generalization errors, increasing efforts have been made to develop an optimization method to explicitly find flat minima as more generalizable solutions. This sharpness-aware minimization (SAM) strategy, however, has not been studied much yet as to how overparameterization can actually affect its behavior. In this work, we analyze SAM under varying degrees of overparameterization and present both empirical and theoretical results that suggest a critical influence of overparameterization on SAM. Specifically, we first use standard techniques in optimization to prove that SAM can achieve a linear convergence rate under overparameterization in a stochastic setting. We also show that the linearly stable minima found by SAM are indeed flatter and have more uniformly distributed Hessian moments compared to those of SGD. These results are corroborated with our experiments that reveal a consistent trend that the generalization improvement made by SAM continues to increase as the model becomes more overparameterized. We further present that sparsity can open up an avenue for effective overparameterization in practice.
    Beyond Invariance: Test-Time Label-Shift Adaptation for Distributions with "Spurious" Correlations. (arXiv:2211.15646v4 [stat.ML] UPDATED)
    Changes in the data distribution at test time can have deleterious effects on the performance of predictive models $p(y|x)$. We consider situations where there are additional meta-data labels (such as group labels), denoted by $z$, that can account for such changes in the distribution. In particular, we assume that the prior distribution $p(y, z)$, which models the dependence between the class label $y$ and the "nuisance" factors $z$, may change across domains, either due to a change in the correlation between these terms, or a change in one of their marginals. However, we assume that the generative model for features $p(x|y,z)$ is invariant across domains. We note that this corresponds to an expanded version of the widely used "label shift" assumption, where the labels now also include the nuisance factors $z$. Based on this observation, we propose a test-time label shift correction that adapts to changes in the joint distribution $p(y, z)$ using EM applied to unlabeled samples from the target domain distribution, $p_t(x)$. Importantly, we are able to avoid fitting a generative model $p(x|y, z)$, and merely need to reweight the outputs of a discriminative model $p_s(y, z|x)$ trained on the source distribution. We evaluate our method, which we call "Test-Time Label-Shift Adaptation" (TTLSA), on several standard image and text datasets, as well as the CheXpert chest X-ray dataset, and show that it improves performance over methods that target invariance to changes in the distribution, as well as baseline empirical risk minimization methods. Code for reproducing experiments is available at https://github.com/nalzok/test-time-label-shift .
    Modern Bayesian Experimental Design. (arXiv:2302.14545v2 [stat.ML] UPDATED)
    Bayesian experimental design (BED) provides a powerful and general framework for optimizing the design of experiments. However, its deployment often poses substantial computational challenges that can undermine its practical use. In this review, we outline how recent advances have transformed our ability to overcome these challenges and thus utilize BED effectively, before discussing some key areas for future development in the field.
    Direction-oriented Multi-objective Learning: Simple and Provable Stochastic Algorithms. (arXiv:2305.18409v3 [cs.LG] UPDATED)
    Multi-objective optimization (MOO) has become an influential framework in many machine learning problems with multiple objectives such as learning with multiple criteria and multi-task learning (MTL). In this paper, we propose a new direction-oriented multi-objective problem by regularizing the common descent direction within a neighborhood of a direction that optimizes a linear combination of objectives such as the average loss in MTL. This formulation includes GD and MGDA as special cases, enjoys the direction-oriented benefit as in CAGrad, and facilitates the design of stochastic algorithms. To solve this problem, we propose Stochastic Direction-oriented Multi-objective Gradient descent (SDMGrad) with simple SGD type of updates, and its variant SDMGrad-OS with an efficient objective sampling in the setting where the number of objectives is large. For a constant-level regularization parameter $\lambda$, we show that SDMGrad and SDMGrad-OS provably converge to a Pareto stationary point with improved complexities and milder assumptions. For an increasing $\lambda$, this convergent point reduces to a stationary point of the linear combination of objectives. We demonstrate the superior performance of the proposed methods in a series of tasks on multi-task supervised learning and reinforcement learning. Code is provided at https://github.com/ml-opt-lab/sdmgrad.
    Calabi-Yau Four/Five/Six-folds as $\mathbb{P}^n_\textbf{w}$ Hypersurfaces: Machine Learning, Approximation, and Generation. (arXiv:2311.17146v1 [hep-th])
    Calabi-Yau four-folds may be constructed as hypersurfaces in weighted projective spaces of complex dimension 5 defined via weight systems of 6 weights. In this work, neural networks were implemented to learn the Calabi-Yau Hodge numbers from the weight systems, where gradient saliency and symbolic regression then inspired a truncation of the Landau-Ginzburg model formula for the Hodge numbers of any dimensional Calabi-Yau constructed in this way. The approximation always provides a tight lower bound, is shown to be dramatically quicker to compute (with compute times reduced by up to four orders of magnitude), and gives remarkably accurate results for systems with large weights. Additionally, complementary datasets of weight systems satisfying the necessary but insufficient conditions for transversality were constructed, including considerations of the IP, reflexivity, and intradivisibility properties. Overall producing a classification of this weight system landscape, further confirmed with machine learning methods. Using the knowledge of this classification, and the properties of the presented approximation, a novel dataset of transverse weight systems consisting of 7 weights was generated for a sum of weights $\leq 200$; producing a new database of Calabi-Yau five-folds, with their respective topological properties computed. Further to this an equivalent database of candidate Calabi-Yau six-folds was generated with approximated Hodge numbers.
    Federated Online and Bandit Convex Optimization. (arXiv:2311.17586v1 [cs.LG])
    We study the problems of distributed online and bandit convex optimization against an adaptive adversary. We aim to minimize the average regret on $M$ machines working in parallel over $T$ rounds with $R$ intermittent communications. Assuming the underlying cost functions are convex and can be generated adaptively, our results show that collaboration is not beneficial when the machines have access to the first-order gradient information at the queried points. This is in contrast to the case for stochastic functions, where each machine samples the cost functions from a fixed distribution. Furthermore, we delve into the more challenging setting of federated online optimization with bandit (zeroth-order) feedback, where the machines can only access values of the cost functions at the queried points. The key finding here is identifying the high-dimensional regime where collaboration is beneficial and may even lead to a linear speedup in the number of machines. We further illustrate our findings through federated adversarial linear bandits by developing novel distributed single and two-point feedback algorithms. Our work is the first attempt towards a systematic understanding of federated online optimization with limited feedback, and it attains tight regret bounds in the intermittent communication setting for both first and zeroth-order feedback. Our results thus bridge the gap between stochastic and adaptive settings in federated online optimization.
    Using Ornstein-Uhlenbeck Process to understand Denoising Diffusion Probabilistic Model and its Noise Schedules. (arXiv:2311.17673v1 [stat.ML])
    The aim of this short note is to show that Denoising Diffusion Probabilistic Model DDPM, a non-homogeneous discrete-time Markov process, can be represented by a time-homogeneous continuous-time Markov process observed at non-uniformly sampled discrete times. Surprisingly, this continuous-time Markov process is the well-known and well-studied Ornstein-Ohlenbeck (OU) process, which was developed in 1930's for studying Brownian particles in Harmonic potentials. We establish the formal equivalence between DDPM and the OU process using its analytical solution. We further demonstrate that the design problem of the noise scheduler for non-homogeneous DDPM is equivalent to designing observation times for the OU process. We present several heuristic designs for observation times based on principled quantities such as auto-variance and Fisher Information and connect them to ad hoc noise schedules for DDPM. Interestingly, we show that the Fisher-Information-motivated schedule corresponds exactly the cosine schedule, which was developed without any theoretical foundation but is the current state-of-the-art noise schedule.  ( 2 min )
    Variational Bayes image restoration with compressive autoencoders. (arXiv:2311.17744v1 [cs.CV])
    Regularization of inverse problems is of paramount importance in computational imaging. The ability of neural networks to learn efficient image representations has been recently exploited to design powerful data-driven regularizers. While state-of-the-art plug-and-play methods rely on an implicit regularization provided by neural denoisers, alternative Bayesian approaches consider Maximum A Posteriori (MAP) estimation in the latent space of a generative model, thus with an explicit regularization. However, state-of-the-art deep generative models require a huge amount of training data compared to denoisers. Besides, their complexity hampers the optimization of the latent MAP. In this work, we propose to use compressive autoencoders for latent estimation. These networks, which can be seen as variational autoencoders with a flexible latent prior, are smaller and easier to train than state-of-the-art generative models. We then introduce the Variational Bayes Latent Estimation (VBLE) algorithm, which performs this estimation within the framework of variational inference. This allows for fast and easy (approximate) posterior sampling. Experimental results on image datasets BSD and FFHQ demonstrate that VBLE reaches similar performance than state-of-the-art plug-and-play methods, while being able to quantify uncertainties faster than other existing posterior sampling techniques.  ( 2 min )
    Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling. (arXiv:2310.06389v2 [cs.CV] UPDATED)
    Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models.  ( 2 min )
    Marginal Laplacian Score. (arXiv:2311.17795v1 [cs.LG])
    High-dimensional imbalanced data poses a machine learning challenge. In the absence of sufficient or high-quality labels, unsupervised feature selection methods are crucial for the success of subsequent algorithms. Therefore, there is a growing need for unsupervised feature selection algorithms focused on imbalanced data. Thus, we propose a Marginal Laplacian Score (MLS) a modification of the well-known Laplacian Score (LS) to be better suited for imbalance data. We introduce an assumption that the minority class or anomalous appear more frequently in the margin of the features. Consequently, MLS aims to preserve the local structure of the data set's margin. As MLS is better suited for handling imbalanced data, we propose its integration into modern feature selection methods that utilize the Laplacian score. We integrate the MLS algorithm into the Differentiable Unsupervised Feature Selection (DUFS), resulting in DUFS-MLS. The proposed methods demonstrate robust and improved performance on synthetic and public data sets.  ( 2 min )
    Maximum Entropy Model Correction in Reinforcement Learning. (arXiv:2311.17855v1 [cs.LG])
    We propose and theoretically analyze an approach for planning with an approximate model in reinforcement learning that can reduce the adverse impact of model error. If the model is accurate enough, it accelerates the convergence to the true value function too. One of its key components is the MaxEnt Model Correction (MoCo) procedure that corrects the model's next-state distributions based on a Maximum Entropy density estimation formulation. Based on MoCo, we introduce the Model Correcting Value Iteration (MoCoVI) algorithm, and its sampled-based variant MoCoDyna. We show that MoCoVI and MoCoDyna's convergence can be much faster than the conventional model-free algorithms. Unlike traditional model-based algorithms, MoCoVI and MoCoDyna effectively utilize an approximate model and still converge to the correct value function.  ( 2 min )
    Efficient Computation of Sparse and Robust Maximum Association Estimators. (arXiv:2311.17563v1 [stat.CO])
    Although robust statistical estimators are less affected by outlying observations, their computation is usually more challenging. This is particularly the case in high-dimensional sparse settings. The availability of new optimization procedures, mainly developed in the computer science domain, offers new possibilities for the field of robust statistics. This paper investigates how such procedures can be used for robust sparse association estimators. The problem can be split into a robust estimation step followed by an optimization for the remaining decoupled, (bi-)convex problem. A combination of the augmented Lagrangian algorithm and adaptive gradient descent is implemented to also include suitable constraints for inducing sparsity. We provide results concerning the precision of the algorithm and show the advantages over existing algorithms in this context. High-dimensional empirical examples underline the usefulness of this procedure. Extensions to other robust sparse estimators are possible.  ( 2 min )
    Over-Squashing in Riemannian Graph Neural Networks. (arXiv:2311.15945v1 [cs.LG] CROSS LISTED)
    Most graph neural networks (GNNs) are prone to the phenomenon of over-squashing in which node features become insensitive to information from distant nodes in the graph. Recent works have shown that the topology of the graph has the greatest impact on over-squashing, suggesting graph rewiring approaches as a suitable solution. In this work, we explore whether over-squashing can be mitigated through the embedding space of the GNN. In particular, we consider the generalization of Hyperbolic GNNs (HGNNs) to Riemannian manifolds of variable curvature in which the geometry of the embedding space is faithful to the graph's topology. We derive bounds on the sensitivity of the node features in these Riemannian GNNs as the number of layers increases, which yield promising theoretical and empirical results for alleviating over-squashing in graphs with negative curvature.  ( 2 min )
    CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra. (arXiv:2309.03060v2 [cs.LG] UPDATED)
    Many areas of machine learning and science involve large linear algebra problems, such as eigendecompositions, solving linear systems, computing matrix exponentials, and trace estimation. The matrices involved often have Kronecker, convolutional, block diagonal, sum, or product structure. In this paper, we propose a simple but general framework for large-scale linear algebra problems in machine learning, named CoLA (Compositional Linear Algebra). By combining a linear operator abstraction with compositional dispatch rules, CoLA automatically constructs memory and runtime efficient numerical algorithms. Moreover, CoLA provides memory efficient automatic differentiation, low precision computation, and GPU acceleration in both JAX and PyTorch, while also accommodating new objects, operations, and rules in downstream packages via multiple dispatch. CoLA can accelerate many algebraic operations, while making it easy to prototype matrix structures and algorithms, providing an appealing drop-in tool for virtually any computational effort that requires linear algebra. We showcase its efficacy across a broad range of applications, including partial differential equations, Gaussian processes, equivariant model construction, and unsupervised learning.  ( 2 min )
    A Primer on Deep Learning for Causal Inference. (arXiv:2110.04442v2 [cs.LG] UPDATED)
    This review systematizes the emerging literature for causal inference using deep neural networks under the potential outcomes framework. It provides an intuitive introduction on how deep learning can be used to estimate/predict heterogeneous treatment effects and extend causal inference to settings where confounding is non-linear, time varying, or encoded in text, networks, and images. To maximize accessibility, we also introduce prerequisite concepts from causal inference and deep learning. The survey differs from other treatments of deep learning and causal inference in its sharp focus on observational causal estimation, its extended exposition of key algorithms, and its detailed tutorials for implementing, training, and selecting among deep estimators in Tensorflow 2 available at github.com/kochbj/Deep-Learning-for-Causal-Inference.  ( 2 min )
    Unified Binary and Multiclass Margin-Based Classification. (arXiv:2311.17778v1 [stat.ML])
    The notion of margin loss has been central to the development and analysis of algorithms for binary classification. To date, however, there remains no consensus as to the analogue of the margin loss for multiclass classification. In this work, we show that a broad range of multiclass loss functions, including many popular ones, can be expressed in the relative margin form, a generalization of the margin form of binary losses. The relative margin form is broadly useful for understanding and analyzing multiclass losses as shown by our prior work (Wang and Scott, 2020, 2021). To further demonstrate the utility of this way of expressing multiclass losses, we use it to extend the seminal result of Bartlett et al. (2006) on classification-calibration of binary margin losses to multiclass. We then analyze the class of Fenchel-Young losses, and expand the set of these losses that are known to be classification-calibrated.  ( 2 min )
    Learning sources of variability from high-dimensional observational studies. (arXiv:2307.13868v2 [stat.ME] UPDATED)
    Causal inference studies whether the presence of a variable influences an observed outcome. As measured by quantities such as the "average treatment effect," this paradigm is employed across numerous biological fields, from vaccine and drug development to policy interventions. Unfortunately, the majority of these methods are often limited to univariate outcomes. Our work generalizes causal estimands to outcomes with any number of dimensions or any measurable space, and formulates traditional causal estimands for nominal variables as causal discrepancy tests. We propose a simple technique for adjusting universally consistent conditional independence tests and prove that these tests are universally consistent causal discrepancy tests. Numerical experiments illustrate that our method, Causal CDcorr, leads to improvements in both finite sample validity and power when compared to existing strategies. Our methods are all open source and available at github.com/ebridge2/cdcorr.  ( 2 min )
    Robust Correlated Equilibrium: Definition and Computation. (arXiv:2311.17592v1 [eess.SY])
    We study N-player finite games with costs perturbed due to time-varying disturbances in the underlying system and to that end we propose the concept of Robust Correlated Equilibrium that generalizes the definition of Correlated Equilibrium. Conditions under which the Robust Correlated Equilibrium exists are specified and a decentralized algorithm for learning strategies that are optimal in the sense of Robust Correlated Equilibrium is proposed. The primary contribution of the paper is the convergence analysis of the algorithm and to that end, we propose an extension of the celebrated Blackwell's Approachability theorem to games with costs that are not just time-average as in the original Blackwell's Approachability Theorem but also include time-average of previous algorithm iterates. The designed algorithm is applied to a practical water distribution network with pumps being the controllers and their costs being perturbed by uncertain consumption by consumers. Simulation results show that each controller achieves no regret and empirical distributions converge to the Robust Correlated Equilibrium.  ( 2 min )
    SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. (arXiv:2310.03684v3 [cs.LG] UPDATED)
    Despite efforts to align large language models (LLMs) with human values, widely-used LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on LLMs. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. SmoothLLM reduces the attack success rate on numerous popular LLMs to below one percentage point, avoids unnecessary conservatism, and admits provable guarantees on attack mitigation. Moreover, our defense uses exponentially fewer queries than existing attacks and is compatible with any LLM. Our code is publicly available at the following link: https://github.com/arobey1/smooth-llm.  ( 2 min )
    Predicting the Age of Astronomical Transients from Real-Time Multivariate Time Series. (arXiv:2311.17143v1 [astro-ph.IM])
    Astronomical transients, such as supernovae and other rare stellar explosions, have been instrumental in some of the most significant discoveries in astronomy. New astronomical sky surveys will soon record unprecedented numbers of transients as sparsely and irregularly sampled multivariate time series. To improve our understanding of the physical mechanisms of transients and their progenitor systems, early-time measurements are necessary. Prioritizing the follow-up of transients based on their age along with their class is crucial for new surveys. To meet this demand, we present the first method of predicting the age of transients in real-time from multi-wavelength time-series observations. We build a Bayesian probabilistic recurrent neural network. Our method can accurately predict the age of a transient with robust uncertainties as soon as it is initially triggered by a survey telescope. This work will be essential for the advancement of our understanding of the numerous young transients being detected by ongoing and upcoming astronomical surveys.  ( 2 min )
    Evaluating Treatment Prioritization Rules via Rank-Weighted Average Treatment Effects. (arXiv:2111.07966v2 [stat.ME] UPDATED)
    There are a number of available methods for selecting whom to prioritize for treatment, including ones based on treatment effect estimation, risk scoring, and hand-crafted rules. We propose rank-weighted average treatment effect (RATE) metrics as a simple and general family of metrics for comparing and testing the quality of treatment prioritization rules. RATE metrics are agnostic as to how the prioritization rules were derived, and only assess how well they identify individuals that benefit the most from treatment. We define a family of RATE estimators and prove a central limit theorem that enables asymptotically exact inference in a wide variety of randomized and observational study settings. RATE metrics subsume a number of existing metrics, including the Qini coefficient, and our analysis directly yields inference methods for these metrics. We showcase RATE in the context of a number of applications, including optimal targeting of aspirin to stroke patients.  ( 2 min )
    Learning to Estimate Without Bias. (arXiv:2110.12403v3 [cs.LG] UPDATED)
    The Gauss Markov theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models. In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints. The classical approach to designing non-linear MVUEs is through maximum likelihood estimation (MLE) which often involves computationally challenging optimizations. On the other hand, deep learning methods allow for non-linear estimators with fixed computational complexity. Learning based estimators perform optimally on average with respect to their training set but may suffer from significant bias in other parameters. To avoid this, we propose to add a simple bias constraint to the loss function, resulting in an estimator we refer to as Bias Constrained Estimator (BCE). We prove that this yields asymptotic MVUEs that behave similarly to the classical MLEs and asymptotically attain the Cramer Rao bound. We demonstrate the advantages of our approach in the context of signal to noise ratio estimation as well as covariance estimation. A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance. Examples include distributed sensor networks and data augmentation in test-time. In such applications, we show that BCE leads to asymptotically consistent estimators.  ( 3 min )
    A point cloud approach to generative modeling for galaxy surveys at the field level. (arXiv:2311.17141v1 [astro-ph.CO])
    We introduce a diffusion-based generative model to describe the distribution of galaxies in our Universe directly as a collection of points in 3-D space (coordinates) optionally with associated attributes (e.g., velocities and masses), without resorting to binning or voxelization. The custom diffusion model can be used both for emulation, reproducing essential summary statistics of the galaxy distribution, as well as inference, by computing the conditional likelihood of a galaxy field. We demonstrate a first application to massive dark matter haloes in the Quijote simulation suite. This approach can be extended to enable a comprehensive analysis of cosmological data, circumventing limitations inherent to summary statistic -- as well as neural simulation-based inference methods.  ( 2 min )
    Privacy Measurement in Tabular Synthetic Data: State of the Art and Future Research Directions. (arXiv:2311.17453v1 [cs.AI])
    Synthetic data (SD) have garnered attention as a privacy enhancing technology. Unfortunately, there is no standard for quantifying their degree of privacy protection. In this paper, we discuss proposed quantification approaches. This contributes to the development of SD privacy standards; stimulates multi-disciplinary discussion; and helps SD researchers make informed modeling and evaluation decisions.  ( 2 min )
    Estimates on Learning Rates for Multi-Penalty Distribution Regression. (arXiv:2006.09017v2 [cs.LG] UPDATED)
    This paper is concerned with functional learning by utilizing two-stage sampled distribution regression. We study a multi-penalty regularization algorithm for distribution regression under the framework of learning theory. The algorithm aims at regressing to real valued outputs from probability measures. The theoretical analysis on distribution regression is far from maturity and quite challenging, since only second stage samples are observable in practical setting. In the algorithm, to transform information from samples, we embed the distributions to a reproducing kernel Hilbert space $\mathcal{H}_K$ associated with Mercer kernel $K$ via mean embedding technique. The main contribution of the paper is to present a novel multi-penalty regularization algorithm to capture more features of distribution regression and derive optimal learning rates for the algorithm. The work also derives learning rates for distribution regression in the nonstandard setting $f_{\rho}\notin\mathcal{H}_K$, which is not explored in existing literature. Moreover, we propose a distribution regression-based distributed learning algorithm to face large-scale data or information challenge. The optimal learning rates are derived for the distributed learning algorithm. By providing new algorithms and showing their learning rates, we improve the existing work in different aspects in the literature.  ( 2 min )
    DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling. (arXiv:2311.17082v1 [cs.CV])
    Recent methods such as Score Distillation Sampling (SDS) and Variational Score Distillation (VSD) using 2D diffusion models for text-to-3D generation have demonstrated impressive generation quality. However, the long generation time of such algorithms significantly degrades the user experience. To tackle this problem, we propose DreamPropeller, a drop-in acceleration algorithm that can be wrapped around any existing text-to-3D generation pipeline based on score distillation. Our framework generalizes Picard iterations, a classical algorithm for parallel sampling an ODE path, and can account for non-ODE paths such as momentum-based gradient updates and changes in dimensions during the optimization process as in many cases of 3D generation. We show that our algorithm trades parallel compute for wallclock time and empirically achieves up to 4.7x speedup with a negligible drop in generation quality for all tested frameworks.  ( 2 min )

  • Open

    First AI minister likens overregulation to refusal to adopt the printing press
    The world's first AI minister, Omar Al Olama, warns that overregulation of artificial intelligence (AI) can have grave consequences, comparing it to the Middle East's refusal to adopt the printing press, which led to the decline of the Ottoman Empire. Al Olama emphasizes the need for governments to find a balance between regulation and progress, and suggests a three-pronged approach of reskilling, retooling, and retiring for workers affected by AI. He believes that proactive measures are necessary to ensure that the population is not left behind in the face of AI advancements. Source : https://finance.yahoo.com/news/world-first-ai-minister-likens-175027978.html submitted by /u/NuseAI [link] [comments]  ( 9 min )
    An agency created an AI model who earns up to $11,000 a month
    A Spanish modeling agency has created an AI influencer named Aitana López, who can earn up to $11,000 a month as a model. López is a pink-haired 25-year-old with 124,000 followers on Instagram. The agency created her because they were having trouble working with real models and influencers. She can make over $1,000 per advert and earns between $3,000 to $10,000 a month. AI-generated characters like López are finding success on social media and adult content platforms. Fanvue's CEO believes that AI-generated characters will soon be as widespread as human creators. Source : https://www.businessinsider.com/ai-influencer-aitana-clueless-agency-tech-spain-2023-11 submitted by /u/NuseAI [link] [comments]  ( 9 min )
    How much of a threat is Google Gemini to chatgpt?
    Will they coexist? Gemini is expected to be the largest llm with 1 trillion parameters, it can perform many tasks, probably even beat gpt4, do you think this will threaten chatgpt? Especially if it happens that Gemini is free, while gpt4 costs 20 bucks submitted by /u/Fast_Key5139 [link] [comments]  ( 9 min )
    Best Custom GPTs for Marketing, Business Tasks, Coding & Productivity
    submitted by /u/Senior_tasteey [link] [comments]
    Google has been way too quiet
    The fact that they haven’t released much this year even though they are at the forefront of edge sciences like quantum computers, AI and many other fields. Overall Google has overall the best scientists in the world and not published much is ludicrous to me. They are hiding something crazy powerful for sure and I’m not just talking about Gemini which I’m sure will best gp4 by a mile, but many other revolutionary tech. I think they’re sitting on some tech too see who will release it first. submitted by /u/Major_Fishing6888 [link] [comments]  ( 1 min )
    Behind China's Plans to Build AI for the World
    China is taking a different path in AI development compared to the West. Instead of competing to write the rules, China is building its own AI ecosystems and working with allies to expand its influence. This includes transforming cities into 'safe cities' and 'smart cities' using AI technology. China already dominates exports of facial recognition technology and is spreading its surveillance expertise worldwide. The country's vocational training program, the 'Luban workshop,' is educating thousands in AI in developing economies. Chinese tech companies have also played a role in building the world's largest AI supercomputer in the United Arab Emirates. If China's strategy is successful, it could render the West's agreements on safe and rights-respecting AI obsolete. The US needs a more expansive vision to empower other nations with AI education and infrastructure to compete with China's efforts. Source : https://www.politico.com/news/magazine/2023/11/30/china-global-ai-plans-00129160 submitted by /u/NuseAI [link] [comments]  ( 9 min )
    Is this guy AI? I honestly can't tell
    submitted by /u/a1a3a5a7a9 [link] [comments]  ( 1 min )
    ChatGPT vs Google Search Labs?
    Title. I just managed to get Google Search Labs and I'm wondering if others try using it instead of ChatGPT. It feels really weird typing out full sentences into Google though -- I always just put a couple of tags + "Reddit". https://preview.redd.it/g6oeuhpxyg3c1.png?width=2314&format=png&auto=webp&s=1b800cb03ec17a032e43bdcaff7e443908fa2e64 submitted by /u/IgnisIncendio [link] [comments]
    Pokemon inspired fashion show
    submitted by /u/Sea_Permit5660 [link] [comments]
    Are the models trained to just lie/deny when caught. What other data they might be gathering that they blatantly deny about
    So this is the Meta AI. I first made it believe this made up word which it kept denying as it doesn’t have access to ‘real - time’ data. Next I asked a simple latest news section and it gave a proper answer. When i caught it you can see the response yourself. I’m a ML Engineer myself but this is weird. Like have they been hard coded to respond as such if caught. What else stuff can they actually do or are doing that they just deny. I’ve also tested with chat-gpt that sometimes if i make one instance of it learn a new word (Made up). 1 day later i ask another instance about it and it knows 🤷‍♂️. submitted by /u/Cizhu [link] [comments]
    One-Minute Daily AI News 11/29/2023
    Stable Diffusion XL Turbo can generate AI images as fast as you can type.[1] AWS and NVIDIA Announce Strategic Collaboration to Offer New Supercomputing Infrastructure, Software and Services for Generative AI.[2] ByteDance Launches Large-Scale Model Product “ChitChop”.[3] Sam Altman Officially Returns to OpenAI—With a New Board Seat for Microsoft.[4] Sources: [1] https://arstechnica.com/information-technology/2023/11/stable-diffusion-turbo-xl-accelerates-image-synthesis-with-one-step-generation/ [2] https://nvidianews.nvidia.com/news/aws-nvidia-strategic-collaboration-for-generative-ai [3] https://pandaily.com/bytedance-launches-large-scale-model-product-chitchop/ [4] https://www.wired.com/story/sam-altman-officially-returns-to-openai-board-seat-microsoft/ submitted by /u/Excellent-Target-847 [link] [comments]
    Google DeepMind uses AI to discover 2.2 million new materials – equivalent to nearly 800 years’ worth of knowledge. Shares they've already validated 736 in laboratories.
    Materials discovery is critical but tough. New materials enable big innovations like batteries or LEDs. But there are ~infinitely many combinations to try. Testing for them experimentally is slow and expensive. So scientists and engineers want to simulate and screen materials on computers first. This can check way more candidates before real-world experiments. However, models historically struggled at accurately predicting if materials are stable. Researchers at DeepMind made a system called GNoME that uses graph neural networks and active learning to push past these limits. GNoME models materials' crystal structures as graphs and predicts formation energies. It actively generates and filters candidates, evaluating the most promising with simulations. This expands its knowledge and improves predictions over multiple cycles. The authors introduced new ways to generate derivative structures that respect symmetries, further diversifying discoveries. The results: GNoME found 2.2 million new stable materials - equivalent to 800 years of normal discovery. Of those, 380k were the most stable and candidates for validation. 736 were validated in external labs. These include a totally new diamond-like optical material and another that may be a superconductor. Overall this demonstrates how scaling up deep learning can massively speed up materials innovation. As data and models improve together, it'll accelerate solutions to big problems needing new engineered materials. TLDR: DeepMind made an AI system that uses graph neural networks to discover possible new materials. It found 2.2 million candidates, and over 300k are most stable. Over 700 have already been synthesized. Full summary available here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 10 min )
    when we get to agi, we may want to shift from capitalism to socialism
    the transition may make it much easier for us to have a fighting chance against climate change. agi could probably figure out how to make it a win-win for everyone submitted by /u/Georgeo57 [link] [comments]
    What???…..what…..
    But why and what why not not??? submitted by /u/Impossible_Belt_7757 [link] [comments]
  • Open

    "Diffusion Model Alignment Using Direct Preference Optimization (DPO)", Wallace et al 2023 {Salesforce}
    submitted by /u/gwern [link] [comments]
    [D] I'm interviewing Rich Sutton in a week, what should I ask him?
    submitted by /u/gwern [link] [comments]
    Addressing distributional shifts in transitions and rewards
    What are the state-of-the-art methods to address distributional shift in a RL task? I need to apply RL to a traffic control problem where the distribution of traffic (arrival rate of vehicles, origin-destination locations, etc) can change over time. I can indeed train on multiple distributions, but I would like to use some method that allows me to adapt to quickly adapt to a different distribution, or quickly fine-tune a new model for it. submitted by /u/fedetask [link] [comments]
    Creating a single agent that acts across multiple environments
    Hi everyone, thanks in advance for the help! I’m working on a q-learning project where a single agent is working to optimize performance across many (thousands) independent machines. All machines are the same and generate the same metrics on a regularly updating database. I read a thread from this community from a couple years ago and it sounds like it’s possible to apply a single agent across multiple environments. Given these are all identical environments and denoted by a unique ID, I’m wondering if there’s a simple group by function I can apply while building the model to train it on a much wider array of data rather than training it on a single dataset before trying to deploy it across all the various environments. Thank you all again for your help! submitted by /u/Shikaze33_3 [link] [comments]
    Error implementation Dreamer v3 on vector base observation environment
    Please, what am I doing wrong? I tried training my agent with Dreamerv3 on my custom environment which is vector-based observation. I followed the Sheaprl method for adding a custom environment . Please, what am I doing wrong? I tried training my agent with Dreamerv3 on my custom environment which is vector-based observation. I followed the Sheaprl method for adding a custom environment. to train the agent to navigate to its goal while avoiding obstacles. I want to use Dreamer v3. My observation space is a flat vector-based observation(23), not an image and the action space is continuous with 2 dimensions. I want to write a script that will use Dreamer v3. If possible, the scripts below will be solely for my environment: - envs: My environment - configs.yaml - dreamer.py - exploration.py - models.py - networks.py - parallel.py - requirements.txt - tools.py I want a specific, clean code so that I can modify Dreamer v3 for other environments and options that are image-based. It's making it difficult for me to streamline what I want to do. I will share my environment and a simple script on how to test it. I would appreciate it if you could help me or guide me on how to use Dreamer v3 with a flat vector-based observation space. Thanks! this my zip File that contains my environment : https://www.dropbox.com/scl/fi/z13wkqgjtuszlt3oimxd4/SacNav.zip?rlkey=f0y2yctxa5ajjt442na8abk4l&dl=0 submitted by /u/Goodluck_o [link] [comments]
    learning sources
    hello I would like to hear if you guys have any recommendations for me as to how to move forward with my learning journey. so far I believe I have a solid understanding of how nns works, including the maths of it and coding it. I managed to implement a vanilla gradient decent algorithm to a simple environment. now I'm thinking on how to move forward. I like understanding things down to the math and then trying to implement it myself with numpy to make sure I understood everything. but also on top of that I want to start learning to use existing practical tools for rl and nns like pytorch or whatnot. do you guys have any recommendations for sources that tackle the theory and practical? btw im asking here because I'm specifically interested in deep rl tho it might seem pretty general. submitted by /u/urtypicalretarded [link] [comments]
    Easily implement parallel training.
    submitted by /u/NoteDancing [link] [comments]
    Why should I use RLLib over stable-baselines3?
    So I have used sb3 for about 3 years now, and I find it incredibly simple to setup experiments, add functional modifications/custom network architectures to the RL algorithm, optimise parameters using Optuna, and also do inference after training. All this with very little effort other than to understand the basic dynamics of how it works. So in my new company they use RLLib which, to say the least, is a monstrosity to do any of the above mentioned things I did easily in sb3. So the argument that's usually given is that it's easy to deploy the models with RLLib. But is it really that easy? Considering we could deploy the model trained using sb3 just easily or even faster with a FAST api and docker combination, is it worth going through the difficulties of RLLib just for the additional algorithms it offers? What's your experience like, for the people in the industry using RL? submitted by /u/sharafath28 [link] [comments]
    Is there any game that allow us to interact with it by python?
    i'm learning RL'Reinforcement Learning' and i've done a few big and small project to get the hang of it. i wrote a code(trained a RL-model) that play a game by capturing screen(mss) and sending inputs(PyDirectInputs) but it was TOO SLOW; like barely 1 fps.(whether in training, or using) so i was wondering is there any game that allow us to interact with it by python? like some API , interface , software, etc. that connects our codes directly to game. or something like this? submitted by /u/Mr_Lucifer_666 [link] [comments]
  • Open

    [P] Modified Tsetlin Machine implementation performance on 7950X3D
    Hey. I got some pretty impressive results for my pet-project that I've been working on for the past 1.5 years. MNIST inference performance using one flat layer without convolution on Ryzen 7950X3D CPU: 46 millions predictions per second, throughput: 25 GB/s, accuracy: 98.05%. AGI achieved. ACI (Artificial Collective Intelligence), to be honest. Modified Tsetlin Machine on MNIST performance submitted by /u/ArtemHnilov [link] [comments]  ( 9 min )
    [D] Can Regression Analysis Predict Stock Market Returns?
    Hi everyone, I'm exploring an idea and would appreciate insights from those more knowledgeable in this field. I'm considering using regression analysis to predict stock market returns. My approach involves selecting a few non-traditional variables to see if they significantly impact stock returns. For instance, one variable I'm considering is Research and Development expenses as a percentage of revenue. Using Excel, I've managed to set up a basic model. The formula for predicting future stock prices is derived from the regression: it's the intercept plus the coeffecent times each variables (variable1+ variable2, etc.). However, I'm not well-versed in advanced mathematics or coding, and my knowledge is limited to basic concepts. Here are my questions: Could this model potentially work for predicting stock market returns? If there are limitations or reasons why it might not work, what are they? What should I consider to improve the model's accuracy and reliability? Are there any additional inputs or variables that I should look into? Any guidance, critiques, or suggestions would be greatly appreciated. I'm here to learn and understand more about this concept. Thanks in advance! submitted by /u/lolleh10 [link] [comments]  ( 9 min )
    [D] Looking for similar educational content like this
    https://youtu.be/p1bfK8ZJgkE?si=439ofCFcaFX1EBie (Krish Naik - End to end deep learning project with deployment) I am a "full stack" data scientist with a couple of years experience, and I really like this type of content where they cover end-to-end projects with a high level of code and implementation quality (i.e. modular and reusable code, decorators, dataclasses, pydantic, mlops, dvc, ci/cd, deployment, etc.) Are there any similar courses, videos, creators, websites, or similar you would recommend that cover similar stuff with equal skill level? I have implemented end-to-end projects like this in my current job, but i wish to learn more and adhere to best practices when it comes the python development, machine learning application, and so on. submitted by /u/Fendrbud [link] [comments]  ( 9 min )
    [D] Kaggle competition beginner question
    i am a beginner in kaggle competition for a kaggle competition when i finish doing all my model (preprocessing data + training + validation+ saving the model in kaggle/working ) , when i submit the notebook to kaggle it re-run it and count the time of running , is it normal ? nowing that i have the model already saved and ready in kaggle/working , like when i submit it also re-run all the code even the preprocessing , if you want to help please clarify this point for me submitted by /u/Express-Pool-5480 [link] [comments]  ( 9 min )
    YUAN-2.0-102B, with code and weights. Scores between ChatGPT and GPT-4 on various benchmarks [R]
    submitted by /u/we_are_mammals [link] [comments]
    Mechanical engineer wants to do machine learning [D].
    I am a mechanical engineering student I want to break into the field of machine learning But I don't have any type of guidance. Even on how can a mechanical engineer become a machine learning expert. I saw one post from a person who worked as a data analyst in an mechanical engineer firm he was also a mechanical engineer. Please do guide. submitted by /u/AromaticEconomics113 [link] [comments]  ( 9 min )
    [P] MLM to parse through audio input to discern music from background ambience
    Hello all :) I'm working on an iOS audio app that processes microphone input to reactive visualizations synced to the music. I have an AVFoundation pipeline set up to record audio, perform realtime FFT analysis into frames of features like spectral centroid and chroma vectors, and map these to parameters driving the visuals. I would now like to train a model to distinguish music from ambient noise in the mic data to filter out segments with just irrelevant background talking or sounds. The app is built in Swift and needs to perform prediction very efficiently even on older iPhones (ideally lol, not the biggest concern). What type of neural network model would you recommend for classifying short sequences of audio features per frame? I'm considering LSTM or Transformer architectures. What feature sets provide the most discriminative signal for music vs noise? Should I quantify model accuracy by frame-level predictions or aggregate metrics like overall sequence accuracy? Any advice on optimizing models for real-time mobile inference? I have experience gathering and labeling timestamped training data pairs of raw audio and feature sets. Please let me know what other specifics around data, model configuration, metrics, or optimization would be useful to provide. Thanks in advance for any guidance! submitted by /u/zeke-001 [link] [comments]  ( 9 min )
    [D] Self-hosting embedding model vs using hosted embedding services
    Hi ML group, How does everyone using embeddings in your application? Do you self host the embedding models in your machine, or use hosted services such as https://embaas.io/? I assume for closed sourced model such as openai embedding or voyage ai embedding, there is no self hosted options. For open source models such as sentence-transformers/all-MiniLM-L6-v2, BAAI/bge-small-en, bge-large-en-v1.5, etc, do you prefer to self host the models, or use a hosted service? Also, when using a vector database, do you handle embeddings separately before inserting the data into the vector db, or leverage builtin embedding functionalities encapsulated by vector db products? Are there any limitations you encountered? submitted by /u/songrenchu [link] [comments]  ( 9 min )
    [D] Seeking Guidance: Best Learning Path for Aspiring ML/AI Engineer - Prioritize Core Programming(DSA) or Dive into Machine Learning Libraries?
    Hello everyone, warm regards to you all. I am currently an IT student pursuing a bachelor's degree, and I aspire to become a Machine Learning or Artificial Intelligence engineer in the future. I'm seeking advice on the ideal learning path. Should I prioritize core programming skills such as data structures and algorithms and gain knowledge of software engineering first? Alternatively, is it advisable to dive directly into the ML/AI world by learning different libraries like TensorFlow? If software engineering knowledge is crucial for success in the ML/AI field, how much proficiency in software engineering is necessary? Thank you for your insights. submitted by /u/neerazayn [link] [comments]  ( 9 min )
    [D] I'm interviewing Rich Sutton in a week, what should I ask him?
    Rich is an author of the RL book, and more recently, he founded the OpenMind Research Institute with some colleagues. ​ The interview is in 1 week. I have a background in RL and already have some ideas on questions and topics, but I also want to source questions outside of the Alberta RL bubble. Technical questions are the best, though I am open to anything. Thank you! ​ I'll post an update in this thread in a couple weeks after the interview is published. submitted by /u/ejmejm1 [link] [comments]  ( 9 min )
    AI to generate videos of hands playing piano from MIDI Files [D]
    Hey! I'm collaborating with a piano learning app, and I'd like to suggest them a way to scale the lesson creation process using AI. Has anyone here explored the feasibility of using AI to generate videos that mimic human hands playing piano pieces? I'd like to explore what's possible providing the MIDI file and/or score. Does anyone have experience with this or could point me to resources or research? 🙏 https://preview.redd.it/tdaznpayki3c1.png?width=856&format=png&auto=webp&s=5ebac31b502ffd1982f3645fda23e86ae918c319 ​ submitted by /u/brank87 [link] [comments]  ( 9 min )
    [P] We integrated Netron into GitHub for visualizing model architectures
    We LOVE the Netron library that Lutz Roeder created: https://github.com/lutzroeder/netron We wanted to explore what it felt like to be able to render Netron visualizations for ML models hosted in GitHub so we built a Netron integration: https://about.xethub.com/blog/visualizing-ml-models-github-netron Netron focuses on viewing 1 model file at a time but we also incorporated before-and-after model visualizations in pull requests: https://assets-global.website-files.com/6474aea6101c81b742144dd2/65689c07b7f2b060a925097c_github_pr2.png submitted by /u/semicausal [link] [comments]  ( 9 min )
    [D] AAMAS 2024 Reviews Are Out!
    I didn't see a discussion post, so I figured I'd make this one. submitted by /u/LessPoliticalAccount [link] [comments]  ( 9 min )
    [R] Hierarchically Gated Recurrent Neural Network for Sequence Modeling
    Paper: https://arxiv.org/abs/2311.04823 Code: https://github.com/OpenNLPLab/HGRN Models: https://huggingface.co/OpenNLPLab Abstract: Transformers have surpassed RNNs in popularity due to their superior abilities in parallel training and long-term dependency modeling. Recently, there has been a renewed interest in using linear RNNs for efficient sequence modeling. These linear RNNs often employ gating mechanisms in the output of the linear recurrence layer while ignoring the significance of using forget gates within the recurrence. In this paper, we propose a gated linear RNN model dubbed Hierarchically Gated Recurrent Neural Network (HGRN), which includes forget gates that are lower bounded by a learnable value. The lower bound increases monotonically when moving up layers. This allows the upper layers to model long-term dependencies and the lower layers to model more local, short-term dependencies. Experiments on language modeling, image classification, and long-range arena benchmarks showcase the efficiency and effectiveness of our proposed model. The source code is available at this https URL. https://preview.redd.it/thph9bpmjh3c1.png?width=965&format=png&auto=webp&s=8e4871cd280ef7e5b771b463435d47da11dca52d submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [R] RO-LLaMA: Generalist LLM for Radiation Oncology via Noise Augmentation and Consistency Regularization
    submitted by /u/davidmezzetti [link] [comments]  ( 9 min )
    [D] NeurIPS Climbing Club (Gradient Ascent?)
    I'll be attending NeurIPS 2023 this year and won't know anyone. I'm an avid climber and there is a nice bouldering gym close to the conference center - would anyone else fancy a climb over the week? Reply below so we can see how many people and I can setup a meetup page or something. I'm a final year PhD from the UK working in video understanding. submitted by /u/fi5k3n [link] [comments]  ( 9 min )
    [N] stable-fast for SD inference: Faster than AITemplate, On par with TensorRT
    https://github.com/chengzeyi/stable-fast https://preview.redd.it/en9wfkfc6g3c1.png?width=700&format=png&auto=webp&s=f66a45e69ab37885c2973c3ce11841a698e05b35 stable-fast achieves SOTA inference performance on ALL kinds of diffuser models.And unlike TensorRT or AITemplate, which takes dozens of minutes to compile a model, stable-fast only takes a few seconds to compile a model. stable-fast also supports dynamic shape, LoRA and ControlNet out of the box. submitted by /u/ciiic [link] [comments]  ( 9 min )
    [D] Insights from Deploying CodeLlama 34Bn Model with Multiple Libraries
    Hi everyone, We've recently experimented with deploying the CodeLlama 34 Bn model and wanted to share our key findings for those interested: Best Performance: Quantized GPTQ, 4-bit CodeLlama-Python-34B model using vLLM. Results: Average lowest latency of 3.51 sec, average token generation at 58.40/sec, and a cold start time of 21.8 sec on our platform, using Nvidia A100 GPU. CodeLlama 34Bn Other Libraries Tested: HuggingFace Transformer Pipeline, AutoGPTQ, Text Generation Inference. Keen to hear your experiences and learnings in similar deployments! submitted by /u/Tiny_Cut_8440 [link] [comments]  ( 9 min )
    [D]: Understanding GPU Memory Allocation When Training Large Models
    TL;DR: Why does GPU memory usage spike during gradient update step (can't account for 10gbs) but then drop down? I've been working on fine-tuning some of the larger LMs available on HuggingFace (e.g. Falcon40B and Llama-2-70B) and so far all my estimates for memory requirements don't add up. I have access to 4 A100-80gb GPUs and was fairly confident that I should have enough RAM to fine-tune Falcon40B with LoRA but I keep getting CUDA OOMs errors. I have figured out ways to get things running, but this made me realize I don't really understand how memory is allocated during training. Here's my understanding of where memory goes when you want to train a model: Setting -> Defining a TOTAL_MEMORY = 0 (MB) and I will update it as I move through each step that adds memory. -> Checking memo…  ( 11 min )
    [P] EasyOCR in C++!
    Hello all! I just uploaded my C++ implementation of EasyOCR, a well known ocr library for python. Also dusted some cobwebbs from some audio related projects as well, feel free to leave feedback or contribute! I only implemented the most salient parts, so certainly could use some community help! Cheers! https://github.com/ksasso1028/EasyOCR-cpp submitted by /u/YamGloomy948 [link] [comments]  ( 9 min )
    [D] Сan I find a job in machine learning without education, but knowing how to do everything?
    I am learning ML engineering about 6 months but just now I was afraid that I wouldn’t be able to find a job. I would like to hear from those who managed to get a job without education submitted by /u/ExplorerMost2308 [link] [comments]  ( 1 min )
    [R] Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning
    submitted by /u/tell-me-the-truth- [link] [comments]  ( 1 min )
  • Open

    Welcome to a New Era of Building in the Cloud with Generative AI on AWS
    We believe generative AI has the potential over time to transform virtually every customer experience we know. The number of companies launching generative AI applications on AWS is substantial and building quickly, including adidas, Booking.com, Bridgewater Associates, Clariant, Cox Automotive, GoDaddy, and LexisNexis Legal & Professional, to name just a few. Innovative startups like Perplexity […]  ( 26 min )
    Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 2: Interactive User Experiences in SageMaker Studio
    Amazon SageMaker is a fully managed service that enables developers and data scientists to quickly and easily build, train, and deploy machine learning (ML) models at scale. SageMaker makes it easy to deploy models into production directly through API calls to the service. Models are packaged into containers for robust and scalable deployments. SageMaker provides […]  ( 12 min )
    Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 1: PySDK Improvements
    Amazon SageMaker is a fully managed service that enables developers and data scientists to quickly and effortlessly build, train, and deploy machine learning (ML) models at any scale. SageMaker makes it straightforward to deploy models into production directly through API calls to the service. Models are packaged into containers for robust and scalable deployments. Although […]  ( 15 min )
    New – Code Editor, based on Code-OSS VS Code Open Source now available in Amazon SageMaker Studio
    Today, we are excited to announce support for Code Editor, a new integrated development environment (IDE) option in Amazon SageMaker Studio. Code Editor is based on Code-OSS, Visual Studio Code Open Source, and provides access to the familiar environment and tools of the popular IDE that machine learning (ML) developers know and love, fully integrated […]  ( 9 min )
    Scale foundation model inference to hundreds of models with Amazon SageMaker – Part 1
    As democratization of foundation models (FMs) becomes more prevalent and demand for AI-augmented services increases, software as a service (SaaS) providers are looking to use machine learning (ML) platforms that support multiple tenants—for data scientists internal to their organization and external customers. More and more companies are realizing the value of using FMs to generate […]  ( 17 min )
    Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker
    As organizations deploy models to production, they are constantly looking for ways to optimize the performance of their foundation models (FMs) running on the latest accelerators, such as AWS Inferentia and GPUs, so they can reduce their costs and decrease response latency to provide the best experience to end-users. However, some FMs don’t fully utilize […]  ( 13 min )
    Minimize real-time inference latency by using Amazon SageMaker routing strategies
    Amazon SageMaker makes it straightforward to deploy machine learning (ML) models for real-time inference and offers a broad selection of ML instances spanning CPUs and accelerators such as AWS Inferentia. As a fully managed service, you can scale your model deployments, minimize inference costs, and manage your models more effectively in production with reduced operational […]  ( 6 min )
    Build and evaluate machine learning models with advanced configurations using the SageMaker Canvas model leaderboard
    Amazon SageMaker Canvas is a no-code workspace that enables analysts and citizen data scientists to generate accurate machine learning (ML) predictions for their business needs. Starting today, SageMaker Canvas supports advanced model build configurations such as selecting a training method (ensemble or hyperparameter optimization) and algorithms, customizing the training and validation data split ratio, and […]  ( 12 min )
    Introducing Amazon SageMaker HyperPod to train foundation models at scale
    Building foundation models (FMs) requires building, maintaining, and optimizing large clusters to train models with tens to hundreds of billions of parameters on vast amounts of data. Creating a resilient environment that can handle failures and environmental changes without losing days or weeks of model training progress is an operational challenge that requires you to […]  ( 10 min )
    Easily build semantic image search using Amazon Titan
    Digital publishers are continuously looking for ways to streamline and automate their media workflows to generate and publish new content as rapidly as they can, but without foregoing quality. Adding images to capture the essence of text can improve the reading experience. Machine learning techniques can help you discover such images. “A striking image is […]  ( 10 min )
    Evaluate large language models for quality and responsibility
    The risks associated with generative AI have been well-publicized. Toxicity, bias, escaped PII, and hallucinations negatively impact an organization’s reputation and damage customer trust. Research shows that not only do risks for bias and toxicity transfer from pre-trained foundation models (FM) to task-specific generative AI services, but that tuning an FM for specific tasks, on […]  ( 13 min )
    Accelerate data preparation for ML in Amazon SageMaker Canvas
    Data preparation is a crucial step in any machine learning (ML) workflow, yet it often involves tedious and time-consuming tasks. Amazon SageMaker Canvas now supports comprehensive data preparation capabilities powered by Amazon SageMaker Data Wrangler. With this integration, SageMaker Canvas provides customers with an end-to-end no-code workspace to prepare data, build and use ML and […]  ( 7 min )
    Operationalize LLM Evaluation at Scale using Amazon SageMaker Clarify and MLOps services
    In the last few years Large Language Models (LLMs) have risen to prominence as outstanding tools capable of understanding, generating and manipulating text with unprecedented proficiency. Their potential applications span from conversational agents to content generation and information retrieval, holding the promise of revolutionizing all industries. However, harnessing this potential while ensuring the responsible and […]  ( 15 min )
    Accelerate deep learning model training up to 35% with Amazon SageMaker smart sifting
    In today’s rapidly evolving landscape of artificial intelligence, deep learning models have found themselves at the forefront of innovation, with applications spanning computer vision (CV), natural language processing (NLP), and recommendation systems. However, the increasing cost associated with training and fine-tuning these models poses a challenge for enterprises. This cost is primarily driven by the […]  ( 8 min )
  • Open

    A few large enterprise software provider strategies for the knowledge graph market
    In November 2023, MarketsandMarkets announced the publication of its Knowledge Graph Market report. In its announcement, M&M estimated the 2023 global knowledge graph market at $0.9 billion, forecasting market growth to $2.4 billion by 2028, a compound annual growth rate of 21.9 percent. M&M also listed these 12 “key players” in its announcement: I haven’t… Read More »A few large enterprise software provider strategies for the knowledge graph market The post A few large enterprise software provider strategies for the knowledge graph market appeared first on Data Science Central.  ( 21 min )
    Towards Better GenAI: 5 Major Issues, and How to Fix Them
    After playing with GPT for some time, testing GenAI vendor solutions, designing my own, and reading feedback from other users, I uncovered a number of problems. Here I share some of the most common issues, and how to address them. It impacts LLM and synthetic data generation the most, including time series generation. 1. Poor… Read More »Towards Better GenAI: 5 Major Issues, and How to Fix Them The post Towards Better GenAI: 5 Major Issues, and How to Fix Them appeared first on Data Science Central.  ( 22 min )
  • Open

    How to memorize the periodic table
    Motivation Memorizing the periodic table has some practical value, especially if you’re a chemist, but in any case it’s an interesting exercise, easier to do than it may sound. And it’s a case study for how you might memorize other things of more practical value to you personally. Major system pegs The Major system is […] How to memorize the periodic table first appeared on John D. Cook.  ( 6 min )
  • Open

    AI Weirdness advent calendar 2023
    It's 2023 and the combo of GPT-4/DALL-E3 can generate passable versions of the saccharine Christmas drawings in an advent calendar. They cannot, however, label them correctly. Also sometimes you get sweatermugs. This means the 2023 AI-generated advent calendar is happening! Full descriptions of every door in the  ( 7 min )
    Bonus: more Christmas scenes
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    ‘Call of Duty’ Comes to GeForce NOW
    Let the games begin — this GFN Thursday brings the highly anticipated Call of Duty: Modern Warfare III to the cloud, the first Activision title on GeForce NOW as part of the NVIDIA and Microsoft partnership. It’s joined by Call of Duty: Modern Warfare II and Call of Duty: Warzone — all three titles can Read article >  ( 8 min )
  • Open

    Pooling in graph neural networks
    Hi everyone, I'm trying to digest this paper called "Graph Convolutional Neural Networks for Histologic Classification of Pancreatic Cancer". It says in the paper: We concatenate the outputs of every GCN layer and then use the "self-attention pooling layer" to select the too 50% highly weighted nodes that determine the class of a graph. Github.com/bmirds/slide2graph I'm just wondering if someone can explain pooling in the graph convolutional neural networks in simple terms? I'm not in this field, so thanks a bunch. submitted by /u/kakashi1992 [link] [comments]  ( 9 min )
  • Open

    Quantum-classical simulation of quantum field theory by quantum circuit learning. (arXiv:2311.16297v1 [hep-th])
    We employ quantum circuit learning to simulate quantum field theories (QFTs). Typically, when simulating QFTs with quantum computers, we encounter significant challenges due to the technical limitations of quantum devices when implementing the Hamiltonian using Pauli spin matrices. To address this challenge, we leverage quantum circuit learning, employing a compact configuration of qubits and low-depth quantum circuits to predict real-time dynamics in quantum field theories. The key advantage of this approach is that a single-qubit measurement can accurately forecast various physical parameters, including fully-connected operators. To demonstrate the effectiveness of our method, we use it to predict quench dynamics, chiral dynamics and jet production in a 1+1-dimensional model of quantum electrodynamics. We find that our predictions closely align with the results of rigorous classical calculations, exhibiting a high degree of accuracy. This hybrid quantum-classical approach illustrates the feasibility of efficiently simulating large-scale QFTs on cutting-edge quantum devices.  ( 2 min )
    Neural General Circulation Models. (arXiv:2311.07222v2 [physics.ao-ph] UPDATED)
    General circulation models (GCMs) are the foundation of weather and climate prediction. GCMs are physics-based simulators which combine a numerical solver for large-scale dynamics with tuned representations for small-scale processes such as cloud formation. Recently, machine learning (ML) models trained on reanalysis data achieved comparable or better skill than GCMs for deterministic weather forecasting. However, these models have not demonstrated improved ensemble forecasts, or shown sufficient stability for long-term weather and climate simulations. Here we present the first GCM that combines a differentiable solver for atmospheric dynamics with ML components, and show that it can generate forecasts of deterministic weather, ensemble weather and climate on par with the best ML and physics-based methods. NeuralGCM is competitive with ML models for 1-10 day forecasts, and with the European Centre for Medium-Range Weather Forecasts ensemble prediction for 1-15 day forecasts. With prescribed sea surface temperature, NeuralGCM can accurately track climate metrics such as global mean temperature for multiple decades, and climate forecasts with 140 km resolution exhibit emergent phenomena such as realistic frequency and trajectories of tropical cyclones. For both weather and climate, our approach offers orders of magnitude computational savings over conventional GCMs. Our results show that end-to-end deep learning is compatible with tasks performed by conventional GCMs, and can enhance the large-scale physical simulations that are essential for understanding and predicting the Earth system.  ( 3 min )
    ChatTraffc: Text-to-Traffic Generation via Diffusion Model. (arXiv:2311.16203v1 [cs.LG])
    Traffic prediction is one of the most significant foundations in Intelligent Transportation Systems (ITS). Traditional traffic prediction methods rely only on historical traffic data to predict traffic trends and face two main challenges. 1) insensitivity to unusual events. 2) poor performance in long-term prediction. In this work, we explore how generative models combined with text describing the traffic system can be applied for traffic generation and name the task Text-to-Traffic Generation (TTG). The key challenge of the TTG task is how to associate text with the spatial structure of the road network and traffic data for generating traffic situations. To this end, we propose ChatTraffic, the first diffusion model for text-to-traffic generation. To guarantee the consistency between synthetic and real data, we augment a diffusion model with the Graph Convolutional Network (GCN) to extract spatial correlations of traffic data. In addition, we construct a large dataset containing text-traffic pairs for the TTG task. We benchmarked our model qualitatively and quantitatively on the released dataset. The experimental results indicate that ChatTraffic can generate realistic traffic situations from the text. Our code and dataset are available at https://github.com/ChyaZhang/ChatTraffic.  ( 2 min )
    LEDITS++: Limitless Image Editing using Text-to-Image Models. (arXiv:2311.16711v1 [cs.CV])
    Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming fine-tuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods. The project page is available at https://leditsplusplus-project.static.hf.space .  ( 2 min )
    A Combinatorial Approach to Robust PCA. (arXiv:2311.16416v1 [cs.DS])
    We study the problem of recovering Gaussian data under adversarial corruptions when the noises are low-rank and the corruptions are on the coordinate level. Concretely, we assume that the Gaussian noises lie in an unknown $k$-dimensional subspace $U \subseteq \mathbb{R}^d$, and $s$ randomly chosen coordinates of each data point fall into the control of an adversary. This setting models the scenario of learning from high-dimensional yet structured data that are transmitted through a highly-noisy channel, so that the data points are unlikely to be entirely clean. Our main result is an efficient algorithm that, when $ks^2 = O(d)$, recovers every single data point up to a nearly-optimal $\ell_1$ error of $\tilde O(ks/d)$ in expectation. At the core of our proof is a new analysis of the well-known Basis Pursuit (BP) method for recovering a sparse signal, which is known to succeed under additional assumptions (e.g., incoherence or the restricted isometry property) on the underlying subspace $U$. In contrast, we present a novel approach via studying a natural combinatorial problem and show that, over the randomness in the support of the sparse signal, a high-probability error bound is possible even if the subspace $U$ is arbitrary.  ( 2 min )
    Personalized Predictions of Glioblastoma Infiltration: Mathematical Models, Physics-Informed Neural Networks and Multimodal Scans. (arXiv:2311.16536v1 [cs.LG])
    Predicting the infiltration of Glioblastoma (GBM) from medical MRI scans is crucial for understanding tumor growth dynamics and designing personalized radiotherapy treatment plans.Mathematical models of GBM growth can complement the data in the prediction of spatial distributions of tumor cells. However, this requires estimating patient-specific parameters of the model from clinical data, which is a challenging inverse problem due to limited temporal data and the limited time between imaging and diagnosis. This work proposes a method that uses Physics-Informed Neural Networks (PINNs) to estimate patient-specific parameters of a reaction-diffusion PDE model of GBM growth from a single 3D structural MRI snapshot. PINNs embed both the data and the PDE into a loss function, thus integrating theory and data. Key innovations include the identification and estimation of characteristic non-dimensional parameters, a pre-training step that utilizes the non-dimensional parameters and a fine-tuning step to determine the patient specific parameters. Additionally, the diffuse domain method is employed to handle the complex brain geometry within the PINN framework. Our method is validated both on synthetic and patient datasets, and shows promise for real-time parametric inference in the clinical setting for personalized GBM treatment.  ( 2 min )
    Vision-Based Incoming Traffic Estimator Using Deep Neural Network on General Purpose Embedded Hardware. (arXiv:2311.16125v1 [cs.CV])
    Traffic management is a serious problem in many cities around the world. Even the suburban areas are now experiencing regular traffic congestion. Inappropriate traffic control wastes fuel, time, and the productivity of nations. Though traffic signals are used to improve traffic flow, they often cause problems due to inappropriate or obsolete timing that does not tally with the actual traffic intensity at the intersection. Traffic intensity determination based on statistical methods only gives the average intensity expected at any given time. However, to control traffic accurately, it is required to know the real-time traffic intensity. In this research, image processing and machine learning have been used to estimate actual traffic intensity in real time. General-purpose electronic hardware has been used for in-situ image processing based on the edge-detection method. A deep neural network (DNN) was trained to infer traffic intensity in each image in real time. The trained DNN estimated traffic intensity accurately in 90% of the real-time images during road tests. The electronic system was implemented on a Raspberry Pi single-board computer; hence, it is cost-effective for large-scale deployment.  ( 2 min )
    On the Long Range Abilities of Transformers. (arXiv:2311.16620v1 [cs.LG])
    Despite their dominance in modern DL and, especially, NLP domains, transformer architectures exhibit sub-optimal performance on long-range tasks compared to recent layers that are specifically designed for this purpose. In this work, drawing inspiration from key attributes of long-range layers, such as state-space layers, linear RNN layers, and global convolution layers, we demonstrate that minimal modifications to the transformer architecture can significantly enhance performance on the Long Range Arena (LRA) benchmark, thus narrowing the gap with these specialized layers. We identify that two key principles for long-range tasks are (i) incorporating an inductive bias towards smoothness, and (ii) locality. As we show, integrating these ideas into the attention mechanism improves results with a negligible amount of additional computation and without any additional trainable parameters. Our theory and experiments also shed light on the reasons for the inferior performance of transformers on long-range tasks and identify critical properties that are essential for successfully capturing long-range dependencies.  ( 2 min )
    Forward Gradients for Data-Driven CFD Wall Modeling. (arXiv:2311.11876v2 [physics.flu-dyn] UPDATED)
    Computational Fluid Dynamics (CFD) is used in the design and optimization of gas turbines and many other industrial/ scientific applications. However, the practical use is often limited by the high computational cost, and the accurate resolution of near-wall flow is a significant contributor to this cost. Machine learning (ML) and other data-driven methods can complement existing wall models. Nevertheless, training these models is bottlenecked by the large computational effort and memory footprint demanded by back-propagation. Recent work has presented alternatives for computing gradients of neural networks where a separate forward and backward sweep is not needed and storage of intermediate results between sweeps is not required because an unbiased estimator for the gradient is computed in a single forward sweep. In this paper, we discuss the application of this approach for training a subgrid wall model that could potentially be used as a surrogate in wall-bounded flow CFD simulations to reduce the computational overhead while preserving predictive accuracy.  ( 2 min )
    LasTGL: An Industrial Framework for Large-Scale Temporal Graph Learning. (arXiv:2311.16605v1 [cs.LG])
    Over the past few years, graph neural networks (GNNs) have become powerful and practical tools for learning on (static) graph-structure data. However, many real-world applications, such as social networks and e-commerce, involve temporal graphs where nodes and edges are dynamically evolving. Temporal graph neural networks (TGNNs) have progressively emerged as an extension of GNNs to address time-evolving graphs and have gradually become a trending research topic in both academics and industry. Advancing research in such an emerging field requires new tools to compose TGNN models and unify their different schemes in dealing with temporal graphs. To facilitate research and application in temporal graph learning, we introduce LasTGL, an industrial framework that integrates unified and extensible implementations of common temporal graph learning algorithms for various advanced tasks. The purpose of LasTGL is to provide the essential building blocks for solving temporal graph learning tasks, focusing on the guiding principles of user-friendliness and quick prototyping on which PyTorch is based. In particular, LasTGL provides comprehensive temporal graph datasets, TGNN models and utilities along with well-documented tutorials, making it suitable for both absolute beginners and expert deep learning practitioners alike.  ( 2 min )
    Beyond Labels: Advancing Cluster Analysis with the Entropy of Distance Distribution (EDD). (arXiv:2311.16621v1 [stat.ML])
    In the evolving landscape of data science, the accurate quantification of clustering in high-dimensional data sets remains a significant challenge, especially in the absence of predefined labels. This paper introduces a novel approach, the Entropy of Distance Distribution (EDD), which represents a paradigm shift in label-free clustering analysis. Traditional methods, reliant on discrete labels, often struggle to discern intricate cluster patterns in unlabeled data. EDD, however, leverages the characteristic differences in pairwise point-to-point distances to discern clustering tendencies, independent of data labeling. Our method employs the Shannon information entropy to quantify the 'peakedness' or 'flatness' of distance distributions in a data set. This entropy measure, normalized against its maximum value, effectively distinguishes between strongly clustered data (indicated by pronounced peaks in distance distribution) and more homogeneous, non-clustered data sets. This label-free quantification is resilient against global translations and permutations of data points, and with an additional dimension-wise z-scoring, it becomes invariant to data set scaling. We demonstrate the efficacy of EDD through a series of experiments involving two-dimensional data spaces with Gaussian cluster centers. Our findings reveal a monotonic increase in the EDD value with the widening of cluster widths, moving from well-separated to overlapping clusters. This behavior underscores the method's sensitivity and accuracy in detecting varying degrees of clustering. EDD's potential extends beyond conventional clustering analysis, offering a robust, scalable tool for unraveling complex data structures without reliance on pre-assigned labels.  ( 2 min )
    Opening the Black Box: Towards inherently interpretable energy data imputation models using building physics insight. (arXiv:2311.16632v1 [stat.ML])
    Missing data are frequently observed by practitioners and researchers in the building energy modeling community. In this regard, advanced data-driven solutions, such as Deep Learning methods, are typically required to reflect the non-linear behavior of these anomalies. As an ongoing research question related to Deep Learning, a model's applicability to limited data settings can be explored by introducing prior knowledge in the network. This same strategy can also lead to more interpretable predictions, hence facilitating the field application of the approach. For that purpose, the aim of this paper is to propose the use of Physics-informed Denoising Autoencoders (PI-DAE) for missing data imputation in commercial buildings. In particular, the presented method enforces physics-inspired soft constraints to the loss function of a Denoising Autoencoder (DAE). In order to quantify the benefits of the physical component, an ablation study between different DAE configurations is conducted. First, three univariate DAEs are optimized separately on indoor air temperature, heating, and cooling data. Then, two multivariate DAEs are derived from the previous configurations. Eventually, a building thermal balance equation is coupled to the last multivariate configuration to obtain PI-DAE. Additionally, two commonly used benchmarks are employed to support the findings. It is shown how introducing physical knowledge in a multivariate Denoising Autoencoder can enhance the inherent model interpretability through the optimized physics-based coefficients. While no significant improvement is observed in terms of reconstruction error with the proposed PI-DAE, its enhanced robustness to varying rates of missing data and the valuable insights derived from the physics-based coefficients create opportunities for wider applications within building systems and the built environment.  ( 3 min )
    GSP-KalmanNet: Tracking Graph Signals via Neural-Aided Kalman Filtering. (arXiv:2311.16602v1 [eess.SP])
    Dynamic systems of graph signals are encountered in various applications, including social networks, power grids, and transportation. While such systems can often be described as state space (SS) models, tracking graph signals via conventional tools based on the Kalman filter (KF) and its variants is typically challenging. This is due to the nonlinearity, high dimensionality, irregularity of the domain, and complex modeling associated with real-world dynamic systems of graph signals. In this work, we study the tracking of graph signals using a hybrid model-based/data-driven approach. We develop the GSP-KalmanNet, which tracks the hidden graphical states from the graphical measurements by jointly leveraging graph signal processing (GSP) tools and deep learning (DL) techniques. The derivations of the GSP-KalmanNet are based on extending the KF to exploit the inherent graph structure via graph frequency domain filtering, which considerably simplifies the computational complexity entailed in processing high-dimensional signals and increases the robustness to small topology changes. Then, we use data to learn the Kalman gain following the recently proposed KalmanNet framework, which copes with partial and approximated modeling, without forcing a specific model over the noise statistics. Our empirical results demonstrate that the proposed GSP-KalmanNet achieves enhanced accuracy and run time performance as well as improved robustness to model misspecifications compared with both model-based and data-driven benchmarks.  ( 2 min )
    Replay across Experiments: A Natural Extension of Off-Policy RL. (arXiv:2311.15951v2 [cs.LG] UPDATED)
    Replaying data is a principal mechanism underlying the stability and data efficiency of off-policy reinforcement learning (RL). We present an effective yet simple framework to extend the use of replays across multiple experiments, minimally adapting the RL workflow for sizeable improvements in controller performance and research iteration times. At its core, Replay Across Experiments (RaE) involves reusing experience from previous experiments to improve exploration and bootstrap learning while reducing required changes to a minimum in comparison to prior work. We empirically show benefits across a number of RL algorithms and challenging control domains spanning both locomotion and manipulation, including hard exploration tasks from egocentric vision. Through comprehensive ablations, we demonstrate robustness to the quality and amount of data available and various hyperparameter choices. Finally, we discuss how our approach can be applied more broadly across research life cycles and can increase resilience by reloading data across random seeds or hyperparameter variations.  ( 2 min )
    Generation of patient specific cardiac chamber models using generative neural networks under a Bayesian framework for electroanatomical mapping. (arXiv:2311.16197v1 [eess.IV])
    Electroanatomical mapping is a technique used in cardiology to create a detailed 3D map of the electrical activity in the heart. It is useful for diagnosis, treatment planning and real time guidance in cardiac ablation procedures to treat arrhythmias like atrial fibrillation. A probabilistic machine learning model trained on a library of CT/MRI scans of the heart can be used during electroanatomical mapping to generate a patient-specific 3D model of the chamber being mapped. The use of probabilistic machine learning models under a Bayesian framework provides a way to quantify uncertainty in results and provide a natural framework of interpretability of the model. Here we introduce a Bayesian approach to surface reconstruction of cardiac chamber models from a sparse 3D point cloud data acquired during electroanatomical mapping. We show how probabilistic graphical models trained on segmented CT/MRI data can be used to generate cardiac chamber models from few acquired locations thereby reducing procedure time and x-ray exposure. We show how they provide insight into what the neural network learns from the segmented CT/MRI images used to train the network, which provides explainability to the resulting cardiac chamber models generated by the model.  ( 2 min )
    More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory. (arXiv:2311.14646v2 [cs.LG] UPDATED)
    In our era of enormous neural networks, empirical progress has been driven by the philosophy that more is better. Recent deep learning practice has found repeatedly that larger model size, more data, and more computation (resulting in lower training loss) improves performance. In this paper, we give theoretical backing to these empirical observations by showing that these three properties hold in random feature (RF) regression, a class of models equivalent to shallow networks with only the last layer trained. Concretely, we first show that the test risk of RF regression decreases monotonically with both the number of features and the number of samples, provided the ridge penalty is tuned optimally. In particular, this implies that infinite width RF architectures are preferable to those of any finite width. We then proceed to demonstrate that, for a large class of tasks characterized by powerlaw eigenstructure, training to near-zero training loss is obligatory: near-optimal performance can only be achieved when the training error is much smaller than the test error. Grounding our theory in real-world data, we find empirically that standard computer vision tasks with convolutional neural tangent kernels clearly fall into this class. Taken together, our results tell a simple, testable story of the benefits of overparameterization, overfitting, and more data in random feature models.  ( 3 min )
    Towards Optimizing with Large Language Models. (arXiv:2310.05204v2 [cs.LG] UPDATED)
    In this work, we conduct an assessment of the optimization capabilities of LLMs across various tasks and data sizes. Each of these tasks corresponds to unique optimization domains, and LLMs are required to execute these tasks with interactive prompting. That is, in each optimization step, the LLM generates new solutions from the past generated solutions with their values, and then the new solutions are evaluated and considered in the next optimization step. Additionally, we introduce three distinct metrics for a comprehensive assessment of task performance from various perspectives. These metrics offer the advantage of being applicable for evaluating LLM performance across a broad spectrum of optimization tasks and are less sensitive to variations in test samples. By applying these metrics, we observe that LLMs exhibit strong optimization capabilities when dealing with small-sized samples. However, their performance is significantly influenced by factors like data size and values, underscoring the importance of further research in the domain of optimization tasks for LLMs.  ( 2 min )
    Constrained Optimization of Rank-One Functions with Indicator Variables. (arXiv:2303.18158v2 [math.OC] UPDATED)
    Optimization problems involving minimization of a rank-one convex function over constraints modeling restrictions on the support of the decision variables emerge in various machine learning applications. These problems are often modeled with indicator variables for identifying the support of the continuous variables. In this paper we investigate compact extended formulations for such problems through perspective reformulation techniques. In contrast to the majority of previous work that relies on support function arguments and disjunctive programming techniques to provide convex hull results, we propose a constructive approach that exploits a hidden conic structure induced by perspective functions. To this end, we first establish a convex hull result for a general conic mixed-binary set in which each conic constraint involves a linear function of independent continuous variables and a set of binary variables. We then demonstrate that extended representations of sets associated with epigraphs of rank-one convex functions over constraints modeling indicator relations naturally admit such a conic representation. This enables us to systematically give perspective formulations for the convex hull descriptions of these sets with nonlinear separable or non-separable objective functions, sign constraints on continuous variables, and combinatorial constraints on indicator variables. We illustrate the efficacy of our results on sparse nonnegative logistic regression problems.  ( 2 min )
    Training Image Derivatives: Increased Accuracy and Universal Robustness. (arXiv:2310.14045v2 [cs.LG] UPDATED)
    Derivative training is a known method that significantly improves the accuracy of neural networks in some low-dimensional applications. In this paper, a similar improvement is obtained for an image analysis problem: reconstructing the vertices of a cube from its image. By training the derivatives with respect to the 6 degrees of freedom of the cube, we obtain 25 times more accurate results for noiseless inputs. The derivatives also offer insight into the robustness problem, which is currently understood in terms of two types of network vulnerabilities. The first type involves small perturbations that dramatically change the output, and the second type relates to substantial image changes that the network erroneously ignores. Defense against each is possible, but safeguarding against both while maintaining the accuracy defies conventional training methods. The first type is analyzed using the network's gradient, while the second relies on human input evaluation, serving as an oracle substitute. For the task at hand, the nearest neighbor oracle can be defined and expanded into Taylor series using image derivatives. This allows for a robustness analysis that unifies both types of vulnerabilities and enables training where accuracy and universal robustness are limited only by network capacity.  ( 2 min )
    Big Data Analytics for Network Level Short-Term Travel Time Prediction with Hierarchical LSTM. (arXiv:2201.05760v3 [cs.LG] UPDATED)
    The travel time data collected from widespread traffic monitoring sensors necessitate big data analytic tools for querying, visualization, and identifying meaningful traffic patterns. This paper utilizes a large-scale travel time dataset from Caltrans Performance Measurement System (PeMS) system that is an overflow for traditional data processing and modeling tools. To overcome the challenges of the massive amount of data, the big data analytic engines Apache Spark and Apache MXNet are applied for data wrangling and modeling. Seasonality and autocorrelation were performed to explore and visualize the trend of time-varying data. Inspired by the success of the hierarchical architecture for many Artificial Intelligent (AI) tasks, we consolidate the cell and hidden states passed from low-level to the high-level LSTM with an attention pooling similar to how the human perception system operates. The designed hierarchical LSTM model can consider the dependencies at different time scales to capture the spatial-temporal correlations of network-level travel time. Another self-attention module is then devised to connect LSTM extracted features to the fully connected layers, predicting travel time for all corridors instead of a single link/route. The comparison results show that the Hierarchical LSTM with Attention (HierLSTMat) model gives the best prediction results at 30-minute and 45-min horizons and can successfully forecast unusual congestion. The efficiency gained from big data analytic tools was evaluated by comparing them with popular data science and deep learning frameworks.  ( 3 min )
    Long Range Graph Benchmark. (arXiv:2206.08164v4 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) that are based on the message passing (MP) paradigm generally exchange information between 1-hop neighbors to build node representations at each layer. In principle, such networks are not able to capture long-range interactions (LRI) that may be desired or necessary for learning a given task on graphs. Recently, there has been an increasing interest in development of Transformer-based methods for graphs that can consider full node connectivity beyond the original sparse structure, thus enabling the modeling of LRI. However, MP-GNNs that simply rely on 1-hop message passing often fare better in several existing graph benchmarks when combined with positional feature representations, among other innovations, hence limiting the perceived utility and ranking of Transformer-like architectures. Here, we present the Long Range Graph Benchmark (LRGB) with 5 graph learning datasets: PascalVOC-SP, COCO-SP, PCQM-Contact, Peptides-func and Peptides-struct that arguably require LRI reasoning to achieve strong performance in a given task. We benchmark both baseline GNNs and Graph Transformer networks to verify that the models which capture long-range dependencies perform significantly better on these tasks. Therefore, these datasets are suitable for benchmarking and exploration of MP-GNNs and Graph Transformer architectures that are intended to capture LRI.  ( 3 min )
    GraSS: Contrastive Learning with Gradient Guided Sampling Strategy for Remote Sensing Image Semantic Segmentation. (arXiv:2306.15868v3 [cs.LG] UPDATED)
    Self-supervised contrastive learning (SSCL) has achieved significant milestones in remote sensing image (RSI) understanding. Its essence lies in designing an unsupervised instance discrimination pretext task to extract image features from a large number of unlabeled images that are beneficial for downstream tasks. However, existing instance discrimination based SSCL suffer from two limitations when applied to the RSI semantic segmentation task: 1) Positive sample confounding issue; 2) Feature adaptation bias. It introduces a feature adaptation bias when applied to semantic segmentation tasks that require pixel-level or object-level features. In this study, We observed that the discrimination information can be mapped to specific regions in RSI through the gradient of unsupervised contrastive loss, these specific regions tend to contain singular ground objects. Based on this, we propose contrastive learning with Gradient guided Sampling Strategy (GraSS) for RSI semantic segmentation. GraSS consists of two stages: Instance Discrimination warm-up (ID warm-up) and Gradient guided Sampling contrastive training (GS training). The ID warm-up aims to provide initial discrimination information to the contrastive loss gradients. The GS training stage aims to utilize the discrimination information contained in the contrastive loss gradients and adaptively select regions in RSI patches that contain more singular ground objects, in order to construct new positive and negative samples. Experimental results on three open datasets demonstrate that GraSS effectively enhances the performance of SSCL in high-resolution RSI semantic segmentation. Compared to seven baseline methods from five different types of SSCL, GraSS achieves an average improvement of 1.57\% and a maximum improvement of 3.58\% in terms of mean intersection over the union. The source code is available at https://github.com/GeoX-Lab/GraSS  ( 3 min )
    BatteryML:An Open-source platform for Machine Learning on Battery Degradation. (arXiv:2310.14714v2 [cs.LG] UPDATED)
    Battery degradation remains a pivotal concern in the energy storage domain, with machine learning emerging as a potent tool to drive forward insights and solutions. However, this intersection of electrochemical science and machine learning poses complex challenges. Machine learning experts often grapple with the intricacies of battery science, while battery researchers face hurdles in adapting intricate models tailored to specific datasets. Beyond this, a cohesive standard for battery degradation modeling, inclusive of data formats and evaluative benchmarks, is conspicuously absent. Recognizing these impediments, we present BatteryML - a one-step, all-encompass, and open-source platform designed to unify data preprocessing, feature extraction, and the implementation of both traditional and state-of-the-art models. This streamlined approach promises to enhance the practicality and efficiency of research applications. BatteryML seeks to fill this void, fostering an environment where experts from diverse specializations can collaboratively contribute, thus elevating the collective understanding and advancement of battery research.The code for our project is publicly available on GitHub at https://github.com/microsoft/BatteryML.  ( 2 min )
    Diffusion Models for Interferometric Satellite Aperture Radar. (arXiv:2308.16847v2 [cs.CV] UPDATED)
    Probabilistic Diffusion Models (PDMs) have recently emerged as a very promising class of generative models, achieving high performance in natural image generation. However, their performance relative to non-natural images, like radar-based satellite data, remains largely unknown. Generating large amounts of synthetic (and especially labelled) satellite data is crucial to implement deep-learning approaches for the processing and analysis of (interferometric) satellite aperture radar data. Here, we leverage PDMs to generate several radar-based satellite image datasets. We show that PDMs succeed in generating images with complex and realistic structures, but that sampling time remains an issue. Indeed, accelerated sampling strategies, which work well on simple image datasets like MNIST, fail on our radar datasets. We provide a simple and versatile open-source https://github.com/thomaskerdreux/PDM_SAR_InSAR_generation to train, sample and evaluate PDMs using any dataset on a single GPU.  ( 2 min )
    Stitched ViTs are Flexible Vision Backbones. (arXiv:2307.00154v2 [cs.CV] UPDATED)
    Large pretrained plain vision Transformers (ViTs) have been the workhorse for many downstream tasks. However, existing works utilizing off-the-shelf ViTs are inefficient in terms of training and deployment, because adopting ViTs with individual sizes requires separate trainings and is restricted by fixed performance-efficiency trade-offs. In this paper, we are inspired by stitchable neural networks (SN-Net), which is a new framework that cheaply produces a single model that covers rich subnetworks by stitching pretrained model families, supporting diverse performance-efficiency trade-offs at runtime. Building upon this foundation, we introduce SN-Netv2, a systematically improved model stitching framework to facilitate downstream task adaptation. Specifically, we first propose a two-way stitching scheme to enlarge the stitching space. We then design a resource-constrained sampling strategy that takes into account the underlying FLOPs distributions in the space for better sampling. Finally, we observe that learning stitching layers as a low-rank update plays an essential role on downstream tasks to stabilize training and ensure a good Pareto frontier. With extensive experiments on ImageNet-1K, ADE20K, COCO-Stuff-10K and NYUv2, SN-Netv2 demonstrates superior performance over SN-Netv1 on downstream dense predictions and shows strong ability as a flexible vision backbone, achieving great advantages in both training efficiency and deployment flexibility. Code is available at https://github.com/ziplab/SN-Netv2.
    Revisiting LARS for Large Batch Training Generalization of Neural Networks. (arXiv:2309.14053v2 [cs.LG] UPDATED)
    LARS and LAMB have emerged as prominent techniques in Large Batch Learning (LBL) to ensure training stability in AI. Convergence stability is a challenge in LBL, where the AI agent usually gets trapped in the sharp minimizer. To address this challenge, warm-up is an efficient technique, but it lacks a strong theoretical foundation. Specifically, the warm-up process often reduces gradients in the early phase, inadvertently preventing the agent from escaping the sharp minimizer early on. In light of this situation, we conduct empirical experiments to analyze the behaviors of LARS and LAMB with and without a warm-up strategy. Our analyses give a comprehensive insight into the behaviors of LARS, LAMB, and the necessity of a warm-up technique in LBL, including an explanation of their failure in many cases. Building upon these insights, we propose a novel algorithm called Time Varying LARS (TVLARS), which facilitates robust training in the initial phase without the need for warm-up. A configurable sigmoid-like function is employed in TVLARS to replace the warm-up process to enhance training stability. Moreover, TVLARS stimulates gradient exploration in the early phase, thus allowing it to surpass the sharp minimizes early on and gradually transition to LARS and achieving robustness of LARS in the latter phases. Extensive experimental evaluations reveal that TVLARS consistently outperforms LARS and LAMB in most cases, with improvements of up to 2% in classification scenarios. Notably, in every case of self-supervised learning, TVLARS dominates LARS and LAMB with performance improvements of up to 10%.
    The Predicted-Updates Dynamic Model: Offline, Incremental, and Decremental to Fully Dynamic Transformations. (arXiv:2307.08890v2 [cs.DS] UPDATED)
    We formulate the predicted-updates dynamic model, one of the first beyond-worst-case models for dynamic algorithms, which generalizes a large set of well-studied dynamic models including the offline dynamic, incremental, and decremental models to the fully dynamic setting when given predictions about the update times of the elements. In the most basic form of our model, we receive a set of predicted update times for all of the updates that occur over the event horizon. We give a novel framework that "lifts" offline divide-and-conquer algorithms into the fully dynamic setting with little overhead. Using this, we are able to interpolate between the offline and fully dynamic settings; when the $\ell_1$ error of the prediction is linear in the number of updates, we achieve the offline runtime of the algorithm (up to $\mathrm{poly} \log n$ factors). Provided a fully dynamic backstop algorithm, our algorithm will never do worse than the backstop algorithm regardless of the prediction error. Furthermore, our framework achieves a smooth linear trade-off between $\ell_1$ error in the predictions and runtime. These correspond to the desiderata of consistency, robustness, and graceful degradation of the algorithms-with-predictions literature. We further extend our techniques to incremental and decremental settings, transforming algorithms in these settings when given predictions of only the deletion and insertion times, respectively. Our framework is general, and we apply it to obtain improved efficiency bounds over the state-of-the-art dynamic algorithms for a variety of problems including triconnectivity, planar digraph all pairs shortest paths, $k$-edge connectivity, and others, for prediction error of reasonable magnitude.
    StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models. (arXiv:2311.16509v1 [cs.CL])
    We propose StyleCap, a method to generate natural language descriptions of speaking styles appearing in speech. Although most of conventional techniques for para-/non-linguistic information recognition focus on the category classification or the intensity estimation of pre-defined labels, they cannot provide the reasoning of the recognition result in an interpretable manner. As a first step towards an end-to-end method for generating speaking-style prompts from speech, i.e., automatic speaking-style captioning, StyleCap uses paired data of speech and natural language descriptions to train neural networks that predict prefix vectors fed into a large language model (LLM)-based text decoder from a speech representation vector. We explore an appropriate text decoder and speech feature representation suitable for this new task. The experimental results demonstrate that our StyleCap leveraging richer LLMs for the text decoder, speech self-supervised learning (SSL) features, and sentence rephrasing augmentation improves the accuracy and diversity of generated speaking-style captions. Samples of speaking-style captions generated by our StyleCap are publicly available.
    Federated Learning with Diffusion Models for Privacy-Sensitive Vision Tasks. (arXiv:2311.16538v1 [cs.LG])
    Diffusion models have shown great potential for vision-related tasks, particularly for image generation. However, their training is typically conducted in a centralized manner, relying on data collected from publicly available sources. This approach may not be feasible or practical in many domains, such as the medical field, which involves privacy concerns over data collection. Despite the challenges associated with privacy-sensitive data, such domains could still benefit from valuable vision services provided by diffusion models. Federated learning (FL) plays a crucial role in enabling decentralized model training without compromising data privacy. Instead of collecting data, an FL system gathers model parameters, effectively safeguarding the private data of different parties involved. This makes FL systems vital for managing decentralized learning tasks, especially in scenarios where privacy-sensitive data is distributed across a network of clients. Nonetheless, FL presents its own set of challenges due to its distributed nature and privacy-preserving properties. Therefore, in this study, we explore the FL strategy to train diffusion models, paving the way for the development of federated diffusion models. We conduct experiments on various FL scenarios, and our findings demonstrate that federated diffusion models have great potential to deliver vision services to privacy-sensitive domains.
    Negative Feedback Training: A Novel Concept to Improve Robustness of NVCIM DNN Accelerators. (arXiv:2305.14561v2 [cs.LG] UPDATED)
    Compute-in-memory (CIM) accelerators built upon non-volatile memory (NVM) devices excel in energy efficiency and latency when performing Deep Neural Network (DNN) inference, thanks to their in-situ data processing capability. However, the stochastic nature and intrinsic variations of NVM devices often result in performance degradation in DNN inference. Introducing these non-ideal device behaviors during DNN training enhances robustness, but drawbacks include limited accuracy improvement, reduced prediction confidence, and convergence issues. This arises from a mismatch between the deterministic training and non-deterministic device variations, as such training, though considering variations, relies solely on the model's final output. In this work, we draw inspiration from the control theory and propose a novel training concept: Negative Feedback Training (NFT) leveraging the multi-scale noisy information captured from network. We develop two specific NFT instances, Oriented Variational Forward (OVF) and Intermediate Representation Snapshot (IRS). Extensive experiments show that our methods outperform existing state-of-the-art methods with up to a 46.71% improvement in inference accuracy while reducing epistemic uncertainty, boosting output confidence, and improving convergence probability. Their effectiveness highlights the generality and practicality of our NFT concept in enhancing DNN robustness against device variations.
    LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models. (arXiv:2310.08659v4 [cs.CL] UPDATED)
    Quantization is an indispensable technique for serving Large Language Models (LLMs) and has recently found its way into LoRA fine-tuning. In this work we focus on the scenario where quantization and LoRA fine-tuning are applied together on a pre-trained model. In such cases it is common to observe a consistent gap in the performance on downstream tasks between full fine-tuning and quantization plus LoRA fine-tuning approach. In response, we propose LoftQ (LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that simultaneously quantizes an LLM and finds a proper low-rank initialization for LoRA fine-tuning. Such an initialization alleviates the discrepancy between the quantized and full-precision model and significantly improves generalization in downstream tasks. We evaluate our method on natural language understanding, question answering, summarization, and natural language generation tasks. Experiments show that our method is highly effective and outperforms existing quantization methods, especially in the challenging 2-bit and 2/4-bit mixed precision regimes. The code is available on https://github.com/yxli2123/LoftQ.
    Distributed Differentiable Dynamic Game for Multi-robot Coordination. (arXiv:2207.08892v3 [cs.RO] UPDATED)
    This paper develops a Distributed Differentiable Dynamic Game (D3G) framework, which can efficiently solve the forward and inverse problems in multi-robot coordination. We formulate multi-robot coordination as a dynamic game, where the behavior of a robot is dictated by its own dynamics and objective that also depends on others' behavior. In the forward problem, D3G enables all robots collaboratively to seek the Nash equilibrium of the game in a distributed manner, by developing a distributed shooting-based Nash solver. In the inverse problem, where each robot aims to find (learn) its objective (and dynamics) parameters to mimic given coordination demonstrations, D3G proposes a differentiation solver based on Differential Pontryagin's Maximum Principle, which allows each robot to update its parameters in a distributed and coordinated manner. We test the D3G in simulation with two types of robots given different task configurations. The results demonstrate the effectiveness of D3G for solving both forward and inverse problems in comparison with existing methods.  ( 2 min )
    DSI++: Updating Transformer Memory with New Documents. (arXiv:2212.09744v2 [cs.CL] UPDATED)
    Differentiable Search Indices (DSIs) encode a corpus of documents in model parameters and use the same model to answer user queries directly. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents ($+12\%$). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting significantly. Concretely, it improves the average Hits@10 by $+21.1\%$ over competitive baselines for NQ and requires $6$ times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence.  ( 3 min )
    Leveraging Out-of-Domain Data for Domain-Specific Prompt Tuning in Multi-Modal Fake News Detection. (arXiv:2311.16496v1 [cs.LG])
    The spread of fake news using out-of-context images has become widespread and is a challenging task in this era of information overload. Since annotating huge amounts of such data requires significant time of domain experts, it is imperative to develop methods which can work in limited annotated data scenarios. In this work, we explore whether out-of-domain data can help to improve out-of-context misinformation detection (termed here as multi-modal fake news detection) of a desired domain, eg. politics, healthcare, etc. Towards this goal, we propose a novel framework termed DPOD (Domain-specific Prompt-tuning using Out-of-Domain data). First, to compute generalizable features, we modify the Vision-Language Model, CLIP to extract features that helps to align the representations of the images and corresponding text captions of both the in-domain and out-of-domain data in a label-aware manner. Further, we propose a domain-specific prompt learning technique which leverages the training samples of all the available domains based on the the extent they can be useful to the desired domain. Extensive experiments on a large-scale benchmark dataset, namely NewsClippings demonstrate that the proposed framework achieves state of-the-art performance, significantly surpassing the existing approaches for this challenging task.
    Geometry-Aware Adaptation for Pretrained Models. (arXiv:2307.12226v2 [cs.LG] UPDATED)
    Machine learning models -- including prominent zero-shot models -- are often trained on datasets whose labels are only a small proportion of a larger label space. Such spaces are commonly equipped with a metric that relates the labels via distances between them. We propose a simple approach to exploit this information to adapt the trained model to reliably predict new classes -- or, in the case of zero-shot prediction, to improve its performance -- without any additional training. Our technique is a drop-in replacement of the standard prediction rule, swapping argmax with the Fr\'echet mean. We provide a comprehensive theoretical analysis for this approach, studying (i) learning-theoretic results trading off label space diameter, sample complexity, and model dimension, (ii) characterizations of the full range of scenarios in which it is possible to predict any unobserved class, and (iii) an optimal active learning-like next class selection procedure to obtain optimal training classes for when it is not possible to predict the entire range of unobserved classes. Empirically, using easily-available external metrics, our proposed approach, Loki, gains up to 29.7% relative improvement over SimCLR on ImageNet and scales to hundreds of thousands of classes. When no such metric is available, Loki can use self-derived metrics from class embeddings and obtains a 10.5% improvement on pretrained zero-shot models such as CLIP.
    Curve Your Enthusiasm: Concurvity Regularization in Differentiable Generalized Additive Models. (arXiv:2305.11475v3 [cs.LG] UPDATED)
    Generalized Additive Models (GAMs) have recently experienced a resurgence in popularity due to their interpretability, which arises from expressing the target value as a sum of non-linear transformations of the features. Despite the current enthusiasm for GAMs, their susceptibility to concurvity - i.e., (possibly non-linear) dependencies between the features - has hitherto been largely overlooked. Here, we demonstrate how concurvity can severly impair the interpretability of GAMs and propose a remedy: a conceptually simple, yet effective regularizer which penalizes pairwise correlations of the non-linearly transformed feature variables. This procedure is applicable to any differentiable additive model, such as Neural Additive Models or NeuralProphet, and enhances interpretability by eliminating ambiguities due to self-canceling feature contributions. We validate the effectiveness of our regularizer in experiments on synthetic as well as real-world datasets for time-series and tabular data. Our experiments show that concurvity in GAMs can be reduced without significantly compromising prediction quality, improving interpretability and reducing variance in the feature importances.  ( 2 min )
    Unsupervised Image Outlier Detection using RANSAC. (arXiv:2307.12301v2 [cs.CV] UPDATED)
    Image outlier detection (OD) is an essential tool to ensure the quality and accuracy of image datasets used in computer vision tasks. Most existing approaches, however, require a set of in-distribution data for training prior to outlier prediction. The quality and quantity of the data can influence the resulting performance. Thus, selecting a suitable in-distribution set often requires considerable effort. In this work, we propose RANSAC-NN, an unsupervised image OD algorithm designed to detect outliers within contaminated sets in a one-class classification fashion. Without any training, RANSAC-NN performs favorably in comparison to other well-established methods in a variety of OD benchmarks. Furthermore, we show that our method can enhance the robustness of existing OD methods by simply applying RANSAC-NN during pre-processing.
    MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph Structures. (arXiv:2311.16666v1 [cs.LG])
    The quest for accurate prediction of drug molecule properties poses a fundamental challenge in the realm of Artificial Intelligence Drug Discovery (AIDD). An effective representation of drug molecules emerges as a pivotal component in this pursuit. Contemporary leading-edge research predominantly resorts to self-supervised learning (SSL) techniques to extract meaningful structural representations from large-scale, unlabeled molecular data, subsequently fine-tuning these representations for an array of downstream tasks. However, an inherent shortcoming of these studies lies in their singular reliance on one modality of molecular information, such as molecule image or SMILES representations, thus neglecting the potential complementarity of various molecular modalities. In response to this limitation, we propose MolIG, a novel MultiModaL molecular pre-training framework for predicting molecular properties based on Image and Graph structures. MolIG model innovatively leverages the coherence and correlation between molecule graph and molecule image to execute self-supervised tasks, effectively amalgamating the strengths of both molecular representation forms. This holistic approach allows for the capture of pivotal molecular structural characteristics and high-level semantic information. Upon completion of pre-training, Graph Neural Network (GNN) Encoder is used for the prediction of downstream tasks. In comparison to advanced baseline models, MolIG exhibits enhanced performance in downstream tasks pertaining to molecular property prediction within benchmark groups such as MoleculeNet Benchmark Group and ADMET Benchmark Group.
    SR-OOD: Out-of-Distribution Detection via Sample Repairing. (arXiv:2305.18228v2 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection is a crucial task for ensuring the reliability and robustness of machine learning models. Recent works have shown that generative models often assign high confidence scores to OOD samples, indicating that they fail to capture the semantic information of the data. To tackle this problem, we take advantage of sample repairing and propose a novel OOD detection framework, namely SR-OOD. Our framework leverages the idea that repairing an OOD sample can reveal its semantic inconsistency with the in-distribution data. Specifically, our framework consists of two components: a sample repairing module and a detection module. The sample repairing module applies erosion to an input sample and uses a generative adversarial network to repair it. The detection module then determines whether the input sample is OOD using a distance metric. Our framework does not require any additional data or label information for detection, making it applicable to various scenarios. We conduct extensive experiments on three image datasets: CIFAR-10, CelebA, and Pokemon. The results demonstrate that our approach achieves superior performance over the state-of-the-art generative methods in OOD detection.  ( 2 min )
    Masked Diffusion Models Are Fast Distribution Learners. (arXiv:2306.11363v4 [cs.CV] UPDATED)
    Diffusion model has emerged as the \emph{de-facto} model for image generation, yet the heavy training overhead hinders its broader adoption in the research community. We observe that diffusion models are commonly trained to learn all fine-grained visual information from scratch. This paradigm may cause unnecessary training costs hence requiring in-depth investigation. In this work, we show that it suffices to train a strong diffusion model by first pre-training the model to learn some primer distribution that loosely characterizes the unknown real image distribution. Then the pre-trained model can be fine-tuned for various generation tasks efficiently. In the pre-training stage, we propose to mask a high proportion (e.g., up to 90\%) of input images to approximately represent the primer distribution and introduce a masked denoising score matching objective to train a model to denoise visible areas. In subsequent fine-tuning stage, we efficiently train diffusion model without masking. Utilizing the two-stage training framework, we achieves significant training acceleration and a new FID score record of 6.27 on CelebA-HQ $256 \times 256$ for ViT-based diffusion models. The generalizability of a pre-trained model further helps building models that perform better than ones trained from scratch on different downstream datasets. For instance, a diffusion model pre-trained on VGGFace2 attains a 46\% quality improvement when fine-tuned on a different dataset that contains only 3000 images. Our code is available at \url{https://github.com/jiachenlei/maskdm}.
    Edge Directionality Improves Learning on Heterophilic Graphs. (arXiv:2305.10498v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have become the de-facto standard tool for modeling relational data. However, while many real-world graphs are directed, the majority of today's GNN models discard this information altogether by simply making the graph undirected. The reasons for this are historical: 1) many early variants of spectral GNNs explicitly required undirected graphs, and 2) the first benchmarks on homophilic graphs did not find significant gain from using direction. In this paper, we show that in heterophilic settings, treating the graph as directed increases the effective homophily of the graph, suggesting a potential gain from the correct use of directionality information. To this end, we introduce Directed Graph Neural Network (Dir-GNN), a novel general framework for deep learning on directed graphs. Dir-GNN can be used to extend any Message Passing Neural Network (MPNN) to account for edge directionality information by performing separate aggregations of the incoming and outgoing edges. We prove that Dir-GNN matches the expressivity of the Directed Weisfeiler-Lehman test, exceeding that of conventional MPNNs. In extensive experiments, we validate that while our framework leaves performance unchanged on homophilic datasets, it leads to large gains over base models such as GCN, GAT and GraphSage on heterophilic benchmarks, outperforming much more complex methods and achieving new state-of-the-art results.
    Making Self-supervised Learning Robust to Spurious Correlation via Learning-speed Aware Sampling. (arXiv:2311.16361v1 [cs.LG])
    Self-supervised learning (SSL) has emerged as a powerful technique for learning rich representations from unlabeled data. The data representations are able to capture many underlying attributes of data, and be useful in downstream prediction tasks. In real-world settings, spurious correlations between some attributes (e.g. race, gender and age) and labels for downstream tasks often exist, e.g. cancer is usually more prevalent among elderly patients. In this paper, we investigate SSL in the presence of spurious correlations and show that the SSL training loss can be minimized by capturing only a subset of the conspicuous features relevant to those sensitive attributes, despite the presence of other important predictive features for the downstream tasks. To address this issue, we investigate the learning dynamics of SSL and observe that the learning is slower for samples that conflict with such correlations (e.g. elder patients without cancer). Motivated by these findings, we propose a learning-speed aware SSL (LA-SSL) approach, in which we sample each training data with a probability that is inversely related to its learning speed. We evaluate LA-SSL on three datasets that exhibit spurious correlations between different attributes, demonstrating that it improves the robustness of pretrained representations on downstream classification tasks.
    CHeart: A Conditional Spatio-Temporal Generative Model for Cardiac Anatomy. (arXiv:2301.13098v2 [eess.IV] UPDATED)
    Two key questions in cardiac image analysis are to assess the anatomy and motion of the heart from images; and to understand how they are associated with non-imaging clinical factors such as gender, age and diseases. While the first question can often be addressed by image segmentation and motion tracking algorithms, our capability to model and to answer the second question is still limited. In this work, we propose a novel conditional generative model to describe the 4D spatio-temporal anatomy of the heart and its interaction with non-imaging clinical factors. The clinical factors are integrated as the conditions of the generative modelling, which allows us to investigate how these factors influence the cardiac anatomy. We evaluate the model performance in mainly two tasks, anatomical sequence completion and sequence generation. The model achieves a high performance in anatomical sequence completion, comparable to or outperforming other state-of-the-art generative models. In terms of sequence generation, given clinical conditions, the model can generate realistic synthetic 4D sequential anatomies that share similar distributions with the real data.  ( 2 min )
    Addressing the Impact of Localized Training Data in Graph Neural Networks. (arXiv:2307.12689v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved notable success in learning from graph-structured data, owing to their ability to capture intricate dependencies and relationships between nodes. They excel in various applications, including semi-supervised node classification, link prediction, and graph generation. However, it is important to acknowledge that the majority of state-of-the-art GNN models are built upon the assumption of an in-distribution setting, which hinders their performance on real-world graphs with dynamic structures. In this article, we aim to assess the impact of training GNNs on localized subsets of the graph. Such restricted training data may lead to a model that performs well in the specific region it was trained on but fails to generalize and make accurate predictions for the entire graph. In the context of graph-based semi-supervised learning (SSL), resource constraints often lead to scenarios where the dataset is large, but only a portion of it can be labeled, affecting the model's performance. This limitation affects tasks like anomaly detection or spam detection when labeling processes are biased or influenced by human subjectivity. To tackle the challenges posed by localized training data, we approach the problem as an out-of-distribution (OOD) data issue by by aligning the distributions between the training data, which represents a small portion of labeled data, and the graph inference process that involves making predictions for the entire graph. We propose a regularization method to minimize distributional discrepancies between localized training data and graph inference, improving model performance on OOD data. Extensive tests on popular GNN models show significant performance improvement on three citation GNN benchmark datasets. The regularization approach effectively enhances model adaptation and generalization, overcoming challenges posed by OOD data.  ( 3 min )
    STLGRU: Spatio-Temporal Lightweight Graph GRU for Traffic Flow Prediction. (arXiv:2212.04548v2 [cs.LG] UPDATED)
    Reliable forecasting of traffic flow requires efficient modeling of traffic data. Different correlations and influences arise in a dynamic traffic network, making modeling a complicated task. Existing literature has proposed many different methods to capture the complex underlying spatial-temporal relations of traffic networks. However, methods still struggle to capture different local and global dependencies of long-range nature. Also, as more and more sophisticated methods are being proposed, models are increasingly becoming memory-heavy and, thus, unsuitable for low-powered devices. In this paper, we focus on solving these problems by proposing a novel deep learning framework - STLGRU. Specifically, our proposed STLGRU can effectively capture both local and global spatial-temporal relations of a traffic network using memory-augmented attention and gating mechanism. Instead of employing separate temporal and spatial components, we show that our memory module and gated unit can learn the spatial-temporal dependencies successfully, allowing for reduced memory usage with fewer parameters. We extensively experiment on several real-world traffic prediction datasets to show that our model performs better than existing methods while the memory footprint remains lower. Code is available at \url{https://github.com/Kishor-Bhaumik/STLGRU}.  ( 2 min )
    In-Context Learning Dynamics with Random Binary Sequences. (arXiv:2310.17639v2 [cs.AI] UPDATED)
    Large language models (LLMs) trained on huge corpora of text datasets demonstrate intriguing capabilities, achieving state-of-the-art performance on tasks they were not explicitly trained for. The precise nature of LLM capabilities is often mysterious, and different prompts can elicit different capabilities through in-context learning. We propose a framework that enables us to analyze in-context learning dynamics to understand latent concepts underlying LLMs' behavioral patterns. This provides a more nuanced understanding than success-or-failure evaluation benchmarks, but does not require observing internal activations as a mechanistic interpretation of circuits would. Inspired by the cognitive science of human randomness perception, we use random binary sequences as context and study dynamics of in-context learning by manipulating properties of context data, such as sequence length. In the latest GPT-3.5+ models, we find emergent abilities to generate seemingly random numbers and learn basic formal languages, with striking in-context learning dynamics where model outputs transition sharply from seemingly random behaviors to deterministic repetition.  ( 2 min )
    FedAL: Black-Box Federated Knowledge Distillation Enabled by Adversarial Learning. (arXiv:2311.16584v1 [cs.LG])
    Knowledge distillation (KD) can enable collaborative learning among distributed clients that have different model architectures and do not share their local data and model parameters with others. Each client updates its local model using the average model output/feature of all client models as the target, known as federated KD. However, existing federated KD methods often do not perform well when clients' local models are trained with heterogeneous local datasets. In this paper, we propose Federated knowledge distillation enabled by Adversarial Learning (FedAL) to address the data heterogeneity among clients. First, to alleviate the local model output divergence across clients caused by data heterogeneity, the server acts as a discriminator to guide clients' local model training to achieve consensus model outputs among clients through a min-max game between clients and the discriminator. Moreover, catastrophic forgetting may happen during the clients' local training and global knowledge transfer due to clients' heterogeneous local data. Towards this challenge, we design the less-forgetting regularization for both local training and global knowledge transfer to guarantee clients' ability to transfer/learn knowledge to/from others. Experimental results show that FedAL and its variants achieve higher accuracy than other federated KD baselines.
    Empowering COVID-19 Detection: Optimizing Performance Through Fine-Tuned EfficientNet Deep Learning Architecture. (arXiv:2311.16593v1 [eess.IV])
    The worldwide COVID-19 pandemic has profoundly influenced the health and everyday experiences of individuals across the planet. It is a highly contagious respiratory disease requiring early and accurate detection to curb its rapid transmission. Initial testing methods primarily revolved around identifying the genetic composition of the coronavirus, exhibiting a relatively low detection rate and requiring a time-intensive procedure. To address this challenge, experts have suggested using radiological imagery, particularly chest X-rays, as a valuable approach within the diagnostic protocol. This study investigates the potential of leveraging radiographic imaging (X-rays) with deep learning algorithms to swiftly and precisely identify COVID-19 patients. The proposed approach elevates the detection accuracy by fine-tuning with appropriate layers on various established transfer learning models. The experimentation was conducted on a COVID-19 X-ray dataset containing 2000 images. The accuracy rates achieved were impressive of 100% for EfficientNetB4 model. The fine-tuned EfficientNetB4 achieved an excellent accuracy score, showcasing its potential as a robust COVID-19 detection model. Furthermore, EfficientNetB4 excelled in identifying Lung disease using Chest X-ray dataset containing 4,350 Images, achieving remarkable performance with an accuracy of 99.17%, precision of 99.13%, recall of 99.16%, and f1-score of 99.14%. These results highlight the promise of fine-tuned transfer learning for efficient lung detection through medical imaging, especially with X-ray images. This research offers radiologists an effective means of aiding rapid and precise COVID-19 diagnosis and contributes valuable assistance for healthcare professionals in accurately identifying affected patients.
    T-Rep: Representation Learning for Time Series using Time-Embeddings. (arXiv:2310.04486v2 [cs.LG] UPDATED)
    Multivariate time series present challenges to standard machine learning techniques, as they are often unlabeled, high dimensional, noisy, and contain missing data. To address this, we propose T-Rep, a self-supervised method to learn time series representations at a timestep granularity. T-Rep learns vector embeddings of time alongside its feature extractor, to extract temporal features such as trend, periodicity, or distribution shifts from the signal. These time-embeddings are leveraged in pretext tasks, to incorporate smooth and fine-grained temporal dependencies in the representations, as well as reinforce robustness to missing data. We evaluate T-Rep on downstream classification, forecasting, and anomaly detection tasks. It is compared to existing self-supervised algorithms for time series, which it outperforms in all three tasks. We test T-Rep in missing data regimes, where it proves more resilient than its counterparts. Finally, we provide latent space visualisation experiments, highlighting the interpretability of the learned representations.  ( 2 min )
    B-LSTM-MIONet: Bayesian LSTM-based Neural Operators for Learning the Response of Complex Dynamical Systems to Length-Variant Multiple Input Functions. (arXiv:2311.16519v1 [cs.LG])
    Deep Operator Network (DeepONet) is a neural network framework for learning nonlinear operators such as those from ordinary differential equations (ODEs) describing complex systems. Multiple-input deep neural operators (MIONet) extended DeepONet to allow multiple input functions in different Banach spaces. MIONet offers flexibility in training dataset grid spacing, without constraints on output location. However, it requires offline inputs and cannot handle varying sequence lengths in testing datasets, limiting its real-time application in dynamic complex systems. This work redesigns MIONet, integrating Long Short Term Memory (LSTM) to learn neural operators from time-dependent data. This approach overcomes data discretization constraints and harnesses LSTM's capability with variable-length, real-time data. Factors affecting learning performance, like algorithm extrapolation ability are presented. The framework is enhanced with uncertainty quantification through a novel Bayesian method, sampling from MIONet parameter distributions. Consequently, we develop the B-LSTM-MIONet, incorporating LSTM's temporal strengths with Bayesian robustness, resulting in a more precise and reliable model for noisy datasets.
    Mate! Are You Really Aware? An Explainability-Guided Testing Framework for Robustness of Malware Detectors. (arXiv:2111.10085v4 [cs.CR] UPDATED)
    Numerous open-source and commercial malware detectors are available. However, their efficacy is threatened by new adversarial attacks, whereby malware attempts to evade detection, e.g., by performing feature-space manipulation. In this work, we propose an explainability-guided and model-agnostic testing framework for robustness of malware detectors when confronted with adversarial attacks. The framework introduces the concept of Accrued Malicious Magnitude (AMM) to identify which malware features could be manipulated to maximize the likelihood of evading detection. We then use this framework to test several state-of-the-art malware detectors' abilities to detect manipulated malware. We find that (i) commercial antivirus engines are vulnerable to AMM-guided test cases; (ii) the ability of a manipulated malware generated using one detector to evade detection by another detector (i.e., transferability) depends on the overlap of features with large AMM values between the different detectors; and (iii) AMM values effectively measure the fragility of features (i.e., capability of feature-space manipulation to flip the prediction results) and explain the robustness of malware detectors facing evasion attacks. Our findings shed light on the limitations of current malware detectors, as well as how they can be improved.
    Continual Instruction Tuning for Large Multimodal Models. (arXiv:2311.16206v1 [cs.LG])
    Instruction tuning is now a widely adopted approach to aligning large multimodal models (LMMs) to follow human intent. It unifies the data format of vision-language tasks, enabling multi-task joint training. However, vision-language tasks are constantly being created in practice. Instead of always re-training LMMs when new tasks arrive, continual learning offers flexibility for models to continually and efficiently exploit the evolving data. This work aims to explore the following two questions: 1) Do LMMs still suffer from catastrophic forgetting in continual instruction tuning? 2) Are the existing three classes of continual learning methods still applicable to the continual instruction tuning of LMMs? An extensive study is conducted to address the above questions. First, we establish the first benchmark in this setting and reveal that catastrophic forgetting is still observed when continually instruction-tuning LMMs. However, the multi-task joint instruction tuning can facilitate the model's continual learning ability and mitigate forgetting. Second, we integrate and adapt classic continual learning methods to our context, demonstrating the efficacy of data replay and model expansion strategies across diverse scenarios. In contrast, regularization-based methods only perform well on models that have been jointly instruction-tuned on multiple tasks. Third, we delve into the correlation and forgetting dynamics between vision-language task pairs and propose task-similarity-informed regularization and model expansion methods for continual instruction tuning of LMMs. Experimental results show that our approach consistently boosts the model's performance.
    DGR: Tackling Drifted and Correlated Noise in Quantum Error Correction via Decoding Graph Re-weighting. (arXiv:2311.16214v1 [quant-ph])
    Quantum hardware suffers from high error rates and noise, which makes directly running applications on them ineffective. Quantum Error Correction (QEC) is a critical technique towards fault tolerance which encodes the quantum information distributively in multiple data qubits and uses syndrome qubits to check parity. Minimum-Weight-Perfect-Matching (MWPM) is a popular QEC decoder that takes the syndromes as input and finds the matchings between syndromes that infer the errors. However, there are two paramount challenges for MWPM decoders. First, as noise in real quantum systems can drift over time, there is a potential misalignment with the decoding graph's initial weights, leading to a severe performance degradation in the logical error rates. Second, while the MWPM decoder addresses independent errors, it falls short when encountering correlated errors typical on real hardware, such as those in the 2Q depolarizing channel. We propose DGR, an efficient decoding graph edge re-weighting strategy with no quantum overhead. It leverages the insight that the statistics of matchings across decoding iterations offer rich information about errors on real quantum hardware. By counting the occurrences of edges and edge pairs in decoded matchings, we can statistically estimate the up-to-date probabilities of each edge and the correlations between them. The reweighting process includes two vital steps: alignment re-weighting and correlation re-weighting. The former updates the MWPM weights based on statistics to align with actual noise, and the latter adjusts the weight considering edge correlations. Extensive evaluations on surface code and honeycomb code under various settings show that DGR reduces the logical error rate by 3.6x on average-case noise mismatch with exceeding 5000x improvement under worst-case mismatch.
    Masked Autoencoders are Scalable Learners of Cellular Morphology. (arXiv:2309.16064v2 [cs.CV] UPDATED)
    Inferring biological relationships from cellular phenotypes in high-content microscopy screens provides significant opportunity and challenge in biological research. Prior results have shown that deep vision models can capture biological signal better than hand-crafted features. This work explores how self-supervised deep learning approaches scale when training larger models on larger microscopy datasets. Our results show that both CNN- and ViT-based masked autoencoders significantly outperform weakly supervised baselines. At the high-end of our scale, a ViT-L/8 trained on over 3.5-billion unique crops sampled from 93-million microscopy images achieves relative improvements as high as 28% over our best weakly supervised baseline at inferring known biological relationships curated from public databases. Relevant code and select models released with this work can be found at: https://github.com/recursionpharma/maes_microscopy.  ( 2 min )
    A Graph Neural Network-Based QUBO-Formulated Hamiltonian-Inspired Loss Function for Combinatorial Optimization using Reinforcement Learning. (arXiv:2311.16277v1 [cs.LG])
    Quadratic Unconstrained Binary Optimization (QUBO) is a generic technique to model various NP-hard Combinatorial Optimization problems (CO) in the form of binary variables. Ising Hamiltonian is used to model the energy function of a system. QUBO to Ising Hamiltonian is regarded as a technique to solve various canonical optimization problems through quantum optimization algorithms. Recently, PI-GNN, a generic framework, has been proposed to address CO problems over graphs based on Graph Neural Network (GNN) architecture. They introduced a generic QUBO-formulated Hamiltonian-inspired loss function that was directly optimized using GNN. PI-GNN is highly scalable but there lies a noticeable decrease in the number of satisfied constraints when compared to problem-specific algorithms and becomes more pronounced with increased graph densities. Here, We identify a behavioral pattern related to it and devise strategies to improve its performance. Another group of literature uses Reinforcement learning (RL) to solve the aforementioned NP-hard problems using problem-specific reward functions. In this work, we also focus on creating a bridge between the RL-based solutions and the QUBO-formulated Hamiltonian. We formulate and empirically evaluate the compatibility of the QUBO-formulated Hamiltonian as the generic reward function in the RL-based paradigm in the form of rewards. Furthermore, we also introduce a novel Monty Carlo Tree Search-based strategy with GNN where we apply a guided search through manual perturbation of node labels during training. We empirically evaluated our methods and observed up to 44% improvement in the number of constraint violations compared to the PI-GNN.
    Gaussian Processes for Monitoring Air-Quality in Kampala. (arXiv:2311.16625v1 [cs.LG])
    Monitoring air pollution is of vital importance to the overall health of the population. Unfortunately, devices that can measure air quality can be expensive, and many cities in low and middle-income countries have to rely on a sparse allocation of them. In this paper, we investigate the use of Gaussian Processes for both nowcasting the current air-pollution in places where there are no sensors and forecasting the air-pollution in the future at the sensor locations. In particular, we focus on the city of Kampala in Uganda, using data from AirQo's network of sensors. We demonstrate the advantage of removing outliers, compare different kernel functions and additional inputs. We also compare two sparse approximations to allow for the large amounts of temporal data in the dataset.
    Towards Responsible Governance of Biological Design Tools. (arXiv:2311.15936v2 [cs.CY] UPDATED)
    Recent advancements in generative machine learning have enabled rapid progress in biological design tools (BDTs) such as protein structure and sequence prediction models. The unprecedented predictive accuracy and novel design capabilities of BDTs present new and significant dual-use risks. For example, their predictive accuracy allows biological agents, whether vaccines or pathogens, to be developed more quickly, while the design capabilities could be used to discover drugs or evade DNA screening techniques. Similar to other dual-use AI systems, BDTs present a wicked problem: how can regulators uphold public safety without stifling innovation? We highlight how current regulatory proposals that are primarily tailored toward large language models may be less effective for BDTs, which require fewer computational resources to train and are often developed in an open-source manner. We propose a range of measures to mitigate the risk that BDTs are misused, across the areas of responsible development, risk assessment, transparency, access management, cybersecurity, and investing in resilience. Implementing such measures will require close coordination between developers and governments.  ( 3 min )
    Towards Attributions of Input Variables in a Coalition. (arXiv:2309.13411v2 [cs.LG] UPDATED)
    This paper aims to develop a new attribution method to explain the conflict between individual variables' attributions and their coalition's attribution from a fully new perspective. First, we find that the Shapley value can be reformulated as the allocation of Harsanyi interactions encoded by the AI model. Second, based the re-alloction of interactions, we extend the Shapley value to the attribution of coalitions. Third we ective. We derive the fundamental mechanism behind the conflict. This conflict come from the interaction containing partial variables in their coalition.
    Extending CAM-based XAI methods for Remote Sensing Imagery Segmentation. (arXiv:2310.01837v2 [cs.CV] UPDATED)
    Current AI-based methods do not provide comprehensible physical interpretations of the utilized data, extracted features, and predictions/inference operations. As a result, deep learning models trained using high-resolution satellite imagery lack transparency and explainability and can be merely seen as a black box, which limits their wide-level adoption. Experts need help understanding the complex behavior of AI models and the underlying decision-making process. The explainable artificial intelligence (XAI) field is an emerging field providing means for robust, practical, and trustworthy deployment of AI models. Several XAI techniques have been proposed for image classification tasks, whereas the interpretation of image segmentation remains largely unexplored. This paper offers to bridge this gap by adapting the recent XAI classification algorithms and making them usable for muti-class image segmentation, where we mainly focus on buildings' segmentation from high-resolution satellite images. To benchmark and compare the performance of the proposed approaches, we introduce a new XAI evaluation methodology and metric based on "Entropy" to measure the model uncertainty. Conventional XAI evaluation methods rely mainly on feeding area-of-interest regions from the image back to the pre-trained (utility) model and then calculating the average change in the probability of the target class. Those evaluation metrics lack the needed robustness, and we show that using Entropy to monitor the model uncertainty in segmenting the pixels within the target class is more suitable. We hope this work will pave the way for additional XAI research for image segmentation and applications in the remote sensing discipline.
    Communication Efficiency Optimization of Federated Learning for Computing and Network Convergence of 6G Networks. (arXiv:2311.16540v1 [cs.LG])
    Federated learning effectively addresses issues such as data privacy by collaborating across participating devices to train global models. However, factors such as network topology and device computing power can affect its training or communication process in complex network environments. A new network architecture and paradigm with computing-measurable, perceptible, distributable, dispatchable, and manageable capabilities, computing and network convergence (CNC) of 6G networks can effectively support federated learning training and improve its communication efficiency. By guiding the participating devices' training in federated learning based on business requirements, resource load, network conditions, and arithmetic power of devices, CNC can reach this goal. In this paper, to improve the communication efficiency of federated learning in complex networks, we study the communication efficiency optimization of federated learning for computing and network convergence of 6G networks, methods that gives decisions on its training process for different network conditions and arithmetic power of participating devices in federated learning. The experiments address two architectures that exist for devices in federated learning and arrange devices to participate in training based on arithmetic power while achieving optimization of communication efficiency in the process of transferring model parameters. The results show that the method we proposed can (1) cope well with complex network situations (2) effectively balance the delay distribution of participating devices for local training (3) improve the communication efficiency during the transfer of model parameters (4) improve the resource utilization in the network.
    Pseudo-Likelihood Inference. (arXiv:2311.16656v1 [cs.LG])
    Simulation-Based Inference (SBI) is a common name for an emerging family of approaches that infer the model parameters when the likelihood is intractable. Existing SBI methods either approximate the likelihood, such as Approximate Bayesian Computation (ABC) or directly model the posterior, such as Sequential Neural Posterior Estimation (SNPE). While ABC is efficient on low-dimensional problems, on higher-dimensional tasks, it is generally outperformed by SNPE, which leverages function approximation. In this paper, we propose Pseudo-Likelihood Inference (PLI), a new method that brings neural approximation into ABC, making it competitive on challenging Bayesian system identification tasks. By utilizing integral probability metrics, we introduce a smooth likelihood kernel with an adaptive bandwidth that is updated based on information-theoretic trust regions. Thanks to this formulation, our method (i) allows for optimizing neural posteriors via gradient descent, (ii) does not rely on summary statistics, and (iii) enables multiple observations as input. In comparison to SNPE, it leads to improved performance when more data is available. The effectiveness of PLI is evaluated on four classical SBI benchmark tasks and on a highly dynamic physical system, showing particular advantages on stochastic simulations and multi-modal posterior landscapes.
    Near-Optimal Pure Exploration in Matrix Games: A Generalization of Stochastic Bandits & Dueling Bandits. (arXiv:2310.16252v2 [cs.LG] UPDATED)
    We study the sample complexity of identifying the pure strategy Nash equilibrium (PSNE) in a two-player zero-sum matrix game with noise. Formally, we are given a stochastic model where any learner can sample an entry $(i,j)$ of the input matrix $A\in[-1,1]^{n\times m}$ and observe $A_{i,j}+\eta$ where $\eta$ is a zero-mean 1-sub-Gaussian noise. The aim of the learner is to identify the PSNE of $A$, whenever it exists, with high probability while taking as few samples as possible. Zhou et al. (2017) presents an instance-dependent sample complexity lower bound that depends only on the entries in the row and column in which the PSNE lies. We design a near-optimal algorithm whose sample complexity matches the lower bound, up to log factors. The problem of identifying the PSNE also generalizes the problem of pure exploration in stochastic multi-armed bandits and dueling bandits, and our result matches the optimal bounds, up to log factors, in both the settings.  ( 2 min )
    A Multivariate Unimodality Test Harnenssing the Dip Statistic of Mahalanobis Distances Over Random Projections. (arXiv:2311.16614v1 [stat.ME])
    Unimodality, pivotal in statistical analysis, offers insights into dataset structures and drives sophisticated analytical procedures. While unimodality's confirmation is straightforward for one-dimensional data using methods like Silverman's approach and Hartigans' dip statistic, its generalization to higher dimensions remains challenging. By extrapolating one-dimensional unimodality principles to multi-dimensional spaces through linear random projections and leveraging point-to-point distancing, our method, rooted in $\alpha$-unimodality assumptions, presents a novel multivariate unimodality test named mud-pod. Both theoretical and empirical studies confirm the efficacy of our method in unimodality assessment of multidimensional datasets as well as in estimating the number of clusters.
    Modelling wildland fire burn severity in California using a spatial Super Learner approach. (arXiv:2311.16187v1 [cs.LG])
    Given the increasing prevalence of wildland fires in the Western US, there is a critical need to develop tools to understand and accurately predict burn severity. We develop a machine learning model to predict post-fire burn severity using pre-fire remotely sensed data. Hydrological, ecological, and topographical variables collected from four regions of California - the sites of the Kincade fire (2019), the CZU Lightning Complex fire (2020), the Windy fire (2021), and the KNP Fire (2021) - are used as predictors of the difference normalized burn ratio. We hypothesize that a Super Learner (SL) algorithm that accounts for spatial autocorrelation using Vecchia's Gaussian approximation will accurately model burn severity. In all combinations of test and training sets explored, the results of our model showed the SL algorithm outperformed standard Linear Regression methods. After fitting and verifying the performance of the SL model, we use interpretable machine learning tools to determine the main drivers of severe burn damage, including greenness, elevation and fire weather variables. These findings provide actionable insights that enable communities to strategize interventions, such as early fire detection systems, pre-fire season vegetation clearing activities, and resource allocation during emergency responses. When implemented, this model has the potential to minimize the loss of human life, property, resources, and ecosystems in California.
    Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders. (arXiv:2310.08164v2 [cs.LG] UPDATED)
    Large language models (LLMs) aligned to human preferences via reinforcement learning from human feedback (RLHF) underpin many commercial applications of LLM technology. Despite this, the impacts of RLHF on LLM internals remain opaque. We propose a novel method for interpreting implicit reward models (IRMs) in LLMs learned through RLHF. Our approach trains pairs of autoencoders on activations from a base LLM and its RLHF-tuned variant. Through a comparison of autoencoder hidden spaces, we identify features that reflect the accuracy of the learned IRM. To illustrate our method, we fine-tune an LLM via RLHF to learn a token-utility mapping and maximize the aggregate utility of generated text. This is the first application of sparse autoencoders to interpreting IRMs. Our method provides an abstract approximation of reward integrity and holds promise for measuring alignment between specified objectives and learned model behaviors.  ( 2 min )
    Improving Lane Detection Generalization: A Novel Framework using HD Maps for Boosting Diversity. (arXiv:2311.16589v1 [cs.CV])
    Lane detection is a vital task for vehicles to navigate and localize their position on the road. To ensure reliable results, lane detection algorithms must have robust generalization performance in various road environments. However, despite the significant performance improvement of deep learning-based lane detection algorithms, their generalization performance in response to changes in road environments still falls short of expectations. In this paper, we present a novel framework for single-source domain generalization (SSDG) in lane detection. By decomposing data into lane structures and surroundings, we enhance diversity using High-Definition (HD) maps and generative models. Rather than expanding data volume, we strategically select a core subset of data, maximizing diversity and optimizing performance. Our extensive experiments demonstrate that our framework enhances the generalization performance of lane detection, comparable to the domain adaptation-based method.
    Inexpensive High Fidelity Melt Pool Models in Additive Manufacturing Using Generative Deep Diffusion. (arXiv:2311.16168v1 [cs.LG])
    Defects in laser powder bed fusion (L-PBF) parts often result from the meso-scale dynamics of the molten alloy near the laser, known as the melt pool. For instance, the melt pool can directly contribute to the formation of undesirable porosity, residual stress, and surface roughness in the final part. Experimental in-situ monitoring of the three-dimensional melt pool physical fields is challenging, due to the short length and time scales involved in the process. Multi-physics simulation methods can describe the three-dimensional dynamics of the melt pool, but are computationally expensive at the mesh refinement required for accurate predictions of complex effects, such as the formation of keyhole porosity. Therefore, in this work, we develop a generative deep learning model based on the probabilistic diffusion framework to map low-fidelity, coarse-grained simulation information to the high-fidelity counterpart. By doing so, we bypass the computational expense of conducting multiple high-fidelity simulations for analysis by instead upscaling lightweight coarse mesh simulations. Specifically, we implement a 2-D diffusion model to spatially upscale cross-sections of the coarsely simulated melt pool to their high-fidelity equivalent. We demonstrate the preservation of key metrics of the melting process between the ground truth simulation data and the diffusion model output, such as the temperature field, the melt pool dimensions and the variability of the keyhole vapor cavity. Specifically, we predict the melt pool depth within 3 $\mu m$ based on low-fidelity input data 4$\times$ coarser than the high-fidelity simulations, reducing analysis time by two orders of magnitude.
    Small and Dim Target Detection in IR Imagery: A Review. (arXiv:2311.16346v1 [cs.CV])
    While there has been significant progress in object detection using conventional image processing and machine learning algorithms, exploring small and dim target detection in the IR domain is a relatively new area of study. The majority of small and dim target detection methods are derived from conventional object detection algorithms, albeit with some alterations. The task of detecting small and dim targets in IR imagery is complex. This is because these targets often need distinct features, the background is cluttered with unclear details, and the IR signatures of the scene can change over time due to fluctuations in thermodynamics. The primary objective of this review is to highlight the progress made in this field. This is the first review in the field of small and dim target detection in infrared imagery, encompassing various methodologies ranging from conventional image processing to cutting-edge deep learning-based approaches. The authors have also introduced a taxonomy of such approaches. There are two main types of approaches: methodologies using several frames for detection, and single-frame-based detection techniques. Single frame-based detection techniques encompass a diverse range of methods, spanning from traditional image processing-based approaches to more advanced deep learning methodologies. Our findings indicate that deep learning approaches perform better than traditional image processing-based approaches. In addition, a comprehensive compilation of various available datasets has also been provided. Furthermore, this review identifies the gaps and limitations in existing techniques, paving the way for future research and development in this area.
    Evaluation of dynamic characteristics of power grid based on GNN and application on knowledge graph. (arXiv:2311.16522v1 [cs.LG])
    A novel method for detecting faults in power grids using a graph neural network (GNN) has been developed, aimed at enhancing intelligent fault diagnosis in network operation and maintenance. This GNN-based approach identifies faulty nodes within the power grid through a specialized electrical feature extraction model coupled with a knowledge graph. Incorporating temporal data, the method leverages the status of nodes from preceding and subsequent time periods to aid in current fault detection. To validate the effectiveness of this GNN in extracting node features, a correlation analysis of the output features from each node within the neural network layer was conducted. The results from experiments show that this method can accurately locate fault nodes in simulated scenarios with a remarkable 99.53% accuracy. Additionally, the graph neural network's feature modeling allows for a qualitative examination of how faults spread across nodes, providing valuable insights for analyzing fault nodes.
    Anchor Data Augmentation. (arXiv:2311.06965v2 [cs.LG] UPDATED)
    We propose a novel algorithm for data augmentation in nonlinear over-parametrized regression. Our data augmentation algorithm borrows from the literature on causality and extends the recently proposed Anchor regression (AR) method for data augmentation, which is in contrast to the current state-of-the-art domain-agnostic solutions that rely on the Mixup literature. Our Anchor Data Augmentation (ADA) uses several replicas of the modified samples in AR to provide more training examples, leading to more robust regression predictions. We apply ADA to linear and nonlinear regression problems using neural networks. ADA is competitive with state-of-the-art C-Mixup solutions.  ( 2 min )
    Decentralized Online Federated G-Network Learning for Lightweight Intrusion Detection. (arXiv:2306.13029v2 [cs.CR] UPDATED)
    Cyberattacks are increasingly threatening networked systems, often with the emergence of new types of unknown (zero-day) attacks and the rise of vulnerable devices. Such attacks can also target multiple components of a Supply Chain, which can be protected via Machine Learning (ML)-based Intrusion Detection Systems (IDSs). However, the need to learn large amounts of labelled data often limits the applicability of ML-based IDSs to cybersystems that only have access to private local data, while distributed systems such as Supply Chains have multiple components, each of which must preserve its private data while being targeted by the same attack To address this issue, this paper proposes a novel Decentralized and Online Federated Learning Intrusion Detection (DOF-ID) architecture based on the G-Network model with collaborative learning, that allows each IDS used by a specific component to learn from the experience gained in other components, in addition to its own local data, without violating the data privacy of other components. The performance evaluation results using public Kitsune and Bot-IoT datasets show that DOF-ID significantly improves the intrusion detection performance in all of the collaborating components, with acceptable computation time for online learning.
    Geometric instability of graph neural networks on large graphs. (arXiv:2308.10099v2 [cs.LG] UPDATED)
    We analyse the geometric instability of embeddings produced by graph neural networks (GNNs). Existing methods are only applicable for small graphs and lack context in the graph domain. We propose a simple, efficient and graph-native Graph Gram Index (GGI) to measure such instability which is invariant to permutation, orthogonal transformation, translation and order of evaluation. This allows us to study the varying instability behaviour of GNN embeddings on large graphs for both node classification and link prediction.
    Likelihood-based Sensor Calibration using Affine Transformation. (arXiv:2309.11526v2 [cs.LG] UPDATED)
    An important task in the field of sensor technology is the efficient implementation of adaptation procedures of measurements from one sensor to another sensor of identical design. One idea is to use the estimation of an affine transformation between different systems, which can be improved by the knowledge of experts. This paper presents an improved solution from Glacier Research that was published back in 1973. The results demonstrate the adaptability of this solution for various applications, including software calibration of sensors, implementation of expert-based adaptation, and paving the way for future advancements such as distributed learning methods. One idea here is to use the knowledge of experts for estimating an affine transformation between different systems. We evaluate our research with simulations and also with real measured data of a multi-sensor board with 8 identical sensors. Both data set and evaluation script are provided for download. The results show an improvement for both the simulation and the experiments with real data.
    Almost Equivariance via Lie Algebra Convolutions. (arXiv:2310.13164v2 [cs.LG] UPDATED)
    Recently, the equivariance of models with respect to a group action has become an important topic of research in machine learning. However, imbuing an architecture with a specific group equivariance imposes a strong prior on the types of data transformations that the model expects to see. While strictly-equivariant models enforce symmetries, real-world data does not always conform to such strict equivariances, be it due to noise in the data or underlying physical laws that encode only approximate or partial symmetries. In such cases, the prior of strict equivariance can actually prove too strong and cause models to underperform on real-world data. Therefore, in this work we study a closely related topic, that of almost equivariance. We provide a definition of almost equivariance that differs from those extant in the current literature and give a practical method for encoding almost equivariance in models by appealing to the Lie algebra of a Lie group. Specifically, we define Lie algebra convolutions and demonstrate that they offer several benefits over Lie group convolutions, including being well-defined for non-compact groups. From there, we pivot to the realm of theory and demonstrate connections between the notions of equivariance and isometry and those of almost equivariance and almost isometry, respectively. We prove two existence theorems, one showing the existence of almost isometries within bounded distance of isometries of a general manifold, and another showing the converse for Hilbert spaces. We then extend these theorems to prove the existence of almost equivariant manifold embeddings within bounded distance of fully equivariant embedding functions, subject to certain constraints on the group action and the function class. Finally, we demonstrate the validity of our approach by benchmarking against datasets in fully equivariant and almost equivariant settings.  ( 3 min )
    Learning Multimodal Latent Dynamics for Human-Robot Interaction. (arXiv:2311.16380v1 [cs.RO])
    This article presents a method for learning well-coordinated Human-Robot Interaction (HRI) from Human-Human Interactions (HHI). We devise a hybrid approach using Hidden Markov Models (HMMs) as the latent space priors for a Variational Autoencoder to model a joint distribution over the interacting agents. We leverage the interaction dynamics learned from HHI to learn HRI and incorporate the conditional generation of robot motions from human observations into the training, thereby predicting more accurate robot trajectories. The generated robot motions are further adapted with Inverse Kinematics to ensure the desired physical proximity with a human, combining the ease of joint space learning and accurate task space reachability. For contact-rich interactions, we modulate the robot's stiffness using HMM segmentation for a compliant interaction. We verify the effectiveness of our approach deployed on a Humanoid robot via a user study. Our method generalizes well to various humans despite being trained on data from just two humans. We find that Users perceive our method as more human-like, timely, and accurate and rank our method with a higher degree of preference over other baselines.
    FairShap: A Data Re-weighting Approach for Algorithmic Fairness based on Shapley Values. (arXiv:2303.01928v3 [cs.LG] UPDATED)
    Algorithmic fairness is of utmost societal importance, yet the current trend in large-scale machine learning models requires training with massive datasets that are frequently biased. In this context, pre-processing methods that focus on modeling and correcting bias in the data emerge as valuable approaches. In this paper, we propose FairShap, a novel instance-level data re-weighting method for fair algorithmic decision-making through data valuation by means of Shapley Values. FairShap is model-agnostic and easily interpretable, as it measures the contribution of each training data point to a predefined fairness metric. We empirically validate FairShap on several state-of-the-art datasets of different nature, with a variety of training scenarios and models and show how it yields fairer models with similar levels of accuracy than the baselines. We illustrate FairShap's interpretability by means of histograms and latent space visualizations. Moreover, we perform a utility-fairness study, and ablation and runtime experiments to illustrate the impact of the size of the reference dataset and FairShap's computational cost depending on the size of the dataset and the number of features. We believe that FairShap represents a promising direction in interpretable and model-agnostic approaches to algorithmic fairness that yield competitive accuracy even when only biased datasets are available.
    Kernelized Reinforcement Learning with Order Optimal Regret Bounds. (arXiv:2306.07745v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) has shown empirical success in various real world settings with complex models and large state-action spaces. The existing analytical results, however, typically focus on settings with a small number of state-actions or simple models such as linearly modeled state-action value functions. To derive RL policies that efficiently handle large state-action spaces with more general value functions, some recent works have considered nonlinear function approximation using kernel ridge regression. We propose $\pi$-KRVI, an optimistic modification of least-squares value iteration, when the state-action value function is represented by a reproducing kernel Hilbert space (RKHS). We prove the first order-optimal regret guarantees under a general setting. Our results show a significant polynomial in the number of episodes improvement over the state of the art. In particular, with highly non-smooth kernels (such as Neural Tangent kernel or some Mat\'ern kernels) the existing results lead to trivial (superlinear in the number of episodes) regret bounds. We show a sublinear regret bound that is order optimal in the case of Mat\'ern kernels where a lower bound on regret is known.
    Trainable Loss Weights in Super-Resolution. (arXiv:2301.10575v2 [cs.CV] UPDATED)
    In recent years, limited research has discussed the loss function in the super-resolution process. The majority of those studies have only used perceptual similarity conventionally. This is while the development of appropriate loss can improve the quality of other methods as well. In this article, a new weighting method for pixel-wise loss is proposed. With the help of this method, it is possible to use trainable weights based on the general structure of the image and its perceptual features while maintaining the advantages of pixel-wise loss. Also, a criterion for comparing weights of loss is introduced so that the weights can be estimated directly by a convolutional neural network. In addition, in this article, the expectation-maximization method is used for the simultaneous estimation super-resolution network and weighting network. In addition, a new activation function, called "FixedSum", is introduced which can keep the sum of all components of vector constants while keeping the output components between zero and one. As experimental results shows, weighted loss by the proposed method leads to better results than the unweighted loss and weighted loss based on uncertainty in both signal-to-noise and perceptual similarity senses on the state-of-the-art networks. Code is available online.
    Sinkhorn Flow: A Continuous-Time Framework for Understanding and Generalizing the Sinkhorn Algorithm. (arXiv:2311.16706v1 [cs.LG])
    Many problems in machine learning can be formulated as solving entropy-regularized optimal transport on the space of probability measures. The canonical approach involves the Sinkhorn iterates, renowned for their rich mathematical properties. Recently, the Sinkhorn algorithm has been recast within the mirror descent framework, thus benefiting from classical optimization theory insights. Here, we build upon this result by introducing a continuous-time analogue of the Sinkhorn algorithm. This perspective allows us to derive novel variants of Sinkhorn schemes that are robust to noise and bias. Moreover, our continuous-time dynamics not only generalize but also offer a unified perspective on several recently discovered dynamics in machine learning and mathematics, such as the "Wasserstein mirror flow" of (Deb et al. 2023) or the "mean-field Schr\"odinger equation" of (Claisse et al. 2023).  ( 2 min )
    Marsellus: A Heterogeneous RISC-V AI-IoT End-Node SoC with 2-to-8b DNN Acceleration and 30%-Boost Adaptive Body Biasing. (arXiv:2305.08415v3 [cs.AR] UPDATED)
    Emerging Artificial Intelligence-enabled Internet-of-Things (AI-IoT) System-on-a-Chip (SoC) for augmented reality, personalized healthcare, and nano-robotics need to run many diverse tasks within a power envelope of a few tens of mW over a wide range of operating conditions: compute-intensive but strongly quantized Deep Neural Network (DNN) inference, as well as signal processing and control requiring high-precision floating-point. We present Marsellus, an all-digital heterogeneous SoC for AI-IoT end-nodes fabricated in GlobalFoundries 22nm FDX that combines 1) a general-purpose cluster of 16 RISC-V Digital Signal Processing (DSP) cores attuned for the execution of a diverse range of workloads exploiting 4-bit and 2-bit arithmetic extensions (XpulpNN), combined with fused MAC&LOAD operations and floating-point support; 2) a 2-8bit Reconfigurable Binary Engine (RBE) to accelerate 3x3 and 1x1 (pointwise) convolutions in DNNs; 3) a set of On-Chip Monitoring (OCM) blocks connected to an Adaptive Body Biasing (ABB) generator and a hardware control loop, enabling on-the-fly adaptation of transistor threshold voltages. Marsellus achieves up to 180 Gop/s or 3.32 Top/s/W on 2-bit precision arithmetic in software, and up to 637 Gop/s or 12.4 Top/s/W on hardware-accelerated DNN layers.
    FIXED: Frustratingly Easy Domain Generalization with Mixup. (arXiv:2211.05228v2 [cs.CV] UPDATED)
    Domain generalization (DG) aims to learn a generalizable model from multiple training domains such that it can perform well on unseen target domains. A popular strategy is to augment training data to benefit generalization through methods such as Mixup~\cite{zhang2018mixup}. While the vanilla Mixup can be directly applied, theoretical and empirical investigations uncover several shortcomings that limit its performance. Firstly, Mixup cannot effectively identify the domain and class information that can be used for learning invariant representations. Secondly, Mixup may introduce synthetic noisy data points via random interpolation, which lowers its discrimination capability. Based on the analysis, we propose a simple yet effective enhancement for Mixup-based DG, namely domain-invariant Feature mIXup (FIX). It learns domain-invariant representations for Mixup. To further enhance discrimination, we leverage existing techniques to enlarge margins among classes to further propose the domain-invariant Feature MIXup with Enhanced Discrimination (FIXED) approach. We present theoretical insights about guarantees on its effectiveness. Extensive experiments on seven public datasets across two modalities including image classification (Digits-DG, PACS, Office-Home) and time series (DSADS, PAMAP2, UCI-HAR, and USC-HAD) demonstrate that our approach significantly outperforms nine state-of-the-art related methods, beating the best performing baseline by 6.5\% on average in terms of test accuracy. Code is available at: https://github.com/jindongwang/transferlearning/tree/master/code/deep/fixed.
    Generative Social Choice. (arXiv:2309.01291v2 [cs.GT] UPDATED)
    Traditionally, social choice theory has only been applicable to choices among a few predetermined alternatives but not to more complex decisions such as collectively selecting a textual statement. We introduce generative social choice, a framework that combines the mathematical rigor of social choice theory with the capability of large language models to generate text and extrapolate preferences. This framework divides the design of AI-augmented democratic processes into two components: first, proving that the process satisfies rigorous representation guarantees when given access to oracle queries; second, empirically validating that these queries can be approximately implemented using a large language model. We apply this framework to the problem of generating a slate of statements that is representative of opinions expressed as free-form text; specifically, we develop a democratic process with representation guarantees and use this process to represent the opinions of participants in a survey about chatbot personalization. We find that 93 out of 100 participants feel "mostly" or "perfectly" represented by the slate of five statements we extracted.
    Policy Learning with Asymmetric Counterfactual Utilities. (arXiv:2206.10479v3 [stat.ML] UPDATED)
    Data-driven decision making plays an important role even in high stakes settings like medicine and public policy. Learning optimal policies from observed data requires a careful formulation of the utility function whose expected value is maximized across a population. Although researchers typically use utilities that depend on observed outcomes alone, in many settings the decision maker's utility function is more properly characterized by the joint set of potential outcomes under all actions. For example, the Hippocratic principle to "do no harm" implies that the cost of causing death to a patient who would otherwise survive without treatment is greater than the cost of forgoing life-saving treatment. We consider optimal policy learning with asymmetric counterfactual utility functions of this form that consider the joint set of potential outcomes. We show that asymmetric counterfactual utilities lead to an unidentifiable expected utility function, and so we first partially identify it. Drawing on statistical decision theory, we then derive minimax decision rules by minimizing the maximum expected utility loss relative to different alternative policies. We show that one can learn minimax loss decision rules from observed data by solving intermediate classification problems, and establish that the finite sample excess expected utility loss of this procedure is bounded by the regret of these intermediate classifiers. We apply this conceptual framework and methodology to the decision about whether or not to use right heart catheterization for patients with possible pulmonary hypertension.
    Rescuing referral failures during automated diagnosis of domain-shifted medical images. (arXiv:2311.16766v1 [cs.CV])
    The success of deep learning models deployed in the real world depends critically on their ability to generalize well across diverse data domains. Here, we address a fundamental challenge with selective classification during automated diagnosis with domain-shifted medical images. In this scenario, models must learn to avoid making predictions when label confidence is low, especially when tested with samples far removed from the training set (covariate shift). Such uncertain cases are typically referred to the clinician for further analysis and evaluation. Yet, we show that even state-of-the-art domain generalization approaches fail severely during referral when tested on medical images acquired from a different demographic or using a different technology. We examine two benchmark diagnostic medical imaging datasets exhibiting strong covariate shifts: i) diabetic retinopathy prediction with retinal fundus images and ii) multilabel disease prediction with chest X-ray images. We show that predictive uncertainty estimates do not generalize well under covariate shifts leading to non-monotonic referral curves, and severe drops in performance (up to 50%) at high referral rates (>70%). We evaluate novel combinations of robust generalization and post hoc referral approaches, that rescue these failures and achieve significant performance improvements, typically >10%, over baseline methods. Our study identifies a critical challenge with referral in domain-shifted medical images and finds key applications in reliable, automated disease diagnosis.  ( 2 min )
    Learning to design protein-protein interactions with enhanced generalization. (arXiv:2310.18515v2 [cs.LG] UPDATED)
    Discovering mutations enhancing protein-protein interactions (PPIs) is critical for advancing biomedical research and developing improved therapeutics. While machine learning approaches have substantially advanced the field, they often struggle to generalize beyond training data in practical scenarios. The contributions of this work are three-fold. First, we construct PPIRef, the largest and non-redundant dataset of 3D protein-protein interactions, enabling effective large-scale learning. Second, we leverage the PPIRef dataset to pre-train PPIformer, a new SE(3)-equivariant model generalizing across diverse protein-binder variants. We fine-tune PPIformer to predict effects of mutations on protein-protein interactions via a thermodynamically motivated adjustment of the pre-training loss function. Finally, we demonstrate the enhanced generalization of our new PPIformer approach by outperforming other state-of-the-art methods on new, non-leaking splits of standard labeled PPI mutational data and independent case studies optimizing a human antibody against SARS-CoV-2 and increasing the thrombolytic activity of staphylokinase.  ( 2 min )
    Robust Ocean Subgrid-Scale Parameterizations Using Fourier Neural Operators. (arXiv:2310.02691v2 [cs.LG] UPDATED)
    In climate simulations, small-scale processes shape ocean dynamics but remain computationally expensive to resolve directly. For this reason, their contributions are commonly approximated using empirical parameterizations, which lead to significant errors in long-term projections. In this work, we develop parameterizations based on Fourier Neural Operators, showcasing their accuracy and generalizability in comparison to other approaches. Finally, we discuss the potential and limitations of neural networks operating in the frequency domain, paving the way for future investigation.  ( 2 min )
    ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection. (arXiv:2311.15243v2 [cs.CV] UPDATED)
    Out-of-distribution (OOD) detection methods often exploit auxiliary outliers to train model identifying OOD samples, especially discovering challenging outliers from auxiliary outliers dataset to improve OOD detection. However, they may still face limitations in effectively distinguishing between the most challenging OOD samples that are much like in-distribution (ID) data, i.e., ID-like samples. To this end, we propose a novel OOD detection framework that discovers ID-like outliers using CLIP from the vicinity space of the ID samples, thus helping to identify these most challenging OOD samples. Then a prompt learning framework is proposed that utilizes the identified ID-like outliers to further leverage the capabilities of CLIP for OOD detection. Benefiting from the powerful CLIP, we only need a small number of ID samples to learn the prompts of the model without exposing other auxiliary outlier datasets. By focusing on the most challenging ID-like OOD samples and elegantly exploiting the capabilities of CLIP, our method achieves superior few-shot learning performance on various real-world image datasets (e.g., in 4-shot OOD detection on the ImageNet-1k dataset, our method reduces the average FPR95 by 12.16% and improves the average AUROC by 2.76%, compared to state-of-the-art methods).  ( 2 min )
    COVID-19 detection using ViT transformer-based approach from Computed Tomography Images. (arXiv:2310.08165v2 [eess.IV] UPDATED)
    In here, we introduce a novel approach to enhance the accuracy and efficiency of COVID-19 diagnosis using CT images. Leveraging state-of-the-art Transformer models in computer vision, we employed the base ViT Transformer configured for 224x224-sized input images, modifying the output to suit the binary classification task. Notably, input images were resized from the standard CT scan size of 512x512 to match the model's expectations. Our method implements a systematic patient-level prediction strategy, classifying individual CT slices as COVID-19 or non-COVID. To determine the overall diagnosis for each patient, a majority voting approach as well as other thresholding approaches were employed. This method involves evaluating all CT slices for a given patient and assigning the patient the diagnosis that relates to the thresholding for the CT scan. This meticulous patient-level prediction process contributes to the robustness of our solution as it starts from 2D-slices to 3D-patient level. Throughout the evaluation process, our approach resulted in 0.7 macro F1 score on the COV19-CT -DB validation set. To ensure the reliability and effectiveness of our model, we rigorously validate it on the extensive COV-19 CT dataset, which is meticulously annotated for the task. This dataset, with its comprehensive annotations, reinforces the overall robustness of our solution.  ( 3 min )
    Certifying LLM Safety against Adversarial Prompting. (arXiv:2309.02705v2 [cs.CL] UPDATED)
    Large language models (LLMs) released for public use incorporate guardrails to ensure their output is safe, often referred to as "model alignment." An aligned language model should decline a user's request to produce harmful content. However, such safety measures are vulnerable to adversarial attacks, which add maliciously designed token sequences to a harmful prompt to bypass the model's safety guards. In this work, we introduce erase-and-check, the first framework to defend against adversarial prompts with verifiable safety guarantees. We defend against three attack modes: i) adversarial suffix, which appends an adversarial sequence at the end of the prompt; ii) adversarial insertion, where the adversarial sequence is inserted anywhere in the middle of the prompt; and iii) adversarial infusion, where adversarial tokens are inserted at arbitrary positions in the prompt, not necessarily as a contiguous block. Our experimental results demonstrate that this procedure can obtain strong certified safety guarantees on harmful prompts while maintaining good empirical performance on safe prompts. For example, against adversarial suffixes of length 20, it certifiably detects 92% of harmful prompts and labels 94% of safe prompts correctly using the open-source language model Llama 2 as the safety filter. We further improve the filter's performance, in terms of accuracy and speed, by replacing Llama 2 with a DistilBERT safety classifier fine-tuned on safe and harmful prompts. Additionally, we propose two efficient empirical defenses: i) RandEC, a randomized version of erase-and-check that evaluates the safety filter on a small subset of the erased subsequences, and ii) GradEC, a gradient-based version that optimizes the erased tokens to remove the adversarial sequence. The code for our experiments is available at https://github.com/aounon/certified-llm-safety.
    Towards Improving the Generation Quality of Autoregressive Slot VAEs. (arXiv:2206.01370v3 [cs.CV] UPDATED)
    Unconditional scene inference and generation are challenging to learn jointly with a single compositional model. Despite encouraging progress on models that extract object-centric representations (''slots'') from images, unconditional generation of scenes from slots has received less attention. This is primarily because learning the multi-object relations necessary to imagine coherent scenes is difficult. We hypothesize that most existing slot-based models have a limited ability to learn object correlations. We propose two improvements that strengthen object correlation learning. The first is to condition the slots on a global, scene-level variable that captures higher-order correlations between slots. Second, we address the fundamental lack of a canonical order for objects in images by proposing to learn a consistent order to use for the autoregressive generation of scene objects. Specifically, we train an autoregressive slot prior to sequentially generate scene objects following a learned order. Ordered slot inference entails first estimating a randomly ordered set of slots using existing approaches for extracting slots from images, then aligning those slots to ordered slots generated autoregressively with the slot prior. Our experiments across three multi-object environments demonstrate clear gains in unconditional scene generation quality. Detailed ablation studies are also provided that validate the two proposed improvements.
    Equilibrium in the Computing Continuum through Active Inference. (arXiv:2311.16769v1 [cs.DC])
    Computing Continuum (CC) systems are challenged to ensure the intricate requirements of each computational tier. Given the system's scale, the Service Level Objectives (SLOs) which are expressed as these requirements, must be broken down into smaller parts that can be decentralized. We present our framework for collaborative edge intelligence enabling individual edge devices to (1) develop a causal understanding of how to enforce their SLOs, and (2) transfer knowledge to speed up the onboarding of heterogeneous devices. Through collaboration, they (3) increase the scope of SLO fulfillment. We implemented the framework and evaluated a use case in which a CC system is responsible for ensuring Quality of Service (QoS) and Quality of Experience (QoE) during video streaming. Our results showed that edge devices required only ten training rounds to ensure four SLOs; furthermore, the underlying causal structures were also rationally explainable. The addition of new types of devices can be done a posteriori, the framework allowed them to reuse existing models, even though the device type had been unknown. Finally, rebalancing the load within a device cluster allowed individual edge devices to recover their SLO compliance after a network failure from 22% to 89%.
    MACE: A Multi-pattern Accommodated and Efficient Anomaly Detection Method in the Frequency Domain. (arXiv:2311.16191v1 [cs.LG])
    Anomaly detection significantly enhances the robustness of cloud systems. While neural network-based methods have recently demonstrated strong advantages, they encounter practical challenges in cloud environments: the contradiction between the impracticality of maintaining a unique model for each service and the limited ability of dealing with diverse normal patterns by a unified model, as well as issues with handling heavy traffic in real time and short-term anomaly detection sensitivity. Thus, we propose MACE, a Multi-pattern Accommodated and efficient Anomaly detection method in the frequency domain for time series anomaly detection. There are three novel characteristics of it: (i) a pattern extraction mechanism excelling at handling diverse normal patterns, which enables the model to identify anomalies by examining the correlation between the data sample and its service normal pattern, instead of solely focusing on the data sample itself; (ii) a dualistic convolution mechanism that amplifies short-term anomalies in the time domain and hinders the reconstruction of anomalies in the frequency domain, which enlarges the reconstruction error disparity between anomaly and normality and facilitates anomaly detection; (iii) leveraging the sparsity and parallelism of frequency domain to enhance model efficiency. We theoretically and experimentally prove that using a strategically selected subset of Fourier bases can not only reduce computational overhead but is also profit to distinguish anomalies, compared to using the complete spectrum. Moreover, extensive experiments demonstrate MACE's effectiveness in handling diverse normal patterns with a unified model and it achieves state-of-the-art performance with high efficiency. \end{abstract}
    Hyper-Relational Knowledge Graph Neural Network for Next POI. (arXiv:2311.16683v1 [cs.AI])
    With the advancement of mobile technology, Point of Interest (POI) recommendation systems in Location-based Social Networks (LBSN) have brought numerous benefits to both users and companies. Many existing works employ Knowledge Graph (KG) to alleviate the data sparsity issue in LBSN. These approaches primarily focus on modeling the pair-wise relations in LBSN to enrich the semantics and thereby relieve the data sparsity issue. However, existing approaches seldom consider the hyper-relations in LBSN, such as the mobility relation (a 3-ary relation: user-POI-time). This makes the model hard to exploit the semantics accurately. In addition, prior works overlook the rich structural information inherent in KG, which consists of higher-order relations and can further alleviate the impact of data sparsity.To this end, we propose a Hyper-Relational Knowledge Graph Neural Network (HKGNN) model. In HKGNN, a Hyper-Relational Knowledge Graph (HKG) that models the LBSN data is constructed to maintain and exploit the rich semantics of hyper-relations. Then we proposed a Hypergraph Neural Network to utilize the structural information of HKG in a cohesive way. In addition, a self-attention network is used to leverage sequential information and make personalized recommendations. Furthermore, side information, essential in reducing data sparsity by providing background knowledge of POIs, is not fully utilized in current methods. In light of this, we extended the current dataset with available side information to further lessen the impact of data sparsity. Results of experiments on four real-world LBSN datasets demonstrate the effectiveness of our approach compared to existing state-of-the-art methods.
    mvlearnR and Shiny App for multiview learning. (arXiv:2311.16181v1 [q-bio.GN])
    The package mvlearnR and accompanying Shiny App is intended for integrating data from multiple sources or views or modalities (e.g. genomics, proteomics, clinical and demographic data). Most existing software packages for multiview learning are decentralized and offer limited capabilities, making it difficult for users to perform comprehensive integrative analysis. The new package wraps statistical and machine learning methods and graphical tools, providing a convenient and easy data integration workflow. For users with limited programming language, we provide a Shiny Application to facilitate data integration anywhere and on any device. The methods have potential to offer deeper insights into complex disease mechanisms. Availability and Implementation: mvlearnR is available from the following GitHub repository: https://github.com/lasandrall/mvlearnR. The web application is hosted on shinyapps.io and available at: https://multi-viewlearn.shinyapps.io/MultiView_Modeling/
    PyTorch Geometric High Order: A Unified Library for High Order Graph Neural Network. (arXiv:2311.16670v1 [cs.LG])
    We introduce PyTorch Geometric High Order (PyGHO), a library for High Order Graph Neural Networks (HOGNNs) that extends PyTorch Geometric (PyG). Unlike ordinary Message Passing Neural Networks (MPNNs) that exchange messages between nodes, HOGNNs, encompassing subgraph GNNs and k-WL GNNs, encode node tuples, a method previously lacking a standardized framework and often requiring complex coding. PyGHO's main objective is to provide an unified and user-friendly interface for various HOGNNs. It accomplishes this through streamlined data structures for node tuples, comprehensive data processing utilities, and a flexible suite of operators for high-order GNN methodologies. In this work, we present a detailed in-depth of PyGHO and compare HOGNNs implemented with PyGHO with their official implementation on real-world tasks. PyGHO achieves up to $50\%$ acceleration and reduces the code needed for implementation by an order of magnitude. Our library is available at \url{https://github.com/GraphPKU/PygHO}.
    OccamNet: A Fast Neural Model for Symbolic Regression at Scale. (arXiv:2007.10784v3 [cs.LG] UPDATED)
    Neural networks' expressiveness comes at the cost of complex, black-box models that often extrapolate poorly beyond the domain of the training dataset, conflicting with the goal of finding compact analytic expressions to describe scientific data. We introduce OccamNet, a neural network model that finds interpretable, compact, and sparse symbolic fits to data, \`a la Occam's razor. Our model defines a probability distribution over functions with efficient sampling and function evaluation. We train by sampling functions and biasing the probability mass toward better fitting solutions, backpropagating using cross-entropy matching in a reinforcement-learning loss. OccamNet can identify symbolic fits for a variety of problems, including analytic and non-analytic functions, implicit functions, and simple image classification, and can outperform state-of-the-art symbolic regression methods on real-world regression datasets. Our method requires a minimal memory footprint, fits complicated functions in minutes on a single CPU, and scales on a GPU.
    Wasserstein Distributionally Robust Estimation in High Dimensions: Performance Analysis and Optimal Hyperparameter Tuning. (arXiv:2206.13269v2 [stat.ML] UPDATED)
    Wasserstein distributionally robust optimization has recently emerged as a powerful framework for robust estimation, enjoying good out-of-sample performance guarantees, well-understood regularization effects, and computationally tractable reformulations. In such framework, the estimator is obtained by minimizing the worst-case expected loss over all probability distributions which are close, in a Wasserstein sense, to the empirical distribution. In this paper, we propose a Wasserstein distributionally robust estimation framework to estimate an unknown parameter from noisy linear measurements, and we focus on the task of analyzing the squared error performance of such estimators. Our study is carried out in the modern high-dimensional proportional regime, where both the ambient dimension and the number of samples go to infinity at a proportional rate which encodes the under/over-parametrization of the problem. Under an isotropic Gaussian features assumption, we show that the squared error can be recovered as the solution of a convex-concave optimization problem which, surprinsingly, involves at most four scalar variables. Importantly, the precise quantification of the squared error allows to accurately and efficiently compare different ambiguity radii and to understand the effect of the under/over-parametrization on the estimation error. We conclude the paper with a list of exciting research directions enabled by our results.
    SIRAN: Sinkhorn Distance Regularized Adversarial Network for DEM Super-resolution using Discriminative Spatial Self-attention. (arXiv:2311.16490v1 [eess.IV])
    Digital Elevation Model (DEM) is an essential aspect in the remote sensing domain to analyze and explore different applications related to surface elevation information. In this study, we intend to address the generation of high-resolution DEMs using high-resolution multi-spectral (MX) satellite imagery by incorporating adversarial learning. To promptly regulate this process, we utilize the notion of polarized self-attention of discriminator spatial maps as well as introduce a Densely connected Multi-Residual Block (DMRB) module to assist in efficient gradient flow. Further, we present an objective function related to optimizing Sinkhorn distance with traditional GAN to improve the stability of adversarial learning. In this regard, we provide both theoretical and empirical substantiation of better performance in terms of vanishing gradient issues and numerical convergence. We demonstrate both qualitative and quantitative outcomes with available state-of-the-art methods. Based on our experiments on DEM datasets of Shuttle Radar Topographic Mission (SRTM) and Cartosat-1, we show that the proposed model performs preferably against other learning-based state-of-the-art methods. We also generate and visualize several high-resolution DEMs covering terrains with diverse signatures to show the performance of our model.
    Class-Adaptive Sampling Policy for Efficient Continual Learning. (arXiv:2311.16485v1 [cs.LG])
    Continual learning (CL) aims to acquire new knowledge while preserving information from previous experiences without forgetting. Though buffer-based methods (i.e., retaining samples from previous tasks) have achieved acceptable performance, determining how to allocate the buffer remains a critical challenge. Most recent research focuses on refining these methods but often fails to sufficiently consider the varying influence of samples on the learning process, and frequently overlooks the complexity of the classes/concepts being learned. Generally, these methods do not directly take into account the contribution of individual classes. However, our investigation indicates that more challenging classes necessitate preserving a larger number of samples compared to less challenging ones. To address this issue, we propose a novel method and policy named 'Class-Adaptive Sampling Policy' (CASP), which dynamically allocates storage space within the buffer. By utilizing concepts of class contribution and difficulty, CASP adaptively manages buffer space, allowing certain classes to occupy a larger portion of the buffer while reducing storage for others. This approach significantly improves the efficiency of knowledge retention and utilization. CASP provides a versatile solution to boost the performance and efficiency of CL. It meets the demand for dynamic buffer allocation, accommodating the varying contributions of different classes and their learning complexities over time.
    LC4SV: A Denoising Framework Learning to Compensate for Unseen Speaker Verification Models. (arXiv:2311.16604v1 [eess.AS])
    The performance of speaker verification (SV) models may drop dramatically in noisy environments. A speech enhancement (SE) module can be used as a front-end strategy. However, existing SE methods may fail to bring performance improvements to downstream SV systems due to artifacts in the predicted signals of SE models. To compensate for artifacts, we propose a generic denoising framework named LC4SV, which can serve as a pre-processor for various unknown downstream SV models. In LC4SV, we employ a learning-based interpolation agent to automatically generate the appropriate coefficients between the enhanced signal and its noisy input to improve SV performance in noisy environments. Our experimental results demonstrate that LC4SV consistently improves the performance of various unseen SV systems. To the best of our knowledge, this work is the first attempt to develop a learning-based interpolation scheme aiming at improving SV performance in noisy environments.
    Brain-Inspired Efficient Pruning: Exploiting Criticality in Spiking Neural Networks. (arXiv:2311.16141v1 [cs.NE])
    Spiking Neural Networks (SNNs) have been an attractive option for deployment on devices with limited computing resources and lower power consumption because of the event-driven computing characteristic. As such devices have limited computing and storage resources, pruning for SNNs has been widely focused recently. However, the binary and non-differentiable property of spike signals make pruning deep SNNs challenging, so existing methods require high time overhead to make pruning decisions. In this paper, inspired by critical brain hypothesis in neuroscience, we design a regeneration mechanism based on criticality to efficiently obtain the critical pruned networks. Firstly, we propose a low-cost metric for the criticality of pruning structures. Then we re-rank the pruned structures after pruning and regenerate those with higher criticality. We evaluate our method using VGG-16 and ResNet-19 for both unstructured pruning and structured pruning. Our method achieves higher performance compared to current state-of-the-art (SOTA) method with the same time overhead. We also achieve comparable performances (even better on VGG-16) compared to the SOTA method with 11.3x and 15.5x acceleration. Moreover, we investigate underlying mechanism of our method and find that it efficiently selects potential structures, learns the consistent feature representations and reduces the overfitting during the recovery phase.
    Outfit Completion via Conditional Set Transformation. (arXiv:2311.16630v1 [cs.LG])
    In this paper, we formulate the outfit completion problem as a set retrieval task and propose a novel framework for solving this problem. The proposal includes a conditional set transformation architecture with deep neural networks and a compatibility-based regularization method. The proposed method utilizes a map with permutation-invariant for the input set and permutation-equivariant for the condition set. This allows retrieving a set that is compatible with the input set while reflecting the properties of the condition set. In addition, since this structure outputs the element of the output set in a single inference, it can achieve a scalable inference speed with respect to the cardinality of the output set. Experimental results on real data reveal that the proposed method outperforms existing approaches in terms of accuracy of the outfit completion task, condition satisfaction, and compatibility of completion results.
    Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation : A Unified Approach. (arXiv:2311.16514v1 [cs.CV])
    Video Anomaly Detection (VAD) is an open-set recognition task, which is usually formulated as a one-class classification (OCC) problem, where training data is comprised of videos with normal instances while test data contains both normal and anomalous instances. Recent works have investigated the creation of pseudo-anomalies (PAs) using only the normal data and making strong assumptions about real-world anomalies with regards to abnormality of objects and speed of motion to inject prior information about anomalies in an autoencoder (AE) based reconstruction model during training. This work proposes a novel method for generating generic spatio-temporal PAs by inpainting a masked out region of an image using a pre-trained Latent Diffusion Model and further perturbing the optical flow using mixup to emulate spatio-temporal distortions in the data. In addition, we present a simple unified framework to detect real-world anomalies under the OCC setting by learning three types of anomaly indicators, namely reconstruction quality, temporal irregularity and semantic inconsistency. Extensive experiments on four VAD benchmark datasets namely Ped2, Avenue, ShanghaiTech and UBnormal demonstrate that our method performs on par with other existing state-of-the-art PAs generation and reconstruction based methods under the OCC setting. Our analysis also examines the transferability and generalisation of PAs across these datasets, offering valuable insights by identifying real-world anomalies through PAs.
    Univariate Radial Basis Function Layers: Brain-inspired Deep Neural Layers for Low-Dimensional Inputs. (arXiv:2311.16148v1 [cs.NE])
    Deep Neural Networks (DNNs) became the standard tool for function approximation with most of the introduced architectures being developed for high-dimensional input data. However, many real-world problems have low-dimensional inputs for which standard Multi-Layer Perceptrons (MLPs) are the default choice. An investigation into specialized architectures is missing. We propose a novel DNN layer called Univariate Radial Basis Function (U-RBF) layer as an alternative. Similar to sensory neurons in the brain, the U-RBF layer processes each individual input dimension with a population of neurons whose activations depend on different preferred input values. We verify its effectiveness compared to MLPs in low-dimensional function regressions and reinforcement learning tasks. The results show that the U-RBF is especially advantageous when the target function becomes complex and difficult to approximate.
    Transfer Learning between Motor Imagery Datasets using Deep Learning -- Validation of Framework and Comparison of Datasets. (arXiv:2311.16109v1 [cs.CV])
    We present a simple deep learning-based framework commonly used in computer vision and demonstrate its effectiveness for cross-dataset transfer learning in mental imagery decoding tasks that are common in the field of Brain-Computer Interfaces (BCI). We investigate, on a large selection of 12 motor-imagery datasets, which ones are well suited for transfer, both as donors and as receivers. Challenges. Deep learning models typically require long training times and are data-hungry, which impedes their use for BCI systems that have to minimize the recording time for (training) examples and are subject to constraints induced by experiments involving human subjects. A solution to both issues is transfer learning, but it comes with its own challenge, i.e., substantial data distribution shifts between datasets, subjects and even between subsequent sessions of the same subject. Approach. For every pair of pre-training (donor) and test (receiver) dataset, we first train a model on the donor before training merely an additional new linear classification layer based on a few receiver trials. Performance of this transfer approach is then tested on other trials of the receiver dataset. Significance. First, we lower the threshold to use transfer learning between motor imagery datasets: the overall framework is extremely simple and nevertheless obtains decent classification scores. Second, we demonstrate that deep learning models are a good option for motor imagery cross-dataset transfer both for the reasons outlined in the first point and because the framework presented is viable in online scenarios. Finally, analysing which datasets are best suited for transfer learning can be used as a reference for future researchers to determine which to use for pre-training or benchmarking.
    MMPDE-Net and Moving Sampling Physics-informed Neural Networks Based On Moving Mesh Method. (arXiv:2311.16167v1 [math.NA])
    In this work, we propose an end-to-end adaptive sampling neural network (MMPDE-Net) based on the moving mesh PDE method, which can adaptively generate new coordinates of sampling points by solving the moving mesh PDE. This model focuses on improving the efficiency of individual sampling points. Moreover, we have developed an iterative algorithm based on MMPDE-Net, which makes the sampling points more precise and controllable. Since MMPDE-Net is a framework independent of the deep learning solver, we combine it with PINN to propose MS-PINN and demonstrate its effectiveness by performing error analysis under the assumptions given in this paper. Meanwhile, we demonstrate the performance improvement of MS-PINN compared to PINN through numerical experiments on four typical examples to verify the effectiveness of our method.
    On the Robustness of Decision-Focused Learning. (arXiv:2311.16487v1 [cs.LG])
    Decision-Focused Learning (DFL) is an emerging learning paradigm that tackles the task of training a machine learning (ML) model to predict missing parameters of an incomplete optimization problem, where the missing parameters are predicted. DFL trains an ML model in an end-to-end system, by integrating the prediction and optimization tasks, providing better alignment of the training and testing objectives. DFL has shown a lot of promise and holds the capacity to revolutionize decision-making in many real-world applications. However, very little is known about the performance of these models under adversarial attacks. We adopt ten unique DFL methods and benchmark their performance under two distinctly focused attacks adapted towards the Predict-then-Optimize problem setting. Our study proposes the hypothesis that the robustness of a model is highly correlated with its ability to find predictions that lead to optimal decisions without deviating from the ground-truth label. Furthermore, we provide insight into how to target the models that violate this condition and show how these models respond differently depending on the achieved optimality at the end of their training cycles.
    Scalable Label Distribution Learning for Multi-Label Classification. (arXiv:2311.16556v1 [cs.LG])
    Multi-label classification (MLC) refers to the problem of tagging a given instance with a set of relevant labels. Most existing MLC methods are based on the assumption that the correlation of two labels in each label pair is symmetric, which is violated in many real-world scenarios. Moreover, most existing methods design learning processes associated with the number of labels, which makes their computational complexity a bottleneck when scaling up to large-scale output space. To tackle these issues, we propose a novel MLC learning method named Scalable Label Distribution Learning (SLDL) for multi-label classification which can describe different labels as distributions in a latent space, where the label correlation is asymmetric and the dimension is independent of the number of labels. Specifically, SLDL first converts labels into continuous distributions within a low-dimensional latent space and leverages the asymmetric metric to establish the correlation between different labels. Then, it learns the mapping from the feature space to the latent space, resulting in the computational complexity is no longer related to the number of labels. Finally, SLDL leverages a nearest-neighbor-based strategy to decode the latent representations and obtain the final predictions. Our extensive experiments illustrate that SLDL can achieve very competitive classification performances with little computational consumption.
    Rethinking Backdoor Attacks on Dataset Distillation: A Kernel Method Perspective. (arXiv:2311.16646v1 [cs.LG])
    Dataset distillation offers a potential means to enhance data efficiency in deep learning. Recent studies have shown its ability to counteract backdoor risks present in original training samples. In this study, we delve into the theoretical aspects of backdoor attacks and dataset distillation based on kernel methods. We introduce two new theory-driven trigger pattern generation methods specialized for dataset distillation. Following a comprehensive set of analyses and experiments, we show that our optimization-based trigger design framework informs effective backdoor attacks on dataset distillation. Notably, datasets poisoned by our designed trigger prove resilient against conventional backdoor attack detection and mitigation methods. Our empirical results validate that the triggers developed using our approaches are proficient at executing resilient backdoor attacks.
    Text-Driven Image Editing via Learnable Regions. (arXiv:2311.16432v1 [cs.CV])
    Language has emerged as a natural interface for image editing. In this paper, we introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches. Specifically, our approach leverages an existing pretrained text-to-image model and introduces a bounding box generator to find the edit regions that are aligned with the textual prompts. We show that this simple approach enables flexible editing that is compatible with current image generation models, and is able to handle complex prompts featuring multiple objects, complex sentences or long paragraphs. We conduct an extensive user study to compare our method against state-of-the-art methods. Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that align with the language descriptions provided. Our project webpage: https://yuanze-lin.me/LearnableRegions_page.
    Exploring Straighter Trajectories of Flow Matching with Diffusion Guidance. (arXiv:2311.16507v1 [cs.CV])
    Flow matching as a paradigm of generative model achieves notable success across various domains. However, existing methods use either multi-round training or knowledge within minibatches, posing challenges in finding a favorable coupling strategy for straight trajectories. To address this issue, we propose a novel approach, Straighter trajectories of Flow Matching (StraightFM). It straightens trajectories with the coupling strategy guided by diffusion model from entire distribution level. First, we propose a coupling strategy to straighten trajectories, creating couplings between image and noise samples under diffusion model guidance. Second, StraightFM also integrates real data to enhance training, employing a neural network to parameterize another coupling process from images to noise samples. StraightFM is jointly optimized with couplings from above two mutually complementary directions, resulting in straighter trajectories and enabling both one-step and few-step generation. Extensive experiments demonstrate that StraightFM yields high quality samples with fewer step. StraightFM generates visually appealing images with a lower FID among diffusion and traditional flow matching methods within 5 sampling steps when trained on pixel space. In the latent space (i.e., Latent Diffusion), StraightFM achieves a lower KID value compared to existing methods on the CelebA-HQ 256 dataset in fewer than 10 sampling steps.
    Elucidating Discrepancy in Explanations of Predictive Models Developed using EMR. (arXiv:2311.16654v1 [cs.LG])
    The lack of transparency and explainability hinders the clinical adoption of Machine learning (ML) algorithms. While explainable artificial intelligence (XAI) methods have been proposed, little research has focused on the agreement between these methods and expert clinical knowledge. This study applies current state-of-the-art explainability methods to clinical decision support algorithms developed for Electronic Medical Records (EMR) data to analyse the concordance between these factors and discusses causes for identified discrepancies from a clinical and technical perspective. Important factors for achieving trustworthy XAI solutions for clinical decision support are also discussed.
    Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics for Data Selection. (arXiv:2311.16302v1 [cs.LG])
    While data selection methods have been studied extensively in active learning, data pruning, and data augmentation settings, there is little evidence for the efficacy of these methods in industry scale settings, particularly in low-resource languages. Our work presents ways of assessing prospective training examples in those settings for their "usefulness" or "difficulty". We also demonstrate how these measures can be used in selecting important examples for training supervised machine learning models. We primarily experiment with entropy and Error L2-Norm (EL2N) scores. We use these metrics to curate high quality datasets from a large pool of \textit{Weak Signal Labeled} data, which assigns no-defect high confidence hypotheses during inference as ground truth labels. We then conduct training data augmentation experiments using these de-identified datasets and demonstrate that score-based selection can result in a 2% decrease in semantic error rate and 4%-7% decrease in domain classification error rate when compared to the baseline technique of random selection.
    Model-free Test Time Adaptation for Out-Of-Distribution Detection. (arXiv:2311.16420v1 [cs.LG])
    Out-of-distribution (OOD) detection is essential for the reliability of ML models. Most existing methods for OOD detection learn a fixed decision criterion from a given in-distribution dataset and apply it universally to decide if a data point is OOD. Recent work~\cite{fang2022is} shows that given only in-distribution data, it is impossible to reliably detect OOD data without extra assumptions. Motivated by the theoretical result and recent exploration of test-time adaptation methods, we propose a Non-Parametric Test Time \textbf{Ada}ptation framework for \textbf{O}ut-Of-\textbf{D}istribution \textbf{D}etection (\abbr). Unlike conventional methods, \abbr utilizes online test samples for model adaptation during testing, enhancing adaptability to changing data distributions. The framework incorporates detected OOD instances into decision-making, reducing false positive rates, particularly when ID and OOD distributions overlap significantly. We demonstrate the effectiveness of \abbr through comprehensive experiments on multiple OOD detection benchmarks, extensive empirical studies show that \abbr significantly improves the performance of OOD detection over state-of-the-art methods. Specifically, \abbr reduces the false positive rate (FPR95) by $23.23\%$ on the CIFAR-10 benchmarks and $38\%$ on the ImageNet-1k benchmarks compared to the advanced methods. Lastly, we theoretically verify the effectiveness of \abbr.
    Improving Denoising Diffusion Probabilistic Models via Exploiting Shared Representations. (arXiv:2311.16353v1 [cs.LG])
    In this work, we address the challenge of multi-task image generation with limited data for denoising diffusion probabilistic models (DDPM), a class of generative models that produce high-quality images by reversing a noisy diffusion process. We propose a novel method, SR-DDPM, that leverages representation-based techniques from few-shot learning to effectively learn from fewer samples across different tasks. Our method consists of a core meta architecture with shared parameters, i.e., task-specific layers with exclusive parameters. By exploiting the similarity between diverse data distributions, our method can scale to multiple tasks without compromising the image quality. We evaluate our method on standard image datasets and show that it outperforms both unconditional and conditional DDPM in terms of FID and SSIM metrics.
    Manifold Preserving Guided Diffusion. (arXiv:2311.16424v1 [cs.LG])
    Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework that leverages pretrained diffusion models and off-the-shelf neural networks with minimal additional inference cost for a broad range of tasks. Specifically, we leverage the manifold hypothesis to refine the guided diffusion steps and introduce a shortcut algorithm in the process. We then propose two methods for on-manifold training-free guidance using pre-trained autoencoders and demonstrate that our shortcut inherently preserves the manifolds when applied to latent diffusion models. Our experiments show that MPGD is efficient and effective for solving a variety of conditional generation applications in low-compute settings, and can consistently offer up to 3.8x speed-ups with the same number of diffusion steps while maintaining high sample quality compared to the baselines.
    Eigenmatrix for unstructured sparse recovery. (arXiv:2311.16609v1 [math.NA])
    This paper considers the unstructured sparse recovery problems in a general form. Examples include rational approximation, spectral function estimation, Fourier inversion, Laplace inversion, and sparse deconvolution. The main challenges are the noise in the sample values and the unstructured nature of the sample locations. This paper proposes the eigenmatrix, a data-driven construction with desired approximate eigenvalues and eigenvectors. The eigenmatrix offers a new way for these sparse recovery problems. Numerical results are provided to demonstrate the efficiency of the proposed method.
    Domain-Specific Deep Learning Feature Extractor for Diabetic Foot Ulcer Detection. (arXiv:2311.16312v1 [cs.CV])
    Diabetic Foot Ulcer (DFU) is a condition requiring constant monitoring and evaluations for treatment. DFU patient population is on the rise and will soon outpace the available health resources. Autonomous monitoring and evaluation of DFU wounds is a much-needed area in health care. In this paper, we evaluate and identify the most accurate feature extractor that is the core basis for developing a deep-learning wound detection network. For the evaluation, we used mAP and F1-score on the publicly available DFU2020 dataset. A combination of UNet and EfficientNetb3 feature extractor resulted in the best evaluation among the 14 networks compared. UNet and Efficientnetb3 can be used as the classifier in the development of a comprehensive DFU domain-specific autonomous wound detection pipeline.
    Cross Entropy in Deep Learning of Classifiers Is Unnecessary -- ISBE Error is All You Need. (arXiv:2311.16357v1 [cs.LG])
    In deep learning classifiers, the cost function usually takes the form of a combination of SoftMax and CrossEntropy functions. The SoftMax unit transforms the scores predicted by the model network into assessments of the degree (probabilities) of an object's membership to a given class. On the other hand, CrossEntropy measures the divergence of this prediction from the distribution of target scores. This work introduces the ISBE functionality, justifying the thesis about the redundancy of cross entropy computation in deep learning of classifiers. Not only can we omit the calculation of entropy, but also, during back-propagation, there is no need to direct the error to the normalization unit for its backward transformation. Instead, the error is sent directly to the model's network. Using examples of perceptron and convolutional networks as classifiers of images from the MNIST collection, it is observed for ISBE that results are not degraded with SoftMax only, but also with other activation functions such as Sigmoid, Tanh, or their hard variants HardSigmoid and HardTanh. Moreover, up to three percent of time is saved within the total time of forward and backward stages. The article is addressed mainly to programmers and students interested in deep model learning. For example, it illustrates in code snippets possible ways to implement ISBE units, but also formally proves that the softmax trick only applies to the class of softmax functions with relocations.
    A statistical approach to latent dynamic modeling with differential equations. (arXiv:2311.16286v1 [stat.ME])
    Ordinary differential equations (ODEs) can provide mechanistic models of temporally local changes of processes, where parameters are often informed by external knowledge. While ODEs are popular in systems modeling, they are less established for statistical modeling of longitudinal cohort data, e.g., in a clinical setting. Yet, modeling of local changes could also be attractive for assessing the trajectory of an individual in a cohort in the immediate future given its current status, where ODE parameters could be informed by further characteristics of the individual. However, several hurdles so far limit such use of ODEs, as compared to regression-based function fitting approaches. The potentially higher level of noise in cohort data might be detrimental to ODEs, as the shape of the ODE solution heavily depends on the initial value. In addition, larger numbers of variables multiply such problems and might be difficult to handle for ODEs. To address this, we propose to use each observation in the course of time as the initial value to obtain multiple local ODE solutions and build a combined estimator of the underlying dynamics. Neural networks are used for obtaining a low-dimensional latent space for dynamic modeling from a potentially large number of variables, and for obtaining patient-specific ODE parameters from baseline variables. Simultaneous identification of dynamic models and of a latent space is enabled by recently developed differentiable programming techniques. We illustrate the proposed approach in an application with spinal muscular atrophy patients and a corresponding simulation study. In particular, modeling of local changes in health status at any point in time is contrasted to the interpretation of functions obtained from a global regression. This more generally highlights how different application settings might demand different modeling strategies.
    3D Teeth Reconstruction from Panoramic Radiographs using Neural Implicit Functions. (arXiv:2311.16524v1 [cs.CV])
    Panoramic radiography is a widely used imaging modality in dental practice and research. However, it only provides flattened 2D images, which limits the detailed assessment of dental structures. In this paper, we propose Occudent, a framework for 3D teeth reconstruction from panoramic radiographs using neural implicit functions, which, to the best of our knowledge, is the first work to do so. For a given point in 3D space, the implicit function estimates whether the point is occupied by a tooth, and thus implicitly determines the boundaries of 3D tooth shapes. Firstly, Occudent applies multi-label segmentation to the input panoramic radiograph. Next, tooth shape embeddings as well as tooth class embeddings are generated from the segmentation outputs, which are fed to the reconstruction network. A novel module called Conditional eXcitation (CX) is proposed in order to effectively incorporate the combined shape and class embeddings into the implicit function. The performance of Occudent is evaluated using both quantitative and qualitative measures. Importantly, Occudent is trained and validated with actual panoramic radiographs as input, distinct from recent works which used synthesized images. Experiments demonstrate the superiority of Occudent over state-of-the-art methods.
    Utilizing Multiple Inputs Autoregressive Models for Bearing Remaining Useful Life Prediction. (arXiv:2311.16192v1 [cs.LG])
    Accurate prediction of the Remaining Useful Life (RUL) of rolling bearings is crucial in industrial production, yet existing models often struggle with limited generalization capabilities due to their inability to fully process all vibration signal patterns. We introduce a novel multi-input autoregressive model to address this challenge in RUL prediction for bearings. Our approach uniquely integrates vibration signals with previously predicted Health Indicator (HI) values, employing feature fusion to output current window HI values. Through autoregressive iterations, the model attains a global receptive field, effectively overcoming the limitations in generalization. Furthermore, we innovatively incorporate a segmentation method and multiple training iterations to mitigate error accumulation in autoregressive models. Empirical evaluation on the PMH2012 dataset demonstrates that our model, compared to other backbone networks using similar autoregressive approaches, achieves significantly lower Root Mean Square Error (RMSE) and Score. Notably, it outperforms traditional autoregressive models that use label values as inputs and non-autoregressive networks, showing superior generalization abilities with a marked lead in RMSE and Score metrics.
    Target-Free Compound Activity Prediction via Few-Shot Learning. (arXiv:2311.16328v1 [cs.LG])
    Predicting the activities of compounds against protein-based or phenotypic assays using only a few known compounds and their activities is a common task in target-free drug discovery. Existing few-shot learning approaches are limited to predicting binary labels (active/inactive). However, in real-world drug discovery, degrees of compound activity are highly relevant. We study Few-Shot Compound Activity Prediction (FS-CAP) and design a novel neural architecture to meta-learn continuous compound activities across large bioactivity datasets. Our model aggregates encodings generated from the known compounds and their activities to capture assay information. We also introduce a separate encoder for the unknown compound. We show that FS-CAP surpasses traditional similarity-based techniques as well as other state of the art few-shot learning methods on a variety of target-free drug discovery settings and datasets.
    HiFA: High-fidelity Text-to-3D Generation with Advanced Diffusion Guidance. (arXiv:2305.18766v3 [cs.CV] UPDATED)
    The advancements in automatic text-to-3D generation have been remarkable. Most existing methods use pre-trained text-to-image diffusion models to optimize 3D representations like Neural Radiance Fields (NeRFs) via latent-space denoising score matching. Yet, these methods often result in artifacts and inconsistencies across different views due to their suboptimal optimization approaches and limited understanding of 3D geometry. Moreover, the inherent constraints of NeRFs in rendering crisp geometry and stable textures usually lead to a two-stage optimization to attain high-resolution details. This work proposes holistic sampling and smoothing approaches to achieve high-quality text-to-3D generation, all in a single-stage optimization. We compute denoising scores in the text-to-image diffusion model's latent and image spaces. Instead of randomly sampling timesteps (also referred to as noise levels in denoising score matching), we introduce a novel timestep annealing approach that progressively reduces the sampled timestep throughout optimization. To generate high-quality renderings in a single-stage optimization, we propose regularization for the variance of z-coordinates along NeRF rays. To address texture flickering issues in NeRFs, we introduce a kernel smoothing technique that refines importance sampling weights coarse-to-fine, ensuring accurate and thorough sampling in high-density regions. Extensive experiments demonstrate the superiority of our method over previous approaches, enabling the generation of highly detailed and view-consistent 3D assets through a single-stage training process.
    Enhancing Sentiment Analysis Results through Outlier Detection Optimization. (arXiv:2311.16185v1 [cs.LG])
    When dealing with text data containing subjective labels like speaker emotions, inaccuracies or discrepancies among labelers are not uncommon. Such discrepancies can significantly affect the performance of machine learning algorithms. This study investigates the potential of identifying and addressing outliers in text data with subjective labels, aiming to enhance classification outcomes. We utilized the Deep SVDD algorithm, a one-class classification method, to detect outliers in nine text-based emotion and sentiment analysis datasets. By employing both a small-sized language model (DistilBERT base model with 66 million parameters) and non-deep learning machine learning algorithms (decision tree, KNN, Logistic Regression, and LDA) as the classifier, our findings suggest that the removal of outliers can lead to enhanced results in most cases. Additionally, as outliers in such datasets are not necessarily unlearnable, we experienced utilizing a large language model -- DeBERTa v3 large with 131 million parameters, which can capture very complex patterns in data. We continued to observe performance enhancements across multiple datasets.
    Evolutionary Machine Learning and Games. (arXiv:2311.16172v1 [cs.NE])
    Evolutionary machine learning (EML) has been applied to games in multiple ways, and for multiple different purposes. Importantly, AI research in games is not only about playing games; it is also about generating game content, modeling players, and many other applications. Many of these applications pose interesting problems for EML. We will structure this chapter on EML for games based on whether evolution is used to augment machine learning (ML) or ML is used to augment evolution. For completeness, we also briefly discuss the usage of ML and evolution separately in games.
    CarbNN: A Novel Active Transfer Learning Neural Network To Build De Novo Metal Organic Frameworks (MOFs) for Carbon Capture. (arXiv:2311.16158v1 [cs.LG])
    Over the past decade, climate change has become an increasing problem with one of the major contributing factors being carbon dioxide (CO2) emissions; almost 51% of total US carbon emissions are from factories. Current materials used in CO2 capture are lacking either in efficiency, sustainability, or cost. Electrocatalysis of CO2 is a new approach where CO2 can be reduced and the components used industrially as fuel, saving transportation costs, creating financial incentives. Metal Organic Frameworks (MOFs) are crystals made of organo-metals that adsorb, filter, and electrocatalyze CO2. The current available MOFs for capture & electrocatalysis are expensive to manufacture and inefficient at capture. The goal therefore is to computationally design a MOF that can adsorb CO2 and catalyze carbon monoxide & oxygen with low cost. A novel active transfer learning neural network was developed, utilizing transfer learning due to limited available data on 15 MOFs. Using the Cambridge Structural Database with 10,000 MOFs, the model used incremental mutations to fit a trained fitness hyper-heuristic function. Eventually, a Selenium MOF (C18MgO25Se11Sn20Zn5) was converged on. Through analysis of predictions & literature, the converged MOF was shown to be more effective & more synthetically accessible than existing MOFs, showing the model had an understanding of effective electrocatalytic structures in the material space. This novel network can be implemented for other gas separations and catalysis applications that have limited training accessible datasets.
    RACH-Space: Reconstructing Adaptive Convex Hull Space with applications in weak supervision. (arXiv:2307.04870v4 [cs.LG] UPDATED)
    We introduce RACH-Space, a novel classification method in ensemble learning. In particular, we show its applicability as a label model for weakly supervised learning. RACH-Space offers simplicity in implementation with minimal assumptions on the data or weak signals. The model is well suited for scenarios where fully labeled data is not available. Our method is built upon geometrical interpretation of the space spanned by weak signals. Our analysis of the high dimensional convex hull structure underlying general set of weak signals bridges geometry with machine learning. Empirical results also demonstrate that RACH-Space works well in practice and compares favorably to best existing label models for weakly supervised learning.
    Bayesian Formulations for Graph Spectral Denoising. (arXiv:2311.16378v1 [cs.LG])
    We consider noisy signals which are defined on the vertices of a graph and present smoothing algorithms for the cases of Gaussian, dropout, and uniformly distributed noise. The signals are assumed to follow a prior distribution defined in the frequency domain which favors signals which are smooth across the edges of the graph. By pairing this prior distribution with our three models of noise generation, we propose \textit{Maximum A Posteriori} (M.A.P.) estimates of the true signal in the presence of noisy data and provide algorithms for computing the M.A.P. Finally, we demonstrate the algorithms' ability to effectively restore white noise on image data, and from severe dropout in toy \& EHR data.
    Value Approximation for Two-Player General-Sum Differential Games with State Constraints. (arXiv:2311.16520v1 [cs.RO])
    Solving Hamilton-Jacobi-Isaacs (HJI) PDEs enables equilibrial feedback control in two-player differential games, yet faces the curse of dimensionality (CoD). While physics-informed machine learning has been adopted to address CoD in solving PDEs, this method falls short in learning discontinuous solutions due to its sampling nature, leading to poor safety performance of the resulting controllers in robotics applications where values are discontinuous due to state or other temporal logic constraints. In this study, we explore three potential solutions to this problem: (1) a hybrid learning method that uses both equilibrium demonstrations and the HJI PDE, (2) a value-hardening method where a sequence of HJIs are solved with increasing Lipschitz constant on the constraint violation penalty, and (3) the epigraphical technique that lifts the value to a higher dimensional auxiliary state space where the value becomes continuous. Evaluations through 5D and 9D vehicle simulations and 13D drone simulations reveal that the hybrid method outperforms others in terms of generalization and safety performance.
    InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery. (arXiv:2311.16208v1 [q-bio.BM])
    The rapid evolution of artificial intelligence in drug discovery encounters challenges with generalization and extensive training, yet Large Language Models (LLMs) offer promise in reshaping interactions with complex molecular data. Our novel contribution, InstructMol, a multi-modal LLM, effectively aligns molecular structures with natural language via an instruction-tuning approach, utilizing a two-stage training strategy that adeptly combines limited domain-specific data with molecular and textual information. InstructMol showcases substantial performance improvements in drug discovery-related molecular tasks, surpassing leading LLMs and significantly reducing the gap with specialized models, thereby establishing a robust foundation for a versatile and dependable drug discovery assistant.
    Use of Deep Neural Networks for Uncertain Stress Functions with Extensions to Impact Mechanics. (arXiv:2311.16135v1 [cond-mat.mtrl-sci])
    Stress-strain curves, or more generally, stress functions, are an extremely important characterization of a material's mechanical properties. However, stress functions are often difficult to derive and are narrowly tailored to a specific material. Further, large deformations, high strain-rates, temperature sensitivity, and effect of material parameters compound modeling challenges. We propose a generalized deep neural network approach to model stress as a state function with quantile regression to capture uncertainty. We extend these models to uniaxial impact mechanics using stochastic differential equations to demonstrate a use case and provide a framework for implementing this uncertainty-aware stress function. We provide experiments benchmarking our approach against leading constitutive, machine learning, and transfer learning approaches to stress and impact mechanics modeling on publicly available and newly presented data sets. We also provide a framework to optimize material parameters given multiple competing impact scenarios.
    Shortcut Bias Mitigation via Ensemble Diversity Using Diffusion Probabilistic Models. (arXiv:2311.16176v1 [cs.LG])
    Spurious correlations in the data, where multiple cues are predictive of the target labels, often lead to a phenomenon known as simplicity bias, where a model relies on erroneous, easy-to-learn cues while ignoring reliable ones. In this work, we propose an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs) for shortcut bias mitigation. We show that at particular training intervals, DPMs can generate images with novel feature combinations, even when trained on images displaying correlated input features. We leverage this crucial property to generate synthetic counterfactuals to increase model diversity via ensemble disagreement. We show that DPM-guided diversification is sufficient to remove dependence on primary shortcut cues, without a need for additional supervised signals. We further empirically quantify its efficacy on several diversification objectives, and finally show improved generalization and diversification performance on par with prior work that relies on auxiliary data collection.
    From Reactive to Proactive Volatility Modeling with Hemisphere Neural Networks. (arXiv:2311.16333v1 [econ.EM])
    We reinvigorate maximum likelihood estimation (MLE) for macroeconomic density forecasting through a novel neural network architecture with dedicated mean and variance hemispheres. Our architecture features several key ingredients making MLE work in this context. First, the hemispheres share a common core at the entrance of the network which accommodates for various forms of time variation in the error variance. Second, we introduce a volatility emphasis constraint that breaks mean/variance indeterminacy in this class of overparametrized nonlinear models. Third, we conduct a blocked out-of-bag reality check to curb overfitting in both conditional moments. Fourth, the algorithm utilizes standard deep learning software and thus handles large data sets - both computationally and statistically. Ergo, our Hemisphere Neural Network (HNN) provides proactive volatility forecasts based on leading indicators when it can, and reactive volatility based on the magnitude of previous prediction errors when it must. We evaluate point and density forecasts with an extensive out-of-sample experiment and benchmark against a suite of models ranging from classics to more modern machine learning-based offerings. In all cases, HNN fares well by consistently providing accurate mean/variance forecasts for all targets and horizons. Studying the resulting volatility paths reveals its versatility, while probabilistic forecasting evaluation metrics showcase its enviable reliability. Finally, we also demonstrate how this machinery can be merged with other structured deep learning models by revisiting Goulet Coulombe (2022)'s Neural Phillips Curve.
    Context-lumpable stochastic bandits. (arXiv:2306.13053v2 [cs.LG] UPDATED)
    We consider a contextual bandit problem with $S$ contexts and $K$ actions. In each round $t=1,2,\dots$, the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into $r\le \min\{S,K\}$ groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an $\epsilon$-optimal policy after using at most $\widetilde O(r (S +K )/\epsilon^2)$ samples with high probability and provide a matching $\Omega(r(S+K)/\epsilon^2)$ lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $\widetilde O(\sqrt{r^3(S+K)T})$. To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and $\widetilde O(\sqrt{{poly}(r)(S+K)T})$ minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios.
    Protein-ligand binding representation learning from fine-grained interactions. (arXiv:2311.16160v1 [q-bio.BM])
    The binding between proteins and ligands plays a crucial role in the realm of drug discovery. Previous deep learning approaches have shown promising results over traditional computationally intensive methods, but resulting in poor generalization due to limited supervised data. In this paper, we propose to learn protein-ligand binding representation in a self-supervised learning manner. Different from existing pre-training approaches which treat proteins and ligands individually, we emphasize to discern the intricate binding patterns from fine-grained interactions. Specifically, this self-supervised learning problem is formulated as a prediction of the conclusive binding complex structure given a pocket and ligand with a Transformer based interaction module, which naturally emulates the binding process. To ensure the representation of rich binding information, we introduce two pre-training tasks, i.e.~atomic pairwise distance map prediction and mask ligand reconstruction, which comprehensively model the fine-grained interactions from both structure and feature space. Extensive experiments have demonstrated the superiority of our method across various binding tasks, including protein-ligand affinity prediction, virtual screening and protein-ligand docking.
    Multi-Agent Learning of Efficient Fulfilment and Routing Strategies in E-Commerce. (arXiv:2311.16171v1 [cs.AI])
    This paper presents an integrated algorithmic framework for minimising product delivery costs in e-commerce (known as the cost-to-serve or C2S). One of the major challenges in e-commerce is the large volume of spatio-temporally diverse orders from multiple customers, each of which has to be fulfilled from one of several warehouses using a fleet of vehicles. This results in two levels of decision-making: (i) selection of a fulfillment node for each order (including the option of deferral to a future time), and then (ii) routing of vehicles (each of which can carry multiple orders originating from the same warehouse). We propose an approach that combines graph neural networks and reinforcement learning to train the node selection and vehicle routing agents. We include real-world constraints such as warehouse inventory capacity, vehicle characteristics such as travel times, service times, carrying capacity, and customer constraints including time windows for delivery. The complexity of this problem arises from the fact that outcomes (rewards) are driven both by the fulfillment node mapping as well as the routing algorithms, and are spatio-temporally distributed. Our experiments show that this algorithmic pipeline outperforms pure heuristic policies.
    Semantic Generative Augmentations for Few-Shot Counting. (arXiv:2311.16122v1 [cs.CV])
    With the availability of powerful text-to-image diffusion models, recent works have explored the use of synthetic data to improve image classification performances. These works show that it can effectively augment or even replace real data. In this work, we investigate how synthetic data can benefit few-shot class-agnostic counting. This requires to generate images that correspond to a given input number of objects. However, text-to-image models struggle to grasp the notion of count. We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map in order to augment a training dataset for few-shot counting. Due to the small dataset size, the fine-tuned model tends to generate images close to the training images. We propose to enhance the diversity of synthesized images by exchanging captions between images thus creating unseen configurations of object types and spatial layout. Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models on FSC147 and CARPK.
    A Hierarchical Training Paradigm for Antibody Structure-sequence Co-design. (arXiv:2311.16126v1 [q-bio.BM])
    Therapeutic antibodies are an essential and rapidly expanding drug modality. The binding specificity between antibodies and antigens is decided by complementarity-determining regions (CDRs) at the tips of these Y-shaped proteins. In this paper, we propose a hierarchical training paradigm (HTP) for the antibody sequence-structure co-design. HTP consists of four levels of training stages, each corresponding to a specific protein modality within a particular protein domain. Through carefully crafted tasks in different stages, HTP seamlessly and effectively integrates geometric graph neural networks (GNNs) with large-scale protein language models to excavate evolutionary information from not only geometric structures but also vast antibody and non-antibody sequence databases, which determines ligand binding pose and strength. Empirical experiments show that HTP sets the new state-of-the-art performance in the co-design problem as well as the fix-backbone design. Our research offers a hopeful path to unleash the potential of deep generative architectures and seeks to illuminate the way forward for the antibody sequence and structure co-design challenge.
    GNNBleed: Inference Attacks to Unveil Private Edges in Graphs with Realistic Access to GNN Models. (arXiv:2311.16139v1 [cs.CR])
    Graph Neural Networks (GNNs) have increasingly become an indispensable tool in learning from graph-structured data, catering to various applications including social network analysis, recommendation systems, etc. At the heart of these networks are the edges which are crucial in guiding GNN models' predictions. In many scenarios, these edges represent sensitive information, such as personal associations or financial dealings -- thus requiring privacy assurance. However, their contributions to GNN model predictions may in turn be exploited by the adversary to compromise their privacy. Motivated by these conflicting requirements, this paper investigates edge privacy in contexts where adversaries possess black-box GNN model access, restricted further by access controls, preventing direct insights into arbitrary node outputs. In this context, we introduce a series of privacy attacks grounded on the message-passing mechanism of GNNs. These strategies allow adversaries to deduce connections between two nodes not by directly analyzing the model's output for these pairs but by analyzing the output for nodes linked to them. Our evaluation with seven real-life datasets and four GNN architectures underlines a significant vulnerability: even in systems fortified with access control mechanisms, an adaptive adversary can decipher private connections between nodes, thereby revealing potentially sensitive relationships and compromising the confidentiality of the graph.
    MAPSeg: Unified Unsupervised Domain Adaptation for Heterogeneous Medical Image Segmentation Based on 3D Masked Autoencoding and Pseudo-Labeling. (arXiv:2303.09373v2 [cs.CV] UPDATED)
    Robust segmentation is critical for deriving quantitative measures from large-scale, multi-center, and longitudinal medical scans. Manually annotating medical scans, however, is expensive and labor-intensive and may not always be available in every domain. Unsupervised domain adaptation (UDA) is a well-studied technique that alleviates this label-scarcity problem by leveraging available labels from another domain. In this study, we introduce Masked Autoencoding and Pseudo-Labeling Segmentation (MAPSeg), a $\textbf{unified}$ UDA framework with great versatility and superior performance for heterogeneous and volumetric medical image segmentation. To the best of our knowledge, this is the first study that systematically reviews and develops a framework to tackle four different domain shifts in medical image segmentation. More importantly, MAPSeg is the first framework that can be applied to $\textbf{centralized}$, $\textbf{federated}$, and $\textbf{test-time}$ UDA while maintaining comparable performance. We compare MAPSeg with previous state-of-the-art methods on a private infant brain MRI dataset and a public cardiac CT-MRI dataset, and MAPSeg outperforms others by a large margin (10.5 Dice improvement on the private MRI dataset and 5.7 on the public CT-MRI dataset). MAPSeg poses great practical value and can be applied to real-world problems. Our code and pretrained model will be available later.
    GeoTop: Advancing Image Classification with Geometric-Topological Analysis. (arXiv:2311.16157v1 [cs.CV])
    In this study, we explore the application of Topological Data Analysis (TDA) and Lipschitz-Killing Curvatures (LKCs) as powerful tools for feature extraction and classification in the context of biomedical multiomics problems. TDA allows us to capture topological features and patterns within complex datasets, while LKCs provide essential geometric insights. We investigate the potential of combining both methods to improve classification accuracy. Using a dataset of biomedical images, we demonstrate that TDA and LKCs can effectively extract topological and geometrical features, respectively. The combination of these features results in enhanced classification performance compared to using each method individually. This approach offers promising results and has the potential to advance our understanding of complex biological processes in various biomedical applications. Our findings highlight the value of integrating topological and geometrical information in biomedical data analysis. As we continue to delve into the intricacies of multiomics problems, the fusion of these insights holds great promise for unraveling the underlying biological complexities.
    Ransomware Detection and Classification using Machine Learning. (arXiv:2311.16143v1 [cs.CR])
    Vicious assaults, malware, and various ransomware pose a cybersecurity threat, causing considerable damage to computer structures, servers, and mobile and web apps across various industries and businesses. These safety concerns are important and must be addressed immediately. Ransomware detection and classification are critical for guaranteeing rapid reaction and prevention. This study uses the XGBoost classifier and Random Forest (RF) algorithms to detect and classify ransomware attacks. This approach involves analyzing the behaviour of ransomware and extracting relevant features that can help distinguish between different ransomware families. The models are evaluated on a dataset of ransomware attacks and demonstrate their effectiveness in accurately detecting and classifying ransomware. The results show that the XGBoost classifier, Random Forest Classifiers, can effectively detect and classify different ransomware attacks with high accuracy, thereby providing a valuable tool for enhancing cybersecurity.
    Sluggish and Chemically-Biased Interstitial Diffusion in Concentrated Solid Solution Alloys: Mechanisms and Methods. (arXiv:2311.16727v1 [cond-mat.mtrl-sci])
    Interstitial diffusion is a pivotal process that governs the phase stability and irradiation response of materials in non-equilibrium conditions. In this work, we study sluggish and chemically-biased interstitial diffusion in Fe-Ni concentrated solid solution alloys (CSAs) by combining machine learning (ML) and kinetic Monte Carlo (kMC), where ML is used to accurately and efficiently predict the migration energy barriers on-the-fly. The ML-kMC reproduces the diffusivity that was reported by molecular dynamics results at high temperatures. With this powerful tool, we find that the observed sluggish diffusion and the "Ni-Ni-Ni"-biased diffusion in Fe-Ni alloys are ascribed to a unique "Barrier Lock" mechanism, whereas the "Fe-Fe-Fe"-biased diffusion is influenced by a "Component Dominance" mechanism. Inspired by the mentioned mechanisms, a practical AvgS-kMC method is proposed for conveniently and swiftly determining interstitial-mediated diffusivity by only relying on the mean energy barriers of migration patterns. Combining the AvgS-kMC with the differential evolutionary algorithm, an inverse design strategy for optimizing sluggish diffusion properties is applied to emphasize the crucial role of favorable migration patterns.
    On the Role of Randomization in Adversarially Robust Classification. (arXiv:2302.07221v3 [cs.LG] UPDATED)
    Deep neural networks are known to be vulnerable to small adversarial perturbations in test data. To defend against adversarial attacks, probabilistic classifiers have been proposed as an alternative to deterministic ones. However, literature has conflicting findings on the effectiveness of probabilistic classifiers in comparison to deterministic ones. In this paper, we clarify the role of randomization in building adversarially robust classifiers. Given a base hypothesis set of deterministic classifiers, we show the conditions under which a randomized ensemble outperforms the hypothesis set in adversarial risk, extending previous results. Additionally, we show that for any probabilistic binary classifier (including randomized ensembles), there exists a deterministic classifier that outperforms it. Finally, we give an explicit description of the deterministic hypothesis set that contains such a deterministic classifier for many types of commonly used probabilistic classifiers, i.e. randomized ensembles and parametric/input noise injection.
    ACHO: Adaptive Conformal Hyperparameter Optimization. (arXiv:2207.03017v3 [cs.LG] UPDATED)
    Several novel frameworks for hyperparameter search have emerged in the last decade, but most rely on strict, often normal, distributional assumptions, limiting search model flexibility. This paper proposes a novel optimization framework based on upper confidence bound sampling of conformal confidence intervals, whose weaker assumption of exchangeability enables greater choice of search model architectures. Several such architectures were explored and benchmarked on hyperparameter search of random forests and convolutional neural networks, displaying satisfactory interval coverage and superior tuning performance to random search.
    Rethinking Intermediate Layers design in Knowledge Distillation for Kidney and Liver Tumor Segmentation. (arXiv:2311.16700v1 [cs.CV])
    Knowledge distillation(KD) has demonstrated remarkable success across various domains, but its application to medical imaging tasks, such as kidney and liver tumor segmentation, has encountered challenges. Many existing KD methods are not specifically tailored for these tasks. Moreover, prevalent KD methods often lack a careful consideration of what and from where to distill knowledge from the teacher to the student. This oversight may lead to issues like the accumulation of training bias within shallower student layers, potentially compromising the effectiveness of KD. To address these challenges, we propose Hierarchical Layer-selective Feedback Distillation (HLFD). HLFD strategically distills knowledge from a combination of middle layers to earlier layers and transfers final layer knowledge to intermediate layers at both the feature and pixel levels. This design allows the model to learn higher-quality representations from earlier layers, resulting in a robust and compact student model. Extensive quantitative evaluations reveal that HLFD outperforms existing methods by a significant margin. For example, in the kidney segmentation task, HLFD surpasses the student model (without KD) by over 10pp, significantly improving its focus on tumor-specific features. From a qualitative standpoint, the student model trained using HLFD excels at suppressing irrelevant information and can focus sharply on tumor-specific details, which opens a new pathway for more efficient and accurate diagnostic tools.
    FeTrIL: Feature Translation for Exemplar-Free Class-Incremental Learning. (arXiv:2211.13131v2 [cs.CV] UPDATED)
    Exemplar-free class-incremental learning is very challenging due to the negative effect of catastrophic forgetting. A balance between stability and plasticity of the incremental process is needed in order to obtain good accuracy for past as well as new classes. Existing exemplar-free class-incremental methods focus either on successive fine tuning of the model, thus favoring plasticity, or on using a feature extractor fixed after the initial incremental state, thus favoring stability. We introduce a method which combines a fixed feature extractor and a pseudo-features generator to improve the stability-plasticity balance. The generator uses a simple yet effective geometric translation of new class features to create representations of past classes, made of pseudo-features. The translation of features only requires the storage of the centroid representations of past classes to produce their pseudo-features. Actual features of new classes and pseudo-features of past classes are fed into a linear classifier which is trained incrementally to discriminate between all classes. The incremental process is much faster with the proposed method compared to mainstream ones which update the entire deep model. Experiments are performed with three challenging datasets, and different incremental settings. A comparison with ten existing methods shows that our method outperforms the others in most cases.
    Post-hoc Interpretability for Neural NLP: A Survey. (arXiv:2108.04840v5 [cs.CL] UPDATED)
    Neural networks for NLP are becoming increasingly complex and widespread, and there is a growing concern if these models are responsible to use. Explaining models helps to address the safety and ethical concerns and is essential for accountability. Interpretability serves to provide these explanations in terms that are understandable to humans. Additionally, post-hoc methods provide explanations after a model is learned and are generally model-agnostic. This survey provides a categorization of how recent post-hoc interpretability methods communicate explanations to humans, it discusses each method in-depth, and how they are validated, as the latter is often a common concern.
    Adapting Segment Anything Model (SAM) through Prompt-based Learning for Enhanced Protein Identification in Cryo-EM Micrographs. (arXiv:2311.16140v1 [cs.CV])
    Cryo-electron microscopy (cryo-EM) remains pivotal in structural biology, yet the task of protein particle picking, integral for 3D protein structure construction, is laden with manual inefficiencies. While recent AI tools such as Topaz and crYOLO are advancing the field, they do not fully address the challenges of cryo-EM images, including low contrast, complex shapes, and heterogeneous conformations. This study explored prompt-based learning to adapt the state-of-the-art image segmentation foundation model Segment Anything Model (SAM) for cryo-EM. This focus was driven by the desire to optimize model performance with a small number of labeled data without altering pre-trained parameters, aiming for a balance between adaptability and foundational knowledge retention. Through trials with three prompt-based learning strategies, namely head prompt, prefix prompt, and encoder prompt, we observed enhanced performance and reduced computational requirements compared to the fine-tuning approach. This work not only highlights the potential of prompting SAM in protein identification from cryo-EM micrographs but also suggests its broader promise in biomedical image segmentation and object detection.
    Data Analytics with Differential Privacy. (arXiv:2311.16104v1 [cs.LG])
    Differential privacy is the state-of-the-art definition for privacy, guaranteeing that any analysis performed on a sensitive dataset leaks no information about the individuals whose data are contained therein. In this thesis, we develop differentially private algorithms to analyze distributed and streaming data. In the distributed model, we consider the particular problem of learning -- in a distributed fashion -- a global model of the data, that can subsequently be used for arbitrary analyses. We build upon PrivBayes, a differentially private method that approximates the high-dimensional distribution of a centralized dataset as a product of low-order distributions, utilizing a Bayesian Network model. We examine three novel approaches to learning a global Bayesian Network from distributed data, while offering the differential privacy guarantee to all local datasets. Our work includes a detailed theoretical analysis of the distributed, differentially private entropy estimator which we use in one of our algorithms, as well as a detailed experimental evaluation, using both synthetic and real-world data. In the streaming model, we focus on the problem of estimating the density of a stream of users, which expresses the fraction of all users that actually appear in the stream. We offer one of the strongest privacy guarantees for the streaming model, user-level pan-privacy, which ensures that the privacy of any user is protected, even against an adversary that observes the internal state of the algorithm. We provide a detailed analysis of an existing, sampling-based algorithm for the problem and propose two novel modifications that significantly improve it, both theoretically and experimentally, by optimally using all the allocated "privacy budget."  ( 3 min )
    Utility Fairness in Contextual Dynamic Pricing with Demand Learning. (arXiv:2311.16528v1 [stat.ML])
    This paper introduces a novel contextual bandit algorithm for personalized pricing under utility fairness constraints in scenarios with uncertain demand, achieving an optimal regret upper bound. Our approach, which incorporates dynamic pricing and demand learning, addresses the critical challenge of fairness in pricing strategies. We first delve into the static full-information setting to formulate an optimal pricing policy as a constrained optimization problem. Here, we propose an approximation algorithm for efficiently and approximately computing the ideal policy. We also use mathematical analysis and computational studies to characterize the structures of optimal contextual pricing policies subject to fairness constraints, deriving simplified policies which lays the foundations of more in-depth research and extensions. Further, we extend our study to dynamic pricing problems with demand learning, establishing a non-standard regret lower bound that highlights the complexity added by fairness constraints. Our research offers a comprehensive analysis of the cost of fairness and its impact on the balance between utility and revenue maximization. This work represents a step towards integrating ethical considerations into algorithmic efficiency in data-driven dynamic pricing.
    The Graph Convolutional Network with Multi-representation Alignment for Drug Synergy Prediction. (arXiv:2311.16207v1 [q-bio.QM])
    Drug combination refers to the use of two or more drugs to treat a specific disease at the same time. It is currently the mainstream way to treat complex diseases. Compared with single drugs, drug combinations have better efficacy and can better inhibit toxicity and drug resistance. The computational model based on deep learning concatenates the representation of multiple drugs and the corresponding cell line feature as input, and the output is whether the drug combination can have an inhibitory effect on the cell line. However, this strategy of concatenating multiple representations has the following defects: the alignment of drug representation and cell line representation is ignored, resulting in the synergistic relationship not being reflected positionally in the embedding space. Moreover, the alignment measurement function in deep learning cannot be suitable for drug synergy prediction tasks due to differences in input types. Therefore, in this work, we propose a graph convolutional network with multi-representation alignment (GCNMRA) for predicting drug synergy. In the GCNMRA model, we designed a multi-representation alignment function suitable for the drug synergy prediction task so that the positional relationship between drug representations and cell line representation is reflected in the embedding space. In addition, the vector modulus of drug representations and cell line representation is considered to improve the accuracy of calculation results and accelerate model convergence. Finally, many relevant experiments were run on multiple drug synergy datasets to verify the effectiveness of the above innovative elements and the excellence of the GCNMRA model.
    Ultra-short-term multi-step wind speed prediction for wind farms based on adaptive noise reduction technology and temporal convolutional network. (arXiv:2311.16198v1 [cs.LG])
    As an important clean and renewable kind of energy, wind power plays an important role in coping with energy crisis and environmental pollution. However, the volatility and intermittency of wind speed restrict the development of wind power. To improve the utilization of wind power, this study proposes a new wind speed prediction model based on data noise reduction technology, temporal convolutional network (TCN), and gated recurrent unit (GRU). Firstly, an adaptive data noise reduction algorithm P-SSA is proposed based on singular spectrum analysis (SSA) and Pearson correlation coefficient. The original wind speed is decomposed into multiple subsequences by SSA and then reconstructed. When the Pearson correlation coefficient between the reconstructed sequence and the original sequence is greater than 0.99, other noise subsequences are deleted to complete the data denoising. Then, the receptive field of the samples is expanded through the causal convolution and dilated convolution of TCN, and the characteristics of wind speed change are extracted. Then, the time feature information of the sequence is extracted by GRU, and then the wind speed is predicted to form the wind speed sequence prediction model of P-SSA-TCN-GRU. The proposed model was validated on three wind farms in Shandong Province. The experimental results show that the prediction performance of the proposed model is better than that of the traditional model and other models based on TCN, and the wind speed prediction of wind farms with high precision and strong stability is realized. The wind speed predictions of this model have the potential to become the data that support the operation and management of wind farms. The code is available at link.
    Reward Shaping for Improved Learning in Real-time Strategy Game Play. (arXiv:2311.16339v1 [cs.LG])
    We investigate the effect of reward shaping in improving the performance of reinforcement learning in the context of the real-time strategy, capture-the-flag game. The game is characterized by sparse rewards that are associated with infrequently occurring events such as grabbing or capturing the flag, or tagging the opposing player. We show that appropriately designed reward shaping functions applied to different game events can significantly improve the player's performance and training times of the player's learning algorithm. We have validated our reward shaping functions within a simulated environment for playing a marine capture-the-flag game between two players. Our experimental results demonstrate that reward shaping can be used as an effective means to understand the importance of different sub-tasks during game-play towards winning the game, to encode a secondary objective functions such as energy efficiency into a player's game-playing behavior, and, to improve learning generalizable policies that can perform well against different skill levels of the opponent.
    Learning Noise-Robust Joint Representation for Multimodal Emotion Recognition under Realistic Incomplete Data Scenarios. (arXiv:2311.16114v1 [cs.CV])
    Multimodal emotion recognition (MER) in practical scenarios presents a significant challenge due to the presence of incomplete data, such as missing or noisy data. Traditional methods often discard missing data or replace it with a zero vector, neglecting the availability issue of noisy data. Consequently, these approaches are not fully applicable to realistic scenarios, where both missing and noisy data are prevalent. To address this problem, we propose a novel noise-robust MER model, named NMER, which effectively learns robust multimodal joint representations from incomplete data containing noise. Our approach incorporates two key components. First, we introduce a noise scheduler that adjusts the type and level of noise in the training data, emulating the characteristics of incomplete data in realistic scenarios. Second, we employ a Variational AutoEncoder (VAE)-based NMER model to generate robust multimodal joint representations from the noisy data, leveraging the modality invariant feature. The experimental results on the benchmark dataset IEMOCAP indicate the proposed NMER outperforms state-of-the-art MER systems. The ablation results also confirm the effectiveness of the VAE structure. We release our code at \href{https://github.com/WooyoohL/Noise-robust_MER.
    A novel RNA pseudouridine site prediction model using Utility Kernel and data-driven parameters. (arXiv:2311.16132v1 [q-bio.BM])
    RNA protein Interactions (RPIs) play an important role in biological systems. Recently, we have enumerated the RPIs at the residue level and have elucidated the minimum structural unit (MSU) in these interactions to be a stretch of five residues (Nucleotides/amino acids). Pseudouridine is the most frequent modification in RNA. The conversion of uridine to pseudouridine involves interactions between pseudouridine synthase and RNA. The existing models to predict the pseudouridine sites in a given RNA sequence mainly depend on user-defined features such as mono and dinucleotide composition/propensities of RNA sequences. Predicting pseudouridine sites is a non-linear classification problem with limited data points. Deep Learning models are efficient discriminators when the data set size is reasonably large and fail when there is a paucity of data ($<1000$ samples). To mitigate this problem, we propose a Support Vector Machine (SVM) Kernel based on utility theory from Economics, and using data-driven parameters (i.e. MSU) as features. For this purpose, we have used position-specific tri/quad/pentanucleotide composition/propensity (PSPC/PSPP) besides nucleotide and dineculeotide composition as features. SVMs are known to work well in small data regimes and kernels in SVM are designed to classify non-linear data. The proposed model outperforms the existing state-of-the-art models significantly (10%-15% on average).
    Streaming Lossless Volumetric Compression of Medical Images Using Gated Recurrent Convolutional Neural Network. (arXiv:2311.16200v1 [eess.IV])
    Deep learning-based lossless compression methods offer substantial advantages in compressing medical volumetric images. Nevertheless, many learning-based algorithms encounter a trade-off between practicality and compression performance. This paper introduces a hardware-friendly streaming lossless volumetric compression framework, utilizing merely one-thousandth of the model weights compared to other learning-based compression frameworks. We propose a gated recurrent convolutional neural network that combines diverse convolutional structures and fusion gate mechanisms to capture the inter-slice dependencies in volumetric images. Based on such contextual information, we can predict the pixel-by-pixel distribution for entropy coding. Guided by hardware/software co-design principles, we implement the proposed framework on Field Programmable Gate Array to achieve enhanced real-time performance. Extensive experimental results indicate that our method outperforms traditional lossless volumetric compressors and state-of-the-art learning-based lossless compression methods across various medical image benchmarks. Additionally, our method exhibits robust generalization ability and competitive compression speed
    Co-learning synaptic delays, weights and adaptation in spiking neural networks. (arXiv:2311.16112v1 [cs.NE])
    Spiking neural networks (SNN) distinguish themselves from artificial neural networks (ANN) because of their inherent temporal processing and spike-based computations, enabling a power-efficient implementation in neuromorphic hardware. In this paper, we demonstrate that data processing with spiking neurons can be enhanced by co-learning the connection weights with two other biologically inspired neuronal features: 1) a set of parameters describing neuronal adaptation processes and 2) synaptic propagation delays. The former allows the spiking neuron to learn how to specifically react to incoming spikes based on its past. The trained adaptation parameters result in neuronal heterogeneity, which is found in the brain and also leads to a greater variety in available spike patterns. The latter enables to learn to explicitly correlate patterns that are temporally distanced. Synaptic delays reflect the time an action potential requires to travel from one neuron to another. We show that each of the co-learned features separately leads to an improvement over the baseline SNN and that the combination of both leads to state-of-the-art SNN results on all speech recognition datasets investigated with a simple 2-hidden layer feed-forward network. Our SNN outperforms the ANN on the neuromorpic datasets (Spiking Heidelberg Digits and Spiking Speech Commands), even with fewer trainable parameters. On the 35-class Google Speech Commands dataset, our SNN also outperforms a GRU of similar size. Our work presents brain-inspired improvements to SNN that enable them to excel over an equivalent ANN of similar size on tasks with rich temporal dynamics.
    D4AM: A General Denoising Framework for Downstream Acoustic Models. (arXiv:2311.16595v1 [cs.SD])
    The performance of acoustic models degrades notably in noisy environments. Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems. However, existing training objectives of SE methods are not fully effective at integrating speech-text and noisy-clean paired data for training toward unseen ASR systems. In this study, we propose a general denoising framework, D4AM, for various downstream acoustic models. Our framework fine-tunes the SE model with the backward gradient according to a specific acoustic model and the corresponding classification objective. In addition, our method aims to consider the regression objective as an auxiliary loss to make the SE model generalize to other unseen acoustic models. To jointly train an SE unit with regression and classification objectives, D4AM uses an adjustment scheme to directly estimate suitable weighting coefficients rather than undergoing a grid search process with additional training costs. The adjustment scheme consists of two parts: gradient calibration and regression objective weighting. The experimental results show that D4AM can consistently and effectively provide improvements to various unseen acoustic models and outperforms other combination setups. Specifically, when evaluated on the Google ASR API with real noisy data completely unseen during SE training, D4AM achieves a relative WER reduction of 24.65% compared with the direct feeding of noisy input. To our knowledge, this is the first work that deploys an effective combination scheme of regression (denoising) and classification (ASR) objectives to derive a general pre-processor applicable to various unseen ASR systems. Our code is available at https://github.com/ChangLee0903/D4AM.
    Contrastive encoder pre-training-based clustered federated learning for heterogeneous data. (arXiv:2311.16535v1 [cs.LG])
    Federated learning (FL) is a promising approach that enables distributed clients to collaboratively train a global model while preserving their data privacy. However, FL often suffers from data heterogeneity problems, which can significantly affect its performance. To address this, clustered federated learning (CFL) has been proposed to construct personalized models for different client clusters. One effective client clustering strategy is to allow clients to choose their own local models from a model pool based on their performance. However, without pre-trained model parameters, such a strategy is prone to clustering failure, in which all clients choose the same model. Unfortunately, collecting a large amount of labeled data for pre-training can be costly and impractical in distributed environments. To overcome this challenge, we leverage self-supervised contrastive learning to exploit unlabeled data for the pre-training of FL systems. Together, self-supervised pre-training and client clustering can be crucial components for tackling the data heterogeneity issues of FL. Leveraging these two crucial strategies, we propose contrastive pre-training-based clustered federated learning (CP-CFL) to improve the model convergence and overall performance of FL systems. In this work, we demonstrate the effectiveness of CP-CFL through extensive experiments in heterogeneous FL settings, and present various interesting observations.
    Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization. (arXiv:2311.16442v1 [cs.LG])
    Large language models (LLMs) have demonstrated impressive abilities in various domains while the inference cost is expensive. The state-of-the-art methods use 2-bit quantization for mainstream LLMs. However, challenges still exist: (1) Nonnegligible accuracy loss for 2-bit quantization. Weights are quantized by groups, while the ranges of weights are large in some groups, resulting in large quantization errors and nonnegligible accuracy loss (e.g. >3% for Llama2-7b with 2-bit quantization in GPTQ and Greenbit). (2) Limited accuracy improvement by adding 4-bit weights. Increasing 10% extra average bit more 4-bit weights only leads to 50% execution time, hindering the potential of reducing LLM inference cost. To tackle these challenges, we propose the following techniques: (1) We only quantize a small fraction of groups with the larger range using 4-bit with memory alignment consideration on GPUs. (2) We point out that the distribution of the sparse outliers with larger weights is different in 2-bit and 4-bit groups, and only a small fraction of outliers require 16-bit quantization. Such design leads to >0.5% accuracy improvement with <3% average increased bit for Llama2-7b. (3) We design the asynchronous dequantization on GPUs, leading to up to 3.92X speedup. We conduct extensive experiments on different model families and model sizes. We achieve 2.85-bit for each weight and the end-to-end speedup for Llama2-7b is 1.74X over the original model, and we reduce both runtime cost and hardware cost by up to 2.70X and 2.81X with less GPU requirements.
    Deep Learning for Time Series Classification of Parkinson's Disease Eye Tracking Data. (arXiv:2311.16381v1 [cs.LG])
    Eye-tracking is an accessible and non-invasive technology that provides information about a subject's motor and cognitive abilities. As such, it has proven to be a valuable resource in the study of neurodegenerative diseases such as Parkinson's disease. Saccade experiments, in particular, have proven useful in the diagnosis and staging of Parkinson's disease. However, to date, no single eye-movement biomarker has been found to conclusively differentiate patients from healthy controls. In the present work, we investigate the use of state-of-the-art deep learning algorithms to perform Parkinson's disease classification using eye-tracking data from saccade experiments. In contrast to previous work, instead of using hand-crafted features from the saccades, we use raw $\sim1.5\,s$ long fixation intervals recorded during the preparatory phase before each trial. Using these short time series as input we implement two different classification models, InceptionTime and ROCKET. We find that the models are able to learn the classification task and generalize to unseen subjects. InceptionTime achieves $78\%$ accuracy, while ROCKET achieves $88\%$ accuracy. We also employ a novel method for pruning the ROCKET model to improve interpretability and generalizability, achieving an accuracy of $96\%$. Our results suggest that fixation data has low inter-subject variability and potentially carries useful information about brain cognitive and motor conditions, making it suitable for use with machine learning in the discovery of disease-relevant biomarkers.
    On robust overfitting: adversarial training induced distribution matters. (arXiv:2311.16526v1 [cs.LG])
    Adversarial training may be regarded as standard training with a modified loss function. But its generalization error appears much larger than standard training under standard loss. This phenomenon, known as robust overfitting, has attracted significant research attention and remains largely as a mystery. In this paper, we first show empirically that robust overfitting correlates with the increasing generalization difficulty of the perturbation-induced distributions along the trajectory of adversarial training (specifically PGD-based adversarial training). We then provide a novel upper bound for generalization error with respect to the perturbation-induced distributions, in which a notion of the perturbation operator, referred to "local dispersion", plays an important role.
    Adversarial Distribution Balancing for Counterfactual Reasoning. (arXiv:2311.16616v1 [cs.LG])
    The development of causal prediction models is challenged by the fact that the outcome is only observable for the applied (factual) intervention and not for its alternatives (the so-called counterfactuals); in medicine we only know patients' survival for the administered drug and not for other therapeutic options. Machine learning approaches for counterfactual reasoning have to deal with both unobserved outcomes and distributional differences due to non-random treatment administration. Unsupervised domain adaptation (UDA) addresses similar issues; one has to deal with unobserved outcomes -- the labels of the target domain -- and distributional differences between source and target domain. We propose Adversarial Distribution Balancing for Counterfactual Reasoning (ADBCR), which directly uses potential outcome estimates of the counterfactuals to remove spurious causal relations. We show that ADBCR outcompetes state-of-the-art methods on three benchmark datasets, and demonstrate that ADBCR's performance can be further improved if unlabeled validation data are included in the training procedure to better adapt the model to the validation domain.
    Conditions for Length Generalization in Learning Reasoning Skills. (arXiv:2311.16173v1 [cs.AI])
    Reasoning is a fundamental capability of AI agents. Recently, large language models (LLMs) have shown remarkable abilities to perform reasoning tasks. However, numerous evaluations of the reasoning capabilities of LLMs have also showed some limitations. An outstanding limitation is length generalization, meaning that when trained on reasoning problems of smaller lengths or sizes, the resulting models struggle with problems of larger sizes or lengths. This potentially indicates some theoretical limitations of generalization in learning reasoning skills. These evaluations and their observations motivated us to perform a theoretical study of the length generalization problem. This work focused on reasoning tasks that can be formulated as Markov dynamic processes (MDPs) and/or directed acyclic graphs (DAGs). It identifies and proves conditions that decide whether the length generalization problem can be solved or not for a reasoning task in a particular representation. Experiments are also conducted to verify the theoretical results.
    Symmetry-regularized neural ordinary differential equations. (arXiv:2311.16628v1 [stat.ML])
    Neural Ordinary Differential Equations (Neural ODEs) is a class of deep neural network models that interpret the hidden state dynamics of neural networks as an ordinary differential equation, thereby capable of capturing system dynamics in a continuous time framework. In this work, I integrate symmetry regularization into Neural ODEs. In particular, I use continuous Lie symmetry of ODEs and PDEs associated with the model to derive conservation laws and add them to the loss function, making it physics-informed. This incorporation of inherent structural properties into the loss function could significantly improve robustness and stability of the model during training. To illustrate this method, I employ a toy model that utilizes a cosine rate of change in the hidden state, showcasing the process of identifying Lie symmetries, deriving conservation laws, and constructing a new loss function.
    On the Effect of Defections in Federated Learning and How to Prevent Them. (arXiv:2311.16459v1 [cs.LG])
    Federated learning is a machine learning protocol that enables a large population of agents to collaborate over multiple rounds to produce a single consensus model. There are several federated learning applications where agents may choose to defect permanently$-$essentially withdrawing from the collaboration$-$if they are content with their instantaneous model in that round. This work demonstrates the detrimental impact of such defections on the final model's robustness and ability to generalize. We also show that current federated optimization algorithms fail to disincentivize these harmful defections. We introduce a novel optimization algorithm with theoretical guarantees to prevent defections while ensuring asymptotic convergence to an effective solution for all participating agents. We also provide numerical experiments to corroborate our findings and demonstrate the effectiveness of our algorithm.
    Physics-Informed Neural Network for Discovering Systems with Unmeasurable States with Application to Lithium-Ion Batteries. (arXiv:2311.16374v1 [cs.LG])
    Combining machine learning with physics is a trending approach for discovering unknown dynamics, and one of the most intensively studied frameworks is the physics-informed neural network (PINN). However, PINN often fails to optimize the network due to its difficulty in concurrently minimizing multiple losses originating from the system's governing equations. This problem can be more serious when the system's states are unmeasurable, like lithium-ion batteries (LiBs). In this work, we introduce a robust method for training PINN that uses fewer loss terms and thus constructs a less complex landscape for optimization. In particular, instead of having loss terms from each differential equation, this method embeds the dynamics into a loss function that quantifies the error between observed and predicted system outputs. This is accomplished by numerically integrating the predicted states from the neural network(NN) using known dynamics and transforming them to obtain a sequence of predicted outputs. Minimizing such a loss optimizes the NN to predict states consistent with observations given the physics. Further, the system's parameters can be added to the optimization targets. To demonstrate the ability of this method to perform various modeling and control tasks, we apply it to a battery model to concurrently estimate its states and parameters.
    A Foundational Framework and Methodology for Personalized Early and Timely Diagnosis. (arXiv:2311.16195v1 [cs.LG])
    Early diagnosis of diseases holds the potential for deep transformation in healthcare by enabling better treatment options, improving long-term survival and quality of life, and reducing overall cost. With the advent of medical big data, advances in diagnostic tests as well as in machine learning and statistics, early or timely diagnosis seems within reach. Early diagnosis research often neglects the potential for optimizing individual diagnostic paths. To enable personalized early diagnosis, a foundational framework is needed that delineates the diagnosis process and systematically identifies the time-dependent value of various diagnostic tests for an individual patient given their unique characteristics. Here, we propose the first foundational framework for early and timely diagnosis. It builds on decision-theoretic approaches to outline the diagnosis process and integrates machine learning and statistical methodology for estimating the optimal personalized diagnostic path. To describe the proposed framework as well as possibly other frameworks, we provide essential definitions. The development of a foundational framework is necessary for several reasons: 1) formalism provides clarity for the development of decision support tools; 2) observed information can be complemented with estimates of the future patient trajectory; 3) the net benefit of counterfactual diagnostic paths and associated uncertainties can be modeled for individuals 4) 'early' and 'timely' diagnosis can be clearly defined; 5) a mechanism emerges for assessing the value of technologies in terms of their impact on personalized early diagnosis, resulting health outcomes and incurred costs. Finally, we hope that this foundational framework will unlock the long-awaited potential of timely diagnosis and intervention, leading to improved outcomes for patients and higher cost-effectiveness for healthcare systems.
    Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation. (arXiv:2311.16201v1 [cs.CV])
    Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models' capability.
    Aiming to Minimize Alcohol-Impaired Road Fatalities: Utilizing Fairness-Aware and Domain Knowledge-Infused Artificial Intelligence. (arXiv:2311.16180v1 [cs.LG])
    Approximately 30% of all traffic fatalities in the United States are attributed to alcohol-impaired driving. This means that, despite stringent laws against this offense in every state, the frequency of drunk driving accidents is alarming, resulting in approximately one person being killed every 45 minutes. The process of charging individuals with Driving Under the Influence (DUI) is intricate and can sometimes be subjective, involving multiple stages such as observing the vehicle in motion, interacting with the driver, and conducting Standardized Field Sobriety Tests (SFSTs). Biases have been observed through racial profiling, leading to some groups and geographical areas facing fewer DUI tests, resulting in many actual DUI incidents going undetected, ultimately leading to a higher number of fatalities. To tackle this issue, our research introduces an Artificial Intelligence-based predictor that is both fairness-aware and incorporates domain knowledge to analyze DUI-related fatalities in different geographic locations. Through this model, we gain intriguing insights into the interplay between various demographic groups, including age, race, and income. By utilizing the provided information to allocate policing resources in a more equitable and efficient manner, there is potential to reduce DUI-related fatalities and have a significant impact on road safety.
    Seeing Beyond Cancer: Multi-Institutional Validation of Object Localization and 3D Semantic Segmentation using Deep Learning for Breast MRI. (arXiv:2311.16213v1 [eess.IV])
    The clinical management of breast cancer depends on an accurate understanding of the tumor and its anatomical context to adjacent tissues and landmark structures. This context may be provided by semantic segmentation methods; however, previous works have been largely limited to a singular focus on the tumor alone and rarely other tissue types. In contrast, we present a method that exploits tissue-tissue interactions to accurately segment every major tissue type in the breast including: chest wall, skin, adipose tissue, fibroglandular tissue, vasculature and tumor via standard-of-care Dynamic Contrast Enhanced MRI. Comparing our method to prior state-of-the-art, we achieved a superior Dice score on tumor segmentation while maintaining competitive performance on other studied tissues across multiple institutions. Briefly, our method proceeds by localizing the tumor using 2D object detectors, then segmenting the tumor and surrounding tissues independently using two 3D U-nets, and finally integrating these results while mitigating false positives by checking for anatomically plausible tissue-tissue contacts. The object detection models were pre-trained on ImageNet and COCO, and operated on MIP (maximum intensity projection) images in the axial and sagittal planes, establishing a 3D tumor bounding box. By integrating multiple relevant peri-tumoral tissues, our work enables clinical applications in breast cancer staging, prognosis and surgical planning.
    Symphony: Symmetry-Equivariant Point-Centered Spherical Harmonics for Molecule Generation. (arXiv:2311.16199v1 [cs.LG])
    We present Symphony, an $E(3)$-equivariant autoregressive generative model for 3D molecular geometries that iteratively builds a molecule from molecular fragments. Existing autoregressive models such as G-SchNet and G-SphereNet for molecules utilize rotationally invariant features to respect the 3D symmetries of molecules. In contrast, Symphony uses message-passing with higher-degree $E(3)$-equivariant features. This allows a novel representation of probability distributions via spherical harmonic signals to efficiently model the 3D geometry of molecules. We show that Symphony is able to accurately generate small molecules from the QM9 dataset, outperforming existing autoregressive models and approaching the performance of diffusion models.
    Influence Scores at Scale for Efficient Language Data Sampling. (arXiv:2311.16298v1 [cs.LG])
    Modern ML systems ingest data aggregated from diverse sources, such as synthetic, human-annotated, and live customer traffic. Understanding \textit{which} examples are important to the performance of a learning algorithm is crucial for efficient model training. Recently, a growing body of literature has given rise to various "influence scores," which use training artifacts such as model confidence or checkpointed gradients to identify important subsets of data. However, these methods have primarily been developed in computer vision settings, and it remains unclear how well they generalize to language-based tasks using pretrained models. In this paper, we explore the applicability of influence scores in language classification tasks. We evaluate a diverse subset of these scores on the SNLI dataset by quantifying accuracy changes in response to pruning training data through random and influence-score-based sampling. We then stress-test one of the scores -- "variance of gradients" (VoG) from Agarwal et al. (2022) -- in an NLU model stack that was exposed to dynamic user speech patterns in a voice assistant type of setting. Our experiments demonstrate that in many cases, encoder-based language models can be finetuned on roughly 50% of the original data without degradation in performance metrics. Along the way, we summarize lessons learned from applying out-of-the-box implementations of influence scores, quantify the effects of noisy and class-imbalanced data, and offer recommendations on score-based sampling for better accuracy and training efficiency.
    Imperceptible CMOS camera dazzle for adversarial attacks on deep neural networks. (arXiv:2311.16118v1 [cs.CR])
    Despite the outstanding performance of deep neural networks, they are vulnerable to adversarial attacks. While there are many invisible attacks in the digital domain, most physical world adversarial attacks are visible. Here we present an invisible optical adversarial attack that uses a light source to dazzle a CMOS camera with a rolling shutter. We present the photopic conditions required to keep the attacking light source completely invisible while sufficiently jamming the captured image so that a deep neural network applied to it is deceived.
    Asynchronous Wireless Federated Learning with Probabilistic Client Selection. (arXiv:2311.16741v1 [cs.LG])
    Federated learning (FL) is a promising distributed learning framework where distributed clients collaboratively train a machine learning model coordinated by a server. To tackle the stragglers issue in asynchronous FL, we consider that each client keeps local updates and probabilistically transmits the local model to the server at arbitrary times. We first derive the (approximate) expression for the convergence rate based on the probabilistic client selection. Then, an optimization problem is formulated to trade off the convergence rate of asynchronous FL and mobile energy consumption by joint probabilistic client selection and bandwidth allocation. We develop an iterative algorithm to solve the non-convex problem globally optimally. Experiments demonstrate the superiority of the proposed approach compared with the traditional schemes.
    Deep Learning-Based Frequency Offset Estimation. (arXiv:2311.16155v1 [eess.SP])
    In wireless communication systems, the asynchronization of the oscillators in the transmitter and the receiver along with the Doppler shift due to relative movement may lead to the presence of carrier frequency offset (CFO) in the received signals. Estimation of CFO is crucial for subsequent processing such as coherent demodulation. In this brief, we demonstrate the utilization of deep learning for CFO estimation by employing a residual network (ResNet) to learn and extract signal features from the raw in-phase (I) and quadrature (Q) components of the signals. We use multiple modulation schemes in the training set to make the trained model adaptable to multiple modulations or even new signals. In comparison to the commonly used traditional CFO estimation methods, our proposed IQ-ResNet method exhibits superior performance across various scenarios including different oversampling ratios, various signal lengths, and different channels
    Estimating Post-Synaptic Effects for Online Training of Feed-Forward SNNs. (arXiv:2311.16151v1 [cs.NE])
    Facilitating online learning in spiking neural networks (SNNs) is a key step in developing event-based models that can adapt to changing environments and learn from continuous data streams in real-time. Although forward-mode differentiation enables online learning, its computational requirements restrict scalability. This is typically addressed through approximations that limit learning in deep models. In this study, we propose Online Training with Postsynaptic Estimates (OTPE) for training feed-forward SNNs, which approximates Real-Time Recurrent Learning (RTRL) by incorporating temporal dynamics not captured by current approximations, such as Online Training Through Time (OTTT) and Online Spatio-Temporal Learning (OSTL). We show improved scaling for multi-layer networks using a novel approximation of temporal effects on the subsequent layer's activity. This approximation incurs minimal overhead in the time and space complexity compared to similar algorithms, and the calculation of temporal effects remains local to each layer. We characterize the learning performance of our proposed algorithms on multiple SNN model configurations for rate-based and time-based encoding. OTPE exhibits the highest directional alignment to exact gradients, calculated with backpropagation through time (BPTT), in deep networks and, on time-based encoding, outperforms other approximate methods. We also observe sizeable gains in average performance over similar algorithms in offline training of Spiking Heidelberg Digits with equivalent hyper-parameters (OTTT/OSTL - 70.5%; OTPE - 75.2%; BPTT - 78.1%).
    Dual-Stream Attention Transformers for Sewer Defect Classification. (arXiv:2311.16145v1 [cs.CV])
    We propose a dual-stream multi-scale vision transformer (DS-MSHViT) architecture that processes RGB and optical flow inputs for efficient sewer defect classification. Unlike existing methods that combine the predictions of two separate networks trained on each modality, we jointly train a single network with two branches for RGB and motion. Our key idea is to use self-attention regularization to harness the complementary strengths of the RGB and motion streams. The motion stream alone struggles to generate accurate attention maps, as motion images lack the rich visual features present in RGB images. To facilitate this, we introduce an attention consistency loss between the dual streams. By leveraging motion cues through a self-attention regularizer, we align and enhance RGB attention maps, enabling the network to concentrate on pertinent input regions. We evaluate our data on a public dataset as well as cross-validate our model performance in a novel dataset. Our method outperforms existing models that utilize either convolutional neural networks (CNNs) or multi-scale hybrid vision transformers (MSHViTs) without employing attention regularization between the two streams.
    TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models. (arXiv:2311.16503v1 [cs.CV])
    The Diffusion model, a prevalent framework for image generation, encounters significant challenges in terms of broad applicability due to its extended inference times and substantial memory requirements. Efficient Post-training Quantization (PTQ) is pivotal for addressing these issues in traditional models. Different from traditional models, diffusion models heavily depend on the time-step $t$ to achieve satisfactory multi-round denoising. Usually, $t$ from the finite set $\{1, \ldots, T\}$ is encoded to a temporal feature by a few modules totally irrespective of the sampling data. However, existing PTQ methods do not optimize these modules separately. They adopt inappropriate reconstruction targets and complex calibration methods, resulting in a severe disturbance of the temporal feature and denoising trajectory, as well as a low compression efficiency. To solve these, we propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block which is just related to the time-step $t$ and unrelated to the sampling data. Powered by the pioneering block design, we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features in a limited time. Equipped with the framework, we can maintain the most temporal information and ensure the end-to-end generation quality. Extensive experiments on various datasets and diffusion models prove our state-of-the-art results. Remarkably, our quantization approach, for the first time, achieves model performance nearly on par with the full-precision model under 4-bit weight quantization. Additionally, our method incurs almost no extra computational cost and accelerates quantization time by $2.0 \times$ on LSUN-Bedrooms $256 \times 256$ compared to previous works.
  • Open

    Attentional Graph Neural Networks for Robust Massive Network Localization. (arXiv:2311.16856v1 [cs.LG])
    Graph neural networks (GNNs) have gained significant popularity for classification tasks in machine learning, yet their applications to regression problems remain limited. Concurrently, attention mechanisms have emerged as powerful tools in sequential learning tasks. In this paper, we employ GNNs and attention mechanisms to address a classical but challenging nonlinear regression problem: network localization. We propose a novel GNN-based network localization method that achieves exceptional stability and accuracy in the presence of severe non-line-of-sight (NLOS) propagations, while eliminating the need for laborious offline calibration or NLOS identification. Extensive experimental results validate the effectiveness and high accuracy of our GNN-based localization model, particularly in challenging NLOS scenarios. However, the proposed GNN-based model exhibits limited flexibility, and its accuracy is highly sensitive to a specific hyperparameter that determines the graph structure. To address the limitations and extend the applicability of the GNN-based model to real scenarios, we introduce two attentional graph neural networks (AGNNs) that offer enhanced flexibility and the ability to automatically learn the optimal hyperparameter for each node. Experimental results confirm that the AGNN models are able to enhance localization accuracy, providing a promising solution for real-world applications. We also provide some analyses of the improved performance achieved by the AGNN models from the perspectives of dynamic attention and signal denoising characteristics.  ( 2 min )
    Utility Fairness in Contextual Dynamic Pricing with Demand Learning. (arXiv:2311.16528v1 [stat.ML])
    This paper introduces a novel contextual bandit algorithm for personalized pricing under utility fairness constraints in scenarios with uncertain demand, achieving an optimal regret upper bound. Our approach, which incorporates dynamic pricing and demand learning, addresses the critical challenge of fairness in pricing strategies. We first delve into the static full-information setting to formulate an optimal pricing policy as a constrained optimization problem. Here, we propose an approximation algorithm for efficiently and approximately computing the ideal policy. We also use mathematical analysis and computational studies to characterize the structures of optimal contextual pricing policies subject to fairness constraints, deriving simplified policies which lays the foundations of more in-depth research and extensions. Further, we extend our study to dynamic pricing problems with demand learning, establishing a non-standard regret lower bound that highlights the complexity added by fairness constraints. Our research offers a comprehensive analysis of the cost of fairness and its impact on the balance between utility and revenue maximization. This work represents a step towards integrating ethical considerations into algorithmic efficiency in data-driven dynamic pricing.  ( 2 min )
    Gaussian Processes for Monitoring Air-Quality in Kampala. (arXiv:2311.16625v1 [cs.LG])
    Monitoring air pollution is of vital importance to the overall health of the population. Unfortunately, devices that can measure air quality can be expensive, and many cities in low and middle-income countries have to rely on a sparse allocation of them. In this paper, we investigate the use of Gaussian Processes for both nowcasting the current air-pollution in places where there are no sensors and forecasting the air-pollution in the future at the sensor locations. In particular, we focus on the city of Kampala in Uganda, using data from AirQo's network of sensors. We demonstrate the advantage of removing outliers, compare different kernel functions and additional inputs. We also compare two sparse approximations to allow for the large amounts of temporal data in the dataset.  ( 2 min )
    Multinomial belief networks. (arXiv:2311.16909v1 [stat.ML])
    A Bayesian approach to machine learning is attractive when we need to quantify uncertainty, deal with missing observations, when samples are scarce, or when the data is sparse. All of these commonly apply when analysing healthcare data. To address these analytical requirements, we propose a deep generative model for multinomial count data where both the weights and hidden units of the network are Dirichlet distributed. A Gibbs sampling procedure is formulated that takes advantage of a series of augmentation relations, analogous to the Zhou-Cong-Chen model. We apply the model on small handwritten digits, and a large experimental dataset of DNA mutations in cancer, and we show how the model is able to extract biologically meaningful meta-signatures in a fully data-driven way.  ( 2 min )
    Pseudo-Likelihood Inference. (arXiv:2311.16656v1 [cs.LG])
    Simulation-Based Inference (SBI) is a common name for an emerging family of approaches that infer the model parameters when the likelihood is intractable. Existing SBI methods either approximate the likelihood, such as Approximate Bayesian Computation (ABC) or directly model the posterior, such as Sequential Neural Posterior Estimation (SNPE). While ABC is efficient on low-dimensional problems, on higher-dimensional tasks, it is generally outperformed by SNPE, which leverages function approximation. In this paper, we propose Pseudo-Likelihood Inference (PLI), a new method that brings neural approximation into ABC, making it competitive on challenging Bayesian system identification tasks. By utilizing integral probability metrics, we introduce a smooth likelihood kernel with an adaptive bandwidth that is updated based on information-theoretic trust regions. Thanks to this formulation, our method (i) allows for optimizing neural posteriors via gradient descent, (ii) does not rely on summary statistics, and (iii) enables multiple observations as input. In comparison to SNPE, it leads to improved performance when more data is available. The effectiveness of PLI is evaluated on four classical SBI benchmark tasks and on a highly dynamic physical system, showing particular advantages on stochastic simulations and multi-modal posterior landscapes.  ( 2 min )
    Identifiable Feature Learning for Spatial Data with Nonlinear ICA. (arXiv:2311.16849v1 [stat.ML])
    Recently, nonlinear ICA has surfaced as a popular alternative to the many heuristic models used in deep representation learning and disentanglement. An advantage of nonlinear ICA is that a sophisticated identifiability theory has been developed; in particular, it has been proven that the original components can be recovered under sufficiently strong latent dependencies. Despite this general theory, practical nonlinear ICA algorithms have so far been mainly limited to data with one-dimensional latent dependencies, especially time-series data. In this paper, we introduce a new nonlinear ICA framework that employs $t$-process (TP) latent components which apply naturally to data with higher-dimensional dependency structures, such as spatial and spatio-temporal data. In particular, we develop a new learning and inference algorithm that extends variational inference methods to handle the combination of a deep neural network mixing function with the TP prior, and employs the method of inducing points for computational efficacy. On the theoretical side, we show that such TP independent components are identifiable under very general conditions. Further, Gaussian Process (GP) nonlinear ICA is established as a limit of the TP Nonlinear ICA model, and we prove that the identifiability of the latent components at this GP limit is more restricted. Namely, those components are identifiable if and only if they have distinctly different covariance kernels. Our algorithm and identifiability theorems are explored on simulated spatial data and real world spatio-temporal data.  ( 2 min )
    Wasserstein Distributionally Robust Estimation in High Dimensions: Performance Analysis and Optimal Hyperparameter Tuning. (arXiv:2206.13269v2 [stat.ML] UPDATED)
    Wasserstein distributionally robust optimization has recently emerged as a powerful framework for robust estimation, enjoying good out-of-sample performance guarantees, well-understood regularization effects, and computationally tractable reformulations. In such framework, the estimator is obtained by minimizing the worst-case expected loss over all probability distributions which are close, in a Wasserstein sense, to the empirical distribution. In this paper, we propose a Wasserstein distributionally robust estimation framework to estimate an unknown parameter from noisy linear measurements, and we focus on the task of analyzing the squared error performance of such estimators. Our study is carried out in the modern high-dimensional proportional regime, where both the ambient dimension and the number of samples go to infinity at a proportional rate which encodes the under/over-parametrization of the problem. Under an isotropic Gaussian features assumption, we show that the squared error can be recovered as the solution of a convex-concave optimization problem which, surprinsingly, involves at most four scalar variables. Importantly, the precise quantification of the squared error allows to accurately and efficiently compare different ambiguity radii and to understand the effect of the under/over-parametrization on the estimation error. We conclude the paper with a list of exciting research directions enabled by our results.
    Computational Hypergraph Discovery, a Gaussian Process framework for connecting the dots. (arXiv:2311.17007v1 [cs.LG])
    Most scientific challenges can be framed into one of the following three levels of complexity of function approximation. Type 1: Approximate an unknown function given input/output data. Type 2: Consider a collection of variables and functions, some of which are unknown, indexed by the nodes and hyperedges of a hypergraph (a generalized graph where edges can connect more than two vertices). Given partial observations of the variables of the hypergraph (satisfying the functional dependencies imposed by its structure), approximate all the unobserved variables and unknown functions. Type 3: Expanding on Type 2, if the hypergraph structure itself is unknown, use partial observations of the variables of the hypergraph to discover its structure and approximate its unknown functions. While most Computational Science and Engineering and Scientific Machine Learning challenges can be framed as Type 1 and Type 2 problems, many scientific problems can only be categorized as Type 3. Despite their prevalence, these Type 3 challenges have been largely overlooked due to their inherent complexity. Although Gaussian Process (GP) methods are sometimes perceived as well-founded but old technology limited to Type 1 curve fitting, their scope has recently been expanded to Type 2 problems. In this paper, we introduce an interpretable GP framework for Type 3 problems, targeting the data-driven discovery and completion of computational hypergraphs. Our approach is based on a kernel generalization of Row Echelon Form reduction from linear systems to nonlinear ones and variance-based analysis. Here, variables are linked via GPs and those contributing to the highest data variance unveil the hypergraph's structure. We illustrate the scope and efficiency of the proposed approach with applications to (algebraic) equation discovery, network discovery (gene pathways, chemical, and mechanical) and raw data analysis.  ( 3 min )
    Optimal minimax rate of learning interaction kernels. (arXiv:2311.16852v1 [math.ST])
    Nonparametric estimation of nonlocal interaction kernels is crucial in various applications involving interacting particle systems. The inference challenge, situated at the nexus of statistical learning and inverse problems, comes from the nonlocal dependency. A central question is whether the optimal minimax rate of convergence for this problem aligns with the rate of $M^{-\frac{2\beta}{2\beta+1}}$ in classical nonparametric regression, where $M$ is the sample size and $\beta$ represents the smoothness exponent of the radial kernel. Our study confirms this alignment for systems with a finite number of particles. We introduce a tamed least squares estimator (tLSE) that attains the optimal convergence rate for a broad class of exchangeable distributions. The tLSE bridges the smallest eigenvalue of random matrices and Sobolev embedding. This estimator relies on nonasymptotic estimates for the left tail probability of the smallest eigenvalue of the normal matrix. The lower minimax rate is derived using the Fano-Tsybakov hypothesis testing method. Our findings reveal that provided the inverse problem in the large sample limit satisfies a coercivity condition, the left tail probability does not alter the bias-variance tradeoff, and the optimal minimax rate remains intact. Our tLSE method offers a straightforward approach for establishing the optimal minimax rate for models with either local or nonlocal dependency.  ( 2 min )
    Imputation using training labels and classification via label imputation. (arXiv:2311.16877v1 [cs.LG])
    Missing data is a common problem in practical settings. Various imputation methods have been developed to deal with missing data. However, even though the label is usually available in the training data, the common practice of imputation usually only relies on the input and ignores the label. In this work, we illustrate how stacking the label into the input can significantly improve the imputation of the input. In addition, we propose a classification strategy that initializes the predicted test label with missing values and stacks the label with the input for imputation. This allows imputing the label and the input at the same time. Also, the technique is capable of handling data training with missing labels without any prior imputation and is applicable to continuous, categorical, or mixed-type data. Experiments show promising results in terms of accuracy.  ( 2 min )
    The HR-Calculus: Enabling Information Processing with Quaternion Algebra. (arXiv:2311.16771v1 [stat.ML])
    From their inception, quaternions and their division algebra have proven to be advantageous in modelling rotation/orientation in three-dimensional spaces and have seen use from the initial formulation of electromagnetic filed theory through to forming the basis of quantum filed theory. Despite their impressive versatility in modelling real-world phenomena, adaptive information processing techniques specifically designed for quaternion-valued signals have only recently come to the attention of the machine learning, signal processing, and control communities. The most important development in this direction is introduction of the HR-calculus, which provides the required mathematical foundation for deriving adaptive information processing techniques directly in the quaternion domain. In this article, the foundations of the HR-calculus are revised and the required tools for deriving adaptive learning techniques suitable for dealing with quaternion-valued signals, such as the gradient operator, chain and product derivative rules, and Taylor series expansion are presented. This serves to establish the most important applications of adaptive information processing in the quaternion domain for both single-node and multi-node formulations. The article is supported by Supplementary Material, which will be referred to as SM.  ( 2 min )
    Sinkhorn Flow: A Continuous-Time Framework for Understanding and Generalizing the Sinkhorn Algorithm. (arXiv:2311.16706v1 [cs.LG])
    Many problems in machine learning can be formulated as solving entropy-regularized optimal transport on the space of probability measures. The canonical approach involves the Sinkhorn iterates, renowned for their rich mathematical properties. Recently, the Sinkhorn algorithm has been recast within the mirror descent framework, thus benefiting from classical optimization theory insights. Here, we build upon this result by introducing a continuous-time analogue of the Sinkhorn algorithm. This perspective allows us to derive novel variants of Sinkhorn schemes that are robust to noise and bias. Moreover, our continuous-time dynamics not only generalize but also offer a unified perspective on several recently discovered dynamics in machine learning and mathematics, such as the "Wasserstein mirror flow" of (Deb et al. 2023) or the "mean-field Schr\"odinger equation" of (Claisse et al. 2023).  ( 2 min )
    Symmetry-regularized neural ordinary differential equations. (arXiv:2311.16628v1 [stat.ML])
    Neural Ordinary Differential Equations (Neural ODEs) is a class of deep neural network models that interpret the hidden state dynamics of neural networks as an ordinary differential equation, thereby capable of capturing system dynamics in a continuous time framework. In this work, I integrate symmetry regularization into Neural ODEs. In particular, I use continuous Lie symmetry of ODEs and PDEs associated with the model to derive conservation laws and add them to the loss function, making it physics-informed. This incorporation of inherent structural properties into the loss function could significantly improve robustness and stability of the model during training. To illustrate this method, I employ a toy model that utilizes a cosine rate of change in the hidden state, showcasing the process of identifying Lie symmetries, deriving conservation laws, and constructing a new loss function.  ( 2 min )
    A unified weighting framework for evaluating nearest neighbour classification. (arXiv:2311.16872v1 [cs.LG])
    We present the first comprehensive and large-scale evaluation of classical (NN), fuzzy (FNN) and fuzzy rough (FRNN) nearest neighbour classification. We show that existing proposals for nearest neighbour weighting can be standardised in the form of kernel functions, applied to the distance values and/or ranks of the nearest neighbours of a test instance. Furthermore, we identify three commonly used distance functions and four scaling measures. We systematically evaluate these choices on a collection of 85 real-life classification datasets. We find that NN, FNN and FRNN all perform best with Boscovich distance. NN and FRNN perform best with a combination of Samworth rank- and distance weights and scaling by the mean absolute deviation around the median ($r_1$), the standard deviaton ($r_2$) or the interquartile range ($r_{\infty}^*$), while FNN performs best with only Samworth distance-weights and $r_1$- or $r_2$-scaling. We also introduce a new kernel based on fuzzy Yager negation, and show that NN achieves comparable performance with Yager distance-weights, which are simpler to implement than a combination of Samworth distance- and rank-weights. Finally, we demonstrate that FRNN generally outperforms NN, which in turns performs systematically better than FNN.  ( 2 min )
    A Combinatorial Approach to Robust PCA. (arXiv:2311.16416v1 [cs.DS])
    We study the problem of recovering Gaussian data under adversarial corruptions when the noises are low-rank and the corruptions are on the coordinate level. Concretely, we assume that the Gaussian noises lie in an unknown $k$-dimensional subspace $U \subseteq \mathbb{R}^d$, and $s$ randomly chosen coordinates of each data point fall into the control of an adversary. This setting models the scenario of learning from high-dimensional yet structured data that are transmitted through a highly-noisy channel, so that the data points are unlikely to be entirely clean. Our main result is an efficient algorithm that, when $ks^2 = O(d)$, recovers every single data point up to a nearly-optimal $\ell_1$ error of $\tilde O(ks/d)$ in expectation. At the core of our proof is a new analysis of the well-known Basis Pursuit (BP) method for recovering a sparse signal, which is known to succeed under additional assumptions (e.g., incoherence or the restricted isometry property) on the underlying subspace $U$. In contrast, we present a novel approach via studying a natural combinatorial problem and show that, over the randomness in the support of the sparse signal, a high-probability error bound is possible even if the subspace $U$ is arbitrary.  ( 2 min )
    Optimal Categorical Instrumental Variables. (arXiv:2311.17021v1 [econ.EM])
    This paper discusses estimation with a categorical instrumental variable in settings with potentially few observations per category. The proposed categorical instrumental variable estimator (CIV) leverages a regularization assumption that implies existence of a latent categorical variable with fixed finite support achieving the same first stage fit as the observed instrument. In asymptotic regimes that allow the number of observations per category to grow at arbitrary small polynomial rate with the sample size, I show that when the cardinality of the support of the optimal instrument is known, CIV is root-n asymptotically normal, achieves the same asymptotic variance as the oracle IV estimator that presumes knowledge of the optimal instrument, and is semiparametrically efficient under homoskedasticity. Under-specifying the number of support points reduces efficiency but maintains asymptotic normality.  ( 2 min )
    Geometry-Aware Adaptation for Pretrained Models. (arXiv:2307.12226v2 [cs.LG] UPDATED)
    Machine learning models -- including prominent zero-shot models -- are often trained on datasets whose labels are only a small proportion of a larger label space. Such spaces are commonly equipped with a metric that relates the labels via distances between them. We propose a simple approach to exploit this information to adapt the trained model to reliably predict new classes -- or, in the case of zero-shot prediction, to improve its performance -- without any additional training. Our technique is a drop-in replacement of the standard prediction rule, swapping argmax with the Fr\'echet mean. We provide a comprehensive theoretical analysis for this approach, studying (i) learning-theoretic results trading off label space diameter, sample complexity, and model dimension, (ii) characterizations of the full range of scenarios in which it is possible to predict any unobserved class, and (iii) an optimal active learning-like next class selection procedure to obtain optimal training classes for when it is not possible to predict the entire range of unobserved classes. Empirically, using easily-available external metrics, our proposed approach, Loki, gains up to 29.7% relative improvement over SimCLR on ImageNet and scales to hundreds of thousands of classes. When no such metric is available, Loki can use self-derived metrics from class embeddings and obtains a 10.5% improvement on pretrained zero-shot models such as CLIP.  ( 2 min )
    Curve Your Enthusiasm: Concurvity Regularization in Differentiable Generalized Additive Models. (arXiv:2305.11475v3 [cs.LG] UPDATED)
    Generalized Additive Models (GAMs) have recently experienced a resurgence in popularity due to their interpretability, which arises from expressing the target value as a sum of non-linear transformations of the features. Despite the current enthusiasm for GAMs, their susceptibility to concurvity - i.e., (possibly non-linear) dependencies between the features - has hitherto been largely overlooked. Here, we demonstrate how concurvity can severly impair the interpretability of GAMs and propose a remedy: a conceptually simple, yet effective regularizer which penalizes pairwise correlations of the non-linearly transformed feature variables. This procedure is applicable to any differentiable additive model, such as Neural Additive Models or NeuralProphet, and enhances interpretability by eliminating ambiguities due to self-canceling feature contributions. We validate the effectiveness of our regularizer in experiments on synthetic as well as real-world datasets for time-series and tabular data. Our experiments show that concurvity in GAMs can be reduced without significantly compromising prediction quality, improving interpretability and reducing variance in the feature importances.  ( 2 min )
    Kernel Learning in Ridge Regression "Automatically" Yields Exact Low Rank Solution. (arXiv:2310.11736v2 [math.ST] UPDATED)
    We consider kernels of the form $(x,x') \mapsto \phi(\|x-x'\|^2_\Sigma)$ parametrized by $\Sigma$. For such kernels, we study a variant of the kernel ridge regression problem which simultaneously optimizes the prediction function and the parameter $\Sigma$ of the reproducing kernel Hilbert space. The eigenspace of the $\Sigma$ learned from this kernel ridge regression problem can inform us which directions in covariate space are important for prediction. Assuming that the covariates have nonzero explanatory power for the response only through a low dimensional subspace (central mean subspace), we find that the global minimizer of the finite sample kernel learning objective is also low rank with high probability. More precisely, the rank of the minimizing $\Sigma$ is with high probability bounded by the dimension of the central mean subspace. This phenomenon is interesting because the low rankness property is achieved without using any explicit regularization of $\Sigma$, e.g., nuclear norm penalization. Our theory makes correspondence between the observed phenomenon and the notion of low rank set identifiability from the optimization literature. The low rankness property of the finite sample solutions exists because the population kernel learning objective grows "sharply" when moving away from its minimizers in any direction perpendicular to the central mean subspace.  ( 2 min )
    Kernelized Reinforcement Learning with Order Optimal Regret Bounds. (arXiv:2306.07745v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) has shown empirical success in various real world settings with complex models and large state-action spaces. The existing analytical results, however, typically focus on settings with a small number of state-actions or simple models such as linearly modeled state-action value functions. To derive RL policies that efficiently handle large state-action spaces with more general value functions, some recent works have considered nonlinear function approximation using kernel ridge regression. We propose $\pi$-KRVI, an optimistic modification of least-squares value iteration, when the state-action value function is represented by a reproducing kernel Hilbert space (RKHS). We prove the first order-optimal regret guarantees under a general setting. Our results show a significant polynomial in the number of episodes improvement over the state of the art. In particular, with highly non-smooth kernels (such as Neural Tangent kernel or some Mat\'ern kernels) the existing results lead to trivial (superlinear in the number of episodes) regret bounds. We show a sublinear regret bound that is order optimal in the case of Mat\'ern kernels where a lower bound on regret is known.  ( 2 min )
    OccamNet: A Fast Neural Model for Symbolic Regression at Scale. (arXiv:2007.10784v3 [cs.LG] UPDATED)
    Neural networks' expressiveness comes at the cost of complex, black-box models that often extrapolate poorly beyond the domain of the training dataset, conflicting with the goal of finding compact analytic expressions to describe scientific data. We introduce OccamNet, a neural network model that finds interpretable, compact, and sparse symbolic fits to data, \`a la Occam's razor. Our model defines a probability distribution over functions with efficient sampling and function evaluation. We train by sampling functions and biasing the probability mass toward better fitting solutions, backpropagating using cross-entropy matching in a reinforcement-learning loss. OccamNet can identify symbolic fits for a variety of problems, including analytic and non-analytic functions, implicit functions, and simple image classification, and can outperform state-of-the-art symbolic regression methods on real-world regression datasets. Our method requires a minimal memory footprint, fits complicated functions in minutes on a single CPU, and scales on a GPU.  ( 2 min )
    Policy Learning with Asymmetric Counterfactual Utilities. (arXiv:2206.10479v3 [stat.ML] UPDATED)
    Data-driven decision making plays an important role even in high stakes settings like medicine and public policy. Learning optimal policies from observed data requires a careful formulation of the utility function whose expected value is maximized across a population. Although researchers typically use utilities that depend on observed outcomes alone, in many settings the decision maker's utility function is more properly characterized by the joint set of potential outcomes under all actions. For example, the Hippocratic principle to "do no harm" implies that the cost of causing death to a patient who would otherwise survive without treatment is greater than the cost of forgoing life-saving treatment. We consider optimal policy learning with asymmetric counterfactual utility functions of this form that consider the joint set of potential outcomes. We show that asymmetric counterfactual utilities lead to an unidentifiable expected utility function, and so we first partially identify it. Drawing on statistical decision theory, we then derive minimax decision rules by minimizing the maximum expected utility loss relative to different alternative policies. We show that one can learn minimax loss decision rules from observed data by solving intermediate classification problems, and establish that the finite sample excess expected utility loss of this procedure is bounded by the regret of these intermediate classifiers. We apply this conceptual framework and methodology to the decision about whether or not to use right heart catheterization for patients with possible pulmonary hypertension.  ( 3 min )
    Constrained Optimization of Rank-One Functions with Indicator Variables. (arXiv:2303.18158v2 [math.OC] UPDATED)
    Optimization problems involving minimization of a rank-one convex function over constraints modeling restrictions on the support of the decision variables emerge in various machine learning applications. These problems are often modeled with indicator variables for identifying the support of the continuous variables. In this paper we investigate compact extended formulations for such problems through perspective reformulation techniques. In contrast to the majority of previous work that relies on support function arguments and disjunctive programming techniques to provide convex hull results, we propose a constructive approach that exploits a hidden conic structure induced by perspective functions. To this end, we first establish a convex hull result for a general conic mixed-binary set in which each conic constraint involves a linear function of independent continuous variables and a set of binary variables. We then demonstrate that extended representations of sets associated with epigraphs of rank-one convex functions over constraints modeling indicator relations naturally admit such a conic representation. This enables us to systematically give perspective formulations for the convex hull descriptions of these sets with nonlinear separable or non-separable objective functions, sign constraints on continuous variables, and combinatorial constraints on indicator variables. We illustrate the efficacy of our results on sparse nonnegative logistic regression problems.  ( 2 min )
    A statistical approach to latent dynamic modeling with differential equations. (arXiv:2311.16286v1 [stat.ME])
    Ordinary differential equations (ODEs) can provide mechanistic models of temporally local changes of processes, where parameters are often informed by external knowledge. While ODEs are popular in systems modeling, they are less established for statistical modeling of longitudinal cohort data, e.g., in a clinical setting. Yet, modeling of local changes could also be attractive for assessing the trajectory of an individual in a cohort in the immediate future given its current status, where ODE parameters could be informed by further characteristics of the individual. However, several hurdles so far limit such use of ODEs, as compared to regression-based function fitting approaches. The potentially higher level of noise in cohort data might be detrimental to ODEs, as the shape of the ODE solution heavily depends on the initial value. In addition, larger numbers of variables multiply such problems and might be difficult to handle for ODEs. To address this, we propose to use each observation in the course of time as the initial value to obtain multiple local ODE solutions and build a combined estimator of the underlying dynamics. Neural networks are used for obtaining a low-dimensional latent space for dynamic modeling from a potentially large number of variables, and for obtaining patient-specific ODE parameters from baseline variables. Simultaneous identification of dynamic models and of a latent space is enabled by recently developed differentiable programming techniques. We illustrate the proposed approach in an application with spinal muscular atrophy patients and a corresponding simulation study. In particular, modeling of local changes in health status at any point in time is contrasted to the interpretation of functions obtained from a global regression. This more generally highlights how different application settings might demand different modeling strategies.  ( 3 min )
    Opening the Black Box: Towards inherently interpretable energy data imputation models using building physics insight. (arXiv:2311.16632v1 [stat.ML])
    Missing data are frequently observed by practitioners and researchers in the building energy modeling community. In this regard, advanced data-driven solutions, such as Deep Learning methods, are typically required to reflect the non-linear behavior of these anomalies. As an ongoing research question related to Deep Learning, a model's applicability to limited data settings can be explored by introducing prior knowledge in the network. This same strategy can also lead to more interpretable predictions, hence facilitating the field application of the approach. For that purpose, the aim of this paper is to propose the use of Physics-informed Denoising Autoencoders (PI-DAE) for missing data imputation in commercial buildings. In particular, the presented method enforces physics-inspired soft constraints to the loss function of a Denoising Autoencoder (DAE). In order to quantify the benefits of the physical component, an ablation study between different DAE configurations is conducted. First, three univariate DAEs are optimized separately on indoor air temperature, heating, and cooling data. Then, two multivariate DAEs are derived from the previous configurations. Eventually, a building thermal balance equation is coupled to the last multivariate configuration to obtain PI-DAE. Additionally, two commonly used benchmarks are employed to support the findings. It is shown how introducing physical knowledge in a multivariate Denoising Autoencoder can enhance the inherent model interpretability through the optimized physics-based coefficients. While no significant improvement is observed in terms of reconstruction error with the proposed PI-DAE, its enhanced robustness to varying rates of missing data and the valuable insights derived from the physics-based coefficients create opportunities for wider applications within building systems and the built environment.  ( 3 min )
    Beyond Labels: Advancing Cluster Analysis with the Entropy of Distance Distribution (EDD). (arXiv:2311.16621v1 [stat.ML])
    In the evolving landscape of data science, the accurate quantification of clustering in high-dimensional data sets remains a significant challenge, especially in the absence of predefined labels. This paper introduces a novel approach, the Entropy of Distance Distribution (EDD), which represents a paradigm shift in label-free clustering analysis. Traditional methods, reliant on discrete labels, often struggle to discern intricate cluster patterns in unlabeled data. EDD, however, leverages the characteristic differences in pairwise point-to-point distances to discern clustering tendencies, independent of data labeling. Our method employs the Shannon information entropy to quantify the 'peakedness' or 'flatness' of distance distributions in a data set. This entropy measure, normalized against its maximum value, effectively distinguishes between strongly clustered data (indicated by pronounced peaks in distance distribution) and more homogeneous, non-clustered data sets. This label-free quantification is resilient against global translations and permutations of data points, and with an additional dimension-wise z-scoring, it becomes invariant to data set scaling. We demonstrate the efficacy of EDD through a series of experiments involving two-dimensional data spaces with Gaussian cluster centers. Our findings reveal a monotonic increase in the EDD value with the widening of cluster widths, moving from well-separated to overlapping clusters. This behavior underscores the method's sensitivity and accuracy in detecting varying degrees of clustering. EDD's potential extends beyond conventional clustering analysis, offering a robust, scalable tool for unraveling complex data structures without reliance on pre-assigned labels.  ( 2 min )

  • Open

    [D] how to explain why RL is difficult to someone who knows nothing about it?
    How to explain why RL is difficult to someone who knows nothing about it? I’ve been working on an RL project at work. The person who assigned it to me is a computer scientist who is not an expert on RL, but understands it’s a difficult problem. (My boss is on equal footing with the person who assigned the project to me. My boss is not a computer scientist and doesn’t know anything about RL.) This guys boss is a business manager who doesn’t know anything about RL and knows very little about ML. The business manager wants a report on how the project is going from me and I’m getting the sense that he doesn’t really understand why this is taking so long. For context, I’ve been working on this project for about 4 months for 15 hours per week. In that time, I’ve built an entire code base for the problem from scratch and programmed up several models. I have one that mostly works at the moment, but I need to make some changes to the reward functions to get it performing well consistently. I’m the only one working on this project, so I’ve done all of this myself. I also had only done vanilla RL prior to this, so I’ve had to learn a ton about deep RL to make this work. Luckily I know someone who’s an expert in deep RL (outside work) and has been able to give me pointers. I’m feeling like I’ve made a ton of progress and am nearing the home stretch in terms of having a completely polished model. However I’m getting the sense that this guy is not super thrilled with me. This guy doesn’t have any official authority over me, so this is mainly about trying to explain how much work RL is in addition to mg normal slides about the project and where I’m at. submitted by /u/savvyms [link] [comments]  ( 1 min )
    [D] Transformer Libraries/Benchmarks?
    For those of you who work with Transformers, are there any good pytorch libraries or performance benchmarks out there? Note that I am not looking for LLMs, I would just like implementations of the standard Transformer and various newer methods such as the Linear Transformer, Performer, Linformer, Reformer, etc. Ideally, I would like a library that also includes benchmarks, where various models can be compared in performance on simple to complex tasks. I would like to get into research in this space, but I don't know where to start. submitted by /u/Acciorocketships [link] [comments]  ( 1 min )
    [D] Certification/coursework recommended for traditional statisticians
    I work as a statistician in a big pharma company. Our department is filled with mid to senior level statisticians that have done more traditional statistics work for the majority of their careers. Everyone has at least an MS in statistics and 5 years of experience. People program almost entirely in SAS, but a few use R. We typically use GLMs, set up sampling plans and are responsible for planning and analyzing DoEs. Our department found extra money for training that we can have if we use it by the end of the year. I was looking for recommended training programs/certifications/coursework which would be good for this group of people that is recognized in the industry. I am looking for something that could provide an appropriate level of detail for modern machine learning techniques, as it seems the world is moving in that direction. Obviously, we're probably a bit behind in the software side, but can handle the math portions. Im hoping for outcomes anywhere between conversational to practitioner of the new techniques, depending on the level of work the statistician is willing to put into the coursework. submitted by /u/flapjaxrfun [link] [comments]  ( 1 min )
    [R] Positional description matters for transformer arithmetic
    A new paper focuses on improving arithmetic skills in LLMs, primarily GPT-like models. There are three primary challenges faced by LLMs in arithmetic: Complex Calculations: LLMs struggle with intricate arithmetic, particularly involving large numbers, leading to difficulties in internal intermediate steps. Length Limitations: LLMs are limited to handling numbers within the range of their training data, restricting practicality. Integration with Language: Merging arithmetic and natural language data encounters obstacles due to differences in surface formats, causing position-dependent representations that conflict. To address these challenges, the article introduces techniques to enhance multiplication: Padding: Number factors are padded to a fixed 15-digit length, ensuring uniformity and position invariance. Reordering: The order of digits in the product is reversed to align with the natural progression of multiplication. The outcomes are impressive. In testing, their approach achieves 99% accuracy in calculating products for numbers up to 12 digits. Simply asking GPT-4 to multiple two 4-digit numbers has an accuracy of less than 1%, by comparison. To overcome length limitations, the paper explores data formats and positional encodings, including random spacing and alternative encodings. These innovations enable LLMs to generalize addition to handle additional digits. The article also addresses the integration of arithmetic and language data by randomizing formats and using alternative positional encodings, enabling effective data integration. TLDR: As the paper title says, positional description matters for transformer arithmetic. Full summary is here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 1 min )
    [R] MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
    Paper: https://arxiv.org/abs/2311.16502 Blog: https://mmmu-benchmark.github.io/ Abstract: We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domainspecific knowledge, challenging models to perform tasks akin to those faced by experts. Our evaluation of 14 open-source LMMs and the proprietary GPT-4V(ision) highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V only achieves a 56% accuracy, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence. https://preview.redd.it/0k2e074fbc3c1.jpg?width=1663&format=pjpg&auto=webp&s=03c5b80bbab288919bff3c7838f65ecb79cf8174 https://preview.redd.it/g646d94fbc3c1.jpg?width=1475&format=pjpg&auto=webp&s=65f5fa706e8295d4f296186ab7f082deb5968a6d https://preview.redd.it/ueftb64fbc3c1.jpg?width=1665&format=pjpg&auto=webp&s=e3a945765372dc3fc8ae2cb387a5c48b7c5d215b https://preview.redd.it/e10k3a4fbc3c1.jpg?width=1667&format=pjpg&auto=webp&s=cbb71e4030cdb6067dae4fe86baaf8403ff99ec8 submitted by /u/Singularian2501 [link] [comments]  ( 9 min )
    Adversarial Diffusion Distillation
    submitted by /u/KarlKani44 [link] [comments]  ( 9 min )
    [P] Multi-Objective Optimization
    I have training data for 11 inputs and 2 outputs. Is there a library that can optimize the 11 inputs to minimize 1 output and maximize 1 output? submitted by /u/bsiegelwax [link] [comments]  ( 1 min )
    [R] Millions of new materials discovered with deep learning
    Post: https://deepmind.google/discover/blog/millions-of-new-materials-discovered-with-deep-learning/ Paper: https://www.nature.com/articles/s41586-023-06735-9 Abstract: Novel functional materials enable fundamental breakthroughs across technological applications from clean energy to information processing. From microchips to batteries and photovoltaics, discovery of inorganic crystals has been bottlenecked by expensive trial-and-error approaches. Concurrently, deep-learning models for language, vision and biology have showcased emergent predictive capabilities with increasing data and computation. Here we show that graph networks trained at scale can reach unprecedented levels of generalization, improving the efficiency of materials discovery by an order of magnitude. Building on 48,000 stable crystals identified in continuing studies, improved efficiency enables the discovery of 2.2 million structures below the current convex hull, many of which escaped previous human chemical intuition. Our work represents an order-of-magnitude expansion in stable materials known to humanity. Stable discoveries that are on the final convex hull will be made available to screen for technological applications, as we demonstrate for layered materials and solid-electrolyte candidates. Of the stable structures, 736 have already been independently experimentally realized. The scale and diversity of hundreds of millions of first-principles calculations also unlock modelling capabilities for downstream applications, leading in particular to highly accurate and robust learned interatomic potentials that can be used in condensed-phase molecular-dynamics simulations and high-fidelity zero-shot prediction of ionic conductivity. submitted by /u/RobbinDeBank [link] [comments]  ( 9 min )
    [P] GPU Benchmark: Stable Diffusion v1.5 on 23 consumer GPUs (generating 460,000 fancy QR codes)
    We benchmarked SD v1.5 on 23 consumer GPUs - To generate 460,000 fancy QR codes. The best performing GPU/backend combination delivered almost 20,000 images generated per dollar (512x512 resolution). You can read the full benchmark here: https://blog.salad.com/stable-diffusion-v1-5-benchmark/ Some key observations: Do not use the GTX series GPUs for production stable diffusion inference. Absolute performance and cost performance are dismal in the GTX series, and in many cases the benchmark could not be fully completed, with jobs repeatedly running out of CUDA memory. Additionally, many images generated on these GPUs came out all black, instead of as fancy QR codes as desired. There are very few surprises for which GPU is fastest for each backend. Newer GPUs with higher model numbers …  ( 1 min )
    [D] [P] Images to triangles
    Images to triangles Say you have an image, which is the input ... You now need to output a set of n triangles that will most closely replicate it, where n is fixed. How will you go about this problem today Genetic Algos have kind of tackled this problem, but takes forever or converges easily to some picture ... I tried running some preliminary things here on Neat playground here .... https://jerryjohnthomas.github.io/30pieces/ Each triangle is something takes space and color ... So try to group pixels of same color as triangle ... Like a knn thing I think we could maybe do a much better job without cnns, adding a simplifying assumption say the image indeed is made of n triangles and exact replication is possible. Is there anything else you would try. submitted by /u/Existing-Pie-3723 [link] [comments]  ( 1 min )
    [discussion] text to voice generation for textbooks (non-math part)
    i was listening to lex podcast on some stuff i study and wanted to ask, are there any good natural-enough sounding local text to voice models out there? i would very much like to use it to turn the text parts of a book into an audio where i could listen to it while reading. i used edge's tts for speech by giving a paragraph to clipboard and to edge-tts in order to listen the text but it causes two problems: 1. you need internet connection and have the book opened 2. can only do paragraph by paragraph, and is prone to errors or sometimes if you use it too much it wont convert the full text afterwards. the idea would be to turn a chapter of a book into an audio file and transfer it so that i could listen to it on my mobile phone on the fly. what is the status of offline models where they could afford to output an okay voice (or even be able to give the voice of a tutor from their lectures and train it)? submitted by /u/sweetchocolotepie [link] [comments]  ( 1 min )
    [D] Annotation Apps recommendation to draw ground truths for my project
    I am currently doing a project for my feature engineering class, the aim of the project is to detect field boundaries or roads. So for the ground truth I have used image segmentation from matlab which has various tools such as flood fill and others to fill the fields black and making the roads white. As you can see in the second image which is the ground that I have drawn, it has a lot of noise. Our professor suggested us to use iPad to draw groundtruth images. Can anyone please suggest apps for me drawn the same exact groundtruth but without noise submitted by /u/Certain-Seesaw-2274 [link] [comments]  ( 1 min )
    [D] Model that takes a reference image and generates additional images of the subject?
    Hi, I'm looking for a paper that I remember seeing and forgot to save. The paper discusses taking a single input image, e.g. with a person, and then generating additional images with the same person in other contexts. I vaguely remember a demo image of taking a painting of a woman as the reference image, then inpainting that woman into an existing image of an astronaut. I did some research and break-a-scene is the closest to what I'm talking about. But break-a-scene talks about extracting multiple tokens from the source image and I think the paper I'm looking for only talks about a single subject. And the demo image isn't there. Here are a bunch of papers that I found that do similar things that I also don't think are what I saw: https://realfill.github.io/ https://github.com/garibida/cross-image-attention https://github.com/SUDO-AI-3D/zero123plus https://omriavrahami.com/the-chosen-one/ https://omriavrahami.com/break-a-scene/ https://github.com/TencentARC/CustomNet https://damo-vilab.github.io/AnyDoor-Page/ https://github.com/OPPO-Mente-Lab/Subject-Diffusion Lmk if you know of any other papers that handles this or a related task, thanks! submitted by /u/synapticpaint [link] [comments]  ( 1 min )
    [R] Survey of Consciousness Theory from Computational Perspective
    Paper: https://arxiv.org/abs/2309.10063 Abstract: Human consciousness has been a long-lasting mystery for centuries, while machine intelligence and consciousness is an arduous pursuit. Researchers have developed diverse theories for interpreting the consciousness phenomenon in human brains from different perspectives and levels. This paper surveys several main branches of consciousness theories originating from different subjects including information theory, quantum physics, cognitive psychology, physiology and computer science, with the aim of bridging these theories from a computational perspective. It also discusses the existing evaluation metrics of consciousness and possibility for current computational models to be conscious. Breaking the mystery of consciousness can be an essential step in building general artificial intelligence with computing machines. ​ submitted by /u/APaperADay [link] [comments]  ( 1 min )
    [R] Rankitect: Ranking Architecture Search Battling World-class Engineers at Meta Scale
    Paper: https://arxiv.org/abs/2311.08430 Abstract: Neural Architecture Search (NAS) has demonstrated its efficacy in computer vision and potential for ranking systems. However, prior work focused on academic problems, which are evaluated at small scale under well-controlled fixed baselines. In industry system, such as ranking system in Meta, it is unclear whether NAS algorithms from the literature can outperform production baselines because of: (1) scale - Meta ranking systems serve billions of users, (2) strong baselines - the baselines are production models optimized by hundreds to thousands of world-class engineers for years since the rise of deep learning, (3) dynamic baselines - engineers may have established new and stronger baselines during NAS search, and (4) efficiency - the search pipeline must yield results quickly in alignment with the productionization life cycle. In this paper, we present Rankitect, a NAS software framework for ranking systems at Meta. Rankitect seeks to build brand new architectures by composing low level building blocks from scratch. Rankitect implements and improves state-of-the-art (SOTA) NAS methods for comprehensive and fair comparison under the same search space, including sampling-based NAS, one-shot NAS, and Differentiable NAS (DNAS). We evaluate Rankitect by comparing to multiple production ranking models at Meta. We find that Rankitect can discover new models from scratch achieving competitive tradeoff between Normalized Entropy loss and FLOPs. When utilizing search space designed by engineers, Rankitect can generate better models than engineers, achieving positive offline evaluation and online A/B test at Meta scale. https://preview.redd.it/n1l2fhle3b3c1.png?width=1360&format=png&auto=webp&s=772bbd4885bd87fb14461d76f8aca804b32fe1b8 submitted by /u/APaperADay [link] [comments]  ( 1 min )
    [R] Recommendations for ML/DL Workloads in Resource-Limited Environments
    Helloooo! I'm looking into the field of ML Systems, with a particular interest in executing deep learning tasks in environments with limited resources, such as commodity GPUs. Recently, I've noticed a significant focus on Large Language Models (LLMs) in contemporary research. However, I'm curious about other vital types of workloads in this domain. Could you please share your suggestions or insights on other meaningful ML/DL workloads that are crucial in resource-constrained settings? Your input would be greatly appreciated! submitted by /u/E-fazz [link] [comments]  ( 1 min )
    Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks (and why fine-tuning is easily undone)
    submitted by /u/hazardoussouth [link] [comments]  ( 1 min )
    [R] NVAutoNet
    Did anyone read the new paper from NVIDIA called NVAutoNet and knows how did they get to the specified dimension of the feature maps? Because I tried to apply the operations from the table manually and don’t get the same result as the formula. Do they use any kind of padding or if anyone can explain how the operations are applied. Thank you. submitted by /u/Neat_Cucumber_3282 [link] [comments]  ( 1 min )
    [R] "It's not just memorizing the training data" they said: Scalable Extraction of Training Data from (Production) Language Models
    submitted by /u/wojcech [link] [comments]  ( 9 min )
    MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers [R]
    submitted by /u/we_are_mammals [link] [comments]  ( 1 min )
    [D] Is it legal to train on audio that is copyright infringing but not traditionally copyrighted?
    I want to train a model for a musician based on piano covers they have recorded of popular songs and therefore own, but because of copyright infringement they still pay royalties to the writers of these songs. Curious if this is legal or not -- I would presume not since the recordings are not being monetized in this scenario but rather fed into a model, and the recordings are still owned 100% by the musician licensing them to me. submitted by /u/SuperwhizAJ [link] [comments]  ( 1 min )
    [D] What is the motivation for parameter-efficient fine tuning if there's no significant reduction in runtime or GPU memory usage?
    I'm been playing around with methods such as prompt tuning and LoRA, which are parameter efficient as they only fine-tune a very small fraction (that is, 8.1GB, which is very minimal. Fine-tuning time reduction also isn't really a major advantage, with finetuning the same model reduced by 20ms per batch, from 210ms to 190ms. This begs the question - what really is the practical reason for the popularity of parameter-efficie…  ( 1 min )
  • Open

    I wanna co-found cultural a center or a learning hub gravitating around AI projects in Lisbon, Portugal. I'm aiming that it will become a conscious humane-neuronal network+AI. Which organizations can give out funding or have grants for this type of cultural projects?
    Which Artistic or Technological companies /institutions /startup incubators etc can finance us/give us grants? Some insight's a bout this Open Sandbox project: - There would be two main divisions, Creative Generative AI focused projects (image/video/fashion/tattoos/etc) and Bard or Chatgpt guided /helped geekier projects. ​ - I would like "movable" sound-dampening folding screens tall as walls (these will be enough for starting)onsite. ​ - Some kind of poles with big wheels with a monitor stand for a screen connected to cloud computing or space for a laptop, (the screen holder can be easily changed for seated or standing PC use) and a collapsible chair that you can take on the base of the pole, maybe even outside. the pole has a battery inside for the pc with a plug. ​ - there are …  ( 1 min )
    State of the AI race between big tech companies
    submitted by /u/a1a3a5a7a9 [link] [comments]  ( 1 min )
    RunwayML's new Motion Brush feature is unreal
    ​ https://reddit.com/link/186ylge/video/7r46uzdzec3c1/player ​ submitted by /u/ThatNoCodeGuy [link] [comments]  ( 1 min )
    Internet Show with AI Co-Host
    submitted by /u/codeADVanced [link] [comments]  ( 1 min )
    Two questions: what is your outlook on artificial intelligence and how close is your financial well-being tied to the industry?
    I am obviously curious if there’s a correlation submitted by /u/BoomBapBiBimBop [link] [comments]  ( 1 min )
    I’ve never seen a model that can do good pixel art until now.
    submitted by /u/katiecharm [link] [comments]  ( 1 min )
    Mother plucker: Steel fingers guided by AI pluck weeds rapidly and autonomously
    submitted by /u/SAT0725 [link] [comments]  ( 1 min )
    SDXL Turbo is crazy fast and good!!
    submitted by /u/ShooBum-T [link] [comments]  ( 1 min )
    Please correct my understanding of "memory" in LLMs
    I'm trying to understand how GPTs/LLMs work, on a conceptual level and using the correct terminology. Here's my understanding so far (please correct if I'm wrong): GPTs are pre-trained so that for any given input it spits out the statistically best matching output based on its training. It does this token by token, without "understanding" the output, just that this token is often followed by this other token. It gains this knowledge during its training, when the LLM was fed a large number of embeddings (ie its "knowledge"). A LLM can be fine-tuned after the training stage, which builds on its training data to become more accurate for a particular domain. This happens by feeding it domain-specific labelled data, and the model's parameters are modified to match the desired accuracy in the new data. Here's the bit I don't understand about "memory". Afaik, LLMs do not have long-term memory in the human sense (if I tell you I have a 6 year old son today, a year from now you would know little Billy is 7 years old). So how are these models able to answer related follow-up questions in the chat? eg "tell me a story" "make it shorter" ​ Is the application just passing the previous Q&A in the context window? Will the context window and number of tokens required just keep growing the longer the conversation proceeds? Are there architectures where the model queries some database ("select * from user_history") before answering? Is that what vector databases are used for? Or is there an architecture running a near-realtime fine-tuning of the model when the chat begins? Is that how those "speak with your PDF" apps work? Feel free to be technical - I'm a software engineer, but a noob at the AI stuff. ​ ​ submitted by /u/fartzilla21 [link] [comments]  ( 1 min )
    If your country had an animal king...
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 1 min )
    "AI Titans Unleashed: Amazon Q, ChatGPT, Copilot, Google Bard - A Race for Technological Supremacy"
    submitted by /u/Fit-Code-5141 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 11/28/2023
    Chinese tech giant ByteDance has established a new division, named Flow, which will focus on artificial intelligence applications.[1] European tech is starting to bounce back — and AI is turbocharging the recovery. The continent is now home to more highly skilled professionals in the field than the US, according to new research by Atomico, a VC firm headquartered in the UK.[2] Pika, which is building AI tools to generate and edit videos, raises $55M.[3] OpenAI has hired Google’s Richard Ho as the head of hardware to lead the new division to help “influence roadmaps for hardware partners for data center networks, racks, and buildings.”[4] Sources: [1] https://www.yicaiglobal.com/news/chinas-bytedance-sets-up-new-division-focusing-on-ai-applications [2] https://thenextweb.com/news/european-ai-supporting-tech-sector-recovery [3] https://finance.yahoo.com/news/pika-labs-building-ai-tools-133710041.html [4] https://baxtel.com/news/openai-hires-google-s-richard-ho-to-head-new-hardware-department submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    Most AI startups are doomed
    Most AI startups are doomed because they lack defensibility and differentiation. Startups that simply glue together AI APIs and create UIs are not sustainable. Even if a startup has a better UI, competitors can easily copy it. The same logic applies to the underlying technology of AI models like ChatGPT. These models have no real moat and can be replicated by any large internet company. Building the best version of an AI model is also not sustainable because the technological frontier of the AI industry is constantly moving. The AI research community has more firepower and companies quickly adopt the global state-of-the-art. Lasting value in AI requires continuous innovation. Source : https://weightythoughts.com/p/most-ai-startups-are-doomed submitted by /u/NuseAI [link] [comments]
    10 years of NLP history explained in 50 concepts!
    Hey people, sharing a video from my YT channel where I summarized the last 10 years of NLP research, from Word2Vec to RNNs to Transformers to LLMs to Human Alignment… here’s the link for those interested… https://youtu.be/uocYQH0cWTs submitted by /u/AvvYaa [link] [comments]  ( 1 min )
    Why AI Will Save the World
    Artificial Intelligence (AI) is not a destructive force, but rather a tool that can enhance human intelligence and improve various aspects of life. The development and proliferation of AI is seen as a moral obligation and an opportunity to create a better world. However, there is currently a moral panic surrounding AI, fueled by irrational fears and exaggerated concerns. Historically, new technologies have often sparked moral panics, but these panics tend to be irrational and hinder progress. Some actors within the AI panic may have ulterior motives, using the panic to push for regulations and restrictions that benefit themselves. Approaching the conversation about AI with rationality and discernment is crucial to avoid falling into the trap of moral panic. Source : https://a16z.com/ai-will-save-the-world/ submitted by /u/NuseAI [link] [comments]  ( 1 min )
  • Open

    I would Like to use Dreamerv3 (pytorch) to train my environment
    I would Like to use Dreamerv3 (pytorch) to train my environment. my custom environment is a vector-based gym environment.import numpy as npimport os, sys, globimport gymfrom hparams import HyperParams as hpfrom hparams import RobotFrame_Continuous_Datasets_Timestep_1 as data# from hparams import RobotFrame_Continuous_Datasets_Timestep_3 as data# from hparams import RobotFrameContinuousDatasetsTimestep_0_25 as data# from hparams import RobotFrameContinuousDatasetsTimestep_0_5 as dataimport syssys.path.append('./gsoc22-socnavenv')import randomimport socnavenvimport osimport torchdef discrete_to_continuous_action(action:int):"""Function to return a continuous space action for a given discrete action"""# Adjust the possible action values to be either -0.5, 0, 0.5 or 1move_dict = {0: -0.5, 1: 0…  ( 1 min )
    [Help Needed] Project for university: Path finding trough board with holes with sensors - Seeking Advice
    Hello r/reinforcementlearning community, for our masters in product development we should implement and build a system (with real components), that is based on reinforcement learning. However, none of us have a profound understanding of RL, so we have not been able to train successfully so far, and unfortunately our professors can no longer help us and have encouraged us to ask for external help. Perhaps there is someone here who would like to take up the challenge :) We would be happy to receive new ideas or input on our code and how to proceed further. Project Overview: The agent should reach the top of the board (we like to call it cheeseboard) without falling into any holes. To support this task, he has two sensors diagonally above him on the left and right, which return whether or …
    Combining multiprocessing with CUDA
    I am a little bit of a newbie with regards to High Performance Computing. Therefore, forgive me if my question seems too trivial. ​ I am thinking of applying multiprocessing to a CUDA RL program. Right now I am trying to train an environment that takes around 9 hours to train. I'd like to play around with the hyperparameters and a few other things. Therefore, I'd like to run 10 parallel runs of my model using the same GPU. Here is what I am taking care of, after Googling - GPU Size - I understand that my GPU should be large enough to take in data from 10 runs. GPU Year - Apparently older GPUs don't allow more than a single program accessing it. Is there something else that I am missing? My assumption is that if it takes 9 hours to run a single model, then through multiprocessing I can run 10 models in 9 hours, provided the above constraints are met. submitted by /u/Academic-Rent7800 [link] [comments]  ( 1 min )
    Rankitect: Ranking Architecture Search Battling World-class Engineers at Meta Scale
    Paper: https://arxiv.org/abs/2311.08430 Abstract: Neural Architecture Search (NAS) has demonstrated its efficacy in computer vision and potential for ranking systems. However, prior work focused on academic problems, which are evaluated at small scale under well-controlled fixed baselines. In industry system, such as ranking system in Meta, it is unclear whether NAS algorithms from the literature can outperform production baselines because of: (1) scale - Meta ranking systems serve billions of users, (2) strong baselines - the baselines are production models optimized by hundreds to thousands of world-class engineers for years since the rise of deep learning, (3) dynamic baselines - engineers may have established new and stronger baselines during NAS search, and (4) efficiency - the search pipeline must yield results quickly in alignment with the productionization life cycle. In this paper, we present Rankitect, a NAS software framework for ranking systems at Meta. Rankitect seeks to build brand new architectures by composing low level building blocks from scratch. Rankitect implements and improves state-of-the-art (SOTA) NAS methods for comprehensive and fair comparison under the same search space, including sampling-based NAS, one-shot NAS, and Differentiable NAS (DNAS). We evaluate Rankitect by comparing to multiple production ranking models at Meta. We find that Rankitect can discover new models from scratch achieving competitive tradeoff between Normalized Entropy loss and FLOPs. When utilizing search space designed by engineers, Rankitect can generate better models than engineers, achieving positive offline evaluation and online A/B test at Meta scale. https://preview.redd.it/mlceky7x7b3c1.png?width=1379&format=png&auto=webp&s=2acc4e0451db9fbf455b03f9e293e68cc61d25bf submitted by /u/APaperADay [link] [comments]  ( 1 min )
    My Gym Cartpole agent learns with a running mean of around 200 but when I play using the same q table it runs only for 10 points. Please help.
    gridworld/cartpole.ipynb at main · bherwanisuraj/gridworld (github.com) Here is the link for my git repo. If someone can help would be best. submitted by /u/tlevelup [link] [comments]  ( 1 min )
    On "Q*" speculation: some relevant research background on search with LLMs & synthetic data
    submitted by /u/gwern [link] [comments]  ( 1 min )
    "Learning few-shot imitation as cultural transmission", Bhoopchand et al 2023 {DM}
    submitted by /u/gwern [link] [comments]
  • Open

    What does the future hold for generative AI?
    Rodney Brooks, co-founder of iRobot, kicks off an MIT symposium on the promise and potential pitfalls of increasingly powerful AI tools like ChatGPT.  ( 12 min )
  • Open

    Schedule Amazon SageMaker notebook jobs and manage multi-step notebook workflows using APIs
    Amazon SageMaker Studio provides a fully managed solution for data scientists to interactively build, train, and deploy machine learning (ML) models. Amazon SageMaker notebook jobs allow data scientists to run their notebooks on demand or on a schedule with a few clicks in SageMaker Studio. With this launch, you can programmatically run notebooks as jobs […]  ( 11 min )
    Announcing new tools and capabilities to enable responsible AI innovation
    The rapid growth of generative AI brings promising new innovation, and at the same time raises new challenges. These challenges include some that were common before generative AI, such as bias and explainability, and new ones unique to foundation models (FMs), including hallucination and toxicity. At AWS, we are committed to developing generative AI responsibly, […]  ( 9 min )
    Introducing the AWS Generative AI Innovation Center’s Custom Model Program for Anthropic Claude
    Since launching in June 2023, the AWS Generative AI Innovation Center team of strategists, data scientists, machine learning (ML) engineers, and solutions architects have worked with hundreds of customers worldwide, and helped them ideate, prioritize, and build bespoke solutions that harness the power of generative AI. Customers worked closely with us to prioritize use cases, […]  ( 4 min )
  • Open

    Sam Altman returns as CEO, OpenAI has a new initial board
    Mira Murati as CTO, Greg Brockman returns as President. Read messages from CEO Sam Altman and board chair Bret Taylor.  ( 5 min )
  • Open

    The Power of Prompting
    Microsoft Chief Scientific Officer Eric Horvitz explains how new prompting strategies can enable generalist large language models like GPT-4 to achieve exceptional expertise in specific domains, such as medicine, and outperform fine-tuned specialist models. The post The Power of Prompting appeared first on Microsoft Research.  ( 9 min )
  • Open

    Asymptotic Bounds for Smoothness Parameter Estimates in Gaussian Process Interpolation. (arXiv:2203.05400v5 [math.ST] UPDATED)
    It is common to model a deterministic response function, such as the output of a computer experiment, as a Gaussian process with a Mat\'ern covariance kernel. The smoothness parameter of a Mat\'ern kernel determines many important properties of the model in the large data limit, including the rate of convergence of the conditional mean to the response function. We prove that the maximum likelihood estimate of the smoothness parameter cannot asymptotically undersmooth the truth when the data are obtained on a fixed bounded subset of $\mathbb{R}^d$. That is, if the data-generating response function has Sobolev smoothness $\nu_0 > d/2$, then the smoothness parameter estimate cannot be asymptotically less than $\nu_0$. The lower bound is sharp. Additionally, we show that maximum likelihood estimation recovers the true smoothness for a class of compactly supported self-similar functions. For cross-validation we prove an asymptotic lower bound $\nu_0 - d/2$, which however is unlikely to be sharp. The results are based on approximation theory in Sobolev spaces and some general theorems that restrict the set of values that the parameter estimators can take.  ( 2 min )
    Variable Selection with the Knockoffs: Composite Null Hypotheses. (arXiv:2203.02849v4 [math.ST] UPDATED)
    The fixed-X knockoff filter is a flexible framework for variable selection with false discovery rate (FDR) control in linear models with arbitrary design matrices (of full column rank) and it allows for finite-sample selective inference via the Lasso estimates. In this paper, we extend the theory of the knockoff procedure to tests with composite null hypotheses, which are usually more relevant to real-world problems. The main technical challenge lies in handling composite nulls in tandem with dependent features from arbitrary designs. We develop two methods for composite inference with the knockoffs, namely, shifted ordinary least-squares (S-OLS) and feature-response product perturbation (FRPP), building on new structural properties of test statistics under composite nulls. We also propose two heuristic variants of S-OLS method that outperform the celebrated Benjamini-Hochberg (BH) procedure for composite nulls, which serves as a heuristic baseline under dependent test statistics. Finally, we analyze the loss in FDR when the original knockoff procedure is naively applied on composite tests.  ( 2 min )
    A Note on "Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms". (arXiv:2304.04258v2 [stat.ML] UPDATED)
    Data valuation is a growing research field that studies the influence of individual data points for machine learning (ML) models. Data Shapley, inspired by cooperative game theory and economics, is an effective method for data valuation. However, it is well-known that the Shapley value (SV) can be computationally expensive. Fortunately, Jia et al. (2019) showed that for K-Nearest Neighbors (KNN) models, the computation of Data Shapley is surprisingly simple and efficient. In this note, we revisit the work of Jia et al. (2019) and propose a more natural and interpretable utility function that better reflects the performance of KNN models. We derive the corresponding calculation procedure for the Data Shapley of KNN classifiers/regressors with the new utility functions. Our new approach, dubbed soft-label KNN-SV, achieves the same time complexity as the original method. We further provide an efficient approximation algorithm for soft-label KNN-SV based on locality sensitive hashing (LSH). Our experimental results demonstrate that Soft-label KNN-SV outperforms the original method on most datasets in the task of mislabeled data detection, making it a better baseline for future work on data valuation.  ( 2 min )
    Bias-Variance Trade-off in Physics-Informed Neural Networks with Randomized Smoothing for High-Dimensional PDEs. (arXiv:2311.15283v1 [cs.LG])
    While physics-informed neural networks (PINNs) have been proven effective for low-dimensional partial differential equations (PDEs), the computational cost remains a hurdle in high-dimensional scenarios. This is particularly pronounced when computing high-order and high-dimensional derivatives in the physics-informed loss. Randomized Smoothing PINN (RS-PINN) introduces Gaussian noise for stochastic smoothing of the original neural net model, enabling Monte Carlo methods for derivative approximation, eliminating the need for costly auto-differentiation. Despite its computational efficiency in high dimensions, RS-PINN introduces biases in both loss and gradients, negatively impacting convergence, especially when coupled with stochastic gradient descent (SGD). We present a comprehensive analysis of biases in RS-PINN, attributing them to the nonlinearity of the Mean Squared Error (MSE) loss and the PDE nonlinearity. We propose tailored bias correction techniques based on the order of PDE nonlinearity. The unbiased RS-PINN allows for a detailed examination of its pros and cons compared to the biased version. Specifically, the biased version has a lower variance and runs faster than the unbiased version, but it is less accurate due to the bias. To optimize the bias-variance trade-off, we combine the two approaches in a hybrid method that balances the rapid convergence of the biased version with the high accuracy of the unbiased version. In addition, we present an enhanced implementation of RS-PINN. Extensive experiments on diverse high-dimensional PDEs, including Fokker-Planck, HJB, viscous Burgers', Allen-Cahn, and Sine-Gordon equations, illustrate the bias-variance trade-off and highlight the effectiveness of the hybrid RS-PINN. Empirical guidelines are provided for selecting biased, unbiased, or hybrid versions, depending on the dimensionality and nonlinearity of the specific PDE problem.  ( 3 min )
    Optimal and Fair Encouragement Policy Evaluation and Learning. (arXiv:2309.07176v2 [cs.LG] UPDATED)
    In consequential domains, it is often impossible to compel individuals to take treatment, so that optimal policy rules are merely suggestions in the presence of human non-adherence to treatment recommendations. In these same domains, there may be heterogeneity both in who responds in taking-up treatment, and heterogeneity in treatment efficacy. While optimal treatment rules can maximize causal outcomes across the population, access parity constraints or other fairness considerations can be relevant in the case of encouragement. For example, in social services, a persistent puzzle is the gap in take-up of beneficial services among those who may benefit from them the most. When in addition the decision-maker has distributional preferences over both access and average outcomes, the optimal decision rule changes. We study causal identification, statistical variance-reduced estimation, and robust estimation of optimal treatment rules, including under potential violations of positivity. We consider fairness constraints such as demographic parity in treatment take-up, and other constraints, via constrained optimization. Our framework can be extended to handle algorithmic recommendations under an often-reasonable covariate-conditional exclusion restriction, using our robustness checks for lack of positivity in the recommendation. We develop a two-stage algorithm for solving over parametrized policy classes under general constraints to obtain variance-sensitive regret bounds. We illustrate the methods in two case studies based on data from randomized encouragement to enroll in insurance and from pretrial supervised release with electronic monitoring.  ( 2 min )
    Label Differential Privacy via Aggregation. (arXiv:2310.10092v3 [cs.LG] UPDATED)
    In many real-world applications, due to recent developments in the privacy landscape, training data may be aggregated to preserve the privacy of sensitive training labels. In the learning from label proportions (LLP) framework, the dataset is partitioned into bags of feature-vectors which are available only with the sum of the labels per bag. A further restriction, which we call learning from bag aggregates (LBA) is where instead of individual feature-vectors, only the (possibly weighted) sum of the feature-vectors per bag is available. We study whether such aggregation techniques can provide privacy guarantees under the notion of label differential privacy (label-DP) previously studied in for e.g. [Chaudhuri-Hsu'11, Ghazi et al.'21, Esfandiari et al.'22]. It is easily seen that naive LBA and LLP do not provide label-DP. Our main result however, shows that weighted LBA using iid Gaussian weights with $m$ randomly sampled disjoint $k$-sized bags is in fact $(\varepsilon, \delta)$-label-DP for any $\varepsilon > 0$ with $\delta \approx \exp(-\Omega(\sqrt{k}))$ assuming a lower bound on the linear-mse regression loss. Further, the $\ell_2^2$-regressor which minimizes the loss on the aggregated dataset has a loss within $\left(1 + o(1)\right)$-factor of the optimum on the original dataset w.p. $\approx 1 - exp(-\Omega(m))$. We emphasize that no additive label noise is required. The analogous weighted-LLP does not however admit label-DP. Nevertheless, we show that if additive $N(0, 1)$ noise can be added to any constant fraction of the instance labels, then the noisy weighted-LLP admits similar label-DP guarantees without assumptions on the dataset, while preserving the utility of Lipschitz-bounded neural mse-regression tasks. Our work is the first to demonstrate that label-DP can be achieved by randomly weighted aggregation for regression tasks, using no or little additive noise.  ( 3 min )
    Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation. (arXiv:2308.15709v2 [cs.LG] UPDATED)
    Data valuation aims to quantify the usefulness of individual data sources in training machine learning (ML) models, and is a critical aspect of data-centric ML research. However, data valuation faces significant yet frequently overlooked privacy challenges despite its importance. This paper studies these challenges with a focus on KNN-Shapley, one of the most practical data valuation methods nowadays. We first emphasize the inherent privacy risks of KNN-Shapley, and demonstrate the significant technical difficulties in adapting KNN-Shapley to accommodate differential privacy (DP). To overcome these challenges, we introduce TKNN-Shapley, a refined variant of KNN-Shapley that is privacy-friendly, allowing for straightforward modifications to incorporate DP guarantee (DP-TKNN-Shapley). We show that DP-TKNN-Shapley has several advantages and offers a superior privacy-utility tradeoff compared to naively privatized KNN-Shapley in discerning data quality. Moreover, even non-private TKNN-Shapley achieves comparable performance as KNN-Shapley. Overall, our findings suggest that TKNN-Shapley is a promising alternative to KNN-Shapley, particularly for real-world applications involving sensitive data.  ( 2 min )
    Metric Space Magnitude for Evaluating Unsupervised Representation Learning. (arXiv:2311.16054v1 [cs.LG])
    The magnitude of a metric space was recently established as a novel invariant, providing a measure of the `effective size' of a space across multiple scales. By capturing both geometrical and topological properties of data, magnitude is poised to address challenges in unsupervised representation learning tasks. We formalise a novel notion of dissimilarity between magnitude functions of finite metric spaces and use them to derive a quality measure for dimensionality reduction tasks. Our measure is provably stable under perturbations of the data, can be efficiently calculated, and enables a rigorous multi-scale comparison of embeddings. We show the utility of our measure in an experimental suite that comprises different domains and tasks, including the comparison of data visualisations.  ( 2 min )
    Testing High-dimensional Multinomials with Applications to Text Analysis. (arXiv:2301.01381v2 [stat.ME] UPDATED)
    Motivated by applications in text mining and discrete distribution inference, we investigate the testing for equality of probability mass functions of $K$ groups of high-dimensional multinomial distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null, is proposed. The optimal detection boundary is established, and the proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest. The proposed method is demonstrated in simulation studies and applied to analyze two real-world datasets to examine variation among consumer reviews of Amazon movies and diversity of statistical paper abstracts.  ( 2 min )
    Elastic Shape Analysis of Tree-like 3D Objects using Extended SRVF Representation. (arXiv:2110.08693v4 [cs.LG] UPDATED)
    How can one analyze detailed 3D biological objects, such as neurons and botanical trees, that exhibit complex geometrical and topological variation? In this paper, we develop a novel mathematical framework for representing, comparing, and computing geodesic deformations between the shapes of such tree-like 3D objects. A hierarchical organization of subtrees characterizes these objects -- each subtree has the main branch with some side branches attached -- and one needs to match these structures across objects for meaningful comparisons. We propose a novel representation that extends the Square-Root Velocity Function (SRVF), initially developed for Euclidean curves, to tree-shaped 3D objects. We then define a new metric that quantifies the bending, stretching, and branch sliding needed to deform one tree-shaped object into the other. Compared to the current metrics, such as the Quotient Euclidean Distance (QED) and the Tree Edit Distance (TED), the proposed representation and metric capture the full elasticity of the branches (i.e., bending and stretching) as well as the topological variations (i.e., branch death/birth and sliding). It completely avoids the shrinkage that results from the edge collapse and node split operations of the QED and TED metrics. We demonstrate the utility of this framework in comparing, matching, and computing geodesics between biological objects such as neurons and botanical trees. The framework is also applied to various shape analysis tasks: (i) symmetry analysis and symmetrization of tree-shaped 3D objects, (ii) computing summary statistics (means and modes of variations) of populations of tree-shaped 3D objects, (iii) fitting parametric probability distributions to such populations, and (iv) finally synthesizing novel tree-shaped 3D objects through random sampling from estimated probability distributions.  ( 3 min )
    A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation. (arXiv:2304.02858v3 [cs.LG] UPDATED)
    Class imbalance (CI) in classification problems arises when the number of observations belonging to one class is lower than the other. Ensemble learning combines multiple models to obtain a robust model and has been prominently used with data augmentation methods to address class imbalance problems. In the last decade, a number of strategies have been added to enhance ensemble learning and data augmentation methods, along with new methods such as generative adversarial networks (GANs). A combination of these has been applied in many studies, and the evaluation of different combinations would enable a better understanding and guidance for different application domains. In this paper, we present a computational study to evaluate data augmentation and ensemble learning methods used to address prominent benchmark CI problems. We present a general framework that evaluates 9 data augmentation and 9 ensemble learning methods for CI problems. Our objective is to identify the most effective combination for improving classification performance on imbalanced datasets. The results indicate that combinations of data augmentation methods with ensemble learning can significantly improve classification performance on imbalanced datasets. We find that traditional data augmentation methods such as the synthetic minority oversampling technique (SMOTE) and random oversampling (ROS) are not only better in performance for selected CI problems, but also computationally less expensive than GANs. Our study is vital for the development of novel models for handling imbalanced datasets.  ( 3 min )
    AutoWS-Bench-101: Benchmarking Automated Weak Supervision with 100 Labels. (arXiv:2208.14362v2 [cs.LG] UPDATED)
    Weak supervision (WS) is a powerful method to build labeled datasets for training supervised models in the face of little-to-no labeled data. It replaces hand-labeling data with aggregating multiple noisy-but-cheap label estimates expressed by labeling functions (LFs). While it has been used successfully in many domains, weak supervision's application scope is limited by the difficulty of constructing labeling functions for domains with complex or high-dimensional features. To address this, a handful of methods have proposed automating the LF design process using a small set of ground truth labels. In this work, we introduce AutoWS-Bench-101: a framework for evaluating automated WS (AutoWS) techniques in challenging WS settings -- a set of diverse application domains on which it has been previously difficult or impossible to apply traditional WS techniques. While AutoWS is a promising direction toward expanding the application-scope of WS, the emergence of powerful methods such as zero-shot foundation models reveals the need to understand how AutoWS techniques compare or cooperate with modern zero-shot or few-shot learners. This informs the central question of AutoWS-Bench-101: given an initial set of 100 labels for each task, we ask whether a practitioner should use an AutoWS method to generate additional labels or use some simpler baseline, such as zero-shot predictions from a foundation model or supervised learning. We observe that in many settings, it is necessary for AutoWS methods to incorporate signal from foundation models if they are to outperform simple few-shot baselines, and AutoWS-Bench-101 promotes future research in this direction. We conclude with a thorough ablation study of AutoWS methods.  ( 3 min )
    Applying statistical learning theory to deep learning. (arXiv:2311.15404v1 [cs.LG])
    Although statistical learning theory provides a robust framework to understand supervised learning, many theoretical aspects of deep learning remain unclear, in particular how different architectures may lead to inductive bias when trained using gradient based methods. The goal of these lectures is to provide an overview of some of the main questions that arise when attempting to understand deep learning from a learning theory perspective. After a brief reminder on statistical learning theory and stochastic optimization, we discuss implicit bias in the context of benign overfitting. We then move to a general description of the mirror descent algorithm, showing how we may go back and forth between a parameter space and the corresponding function space for a given learning problem, as well as how the geometry of the learning problem may be represented by a metric tensor. Building on this framework, we provide a detailed study of the implicit bias of gradient descent on linear diagonal networks for various regression tasks, showing how the loss function, scale of parameters at initialization and depth of the network may lead to various forms of implicit bias, in particular transitioning between kernel or feature learning.  ( 2 min )
    Closing the ODE-SDE gap in score-based diffusion models through the Fokker-Planck equation. (arXiv:2311.15996v1 [cs.LG])
    Score-based diffusion models have emerged as one of the most promising frameworks for deep generative modelling, due to their state-of-the art performance in many generation tasks while relying on mathematical foundations such as stochastic differential equations (SDEs) and ordinary differential equations (ODEs). Empirically, it has been reported that ODE based samples are inferior to SDE based samples. In this paper we rigorously describe the range of dynamics and approximations that arise when training score-based diffusion models, including the true SDE dynamics, the neural approximations, the various approximate particle dynamics that result, as well as their associated Fokker--Planck equations and the neural network approximations of these Fokker--Planck equations. We systematically analyse the difference between the ODE and SDE dynamics of score-based diffusion models, and link it to an associated Fokker--Planck equation. We derive a theoretical upper bound on the Wasserstein 2-distance between the ODE- and SDE-induced distributions in terms of a Fokker--Planck residual. We also show numerically that conventional score-based diffusion models can exhibit significant differences between ODE- and SDE-induced distributions which we demonstrate using explicit comparisons. Moreover, we show numerically that reducing the Fokker--Planck residual by adding it as an additional regularisation term leads to closing the gap between ODE- and SDE-induced distributions. Our experiments suggest that this regularisation can improve the distribution generated by the ODE, however that this can come at the cost of degraded SDE sample quality.  ( 3 min )
    Bridging Adversarial and Nonstationary Multi-armed Bandit. (arXiv:2201.01628v3 [cs.LG] UPDATED)
    In the multi-armed bandit framework, there are two formulations that are commonly employed to handle time-varying reward distributions: adversarial bandit and nonstationary bandit. Although their oracles, algorithms, and regret analysis differ significantly, we provide a unified formulation in this paper that smoothly bridges the two as special cases. The formulation uses an oracle that takes the best-fixed arm within time windows. Depending on the window size, it turns into the oracle in hindsight in the adversarial bandit and dynamic oracle in the nonstationary bandit. We provide algorithms that attain the optimal regret with the matching lower bound.  ( 2 min )
    Environment Invariant Linear Least Squares. (arXiv:2303.03092v2 [math.ST] UPDATED)
    This paper considers a multi-environment linear regression model in which data from multiple experimental settings are collected. The joint distribution of the response variable and covariates may vary across different environments, yet the conditional expectations of $y$ given the unknown set of important variables are invariant. Such a statistical model is related to the problem of endogeneity, causal inference, and transfer learning. The motivation behind it is illustrated by how the goals of prediction and attribution are inherent in estimating the true parameter and the important variable set. We construct a novel environment invariant linear least squares (EILLS) objective function, a multi-environment version of linear least-squares regression that leverages the above conditional expectation invariance structure and heterogeneity among different environments to determine the true parameter. Our proposed method is applicable without any additional structural knowledge and can identify the true parameter under a near-minimal identification condition. We establish non-asymptotic $\ell_2$ error bounds on the estimation error for the EILLS estimator in the presence of spurious variables. Moreover, we further show that the $\ell_0$ penalized EILLS estimator can achieve variable selection consistency in high-dimensional regimes. These non-asymptotic results demonstrate the sample efficiency of the EILLS estimator and its capability to circumvent the curse of endogeneity in an algorithmic manner without any prior structural knowledge. To the best of our knowledge, this paper is the first to realize statistically efficient invariance learning in the general linear model.  ( 3 min )
    Universal Event Detection in Time Series. (arXiv:2311.15654v1 [stat.ML])
    In our previously published work, we introduced a supervised deep learning method for event detection in multivariate time series data, employing regression instead of binary classification. This simplification avoids the need for point-wise labels throughout the entire dataset, relying solely on ground truth events defined as time points or intervals. In this paper, we establish mathematically that our method is universal, and capable of detecting any type of event with arbitrary precision under mild continuity assumptions on the time series. These events may encompass change points, frauds, anomalies, physical occurrences, and more. We substantiate our theoretical results using the universal approximation theorem for feed-forward neural networks (FFN). Additionally, we provide empirical validations that confirm our claims, demonstrating that our method, with a limited number of parameters, outperforms other deep learning approaches, particularly for rare events and imbalanced datasets from different domains.  ( 2 min )
    Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study. (arXiv:2311.15051v1 [cs.LG])
    Although gradient descent with momentum is widely used in modern deep learning, a concrete understanding of its effects on the training trajectory still remains elusive. In this work, we empirically show that momentum gradient descent with a large learning rate and learning rate warmup displays large catapults, driving the iterates towards flatter minima than those found by gradient descent. We then provide empirical evidence and theoretical intuition that the large catapult is caused by momentum "amplifying" the self-stabilization effect (Damian et al., 2023).  ( 2 min )
    A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation. (arXiv:2311.15238v1 [cs.LG])
    The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of $\tilde{O}(d\sqrt{HK})$ when $K$ is sufficiently large and near-optimal policy switching cost of $\tilde{O}(dH)$, with $d$ being the eluder dimension of the function class, $H$ being the planning horizon, and $K$ being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.  ( 2 min )
    Multi-fidelity Constrained Optimization for Stochastic Black Box Simulators. (arXiv:2311.15137v1 [math.OC])
    Constrained optimization of the parameters of a simulator plays a crucial role in a design process. These problems become challenging when the simulator is stochastic, computationally expensive, and the parameter space is high-dimensional. One can efficiently perform optimization only by utilizing the gradient with respect to the parameters, but these gradients are unavailable in many legacy, black-box codes. We introduce the algorithm Scout-Nd (Stochastic Constrained Optimization for N dimensions) to tackle the issues mentioned earlier by efficiently estimating the gradient, reducing the noise of the gradient estimator, and applying multi-fidelity schemes to further reduce computational effort. We validate our approach on standard benchmarks, demonstrating its effectiveness in optimizing parameters highlighting better performance compared to existing methods.  ( 2 min )
    Decoding trust: A reinforcement learning perspective. (arXiv:2309.14598v2 [q-bio.PE] UPDATED)
    Behavioral experiments on the trust game have shown that trust and trustworthiness are universal among human beings, contradicting the prediction by assuming \emph{Homo economicus} in orthodox Economics. This means some mechanism must be at work that favors their emergence. Most previous explanations however need to resort to some factors based upon imitative learning, a simple version of social learning. Here, we turn to the paradigm of reinforcement learning, where individuals update their strategies by evaluating the long-term return through accumulated experience. Specifically, we investigate the trust game with the Q-learning algorithm, where each participant is associated with two evolving Q-tables that guide one's decision making as trustor and trustee respectively. In the pairwise scenario, we reveal that high levels of trust and trustworthiness emerge when individuals appreciate both their historical experience and returns in the future. Mechanistically, the evolution of the Q-tables shows a crossover that resembles human's psychological changes. We also provide the phase diagram for the game parameters, where the boundary analysis is conducted. These findings are robust when the scenario is extended to a latticed population. Our results thus provide a natural explanation for the emergence of trust and trustworthiness without external factors involved. More importantly, the proposed paradigm shows the potential in deciphering many puzzles in human behaviors.  ( 3 min )
    Energy Discrepancies: A Score-Independent Loss for Energy-Based Models. (arXiv:2307.06431v2 [stat.ML] UPDATED)
    Energy-based models are a simple yet powerful class of probabilistic models, but their widespread adoption has been limited by the computational burden of training them. We propose a novel loss function called Energy Discrepancy (ED) which does not rely on the computation of scores or expensive Markov chain Monte Carlo. We show that ED approaches the explicit score matching and negative log-likelihood loss under different limits, effectively interpolating between both. Consequently, minimum ED estimation overcomes the problem of nearsightedness encountered in score-based estimation methods, while also enjoying theoretical guarantees. Through numerical experiments, we demonstrate that ED learns low-dimensional data distributions faster and more accurately than explicit score matching or contrastive divergence. For high-dimensional image data, we describe how the manifold hypothesis puts limitations on our approach and demonstrate the effectiveness of energy discrepancy by training the energy-based model as a prior of a variational decoder model.  ( 2 min )
    Online Estimation and Optimization of Utility-Based Shortfall Risk. (arXiv:2111.08805v3 [stat.ML] UPDATED)
    Utility-Based Shortfall Risk (UBSR) is a risk metric that is increasingly popular in financial applications, owing to certain desirable properties that it enjoys. We consider the problem of estimating UBSR in a recursive setting, where samples from the underlying loss distribution are available one-at-a-time. We cast the UBSR estimation problem as a root finding problem, and propose stochastic approximation-based estimations schemes. We derive non-asymptotic bounds on the estimation error in the number of samples. We also consider the problem of UBSR optimization within a parameterized class of random variables. We propose a stochastic gradient descent based algorithm for UBSR optimization, and derive non-asymptotic bounds on its convergence.  ( 2 min )
    Dimensionality Reduction and Wasserstein Stability for Kernel Regression. (arXiv:2203.09347v3 [stat.ML] UPDATED)
    In a high-dimensional regression framework, we study consequences of the naive two-step procedure where first the dimension of the input variables is reduced and second, the reduced input variables are used to predict the output variable with kernel regression. In order to analyze the resulting regression errors, a novel stability result for kernel regression with respect to the Wasserstein distance is derived. This allows us to bound errors that occur when perturbed input data is used to fit the regression function. We apply the general stability result to principal component analysis (PCA). Exploiting known estimates from the literature on both principal component analysis and kernel regression, we deduce convergence rates for the two-step procedure. The latter turns out to be particularly useful in a semi-supervised setting.  ( 2 min )
    Bayesian Approach to Linear Bayesian Networks. (arXiv:2311.15610v1 [stat.ML])
    This study proposes the first Bayesian approach for learning high-dimensional linear Bayesian networks. The proposed approach iteratively estimates each element of the topological ordering from backward and its parent using the inverse of a partial covariance matrix. The proposed method successfully recovers the underlying structure when Bayesian regularization for the inverse covariance matrix with unequal shrinkage is applied. Specifically, it shows that the number of samples $n = \Omega( d_M^2 \log p)$ and $n = \Omega(d_M^2 p^{2/m})$ are sufficient for the proposed algorithm to learn linear Bayesian networks with sub-Gaussian and 4m-th bounded-moment error distributions, respectively, where $p$ is the number of nodes and $d_M$ is the maximum degree of the moralized graph. The theoretical findings are supported by extensive simulation studies including real data analysis. Furthermore the proposed method is demonstrated to outperform state-of-the-art frequentist approaches, such as the BHLSM, LISTEN, and TD algorithms in synthetic data.  ( 2 min )
    On the near-optimality of betting confidence sets for bounded means. (arXiv:2310.01547v2 [math.ST] UPDATED)
    Constructing nonasymptotic confidence intervals (CIs) for the mean of a univariate distribution from independent and identically distributed (i.i.d.) observations is a fundamental task in statistics. For bounded observations, a classical nonparametric approach proceeds by inverting standard concentration bounds, such as Hoeffding's or Bernstein's inequalities. Recently, an alternative betting-based approach for defining CIs and their time-uniform variants called confidence sequences (CSs), has been shown to be empirically superior to the classical methods. In this paper, we provide theoretical justification for this improved empirical performance of betting CIs and CSs. Our main contributions are as follows: (i) We first compare CIs using the values of their first-order asymptotic widths (scaled by $\sqrt{n}$), and show that the betting CI of Waudby-Smith and Ramdas (2023) has a smaller limiting width than existing empirical Bernstein (EB)-CIs. (ii) Next, we establish two lower bounds that characterize the minimum width achievable by any method for constructing CIs/CSs in terms of certain inverse information projections. (iii) Finally, we show that the betting CI and CS match the fundamental limits, modulo an additive logarithmic term and a multiplicative constant. Overall these results imply that the betting CI~(and CS) admit stronger theoretical guarantees than the existing state-of-the-art EB-CI~(and CS); both in the asymptotic and finite-sample regimes.  ( 2 min )
    Fantastic Generalization Measures are Nowhere to be Found. (arXiv:2309.13658v2 [cs.LG] UPDATED)
    We study the notion of a generalization bound being uniformly tight, meaning that the difference between the bound and the population loss is small for all learning algorithms and all population distributions. Numerous generalization bounds have been proposed in the literature as potential explanations for the ability of neural networks to generalize in the overparameterized setting. However, in their paper ``Fantastic Generalization Measures and Where to Find Them,'' Jiang et al. (2020) examine more than a dozen generalization bounds, and show empirically that none of them are uniformly tight. This raises the question of whether uniformly-tight generalization bounds are at all possible in the overparameterized setting. We consider two types of generalization bounds: (1) bounds that may depend on the training set and the learned hypothesis (e.g., margin bounds). We prove mathematically that no such bound can be uniformly tight in the overparameterized setting; (2) bounds that may in addition also depend on the learning algorithm (e.g., stability bounds). For these bounds, we show a trade-off between the algorithm's performance and the bound's tightness. Namely, if the algorithm achieves good accuracy on certain distributions, then no generalization bound can be uniformly tight for it in the overparameterized setting. We explain how these formal results can, in our view, inform research on generalization bounds for neural networks, while stressing that other interpretations of these results are also possible.  ( 3 min )
    Augmentation Invariant Manifold Learning. (arXiv:2211.00460v2 [stat.ML] UPDATED)
    Data augmentation is a widely used technique and an essential ingredient in the recent advance in self-supervised representation learning. By preserving the similarity between augmented data, the resulting data representation can improve various downstream analyses and achieve state-of-the-art performance in many applications. Despite the empirical effectiveness, most existing methods lack theoretical understanding under a general nonlinear setting. To fill this gap, we develop a statistical framework on a low-dimension product manifold to model the data augmentation transformation. Under this framework, we introduce a new representation learning method called augmentation invariant manifold learning and design a computationally efficient algorithm by reformulating it as a stochastic optimization problem. Compared with existing self-supervised methods, the new method simultaneously exploits the manifold's geometric structure and invariant property of augmented data and has an explicit theoretical guarantee. Our theoretical investigation characterizes the role of data augmentation in the proposed method and reveals why and how the data representation learned from augmented data can improve the $k$-nearest neighbor classifier in the downstream analysis, showing that a more complex data augmentation leads to more improvement in downstream analysis. Finally, numerical experiments on simulated and real datasets are presented to demonstrate the merit of the proposed method.  ( 2 min )
    Assessing Deep Neural Networks as Probability Estimators. (arXiv:2111.08239v2 [cs.LG] UPDATED)
    Deep Neural Networks (DNNs) have performed admirably in classification tasks. However, the characterization of their classification uncertainties, required for certain applications, has been lacking. In this work, we investigate the issue by assessing DNNs' ability to estimate conditional probabilities and propose a framework for systematic uncertainty characterization. Denoting the input sample as x and the category as y, the classification task of assigning a category y to a given input x can be reduced to the task of estimating the conditional probabilities p(y|x), as approximated by the DNN at its last layer using the softmax function. Since softmax yields a vector whose elements all fall in the interval (0, 1) and sum to 1, it suggests a probabilistic interpretation to the DNN's outcome. Using synthetic and real-world datasets, we look into the impact of various factors, e.g., probability density f(x) and inter-categorical sparsity, on the precision of DNNs' estimations of p(y|x), and find that the likelihood probability density and the inter-categorical sparsity have greater impacts than the prior probability to DNNs' classification uncertainty.
    A Neural Framework for Generalized Causal Sensitivity Analysis. (arXiv:2311.16026v1 [cs.LG])
    Unobserved confounding is common in many applications, making causal inference from observational data challenging. As a remedy, causal sensitivity analysis is an important tool to draw causal conclusions under unobserved confounding with mathematical guarantees. In this paper, we propose NeuralCSA, a neural framework for generalized causal sensitivity analysis. Unlike previous work, our framework is compatible with (i) a large class of sensitivity models, including the marginal sensitivity model, f-sensitivity models, and Rosenbaum's sensitivity model; (ii) different treatment types (i.e., binary and continuous); and (iii) different causal queries, including (conditional) average treatment effects and simultaneous effects on multiple outcomes. The generality of \frameworkname is achieved by learning a latent distribution shift that corresponds to a treatment intervention using two conditional normalizing flows. We provide theoretical guarantees that NeuralCSA is able to infer valid bounds on the causal query of interest and also demonstrate this empirically using both simulated and real-world data.  ( 2 min )
    DIVA: A Dirichlet Process Mixtures Based Incremental Deep Clustering Algorithm via Variational Auto-Encoder. (arXiv:2305.14067v3 [cs.LG] UPDATED)
    Generative model-based deep clustering frameworks excel in classifying complex data, but are limited in handling dynamic and complex features because they require prior knowledge of the number of clusters. In this paper, we propose a nonparametric deep clustering framework that employs an infinite mixture of Gaussians as a prior. Our framework utilizes a memoized online variational inference method that enables the "birth" and "merge" moves of clusters, allowing our framework to cluster data in a "dynamic-adaptive" manner, without requiring prior knowledge of the number of features. We name the framework as DIVA, a Dirichlet Process-based Incremental deep clustering framework via Variational Auto-Encoder. Our framework, which outperforms state-of-the-art baselines, exhibits superior performance in classifying complex data with dynamically changing features, particularly in the case of incremental features. We released our source code implementation at: https://github.com/Ghiara/diva
    Should We Learn Most Likely Functions or Parameters?. (arXiv:2311.15990v1 [cs.LG])
    Standard regularized training procedures correspond to maximizing a posterior distribution over parameters, known as maximum a posteriori (MAP) estimation. However, model parameters are of interest only insomuch as they combine with the functional form of a model to provide a function that can make good predictions. Moreover, the most likely parameters under the parameter posterior do not generally correspond to the most likely function induced by the parameter posterior. In fact, we can re-parametrize a model such that any setting of parameters can maximize the parameter posterior. As an alternative, we investigate the benefits and drawbacks of directly estimating the most likely function implied by the model and the data. We show that this procedure leads to pathological solutions when using neural networks and prove conditions under which the procedure is well-behaved, as well as a scalable approximation. Under these conditions, we find that function-space MAP estimation can lead to flatter minima, better generalization, and improved robustness to overfitting.  ( 2 min )
    Minimizing robust density power-based divergences for general parametric density models. (arXiv:2307.05251v3 [stat.ME] UPDATED)
    Density power divergence (DPD) is designed to robustly estimate the underlying distribution of observations, in the presence of outliers. However, DPD involves an integral of the power of the parametric density models to be estimated; the explicit form of the integral term can be derived only for specific densities, such as normal and exponential densities. While we may perform a numerical integration for each iteration of the optimization algorithms, the computational complexity has hindered the practical application of DPD-based estimation to more general parametric densities. To address the issue, this study introduces a stochastic approach to minimize DPD for general parametric density models. The proposed approach also can be employed to minimize other density power-based $\gamma$-divergences, by leveraging unnormalized models.  ( 2 min )
    Instance-Specific Asymmetric Sensitivity in Differential Privacy. (arXiv:2311.14681v1 [cs.DS])
    We provide a new algorithmic framework for differentially private estimation of general functions that adapts to the hardness of the underlying dataset. We build upon previous work that gives a paradigm for selecting an output through the exponential mechanism based upon closeness of the inverse to the underlying dataset, termed the inverse sensitivity mechanism. Our framework will slightly modify the closeness metric and instead give a simple and efficient application of the sparse vector technique. While the inverse sensitivity mechanism was shown to be instance optimal, it was only with respect to a class of unbiased mechanisms such that the most likely outcome matches the underlying data. We break this assumption in order to more naturally navigate the bias-variance tradeoff, which will also critically allow for extending our method to unbounded data. In consideration of this tradeoff, we provide strong intuition and empirical validation that our technique will be particularly effective when the distances to the underlying dataset are asymmetric. This asymmetry is inherent to a range of important problems including fundamental statistics such as variance, as well as commonly used machine learning performance metrics for both classification and regression tasks. We efficiently instantiate our method in $O(n)$ time for these problems and empirically show that our techniques will give substantially improved differentially private estimations.  ( 2 min )
    Optimal Clustering of Discrete Mixtures: Binomial, Poisson, Block Models, and Multi-layer Networks. (arXiv:2311.15598v1 [math.ST])
    In this paper, we first study the fundamental limit of clustering networks when a multi-layer network is present. Under the mixture multi-layer stochastic block model (MMSBM), we show that the minimax optimal network clustering error rate, which takes an exponential form and is characterized by the Renyi divergence between the edge probability distributions of the component networks. We propose a novel two-stage network clustering method including a tensor-based initialization algorithm involving both node and sample splitting and a refinement procedure by likelihood-based Lloyd algorithm. Network clustering must be accompanied by node community detection. Our proposed algorithm achieves the minimax optimal network clustering error rate and allows extreme network sparsity under MMSBM. Numerical simulations and real data experiments both validate that our method outperforms existing methods. Oftentimes, the edges of networks carry count-type weights. We then extend our methodology and analysis framework to study the minimax optimal clustering error rate for mixture of discrete distributions including Binomial, Poisson, and multi-layer Poisson networks. The minimax optimal clustering error rates in these discrete mixtures all take the same exponential form characterized by the Renyi divergences. These optimal clustering error rates in discrete mixtures can also be achieved by our proposed two-stage clustering algorithm.  ( 2 min )
    Reducing sequential change detection to sequential estimation. (arXiv:2309.09111v2 [math.ST] UPDATED)
    We consider the problem of sequential change detection, where the goal is to design a scheme for detecting any changes in a parameter or functional $\theta$ of the data stream distribution that has small detection delay, but guarantees control on the frequency of false alarms in the absence of changes. In this paper, we describe a simple reduction from sequential change detection to sequential estimation using confidence sequences: we begin a new $(1-\alpha)$-confidence sequence at each time step, and proclaim a change when the intersection of all active confidence sequences becomes empty. We prove that the average run length is at least $1/\alpha$, resulting in a change detection scheme with minimal structural assumptions~(thus allowing for possibly dependent observations, and nonparametric distribution classes), but strong guarantees. Our approach bears an interesting parallel with the reduction from change detection to sequential testing of Lorden (1971) and the e-detector of Shin et al. (2022).  ( 2 min )
    Ergodicity of the underdamped mean-field Langevin dynamics. (arXiv:2007.14660v3 [math.PR] UPDATED)
    We study the long time behavior of an underdamped mean-field Langevin (MFL) equation, and provide a general convergence as well as an exponential convergence rate result under different conditions. The results on the MFL equation can be applied to study the convergence of the Hamiltonian gradient descent algorithm for the overparametrized optimization. We then provide a numerical example of the algorithm to train a generative adversarial networks (GAN).  ( 2 min )
    Maximum Likelihood Estimation is All You Need for Well-Specified Covariate Shift. (arXiv:2311.15961v1 [stat.ML])
    A key challenge of modern machine learning systems is to achieve Out-of-Distribution (OOD) generalization -- generalizing to target data whose distribution differs from that of source data. Despite its significant importance, the fundamental question of ``what are the most effective algorithms for OOD generalization'' remains open even under the standard setting of covariate shift. This paper addresses this fundamental question by proving that, surprisingly, classical Maximum Likelihood Estimation (MLE) purely using source data (without any modification) achieves the minimax optimality for covariate shift under the well-specified setting. That is, no algorithm performs better than MLE in this setting (up to a constant factor), justifying MLE is all you need. Our result holds for a very rich class of parametric models, and does not require any boundedness condition on the density ratio. We illustrate the wide applicability of our framework by instantiating it to three concrete examples -- linear regression, logistic regression, and phase retrieval. This paper further complement the study by proving that, under the misspecified setting, MLE is no longer the optimal choice, whereas Maximum Weighted Likelihood Estimator (MWLE) emerges as minimax optimal in certain scenarios.
    The Lipschitz-Variance-Margin Tradeoff for Enhanced Randomized Smoothing. (arXiv:2309.16883v2 [cs.LG] UPDATED)
    Real-life applications of deep neural networks are hindered by their unsteady predictions when faced with noisy inputs and adversarial attacks. The certified radius is in this context a crucial indicator of the robustness of models. However how to design an efficient classifier with a sufficient certified radius? Randomized smoothing provides a promising framework by relying on noise injection in inputs to obtain a smoothed and more robust classifier. In this paper, we first show that the variance introduced by randomized smoothing closely interacts with two other important properties of the classifier, \textit{i.e.} its Lipschitz constant and margin. More precisely, our work emphasizes the dual impact of the Lipschitz constant of the base classifier, on both the smoothed classifier and the empirical variance. Moreover, to increase the certified robust radius, we introduce a different simplex projection technique for the base classifier to leverage the variance-margin trade-off thanks to Bernstein's concentration inequality, along with an enhanced Lipschitz bound. Experimental results show a significant improvement in certified accuracy compared to current state-of-the-art methods. Our novel certification procedure allows us to use pre-trained models that are used with randomized smoothing, effectively improving the current certification radius in a zero-shot manner.
    Deep Backtracking Counterfactuals for Causally Compliant Explanations. (arXiv:2310.07665v2 [cs.AI] UPDATED)
    Counterfactuals can offer valuable insights by answering what would have been observed under altered circumstances, conditional on a factual observation. Whereas the classical interventional interpretation of counterfactuals has been studied extensively, backtracking constitutes a less studied alternative the backtracking principle has emerged as an alternative philosophy where all causal laws are kept intact. In the present work, we introduce a practical method for computing backtracking counterfactuals in structural causal models that consist of deep generative components. To this end, we impose conditions on the structural assignments that enable the generation of counterfactuals by solving a tractable constrained optimization problem in the structured latent space of a causal model. Our formulation also facilitates a comparison with methods in the field of counterfactual explanations. Compared to these, our method represents a versatile, modular and causally compliant alternative. We demonstrate these properties experimentally on a modified version of MNIST and CelebA.  ( 2 min )
    Global $\mathcal{L}^2$ minimization with certainty via geometrically adapted gradient descent in Deep Learning. (arXiv:2311.15487v1 [cs.LG])
    We consider the gradient descent flow widely used for the minimization of the $\mathcal{L}^2$ cost function in Deep Learning networks, and introduce two modified versions; one adapted for the overparametrized setting, and the other for the underparametrized setting. Both have a clear and natural invariant geometric meaning, taking into account the pullback vector bundle structure in the overparametrized, and the pushforward vector bundle structure in the underparametrized setting. In the overparametrized case, we prove that, provided that a rank condition holds, all orbits of the modified gradient descent drive the $\mathcal{L}^2$ cost to its global minimum at a uniform exponential convergence rate. We point out relations of the latter to sub-Riemannian geometry.  ( 2 min )
    Effective Structural Encodings via Local Curvature Profiles. (arXiv:2311.14864v1 [cs.LG])
    Structural and Positional Encodings can significantly improve the performance of Graph Neural Networks in downstream tasks. Recent literature has begun to systematically investigate differences in the structural properties that these approaches encode, as well as performance trade-offs between them. However, the question of which structural properties yield the most effective encoding remains open. In this paper, we investigate this question from a geometric perspective. We propose a novel structural encoding based on discrete Ricci curvature (Local Curvature Profiles, short LCP) and show that it significantly outperforms existing encoding approaches. We further show that combining local structural encodings, such as LCP, with global positional encodings improves downstream performance, suggesting that they capture complementary geometric information. Finally, we compare different encoding types with (curvature-based) rewiring techniques. Rewiring has recently received a surge of interest due to its ability to improve the performance of Graph Neural Networks by mitigating over-smoothing and over-squashing effects. Our results suggest that utilizing curvature information for structural encodings delivers significantly larger performance increases than rewiring.  ( 2 min )
    Deep Latent Force Models: ODE-based Process Convolutions for Bayesian Deep Learning. (arXiv:2311.14828v1 [stat.ML])
    Effectively modeling phenomena present in highly nonlinear dynamical systems whilst also accurately quantifying uncertainty is a challenging task, which often requires problem-specific techniques. We outline the deep latent force model (DLFM), a domain-agnostic approach to tackling this problem, which consists of a deep Gaussian process architecture where the kernel at each layer is derived from an ordinary differential equation using the framework of process convolutions. Two distinct formulations of the DLFM are presented which utilise weight-space and variational inducing points-based Gaussian process approximations, both of which are amenable to doubly stochastic variational inference. We provide evidence that our model is capable of capturing highly nonlinear behaviour in real-world multivariate time series data. In addition, we find that our approach achieves comparable performance to a number of other probabilistic models on benchmark regression tasks. We also empirically assess the negative impact of the inducing points framework on the extrapolation capabilities of LFM-based models.  ( 2 min )
    Forecasting Cryptocurrency Prices Using Deep Learning: Integrating Financial, Blockchain, and Text Data. (arXiv:2311.14759v1 [q-fin.ST])
    This paper explores the application of Machine Learning (ML) and Natural Language Processing (NLP) techniques in cryptocurrency price forecasting, specifically Bitcoin (BTC) and Ethereum (ETH). Focusing on news and social media data, primarily from Twitter and Reddit, we analyse the influence of public sentiment on cryptocurrency valuations using advanced deep learning NLP methods. Alongside conventional price regression, we treat cryptocurrency price forecasting as a classification problem. This includes both the prediction of price movements (up or down) and the identification of local extrema. We compare the performance of various ML models, both with and without NLP data integration. Our findings reveal that incorporating NLP data significantly enhances the forecasting performance of our models. We discover that pre-trained models, such as Twitter-RoBERTa and BART MNLI, are highly effective in capturing market sentiment, and that fine-tuning Large Language Models (LLMs) also yields substantial forecasting improvements. Notably, the BART MNLI zero-shot classification model shows considerable proficiency in extracting bullish and bearish signals from textual data. All of our models consistently generate profit across different validation scenarios, with no observed decline in profits or reduction in the impact of NLP data over time. The study highlights the potential of text analysis in improving financial forecasts and demonstrates the effectiveness of various NLP techniques in capturing nuanced market sentiment.  ( 2 min )
    Learning Multi-Frequency Partial Correlation Graphs. (arXiv:2311.15756v1 [cs.LG])
    Despite the large research effort devoted to learning dependencies between time series, the state of the art still faces a major limitation: existing methods learn partial correlations but fail to discriminate across distinct frequency bands. Motivated by many applications in which this differentiation is pivotal, we overcome this limitation by learning a block-sparse, frequency-dependent, partial correlation graph, in which layers correspond to different frequency bands, and partial correlations can occur over just a few layers. To this aim, we formulate and solve two nonconvex learning problems: the first has a closed-form solution and is suitable when there is prior knowledge about the number of partial correlations; the second hinges on an iterative solution based on successive convex approximation, and is effective for the general case where no prior knowledge is available. Numerical results on synthetic data show that the proposed methods outperform the current state of the art. Finally, the analysis of financial time series confirms that partial correlations exist only within a few frequency bands, underscoring how our methods enable the gaining of valuable insights that would be undetected without discriminating along the frequency domain.  ( 2 min )
    ConstraintMatch for Semi-constrained Clustering. (arXiv:2311.15395v1 [cs.LG])
    Constrained clustering allows the training of classification models using pairwise constraints only, which are weak and relatively easy to mine, while still yielding full-supervision-level model performance. While they perform well even in the absence of the true underlying class labels, constrained clustering models still require large amounts of binary constraint annotations for training. In this paper, we propose a semi-supervised context whereby a large amount of \textit{unconstrained} data is available alongside a smaller set of constraints, and propose \textit{ConstraintMatch} to leverage such unconstrained data. While a great deal of progress has been made in semi-supervised learning using full labels, there are a number of challenges that prevent a naive application of the resulting methods in the constraint-based label setting. Therefore, we reason about and analyze these challenges, specifically 1) proposing a \textit{pseudo-constraining} mechanism to overcome the confirmation bias, a major weakness of pseudo-labeling, 2) developing new methods for pseudo-labeling towards the selection of \textit{informative} unconstrained samples, 3) showing that this also allows the use of pairwise loss functions for the initial and auxiliary losses which facilitates semi-constrained model training. In extensive experiments, we demonstrate the effectiveness of ConstraintMatch over relevant baselines in both the regular clustering and overclustering scenarios on five challenging benchmarks and provide analyses of its several components.  ( 2 min )
    SAMoSSA: Multivariate Singular Spectrum Analysis with Stochastic Autoregressive Noise. (arXiv:2305.16491v2 [cs.LG] UPDATED)
    The well-established practice of time series analysis involves estimating deterministic, non-stationary trend and seasonality components followed by learning the residual stochastic, stationary components. Recently, it has been shown that one can learn the deterministic non-stationary components accurately using multivariate Singular Spectrum Analysis (mSSA) in the absence of a correlated stationary component; meanwhile, in the absence of deterministic non-stationary components, the Autoregressive (AR) stationary component can also be learnt readily, e.g. via Ordinary Least Squares (OLS). However, a theoretical underpinning of multi-stage learning algorithms involving both deterministic and stationary components has been absent in the literature despite its pervasiveness. We resolve this open question by establishing desirable theoretical guarantees for a natural two-stage algorithm, where mSSA is first applied to estimate the non-stationary components despite the presence of a correlated stationary AR component, which is subsequently learned from the residual time series. We provide a finite-sample forecasting consistency bound for the proposed algorithm, SAMoSSA, which is data-driven and thus requires minimal parameter tuning. To establish theoretical guarantees, we overcome three hurdles: (i) we characterize the spectra of Page matrices of stable AR processes, thus extending the analysis of mSSA; (ii) we extend the analysis of AR process identification in the presence of arbitrary bounded perturbations; (iii) we characterize the out-of-sample or forecasting error, as opposed to solely considering model identification. Through representative empirical studies, we validate the superior performance of SAMoSSA compared to existing baselines. Notably, SAMoSSA's ability to account for AR noise structure yields improvements ranging from 5% to 37% across various benchmark datasets.  ( 3 min )
    On the Linear Convergence of Policy Gradient under Hadamard Parameterization. (arXiv:2305.19575v2 [math.OC] UPDATED)
    The convergence of deterministic policy gradient under the Hadamard parameterization is studied in the tabular setting and the linear convergence of the algorithm is established. To this end, we first show that the error decreases at an $O(\frac{1}{k})$ rate for all the iterations. Based on this result, we further show that the algorithm has a faster local linear convergence rate after $k_0$ iterations, where $k_0$ is a constant that only depends on the MDP problem and the initialization. To show the local linear convergence of the algorithm, we have indeed established the contraction of the sub-optimal probability $b_s^k$ (i.e., the probability of the output policy $\pi^k$ on non-optimal actions) when $k\ge k_0$.  ( 2 min )
    A latent linear model for nonlinear coupled oscillators on graphs. (arXiv:2311.14910v1 [math.DS])
    A system of coupled oscillators on an arbitrary graph is locally driven by the tendency to mutual synchronization between nearby oscillators, but can and often exhibit nonlinear behavior on the whole graph. Understanding such nonlinear behavior has been a key challenge in predicting whether all oscillators in such a system will eventually synchronize. In this paper, we demonstrate that, surprisingly, such nonlinear behavior of coupled oscillators can be effectively linearized in certain latent dynamic spaces. The key insight is that there is a small number of `latent dynamics filters', each with a specific association with synchronizing and non-synchronizing dynamics on subgraphs so that any observed dynamics on subgraphs can be approximated by a suitable linear combination of such elementary dynamic patterns. Taking an ensemble of subgraph-level predictions provides an interpretable predictor for whether the system on the whole graph reaches global synchronization. We propose algorithms based on supervised matrix factorization to learn such latent dynamics filters. We demonstrate that our method performs competitively in synchronization prediction tasks against baselines and black-box classification algorithms, despite its simple and interpretable architecture.  ( 2 min )
    Low-degree learning and the metric entropy of polynomials. (arXiv:2203.09659v3 [cs.LG] UPDATED)
    Let $\mathscr{F}_{n,d}$ be the class of all functions $f:\{-1,1\}^n\to[-1,1]$ on the $n$-dimensional discrete hypercube of degree at most $d$. In the first part of this paper, we prove that any (deterministic or randomized) algorithm which learns $\mathscr{F}_{n,d}$ with $L_2$-accuracy $\varepsilon$ requires at least $\Omega((1-\sqrt{\varepsilon})2^d\log n)$ queries for large enough $n$, thus establishing the sharpness as $n\to\infty$ of a recent upper bound of Eskenazis and Ivanisvili (2021). To do this, we show that the $L_2$-packing numbers $\mathsf{M}(\mathscr{F}_{n,d},\|\cdot\|_{L_2},\varepsilon)$ of the concept class $\mathscr{F}_{n,d}$ satisfy the two-sided estimate $$c(1-\varepsilon)2^d\log n \leq \log \mathsf{M}(\mathscr{F}_{n,d},\|\cdot\|_{L_2},\varepsilon) \leq \frac{2^{Cd}\log n}{\varepsilon^4}$$ for large enough $n$, where $c, C>0$ are universal constants. In the second part of the paper, we present a logarithmic upper bound for the randomized query complexity of classes of bounded approximate polynomials whose Fourier spectra are concentrated on few subsets. As an application, we prove new estimates for the number of random queries required to learn approximate juntas of a given degree, functions with rapidly decaying Fourier tails and constant depth circuits of given size. Finally, we obtain bounds for the number of queries required to learn the polynomial class $\mathscr{F}_{n,d}$ without error in the query and random example models.  ( 2 min )
    Robust and Automatic Data Clustering: Dirichlet Process meets Median-of-Means. (arXiv:2311.15384v1 [stat.ML])
    Clustering stands as one of the most prominent challenges within the realm of unsupervised machine learning. Among the array of centroid-based clustering algorithms, the classic $k$-means algorithm, rooted in Lloyd's heuristic, takes center stage as one of the extensively employed techniques in the literature. Nonetheless, both $k$-means and its variants grapple with noteworthy limitations. These encompass a heavy reliance on initial cluster centroids, susceptibility to converging into local minima of the objective function, and sensitivity to outliers and noise in the data. When confronted with data containing noisy or outlier-laden observations, the Median-of-Means (MoM) estimator emerges as a stabilizing force for any centroid-based clustering framework. On a different note, a prevalent constraint among existing clustering methodologies resides in the prerequisite knowledge of the number of clusters prior to analysis. Utilizing model-based methodologies, such as Bayesian nonparametric models, offers the advantage of infinite mixture models, thereby circumventing the need for such requirements. Motivated by these facts, in this article, we present an efficient and automatic clustering technique by integrating the principles of model-based and centroid-based methodologies that mitigates the effect of noise on the quality of clustering while ensuring that the number of clusters need not be specified in advance. Statistical guarantees on the upper bound of clustering error, and rigorous assessment through simulated and real datasets suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.  ( 2 min )
    $k$-Means Clustering for Persistent Homology. (arXiv:2210.10003v4 [stat.AP] UPDATED)
    Persistent homology is a methodology central to topological data analysis that extracts and summarizes the topological features within a dataset as a persistence diagram; it has recently gained much popularity from its myriad successful applications to many domains. However, its algebraic construction induces a metric space of persistence diagrams with a highly complex geometry. In this paper, we prove convergence of the $k$-means clustering algorithm on persistence diagram space and establish theoretical properties of the solution to the optimization problem in the Karush--Kuhn--Tucker framework. Additionally, we perform numerical experiments on various representations of persistent homology, including embeddings of persistence diagrams as well as diagrams themselves and their generalizations as persistence measures; we find that $k$-means clustering performance directly on persistence diagrams and measures outperform their vectorized representations.  ( 2 min )
    Individualized Treatment Allocations with Distributional Welfare. (arXiv:2311.15878v1 [stat.ME])
    In this paper, we explore optimal treatment allocation policies that target distributional welfare. Most literature on treatment choice has considered utilitarian welfare based on the conditional average treatment effect (ATE). While average welfare is intuitive, it may yield undesirable allocations especially when individuals are heterogeneous (e.g., with outliers) - the very reason individualized treatments were introduced in the first place. This observation motivates us to propose an optimal policy that allocates the treatment based on the conditional \emph{quantile of individual treatment effects} (QoTE). Depending on the choice of the quantile probability, this criterion can accommodate a policymaker who is either prudent or negligent. The challenge of identifying the QoTE lies in its requirement for knowledge of the joint distribution of the counterfactual outcomes, which is generally hard to recover even with experimental data. Therefore, we introduce minimax optimal policies that are robust to model uncertainty. We then propose a range of identifying assumptions under which we can point or partially identify the QoTE. We establish the asymptotic bound on the regret of implementing the proposed policies. We consider both stochastic and deterministic rules. In simulations and two empirical applications, we compare optimal decisions based on the QoTE with decisions based on other criteria.  ( 2 min )
    GLIME: General, Stable and Local LIME Explanation. (arXiv:2311.15722v1 [cs.LG])
    As black-box machine learning models grow in complexity and find applications in high-stakes scenarios, it is imperative to provide explanations for their predictions. Although Local Interpretable Model-agnostic Explanations (LIME) [22] is a widely adpoted method for understanding model behaviors, it is unstable with respect to random seeds [35,24,3] and exhibits low local fidelity (i.e., how well the explanation approximates the model's local behaviors) [21,16]. Our study shows that this instability problem stems from small sample weights, leading to the dominance of regularization and slow convergence. Additionally, LIME's sampling neighborhood is non-local and biased towards the reference, resulting in poor local fidelity and sensitivity to reference choice. To tackle these challenges, we introduce GLIME, an enhanced framework extending LIME and unifying several prior methods. Within the GLIME framework, we derive an equivalent formulation of LIME that achieves significantly faster convergence and improved stability. By employing a local and unbiased sampling distribution, GLIME generates explanations with higher local fidelity compared to LIME. GLIME explanations are independent of reference choice. Moreover, GLIME offers users the flexibility to choose a sampling distribution based on their specific scenarios.  ( 2 min )
    Will More Expressive Graph Neural Networks do Better on Generative Tasks?. (arXiv:2308.11978v2 [cs.LG] UPDATED)
    Graph generation poses a significant challenge as it involves predicting a complete graph with multiple nodes and edges based on simply a given label. This task also carries fundamental importance to numerous real-world applications, including de-novo drug and molecular design. In recent years, several successful methods have emerged in the field of graph generation. However, these approaches suffer from two significant shortcomings: (1) the underlying Graph Neural Network (GNN) architectures used in these methods are often underexplored; and (2) these methods are often evaluated on only a limited number of metrics. To fill this gap, we investigate the expressiveness of GNNs under the context of the molecular graph generation task, by replacing the underlying GNNs of graph generative models with more expressive GNNs. Specifically, we analyse the performance of six GNNs on six different molecular generative objectives on the ZINC-250k dataset in two different generative frameworks: autoregressive generation models, such as GCPN and GraphAF, and one-shot generation models, such as GraphEBM. Through our extensive experiments, we demonstrate that advanced GNNs can indeed improve the performance of GCPN, GraphAF, and GraphEBM on molecular generation tasks, but GNN expressiveness is not a necessary condition for a good GNN-based generative model. Moreover, we show that GCPN and GraphAF with advanced GNNs can achieve state-of-the-art results across 17 other non-GNN-based graph generative approaches, such as variational autoencoders and Bayesian optimisation models, on the proposed molecular generative objectives (DRD2, Median1, Median2), which are important metrics for de-novo molecular design.  ( 3 min )
    The Local Landscape of Phase Retrieval Under Limited Samples. (arXiv:2311.15221v1 [cs.IT])
    In this paper, we provide a fine-grained analysis of the local landscape of phase retrieval under the regime with limited samples. Our aim is to ascertain the minimal sample size necessary to guarantee a benign local landscape surrounding global minima in high dimensions. Let $n$ and $d$ denote the sample size and input dimension, respectively. We first explore the local convexity and establish that when $n=o(d\log d)$, for almost every fixed point in the local ball, the Hessian matrix must have negative eigenvalues as long as $d$ is sufficiently large. Consequently, the local landscape is highly non-convex. We next consider the one-point strong convexity and show that as long as $n=\omega(d)$, with high probability, the landscape is one-point strongly convex in the local annulus: $\{w\in\mathbb{R}^d: o_d(1)\leqslant \|w-w^*\|\leqslant c\}$, where $w^*$ is the ground truth and $c$ is an absolute constant. This implies that gradient descent initialized from any point in this domain can converge to an $o_d(1)$-loss solution exponentially fast. Furthermore, we show that when $n=o(d\log d)$, there is a radius of $\widetilde\Theta\left(\sqrt{1/d}\right)$ such that one-point convexity breaks in the corresponding smaller local ball. This indicates an impossibility to establish a convergence to exact $w^*$ for gradient descent under limited samples by relying solely on one-point convexity.  ( 2 min )
    Selective Inference for Changepoint detection by Recurrent Neural Network. (arXiv:2311.14964v1 [stat.ML])
    In this study, we investigate the quantification of the statistical reliability of detected change points (CPs) in time series using a Recurrent Neural Network (RNN). Thanks to its flexibility, RNN holds the potential to effectively identify CPs in time series characterized by complex dynamics. However, there is an increased risk of erroneously detecting random noise fluctuations as CPs. The primary goal of this study is to rigorously control the risk of false detections by providing theoretically valid p-values to the CPs detected by RNN. To achieve this, we introduce a novel method based on the framework of Selective Inference (SI). SI enables valid inferences by conditioning on the event of hypothesis selection, thus mitigating selection bias. In this study, we apply SI framework to RNN-based CP detection, where characterizing the complex process of RNN selecting CPs is our main technical challenge. We demonstrate the validity and effectiveness of the proposed method through artificial and real data experiments.  ( 2 min )
  • Open

    SAMoSSA: Multivariate Singular Spectrum Analysis with Stochastic Autoregressive Noise. (arXiv:2305.16491v2 [cs.LG] UPDATED)
    The well-established practice of time series analysis involves estimating deterministic, non-stationary trend and seasonality components followed by learning the residual stochastic, stationary components. Recently, it has been shown that one can learn the deterministic non-stationary components accurately using multivariate Singular Spectrum Analysis (mSSA) in the absence of a correlated stationary component; meanwhile, in the absence of deterministic non-stationary components, the Autoregressive (AR) stationary component can also be learnt readily, e.g. via Ordinary Least Squares (OLS). However, a theoretical underpinning of multi-stage learning algorithms involving both deterministic and stationary components has been absent in the literature despite its pervasiveness. We resolve this open question by establishing desirable theoretical guarantees for a natural two-stage algorithm, where mSSA is first applied to estimate the non-stationary components despite the presence of a correlated stationary AR component, which is subsequently learned from the residual time series. We provide a finite-sample forecasting consistency bound for the proposed algorithm, SAMoSSA, which is data-driven and thus requires minimal parameter tuning. To establish theoretical guarantees, we overcome three hurdles: (i) we characterize the spectra of Page matrices of stable AR processes, thus extending the analysis of mSSA; (ii) we extend the analysis of AR process identification in the presence of arbitrary bounded perturbations; (iii) we characterize the out-of-sample or forecasting error, as opposed to solely considering model identification. Through representative empirical studies, we validate the superior performance of SAMoSSA compared to existing baselines. Notably, SAMoSSA's ability to account for AR noise structure yields improvements ranging from 5% to 37% across various benchmark datasets.  ( 3 min )
    Aggregating Capacity in FL through Successive Layer Training for Computationally-Constrained Devices. (arXiv:2305.17005v2 [cs.LG] UPDATED)
    Federated learning (FL) is usually performed on resource-constrained edge devices, e.g., with limited memory for the computation. If the required memory to train a model exceeds this limit, the device will be excluded from the training. This can lead to a lower accuracy as valuable data and computation resources are excluded from training, also causing bias and unfairness. The FL training process should be adjusted to such constraints. The state-of-the-art techniques propose training subsets of the FL model at constrained devices, reducing their resource requirements for training. But these techniques largely limit the co-adaptation among parameters of the model and are highly inefficient, as we show: it is actually better to train a smaller (less accurate) model by the system where all the devices can train the model end-to-end, than applying such techniques. We propose a new method that enables successive freezing and training of the parameters of the FL model at devices, reducing the training's resource requirements at the devices, while still allowing enough co-adaptation between parameters. We show through extensive experimental evaluation that our technique greatly improves the accuracy of the trained model (by 52.4 p.p.) compared with the state of the art, efficiently aggregating the computation capacity available on distributed devices.  ( 2 min )
    Going Beyond Neural Network Feature Similarity: The Network Feature Complexity and Its Interpretation Using Category Theory. (arXiv:2310.06756v2 [cs.LG] UPDATED)
    The behavior of neural networks still remains opaque, and a recently widely noted phenomenon is that networks often achieve similar performance when initialized with different random parameters. This phenomenon has attracted significant attention in measuring the similarity between features learned by distinct networks. However, feature similarity could be vague in describing the same feature since equivalent features hardly exist. In this paper, we expand the concept of equivalent feature and provide the definition of what we call functionally equivalent features. These features produce equivalent output under certain transformations. Using this definition, we aim to derive a more intrinsic metric for the so-called feature complexity regarding the redundancy of features learned by a neural network at each layer. We offer a formal interpretation of our approach through the lens of category theory, a well-developed area in mathematics. To quantify the feature complexity, we further propose an efficient algorithm named Iterative Feature Merging. Our experimental results validate our ideas and theories from various perspectives. We empirically demonstrate that the functionally equivalence widely exists among different features learned by the same neural network and we could reduce the number of parameters of the network without affecting the performance.The IFM shows great potential as a data-agnostic model prune method. We have also drawn several interesting empirical findings regarding the defined feature complexity.  ( 3 min )
    Early Detection of Bark Beetle Attack Using Remote Sensing and Machine Learning: A Review. (arXiv:2210.03829v3 [cs.LG] UPDATED)
    This paper provides a comprehensive review of past and current advances in the early detection of bark beetle-induced tree mortality from three primary perspectives: bark beetle & host interactions, RS, and ML/DL. In contrast to prior efforts, this review encompasses all RS systems and emphasizes ML/DL methods to investigate their strengths and weaknesses. We parse existing literature based on multi- or hyper-spectral analyses and distill their knowledge based on: bark beetle species & attack phases with a primary emphasis on early stages of attacks, host trees, study regions, RS platforms & sensors, spectral/spatial/temporal resolutions, spectral signatures, spectral vegetation indices (SVIs), ML approaches, learning schemes, task categories, models, algorithms, classes/clusters, features, and DL networks & architectures. Although DL-based methods and the random forest (RF) algorithm showed promising results, highlighting their potential to detect subtle changes across visible, thermal, and short-wave infrared (SWIR) spectral regions, they still have limited effectiveness and high uncertainties. To inspire novel solutions to these shortcomings, we delve into the principal challenges & opportunities from different perspectives, enabling a deeper understanding of the current state of research and guiding future research directions.  ( 3 min )
    Domain knowledge-informed Synthetic fault sample generation with Health Data Map for cross-domain Planetary Gearbox Fault Diagnosis. (arXiv:2305.19569v5 [cs.LG] UPDATED)
    Extensive research has been conducted on fault diagnosis of planetary gearboxes using vibration signals and deep learning (DL) approaches. However, DL-based methods are susceptible to the domain shift problem caused by varying operating conditions of the gearbox. Although domain adaptation and data synthesis methods have been proposed to overcome such domain shifts, they are often not directly applicable in real-world situations where only healthy data is available in the target domain. To tackle the challenge of extreme domain shift scenarios where only healthy data is available in the target domain, this paper proposes two novel domain knowledge-informed data synthesis methods utilizing the health data map (HDMap). The two proposed approaches are referred to as scaled CutPaste and FaultPaste. The HDMap is used to physically represent the vibration signal of the planetary gearbox as an image-like matrix, allowing for visualization of fault-related features. CutPaste and FaultPaste are then applied to generate faulty samples based on the healthy data in the target domain, using domain knowledge and fault signatures extracted from the source domain, respectively. In addition to generating realistic faults, the proposed methods introduce scaling of fault signatures for controlled synthesis of faults with various severity levels. A case study is conducted on a planetary gearbox testbed to evaluate the proposed approaches. The results show that the proposed methods are capable of accurately diagnosing faults, even in cases of extreme domain shift, and can estimate the severity of faults that have not been previously observed in the target domain.  ( 3 min )
    From Isolated Islands to Pangea: Unifying Semantic Space for Human Action Understanding. (arXiv:2304.00553v3 [cs.CV] UPDATED)
    As a vital step toward the intelligent agent, Action understanding matters for intelligent agents and has attracted long-term attention. It can be formed as the mapping from the action physical space to the semantic space. Typically, researchers built action datasets according to idiosyncratic choices to define classes and push the envelope of benchmarks respectively. Thus, datasets are incompatible with each other like "Isolated Islands" due to semantic gaps and various class granularities, e.g., do housework in dataset A and wash plate in dataset B. We argue that a more principled semantic space is an urgent need to concentrate the community efforts and enable us to use all datasets together to pursue generalizable action learning. To this end, we design a structured action semantic space in view of verb taxonomy hierarchy and covering massive actions. By aligning the classes of previous datasets to our semantic space, we gather (image/video/skeleton/MoCap) datasets into a unified database in a unified label system, i.e., bridging ``isolated islands'' into a "Pangea". Accordingly, we propose a novel model mapping from the physical space to semantic space to fully use Pangea. In extensive experiments, our new system shows significant superiority, especially in transfer learning. Code and data will be made publicly available.  ( 3 min )
    A Note on "Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms". (arXiv:2304.04258v2 [stat.ML] UPDATED)
    Data valuation is a growing research field that studies the influence of individual data points for machine learning (ML) models. Data Shapley, inspired by cooperative game theory and economics, is an effective method for data valuation. However, it is well-known that the Shapley value (SV) can be computationally expensive. Fortunately, Jia et al. (2019) showed that for K-Nearest Neighbors (KNN) models, the computation of Data Shapley is surprisingly simple and efficient. In this note, we revisit the work of Jia et al. (2019) and propose a more natural and interpretable utility function that better reflects the performance of KNN models. We derive the corresponding calculation procedure for the Data Shapley of KNN classifiers/regressors with the new utility functions. Our new approach, dubbed soft-label KNN-SV, achieves the same time complexity as the original method. We further provide an efficient approximation algorithm for soft-label KNN-SV based on locality sensitive hashing (LSH). Our experimental results demonstrate that Soft-label KNN-SV outperforms the original method on most datasets in the task of mislabeled data detection, making it a better baseline for future work on data valuation.  ( 2 min )
    Kernel Normalized Convolutional Networks. (arXiv:2205.10089v3 [cs.LG] UPDATED)
    Existing convolutional neural network architectures frequently rely upon batch normalization (BatchNorm) to effectively train the model. BatchNorm, however, performs poorly with small batch sizes, and is inapplicable to differential privacy. To address these limitations, we propose kernel normalization and kernel normalized convolutional layers, and incorporate them into kernel normalized convolutional networks (KNConvNets) as the main building blocks. We implement KNConvNets corresponding to the state-of-the-art ResNets while forgoing BatchNorm layers. Through extensive experiments, we illustrate KNConvNets achieve higher or competitive performance compared to the BatchNorm counterparts in image classification and semantic segmentation. They also significantly outperform their batch-independent competitors including layer and group normalization in non-private and differentially private training. Given that, KNConvNets combine the batch-independence property of layer and group normalization with the performance advantage of BatchNorm.  ( 2 min )
    G-PECNet: Towards a Generalizable Pedestrian Trajectory Prediction System. (arXiv:2210.09846v2 [cs.AI] UPDATED)
    Navigating dynamic physical environments without obstructing or damaging human assets is of quintessential importance for social robots. In this work, we solve autonomous drone navigation's sub-problem of predicting out-of-domain human and agent trajectories using a deep generative model. Our method: General-PECNet or G-PECNet observes an improvement of 9.5\% on the Final Displacement Error (FDE) on 2020's benchmark: PECNet through a combination of architectural improvements inspired by periodic activation functions and synthetic trajectory (data) augmentations using Hidden Markov Models (HMMs) and Reinforcement Learning (RL). Additionally, we propose a simple geometry-inspired metric for trajectory non-linearity and outlier detection, helpful for the task. Code available at $\href{https://github.com/Aryan-Garg/PECNet-Pedestrian-Trajectory-Prediction.git}{GitHub}$  ( 2 min )
    In-Context Impersonation Reveals Large Language Models' Strengths and Biases. (arXiv:2305.14930v2 [cs.AI] UPDATED)
    In everyday conversations, humans can take on different roles and adapt their vocabulary to their chosen roles. We explore whether LLMs can take on, that is impersonate, different roles when they generate text in-context. We ask LLMs to assume different personas before solving vision and language tasks. We do this by prefixing the prompt with a persona that is associated either with a social identity or domain expertise. In a multi-armed bandit task, we find that LLMs pretending to be children of different ages recover human-like developmental stages of exploration. In a language-based reasoning task, we find that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts. Finally, we test whether LLMs' impersonations are complementary to visual information when describing different categories. We find that impersonation can improve performance: an LLM prompted to be a bird expert describes birds better than one prompted to be a car expert. However, impersonation can also uncover LLMs' biases: an LLM prompted to be a man describes cars better than one prompted to be a woman. These findings demonstrate that LLMs are capable of taking on diverse roles and that this in-context impersonation can be used to uncover their hidden strengths and biases.  ( 2 min )
    Ergodicity of the underdamped mean-field Langevin dynamics. (arXiv:2007.14660v3 [math.PR] UPDATED)
    We study the long time behavior of an underdamped mean-field Langevin (MFL) equation, and provide a general convergence as well as an exponential convergence rate result under different conditions. The results on the MFL equation can be applied to study the convergence of the Hamiltonian gradient descent algorithm for the overparametrized optimization. We then provide a numerical example of the algorithm to train a generative adversarial networks (GAN).  ( 2 min )
    MA-RECON: Mask-aware deep-neural-network for robust fast MRI k-space interpolation. (arXiv:2209.00462v2 [eess.IV] UPDATED)
    High-quality reconstruction of MRI images from under-sampled `k-space' data, which is in the Fourier domain, is crucial for shortening MRI acquisition times and ensuring superior temporal resolution. Over recent years, a wealth of deep neural network (DNN) methods have emerged, aiming to tackle the complex, ill-posed inverse problem linked to this process. However, their instability against variations in the acquisition process and anatomical distribution exposes a deficiency in the generalization of relevant physical models within these DNN architectures. The goal of our work is to enhance the generalization capabilities of DNN methods for k-space interpolation by introducing `MA-RECON', an innovative mask-aware DNN architecture and associated training method. Unlike preceding approaches, our `MA-RECON' architecture encodes not only the observed data but also the under-sampling mask within the model structure. It implements a tailored training approach that leverages data generated with a variety of under-sampling masks to stimulate the model's generalization of the under-sampled MRI reconstruction problem. Therefore, effectively represents the associated inverse problem, akin to the classical compressed sensing approach. The benefits of our MA-RECON approach were affirmed through rigorous testing with the widely accessible fastMRI dataset. Compared to standard DNN methods and DNNs trained with under-sampling mask augmentation, our approach demonstrated superior generalization capabilities. This resulted in a considerable improvement in robustness against variations in both the acquisition process and anatomical distribution, especially in regions with pathology. In conclusion, our mask-aware strategy holds promise for enhancing the generalization capacity and robustness of DNN-based methodologies for MRI reconstruction from undersampled k-space data.  ( 3 min )
    Probabilistic Multi-Dimensional Classification. (arXiv:2306.06517v2 [cs.LG] UPDATED)
    Multi-dimensional classification (MDC) can be employed in a range of applications where one needs to predict multiple class variables for each given instance. Many existing MDC methods suffer from at least one of inaccuracy, scalability, limited use to certain types of data, hardness of interpretation or lack of probabilistic (uncertainty) estimations. This paper is an attempt to address all these disadvantages simultaneously. We propose a formal framework for probabilistic MDC in which learning an optimal multi-dimensional classifier can be decomposed, without loss of generality, into learning a set of (smaller) single-variable multi-class probabilistic classifiers and a directed acyclic graph. Current and future developments of both probabilistic classification and graphical model learning can directly enhance our framework, which is flexible and provably optimal. A collection of experiments is conducted to highlight the usefulness of this MDC framework.  ( 2 min )
    Parallel Coordinates for Discovery of Interpretable Machine Learning Models. (arXiv:2305.18434v2 [cs.LG] UPDATED)
    This work uses visual knowledge discovery in parallel coordinates to advance methods of interpretable machine learning. The graphic data representation in parallel coordinates made the concepts of hypercubes and hyperblocks (HBs) simple to understand for end users. It is suggested to use mixed and pure hyperblocks in the proposed data classifier algorithm Hyper. It is shown that Hyper models generalize decision trees. The algorithm is presented in several settings and options to discover interactively or automatically overlapping or non-overlapping hyperblocks. Additionally, the use of hyperblocks in conjunction with language descriptions of visual patterns is demonstrated. The benchmark data from the UCI ML repository were used to evaluate the Hyper algorithm. It enabled the discovery of mixed and pure HBs evaluated using 10-fold cross validation. Connections among hyperblocks, dimension reduction and visualization have been established. The capability of end users to find and observe hyperblocks, as well as the ability of side-by-side visualizations to make patterns evident, are among major advantages ofhyperblock technology and the Hyper algorithm. A new method to visualize incomplete n-D data with missing values is proposed, while the traditional parallel coordinates do not support it. The ability of HBs to better prevent both overgeneralization and overfitting of data over decision trees is demonstrated as another benefit of the hyperblocks. The features of VisCanvas 2.0 software tool that implements Hyper technology are presented.  ( 3 min )
    Designing Optimal Behavioral Experiments Using Machine Learning. (arXiv:2305.07721v2 [cs.LG] UPDATED)
    Computational models are powerful tools for understanding human cognition and behavior. They let us express our theories clearly and precisely, and offer predictions that can be subtle and often counter-intuitive. However, this same richness and ability to surprise means our scientific intuitions and traditional tools are ill-suited to designing experiments to test and compare these models. To avoid these pitfalls and realize the full potential of computational modeling, we require tools to design experiments that provide clear answers about what models explain human behavior and the auxiliary assumptions those models must make. Bayesian optimal experimental design (BOED) formalizes the search for optimal experimental designs by identifying experiments that are expected to yield informative data. In this work, we provide a tutorial on leveraging recent advances in BOED and machine learning to find optimal experiments for any kind of model that we can simulate data from, and show how by-products of this procedure allow for quick and straightforward evaluation of models and their parameters against real experimental data. As a case study, we consider theories of how people balance exploration and exploitation in multi-armed bandit decision-making tasks. We validate the presented approach using simulations and a real-world experiment. As compared to experimental designs commonly used in the literature, we show that our optimal designs more efficiently determine which of a set of models best account for individual human behavior, and more efficiently characterize behavior given a preferred model. At the same time, formalizing a scientific question such that it can be adequately addressed with BOED can be challenging and we discuss several potential caveats and pitfalls that practitioners should be aware of. We provide code and tutorial notebooks to replicate all analyses.  ( 3 min )
    AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction. (arXiv:2305.09620v2 [cs.CL] UPDATED)
    Large language models (LLMs) that produce human-like responses have begun to revolutionize research practices in the social sciences. This paper shows how we can integrate LLMs and social surveys to accurately predict individual responses to survey questions that were not asked before. We develop a novel methodological framework to personalize LLMs by considering the meaning of survey questions derived from their text, the latent beliefs of individuals inferred from their response patterns, and the temporal contexts across different survey periods through fine-tuning LLMs with survey data. Using the General Social Survey from 1972 to 2021, we show that the fine-tuned model based on Alpaca-7b can predict individual responses to survey questions that are partially missing as well as entirely missing. The remarkable prediction capabilities allow us to fill in missing trends with high confidence and pinpoint when public attitudes changed, such as the rising support for same-sex marriage. We discuss practical constraints, socio-demographic representation, and ethical concerns regarding individual autonomy and privacy when using LLMs for opinion prediction. This study demonstrates that LLMs and surveys can mutually enhance each other's capabilities: LLMs broaden survey potential, while surveys improve the alignment of LLMs.  ( 2 min )
    TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models. (arXiv:2309.01947v2 [cs.CL] UPDATED)
    Automatic Speech Recognition (ASR) models need to be optimized for specific hardware before they can be deployed on devices. This can be done by tuning the model's hyperparameters or exploring variations in its architecture. Re-training and re-validating models after making these changes can be a resource-intensive task. This paper presents TODM (Train Once Deploy Many), a new approach to efficiently train many sizes of hardware-friendly on-device ASR models with comparable GPU-hours to that of a single training job. TODM leverages insights from prior work on Supernet, where Recurrent Neural Network Transducer (RNN-T) models share weights within a Supernet. It reduces layer sizes and widths of the Supernet to obtain subnetworks, making them smaller models suitable for all hardware types. We introduce a novel combination of three techniques to improve the outcomes of the TODM Supernet: adaptive dropouts, an in-place Alpha-divergence knowledge distillation, and the use of ScaledAdam optimizer. We validate our approach by comparing Supernet-trained versus individually tuned Multi-Head State Space Model (MH-SSM) RNN-T using LibriSpeech. Results demonstrate that our TODM Supernet either matches or surpasses the performance of manually tuned models by up to a relative of 3% better in word error rate (WER), while efficiently keeping the cost of training many models at a small constant.
    Emerging Trends in Federated Learning: From Model Fusion to Federated X Learning. (arXiv:2102.12920v3 [cs.LG] UPDATED)
    Federated learning is a new learning paradigm that decouples data collection and model training via multi-party computation and model aggregation. As a flexible learning setting, federated learning has the potential to integrate with other learning frameworks. We conduct a focused survey of federated learning in conjunction with other learning algorithms. Specifically, we explore various learning algorithms to improve the vanilla federated averaging algorithm and review model fusion methods such as adaptive aggregation, regularization, clustered methods, and Bayesian methods. Following the emerging trends, we also discuss federated learning in the intersection with other learning paradigms, termed federated X learning, where X includes multitask learning, meta-learning, transfer learning, unsupervised learning, and reinforcement learning. This survey reviews the state of the art, challenges, and future directions.  ( 2 min )
    ISLE: An Intelligent Streaming Framework for High-Throughput AI Inference in Medical Imaging. (arXiv:2305.15617v2 [eess.IV] UPDATED)
    As the adoption of Artificial Intelligence (AI) systems within the clinical environment grows, limitations in bandwidth and compute can create communication bottlenecks when streaming imaging data, leading to delays in patient care and increased cost. As such, healthcare providers and AI vendors will require greater computational infrastructure, therefore dramatically increasing costs. To that end, we developed ISLE, an intelligent streaming framework for high-throughput, compute- and bandwidth- optimized, and cost effective AI inference for clinical decision making at scale. In our experiments, ISLE on average reduced data transmission by 98.02% and decoding time by 98.09%, while increasing throughput by 2,730%. We show that ISLE results in faster turnaround times, and reduced overall cost of data, transmission, and compute, without negatively impacting clinical decision making using AI systems.  ( 2 min )
    Understanding and Mitigating Extrapolation Failures in Physics-Informed Neural Networks. (arXiv:2306.09478v2 [cs.LG] UPDATED)
    Physics-informed Neural Networks (PINNs) have recently gained popularity due to their effective approximation of partial differential equations (PDEs) using deep neural networks (DNNs). However, their out of domain behavior is not well understood, with previous work speculating that the presence of high frequency components in the solution function might be to blame for poor extrapolation performance. In this paper, we study the extrapolation behavior of PINNs on a representative set of PDEs of different types, including high-dimensional PDEs. We find that failure to extrapolate is not caused by high frequencies in the solution function, but rather by shifts in the support of the Fourier spectrum over time. We term these spectral shifts and quantify them by introducing a Weighted Wasserstein-Fourier distance (WWF). We show that the WWF can be used to predict PINN extrapolation performance, and that in the absence of significant spectral shifts, PINN predictions stay close to the true solution even in extrapolation. Finally, we propose a transfer learning-based strategy to mitigate the effects of larger spectral shifts, which decreases extrapolation errors by up to 82%.  ( 2 min )
    View it like a radiologist: Shifted windows for deep learning augmentation of CT images. (arXiv:2311.14990v1 [eess.IV])
    Deep learning has the potential to revolutionize medical practice by automating and performing important tasks like detecting and delineating the size and locations of cancers in medical images. However, most deep learning models rely on augmentation techniques that treat medical images as natural images. For contrast-enhanced Computed Tomography (CT) images in particular, the signals producing the voxel intensities have physical meaning, which is lost during preprocessing and augmentation when treating such images as natural images. To address this, we propose a novel preprocessing and intensity augmentation scheme inspired by how radiologists leverage multiple viewing windows when evaluating CT images. Our proposed method, window shifting, randomly places the viewing windows around the region of interest during training. This approach improves liver lesion segmentation performance and robustness on images with poorly timed contrast agent. Our method outperforms classical intensity augmentations as well as the intensity augmentation pipeline of the popular nn-UNet on multiple datasets.  ( 2 min )
    DP-HyPO: An Adaptive Private Hyperparameter Optimization Framework. (arXiv:2306.05734v2 [cs.LG] UPDATED)
    Hyperparameter optimization, also known as hyperparameter tuning, is a widely recognized technique for improving model performance. Regrettably, when training private ML models, many practitioners often overlook the privacy risks associated with hyperparameter optimization, which could potentially expose sensitive information about the underlying dataset. Currently, the sole existing approach to allow privacy-preserving hyperparameter optimization is to uniformly and randomly select hyperparameters for a number of runs, subsequently reporting the best-performing hyperparameter. In contrast, in non-private settings, practitioners commonly utilize ``adaptive'' hyperparameter optimization methods such as Gaussian process-based optimization, which select the next candidate based on information gathered from previous outputs. This substantial contrast between private and non-private hyperparameter optimization underscores a critical concern. In our paper, we introduce DP-HyPO, a pioneering framework for ``adaptive'' private hyperparameter optimization, aiming to bridge the gap between private and non-private hyperparameter optimization. To accomplish this, we provide a comprehensive differential privacy analysis of our framework. Furthermore, we empirically demonstrate the effectiveness of DP-HyPO on a diverse set of real-world datasets.
    Quantum Deep Hedging. (arXiv:2303.16585v2 [quant-ph] UPDATED)
    Quantum machine learning has the potential for a transformative impact across industry sectors and in particular in finance. In our work we look at the problem of hedging where deep reinforcement learning offers a powerful framework for real markets. We develop quantum reinforcement learning methods based on policy-search and distributional actor-critic algorithms that use quantum neural network architectures with orthogonal and compound layers for the policy and value functions. We prove that the quantum neural networks we use are trainable, and we perform extensive simulations that show that quantum models can reduce the number of trainable parameters while achieving comparable performance and that the distributional approach obtains better performance than other standard approaches, both classical and quantum. We successfully implement the proposed models on a trapped-ion quantum processor, utilizing circuits with up to $16$ qubits, and observe performance that agrees well with noiseless simulation. Our quantum techniques are general and can be applied to other reinforcement learning problems beyond hedging.
    On the Effectiveness of Log Representation for Log-based Anomaly Detection. (arXiv:2308.08736v2 [cs.SE] UPDATED)
    Logs are an essential source of information for people to understand the running status of a software system. Due to the evolving modern software architecture and maintenance methods, more research efforts have been devoted to automated log analysis. In particular, machine learning (ML) has been widely used in log analysis tasks. In ML-based log analysis tasks, converting textual log data into numerical feature vectors is a critical and indispensable step. However, the impact of using different log representation techniques on the performance of the downstream models is not clear, which limits researchers and practitioners' opportunities of choosing the optimal log representation techniques in their automated log analysis workflows. Therefore, this work investigates and compares the commonly adopted log representation techniques from previous log analysis research. Particularly, we select six log representation techniques and evaluate them with seven ML models and four public log datasets (i.e., HDFS, BGL, Spirit and Thunderbird) in the context of log-based anomaly detection. We also examine the impacts of the log parsing process and the different feature aggregation approaches when they are employed with log representation techniques. From the experiments, we provide some heuristic guidelines for future researchers and developers to follow when designing an automated log analysis workflow. We believe our comprehensive comparison of log representation techniques can help researchers and practitioners better understand the characteristics of different log representation techniques and provide them with guidance for selecting the most suitable ones for their ML-based log analysis workflow.
    Are Human-generated Demonstrations Necessary for In-context Learning?. (arXiv:2309.14681v3 [cs.LG] UPDATED)
    Despite the promising few-shot ability of large language models (LLMs), the standard paradigm of In-context Learning (ICL) suffers the disadvantages of susceptibility to selected demonstrations and the intricacy to generate these demonstrations. In this paper, we raise the fundamental question that whether human-generated demonstrations are necessary for ICL. To answer this question, we propose self-contemplation prompting strategy (SEC), a paradigm free from human-crafted demonstrations. The key point of SEC is that, instead of using hand-crafted examples as demonstrations in ICL, SEC asks LLMs to first create demonstrations on their own, based on which the final output is generated. SEC is a flexible framework and can be adapted to both the vanilla ICL and the chain-of-thought (CoT), but with greater ease: as the manual-generation process of both examples and rationale can be saved. Extensive experiments in arithmetic reasoning, commonsense reasoning, multi-task language understanding, and code generation benchmarks, show that SEC, which does not require hand-crafted demonstrations, significantly outperforms the zero-shot learning strategy, and achieves comparable results to ICL with hand-crafted demonstrations. This demonstrates that, for many tasks, contemporary LLMs possess a sufficient level of competence to exclusively depend on their own capacity for decision making, removing the need for external training data. Code is available at https://github.com/ruili33/SEC.
    RCT Rejection Sampling for Causal Estimation Evaluation. (arXiv:2307.15176v2 [cs.AI] UPDATED)
    Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates -- such as text data, genomics, or the behavioral social sciences -- researchers have proposed methods to adjust for confounding by adapting machine learning methods to the goal of causal estimation. However, empirical evaluation of these adjustment methods has been challenging and limited. In this work, we build on a promising empirical evaluation strategy that simplifies evaluation design and uses real data: subsampling randomized controlled trials (RCTs) to create confounded observational datasets while using the average causal effects from the RCTs as ground-truth. We contribute a new sampling algorithm, which we call RCT rejection sampling, and provide theoretical guarantees that causal identification holds in the observational data to allow for valid comparisons to the ground-truth RCT. Using synthetic data, we show our algorithm indeed results in low bias when oracle estimators are evaluated on the confounded samples, which is not always the case for a previously proposed algorithm. In addition to this identification result, we highlight several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT -- which we release publicly -- consisting of approximately 70k observations and text data as high-dimensional covariates. Together, these contributions build towards a broader agenda of improved empirical evaluation for causal estimation.
    On the Linear Convergence of Policy Gradient under Hadamard Parameterization. (arXiv:2305.19575v2 [math.OC] UPDATED)
    The convergence of deterministic policy gradient under the Hadamard parameterization is studied in the tabular setting and the linear convergence of the algorithm is established. To this end, we first show that the error decreases at an $O(\frac{1}{k})$ rate for all the iterations. Based on this result, we further show that the algorithm has a faster local linear convergence rate after $k_0$ iterations, where $k_0$ is a constant that only depends on the MDP problem and the initialization. To show the local linear convergence of the algorithm, we have indeed established the contraction of the sub-optimal probability $b_s^k$ (i.e., the probability of the output policy $\pi^k$ on non-optimal actions) when $k\ge k_0$.
    Energy Discrepancies: A Score-Independent Loss for Energy-Based Models. (arXiv:2307.06431v2 [stat.ML] UPDATED)
    Energy-based models are a simple yet powerful class of probabilistic models, but their widespread adoption has been limited by the computational burden of training them. We propose a novel loss function called Energy Discrepancy (ED) which does not rely on the computation of scores or expensive Markov chain Monte Carlo. We show that ED approaches the explicit score matching and negative log-likelihood loss under different limits, effectively interpolating between both. Consequently, minimum ED estimation overcomes the problem of nearsightedness encountered in score-based estimation methods, while also enjoying theoretical guarantees. Through numerical experiments, we demonstrate that ED learns low-dimensional data distributions faster and more accurately than explicit score matching or contrastive divergence. For high-dimensional image data, we describe how the manifold hypothesis puts limitations on our approach and demonstrate the effectiveness of energy discrepancy by training the energy-based model as a prior of a variational decoder model.  ( 2 min )
    Trainable Noise Model as an XAI evaluation method: application on Sobol for remote sensing image segmentation. (arXiv:2310.01828v2 [cs.CV] UPDATED)
    eXplainable Artificial Intelligence (XAI) has emerged as an essential requirement when dealing with mission-critical applications, ensuring transparency and interpretability of the employed black box AI models. The significance of XAI spans various domains, from healthcare to finance, where understanding the decision-making process of deep learning algorithms is essential. Most AI-based computer vision models are often black boxes; hence, providing explainability of deep neural networks in image processing is crucial for their wide adoption and deployment in medical image analysis, autonomous driving, and remote sensing applications. Recently, several XAI methods for image classification tasks have been introduced. On the contrary, image segmentation has received comparatively less attention in the context of explainability, although it is a fundamental task in computer vision applications, especially in remote sensing. Only some research proposes gradient-based XAI algorithms for image segmentation. This paper adapts the recent gradient-free Sobol XAI method for semantic segmentation. To measure the performance of the Sobol method for segmentation, we propose a quantitative XAI evaluation method based on a learnable noise model. The main objective of this model is to induce noise on the explanation maps, where higher induced noise signifies low accuracy and vice versa. A benchmark analysis is conducted to evaluate and compare performance of three XAI methods, including Seg-Grad-CAM, Seg-Grad-CAM++ and Seg-Sobol using the proposed noise-based evaluation technique. This constitutes the first attempt to run and evaluate XAI methods using high-resolution satellite images.
    Exploring the Relationship Between Model Architecture and In-Context Learning Ability. (arXiv:2310.08049v2 [cs.LG] UPDATED)
    What is the relationship between model architecture and the ability to perform in-context learning? In this empirical study, we take the first steps toward answering this question. We evaluate twelve model architectures capable of causal language modeling across a suite of synthetic in-context learning tasks. These selected architectures represent a broad range of paradigms, including recurrent and convolution-based neural networks, transformers, state-space model inspired, and other emerging attention alternatives. We discover that all the considered architectures can perform in-context learning under a wider range of conditions than previously documented. Additionally, we observe stark differences in statistical efficiency and consistency by varying context length and task difficulty. We also measure each architecture's predisposition towards in-context learning when presented with alternative routes for task resolution. Finally, and somewhat surprisingly, we find that several attention alternatives are more robust in-context learners than transformers. Given that such approaches have constant-sized memory footprints at inference time, this result opens the possibility of scaling up in-context learning to accommodate vastly larger numbers of in-context examples.  ( 2 min )
    The Lipschitz-Variance-Margin Tradeoff for Enhanced Randomized Smoothing. (arXiv:2309.16883v2 [cs.LG] UPDATED)
    Real-life applications of deep neural networks are hindered by their unsteady predictions when faced with noisy inputs and adversarial attacks. The certified radius is in this context a crucial indicator of the robustness of models. However how to design an efficient classifier with a sufficient certified radius? Randomized smoothing provides a promising framework by relying on noise injection in inputs to obtain a smoothed and more robust classifier. In this paper, we first show that the variance introduced by randomized smoothing closely interacts with two other important properties of the classifier, \textit{i.e.} its Lipschitz constant and margin. More precisely, our work emphasizes the dual impact of the Lipschitz constant of the base classifier, on both the smoothed classifier and the empirical variance. Moreover, to increase the certified robust radius, we introduce a different simplex projection technique for the base classifier to leverage the variance-margin trade-off thanks to Bernstein's concentration inequality, along with an enhanced Lipschitz bound. Experimental results show a significant improvement in certified accuracy compared to current state-of-the-art methods. Our novel certification procedure allows us to use pre-trained models that are used with randomized smoothing, effectively improving the current certification radius in a zero-shot manner.  ( 2 min )
    MovePose: A High-performance Human Pose Estimation Algorithm on Mobile and Edge Devices. (arXiv:2308.09084v2 [cs.CV] UPDATED)
    We present MovePose, an optimized lightweight convolutional neural network designed specifically for real-time body pose estimation on CPU-based mobile devices. The current solutions do not provide satisfactory accuracy and speed for human posture estimation, and MovePose addresses this gap. It aims to maintain real-time performance while improving the accuracy of human posture estimation for mobile devices. The network produces 17 keypoints for each individual at a rate exceeding 11 frames per second, making it suitable for real-time applications such as fitness tracking, sign language interpretation, and advanced mobile human posture estimation. Our MovePose algorithm has attained an Mean Average Precision (mAP) score of 67.7 on the COCO \cite{cocodata} validation dataset. The MovePose algorithm displayed efficiency with a performance of 69+ frames per second (fps) when run on an Intel i9-10920x CPU. Additionally, it showcased an increased performance of 452+ fps on an NVIDIA RTX3090 GPU. On an Android phone equipped with a Snapdragon 8 + 4G processor, the fps reached above 11. To enhance accuracy, we incorporated three techniques: deconvolution, large kernel convolution, and coordinate classification methods. Compared to basic upsampling, deconvolution is trainable, improves model capacity, and enhances the receptive field. Large kernel convolution strengthens these properties at a decreased computational cost. In summary, MovePose provides high accuracy and real-time performance, marking it a potential tool for a variety of applications, including those focused on mobile-side human posture estimation. The code and models for this algorithm will be made publicly accessible.  ( 3 min )
    Efficient Gradient Estimation via Adaptive Sampling and Importance Sampling. (arXiv:2311.14468v2 [cs.LG] UPDATED)
    Machine learning problems rely heavily on stochastic gradient descent (SGD) for optimization. The effectiveness of SGD is contingent upon accurately estimating gradients from a mini-batch of data samples. Instead of the commonly used uniform sampling, adaptive or importance sampling reduces noise in gradient estimation by forming mini-batches that prioritize crucial data points. Previous research has suggested that data points should be selected with probabilities proportional to their gradient norm. Nevertheless, existing algorithms have struggled to efficiently integrate importance sampling into machine learning frameworks. In this work, we make two contributions. First, we present an algorithm that can incorporate existing importance functions into our framework. Second, we propose a simplified importance function that relies solely on the loss gradient of the output layer. By leveraging our proposed gradient estimation techniques, we observe improved convergence in classification and regression tasks with minimal computational overhead. We validate the effectiveness of our adaptive and importance-sampling approach on image and point-cloud datasets.  ( 2 min )
    Computational and Storage Efficient Quadratic Neurons for Deep Neural Networks. (arXiv:2306.07294v2 [cs.LG] UPDATED)
    Deep neural networks (DNNs) have been widely deployed across diverse domains such as computer vision and natural language processing. However, the impressive accomplishments of DNNs have been realized alongside extensive computational demands, thereby impeding their applicability on resource-constrained devices. To address this challenge, many researchers have been focusing on basic neuron structures, the fundamental building blocks of neural networks, to alleviate the computational and storage cost. In this work, an efficient quadratic neuron architecture distinguished by its enhanced utilization of second-order computational information is introduced. By virtue of their better expressivity, DNNs employing the proposed quadratic neurons can attain similar accuracy with fewer neurons and computational cost. Experimental results have demonstrated that the proposed quadratic neuron structure exhibits superior computational and storage efficiency across various tasks when compared with both linear and non-linear neurons in prior work.  ( 2 min )
    Augmentation Invariant Manifold Learning. (arXiv:2211.00460v2 [stat.ML] UPDATED)
    Data augmentation is a widely used technique and an essential ingredient in the recent advance in self-supervised representation learning. By preserving the similarity between augmented data, the resulting data representation can improve various downstream analyses and achieve state-of-the-art performance in many applications. Despite the empirical effectiveness, most existing methods lack theoretical understanding under a general nonlinear setting. To fill this gap, we develop a statistical framework on a low-dimension product manifold to model the data augmentation transformation. Under this framework, we introduce a new representation learning method called augmentation invariant manifold learning and design a computationally efficient algorithm by reformulating it as a stochastic optimization problem. Compared with existing self-supervised methods, the new method simultaneously exploits the manifold's geometric structure and invariant property of augmented data and has an explicit theoretical guarantee. Our theoretical investigation characterizes the role of data augmentation in the proposed method and reveals why and how the data representation learned from augmented data can improve the $k$-nearest neighbor classifier in the downstream analysis, showing that a more complex data augmentation leads to more improvement in downstream analysis. Finally, numerical experiments on simulated and real datasets are presented to demonstrate the merit of the proposed method.  ( 2 min )
    DeepTSF: Codeless machine learning operations for time series forecasting. (arXiv:2308.00709v2 [cs.LG] UPDATED)
    This paper presents DeepTSF, a comprehensive machine learning operations (MLOps) framework aiming to innovate time series forecasting through workflow automation and codeless modeling. DeepTSF automates key aspects of the ML lifecycle, making it an ideal tool for data scientists and MLops engineers engaged in machine learning (ML) and deep learning (DL)-based forecasting. DeepTSF empowers users with a robust and user-friendly solution, while it is designed to seamlessly integrate with existing data analysis workflows, providing enhanced productivity and compatibility. The framework offers a front-end user interface (UI) suitable for data scientists, as well as other higher-level stakeholders, enabling comprehensive understanding through insightful visualizations and evaluation metrics. DeepTSF also prioritizes security through identity management and access authorization mechanisms. The application of DeepTSF in real-life use cases of the I-NERGY project has already proven DeepTSF's efficacy in DL-based load forecasting, showcasing its significant added value in the electrical power and energy systems domain.  ( 2 min )
    Segmentation of diagnostic tissue compartments on whole slide images with renal thrombotic microangiopathies (TMAs). (arXiv:2311.14971v1 [cs.CV])
    The thrombotic microangiopathies (TMAs) manifest in renal biopsy histology with a broad spectrum of acute and chronic findings. Precise diagnostic criteria for a renal biopsy diagnosis of TMA are missing. As a first step towards a machine learning- and computer vision-based analysis of wholes slide images from renal biopsies, we trained a segmentation model for the decisive diagnostic kidney tissue compartments artery, arteriole, glomerulus on a set of whole slide images from renal biopsies with TMAs and Mimickers (distinct diseases with a similar nephropathological appearance as TMA like severe benign nephrosclerosis, various vasculitides, Bevacizumab-plug glomerulopathy, arteriolar light chain deposition disease). Our segmentation model combines a U-Net-based tissue detection with a Shifted windows-transformer architecture to reach excellent segmentation results for even the most severely altered glomeruli, arterioles and arteries, even on unseen staining domains from a different nephropathology lab. With accurate automatic segmentation of the decisive renal biopsy compartments in human renal vasculopathies, we have laid the foundation for large-scale compartment-specific machine learning and computer vision analysis of renal biopsy repositories with TMAs.  ( 3 min )
    DIVA: A Dirichlet Process Mixtures Based Incremental Deep Clustering Algorithm via Variational Auto-Encoder. (arXiv:2305.14067v3 [cs.LG] UPDATED)
    Generative model-based deep clustering frameworks excel in classifying complex data, but are limited in handling dynamic and complex features because they require prior knowledge of the number of clusters. In this paper, we propose a nonparametric deep clustering framework that employs an infinite mixture of Gaussians as a prior. Our framework utilizes a memoized online variational inference method that enables the "birth" and "merge" moves of clusters, allowing our framework to cluster data in a "dynamic-adaptive" manner, without requiring prior knowledge of the number of features. We name the framework as DIVA, a Dirichlet Process-based Incremental deep clustering framework via Variational Auto-Encoder. Our framework, which outperforms state-of-the-art baselines, exhibits superior performance in classifying complex data with dynamically changing features, particularly in the case of incremental features. We released our source code implementation at: https://github.com/Ghiara/diva
    Identification of morphological fingerprint in perinatal brains using quasi-conformal mapping and contrastive learning. (arXiv:2311.14955v1 [cs.LG])
    The morphological fingerprint in the brain is capable of identifying the uniqueness of an individual. However, whether such individual patterns are present in perinatal brains, and which morphological attributes or cortical regions better characterize the individual differences of ne-onates remain unclear. In this study, we proposed a deep learning framework that projected three-dimensional spherical meshes of three morphological features (i.e., cortical thickness, mean curvature, and sulcal depth) onto two-dimensional planes through quasi-conformal mapping, and employed the ResNet18 and contrastive learning for individual identification. We used the cross-sectional structural MRI data of 682 infants, incorporating with data augmentation, to train the model and fine-tuned the parameters based on 60 infants who had longitudinal scans. The model was validated on 30 longitudinal scanned infant data, and remarkable Top1 and Top5 accuracies of 71.37% and 84.10% were achieved, respectively. The sensorimotor and visual cortices were recognized as the most contributive regions in individual identification. Moreover, the folding morphology demonstrated greater discriminative capability than the cortical thickness, which could serve as the morphological fingerprint in perinatal brains. These findings provided evidence for the emergence of morphological fingerprints in the brain at the beginning of the third trimester, which may hold promising implications for understanding the formation of in-dividual uniqueness in the brain during early development.  ( 2 min )
    Unsupervised Discovery of Interpretable Directions in h-space of Pre-trained Diffusion Models. (arXiv:2310.09912v2 [cs.CV] UPDATED)
    We propose the first unsupervised and learning-based method to identify interpretable directions in h-space of pre-trained diffusion models. Our method is derived from an existing technique that operates on the GAN latent space. Specifically, we employ a shift control module that works on h-space of pre-trained diffusion models to manipulate a sample into a shifted version of itself, followed by a reconstructor to reproduce both the type and the strength of the manipulation. By jointly optimizing them, the model will spontaneously discover disentangled and interpretable directions. To prevent the discovery of meaningless and destructive directions, we employ a discriminator to maintain the fidelity of shifted sample. Due to the iterative generative process of diffusion models, our training requires a substantial amount of GPU VRAM to store numerous intermediate tensors for back-propagating gradient. To address this issue, we propose a general VRAM-efficient training algorithm based on gradient checkpointing technique to back-propagate any gradient through the whole generative process, with acceptable occupancy of VRAM and sacrifice of training efficiency. Compared with existing related works on diffusion models, our method inherently identifies global and scalable directions, without necessitating any other complicated procedures. Extensive experiments on various datasets demonstrate the effectiveness of our method.  ( 2 min )
    AudioMNIST: Exploring Explainable Artificial Intelligence for Audio Analysis on a Simple Benchmark. (arXiv:1807.03418v3 [cs.SD] UPDATED)
    Explainable Artificial Intelligence (XAI) is targeted at understanding how models perform feature selection and derive their classification decisions. This paper explores post-hoc explanations for deep neural networks in the audio domain. Notably, we present a novel Open Source audio dataset consisting of 30,000 audio samples of English spoken digits which we use for classification tasks on spoken digits and speakers' biological sex. We use the popular XAI technique Layer-wise Relevance Propagation (LRP) to identify relevant features for two neural network architectures that process either waveform or spectrogram representations of the data. Based on the relevance scores obtained from LRP, hypotheses about the neural networks' feature selection are derived and subsequently tested through systematic manipulations of the input data. Further, we take a step beyond visual explanations and introduce audible heatmaps. We demonstrate the superior interpretability of audible explanations over visual ones in a human user study.  ( 2 min )
    CALICO: Self-Supervised Camera-LiDAR Contrastive Pre-training for BEV Perception. (arXiv:2306.00349v2 [cs.CV] UPDATED)
    Perception is crucial in the realm of autonomous driving systems, where bird's eye view (BEV)-based architectures have recently reached state-of-the-art performance. The desirability of self-supervised representation learning stems from the expensive and laborious process of annotating 2D and 3D data. Although previous research has investigated pretraining methods for both LiDAR and camera-based 3D object detection, a unified pretraining framework for multimodal BEV perception is missing. In this study, we introduce CALICO, a novel framework that applies contrastive objectives to both LiDAR and camera backbones. Specifically, CALICO incorporates two stages: point-region contrast (PRC) and region-aware distillation (RAD). PRC better balances the region- and scene-level representation learning on the LiDAR modality and offers significant performance improvement compared to existing methods. RAD effectively achieves contrastive distillation on our self-trained teacher model. CALICO's efficacy is substantiated by extensive evaluations on 3D object detection and BEV map segmentation tasks, where it delivers significant performance improvements. Notably, CALICO outperforms the baseline method by 10.5% and 8.6% on NDS and mAP. Moreover, CALICO boosts the robustness of multimodal 3D object detection against adversarial attacks and corruption. Additionally, our framework can be tailored to different backbones and heads, positioning it as a promising approach for multimodal BEV perception.
    Task adaption by biologically inspired stochastic comodulation. (arXiv:2311.15053v1 [cs.LG])
    Brain representations must strike a balance between generalizability and adaptability. Neural codes capture general statistical regularities in the world, while dynamically adjusting to reflect current goals. One aspect of this adaptation is stochastically co-modulating neurons' gains based on their task relevance. These fluctuations then propagate downstream to guide decision-making. Here, we test the computational viability of such a scheme in the context of multi-task learning. We show that fine-tuning convolutional networks by stochastic gain modulation improves on deterministic gain modulation, achieving state-of-the-art results on the CelebA dataset. To better understand the mechanisms supporting this improvement, we explore how fine-tuning performance is affected by architecture using Cifar-100. Overall, our results suggest that stochastic comodulation can enhance learning efficiency and performance in multi-task learning, without additional learnable parameters. This offers a promising new direction for developing more flexible and robust intelligent systems.  ( 2 min )
    MPCNN: A Novel Matrix Profile Approach for CNN-based Sleep Apnea Classification. (arXiv:2311.15041v1 [cs.LG])
    Sleep apnea (SA) is a significant respiratory condition that poses a major global health challenge. Previous studies have investigated several machine and deep learning models for electrocardiogram (ECG)-based SA diagnoses. Despite these advancements, conventional feature extractions derived from ECG signals, such as R-peaks and RR intervals, may fail to capture crucial information encompassed within the complete PQRST segments. In this study, we propose an innovative approach to address this diagnostic gap by delving deeper into the comprehensive segments of the ECG signal. The proposed methodology draws inspiration from Matrix Profile algorithms, which generate an Euclidean distance profile from fixed-length signal subsequences. From this, we derived the Min Distance Profile (MinDP), Max Distance Profile (MaxDP), and Mean Distance Profile (MeanDP) based on the minimum, maximum, and mean of the profile distances, respectively. To validate the effectiveness of our approach, we use the modified LeNet-5 architecture as the primary CNN model, along with two existing lightweight models, BAFNet and SE-MSCNN, for ECG classification tasks. Our extensive experimental results on the PhysioNet Apnea-ECG dataset revealed that with the new feature extraction method, we achieved a per-segment accuracy up to 92.11 \% and a per-recording accuracy of 100\%. Moreover, it yielded the highest correlation compared to state-of-the-art methods, with a correlation coefficient of 0.989. By introducing a new feature extraction method based on distance relationships, we enhanced the performance of certain lightweight models, showing potential for home sleep apnea test (HSAT) and SA detection in IoT devices. The source code for this work is made publicly available in GitHub: https://github.com/vinuni-vishc/MPCNN-Sleep-Apnea.  ( 3 min )
    Support Vector Machine Implementation on MPI-CUDA and Tensorflow Framework. (arXiv:2311.14908v1 [cs.DC])
    Support Vector Machine (SVM) algorithm requires a high computational cost (both in memory and time) to solve a complex quadratic programming (QP) optimization problem during the training process. Consequently, SVM necessitates high computing hardware capabilities. The central processing unit (CPU) clock frequency cannot be increased due to physical limitations in the miniaturization process. However, the potential of parallel multi-architecture, available in both multi-core CPUs and highly scalable GPUs, emerges as a promising solution to enhance algorithm performance. Therefore, there is an opportunity to reduce the high computational time required by SVM for solving the QP optimization problem. This paper presents a comparative study that implements the SVM algorithm on different parallel architecture frameworks. The experimental results show that SVM MPI-CUDA implementation achieves a speedup over SVM TensorFlow implementation on different datasets. Moreover, SVM TensorFlow implementation provides a cross-platform solution that can be migrated to alternative hardware components, which will reduces the development time.  ( 2 min )
    Text2Cohort: Facilitating Intuitive Access to Biomedical Data with Natural Language Cohort Discovery. (arXiv:2305.07637v3 [cs.LG] UPDATED)
    The Imaging Data Commons (IDC) is a cloud-based database that provides researchers with open access to cancer imaging data, with the goal of facilitating collaboration. However, cohort discovery within the IDC database has a significant technical learning curve. Recently, large language models (LLM) have demonstrated exceptional utility for natural language processing tasks. We developed Text2Cohort, a LLM-powered toolkit to facilitate user-friendly natural language cohort discovery in the IDC. Our method translates user input into IDC queries using grounding techniques and returns the query's response. We evaluate Text2Cohort on 50 natural language inputs, from information extraction to cohort discovery. Our toolkit successfully generated responses with an 88% accuracy and 0.94 F1 score. We demonstrate that Text2Cohort can enable researchers to discover and curate cohorts on IDC with high levels of accuracy using natural language in a more intuitive and user-friendly way.
    A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation. (arXiv:2304.02858v3 [cs.LG] UPDATED)
    Class imbalance (CI) in classification problems arises when the number of observations belonging to one class is lower than the other. Ensemble learning combines multiple models to obtain a robust model and has been prominently used with data augmentation methods to address class imbalance problems. In the last decade, a number of strategies have been added to enhance ensemble learning and data augmentation methods, along with new methods such as generative adversarial networks (GANs). A combination of these has been applied in many studies, and the evaluation of different combinations would enable a better understanding and guidance for different application domains. In this paper, we present a computational study to evaluate data augmentation and ensemble learning methods used to address prominent benchmark CI problems. We present a general framework that evaluates 9 data augmentation and 9 ensemble learning methods for CI problems. Our objective is to identify the most effective combination for improving classification performance on imbalanced datasets. The results indicate that combinations of data augmentation methods with ensemble learning can significantly improve classification performance on imbalanced datasets. We find that traditional data augmentation methods such as the synthetic minority oversampling technique (SMOTE) and random oversampling (ROS) are not only better in performance for selected CI problems, but also computationally less expensive than GANs. Our study is vital for the development of novel models for handling imbalanced datasets.
    Detection of developmental language disorder in Cypriot Greek children using a machine learning neural network algorithm. (arXiv:2311.15054v1 [cs.CL])
    Children with developmental language disorder (DLD) encounter difficulties in acquiring various language structures. Early identification and intervention are crucial to prevent negative long-term outcomes impacting the academic, social, and emotional development of children. The study aims to develop an automated method for the identification of DLD using artificial intelligence, specifically a neural network machine learning algorithm. This protocol is applied for the first time in Cypriot Greek children, which is generally considered underresearched in the context of DLD. The neural network model was trained using perceptual and production data elicited from children with DLD and healthy controls. The k-fold technique was used to crossvalidate the algorithm. The performance of the model was evaluated using metrics such as accuracy, precision, recall, F1 score, and ROC/AUC curve to assess its ability to make accurate predictions on a set of unseen data. The results demonstrated high classification values for all metrics (between 0.92 and 0.98), indicating the high accuracy of the neural model in classifying children with DLD. Additionally, the variable importance analysis revealed that the language production skills of children had a more significant impact on the performance of the model compared to perception skills. Neural networks represent powerful tools for detecting DLD, providing early and quick assessments of the disorder, and having the potential to improve clinical outcomes.
    Stochastic analysis of the Elo rating algorithm in round-robin tournaments. (arXiv:2212.12015v2 [cs.LG] UPDATED)
    The Elo algorithm, renowned for its simplicity, is widely used for rating in sports tournaments and other applications. However, despite its widespread use, a detailed understanding of the convergence characteristics of the Elo algorithm is still lacking. Aiming to fill this gap, this paper presents a comprehensive (stochastic) analysis of the Elo algorithm, considering round-robin tournaments. Specifically, analytical expressions are derived describing the evolution of the skills and performance metrics. Then, taking into account the relationship between the behavior of the algorithm and the step-size value, which is a hyperparameter that can be controlled, design guidelines and discussions about the performance of the algorithm are provided. Experimental results are shown confirming the accuracy of the analysis and illustrating the applicability of the theoretical findings using real-world data obtained from SuperLega, the Italian volleyball league.
    On the near-optimality of betting confidence sets for bounded means. (arXiv:2310.01547v2 [math.ST] UPDATED)
    Constructing nonasymptotic confidence intervals (CIs) for the mean of a univariate distribution from independent and identically distributed (i.i.d.) observations is a fundamental task in statistics. For bounded observations, a classical nonparametric approach proceeds by inverting standard concentration bounds, such as Hoeffding's or Bernstein's inequalities. Recently, an alternative betting-based approach for defining CIs and their time-uniform variants called confidence sequences (CSs), has been shown to be empirically superior to the classical methods. In this paper, we provide theoretical justification for this improved empirical performance of betting CIs and CSs. Our main contributions are as follows: (i) We first compare CIs using the values of their first-order asymptotic widths (scaled by $\sqrt{n}$), and show that the betting CI of Waudby-Smith and Ramdas (2023) has a smaller limiting width than existing empirical Bernstein (EB)-CIs. (ii) Next, we establish two lower bounds that characterize the minimum width achievable by any method for constructing CIs/CSs in terms of certain inverse information projections. (iii) Finally, we show that the betting CI and CS match the fundamental limits, modulo an additive logarithmic term and a multiplicative constant. Overall these results imply that the betting CI~(and CS) admit stronger theoretical guarantees than the existing state-of-the-art EB-CI~(and CS); both in the asymptotic and finite-sample regimes.  ( 2 min )
    Empirical Study of PEFT techniques for Winter Wheat Segmentation. (arXiv:2310.01825v2 [cs.CV] UPDATED)
    Parameter Efficient Fine Tuning (PEFT) techniques have recently experienced significant growth and have been extensively employed to adapt large vision and language models to various domains, enabling satisfactory model performance with minimal computational needs. Despite these advances, more research has yet to delve into potential PEFT applications in real-life scenarios, particularly in the critical domains of remote sensing and crop monitoring. The diversity of climates across different regions and the need for comprehensive large-scale datasets have posed significant obstacles to accurately identify crop types across varying geographic locations and changing growing seasons. This study seeks to bridge this gap by comprehensively exploring the feasibility of cross-area and cross-year out-of-distribution generalization using the State-of-the-Art (SOTA) wheat crop monitoring model. The aim of this work is to explore PEFT approaches for crop monitoring. Specifically, we focus on adapting the SOTA TSViT model to address winter wheat field segmentation, a critical task for crop monitoring and food security. This adaptation process involves integrating different PEFT techniques, including BigFit, LoRA, Adaptformer, and prompt tuning. Using PEFT techniques, we achieved notable results comparable to those achieved using full fine-tuning methods while training only a mere 0.7% parameters of the whole TSViT architecture. The in-house labeled data-set, referred to as the Beqaa-Lebanon dataset, comprises high-quality annotated polygons for wheat and non-wheat classes with a total surface of 170 kmsq, over five consecutive years. Using Sentinel-2 images, our model achieved a 84% F1-score. We intend to publicly release the Lebanese winter wheat data set, code repository, and model weights.  ( 3 min )
    Unraveling the "Anomaly" in Time Series Anomaly Detection: A Self-supervised Tri-domain Solution. (arXiv:2311.11235v2 [cs.LG] UPDATED)
    The ongoing challenges in time series anomaly detection (TSAD), notably the scarcity of anomaly labels and the variability in anomaly lengths and shapes, have led to the need for a more efficient solution. As limited anomaly labels hinder traditional supervised models in TSAD, various SOTA deep learning techniques, such as self-supervised learning, have been introduced to tackle this issue. However, they encounter difficulties handling variations in anomaly lengths and shapes, limiting their adaptability to diverse anomalies. Additionally, many benchmark datasets suffer from the problem of having explicit anomalies that even random functions can detect. This problem is exacerbated by ill-posed evaluation metrics, known as point adjustment (PA), which can result in inflated model performance. In this context, we propose a novel self-supervised learning based Tri-domain Anomaly Detector (TriAD), which addresses these challenges by modeling features across three data domains - temporal, frequency, and residual domains - without relying on anomaly labels. Unlike traditional contrastive learning methods, TriAD employs both inter-domain and intra-domain contrastive loss to learn common attributes among normal data and differentiate them from anomalies. Additionally, our approach can detect anomalies of varying lengths by integrating with a discord discovery algorithm. It is worth noting that this study is the first to reevaluate the deep learning potential in TSAD, utilizing both rigorously designed datasets (i.e., UCR Archive) and evaluation metrics (i.e., PA%K and affiliation). Through experimental results on the UCR dataset, TriAD achieves an impressive three-fold increase in PA%K based F1 scores over SOTA deep learning models, and 50% increase of accuracy as compared to SOTA discord discovery algorithms.  ( 3 min )
    Unsupervised learning of site percolation based on shuffled configurations. (arXiv:2311.14725v1 [cond-mat.stat-mech])
    In the field of statistical physics, machine learning has gained significant popularity and has achieved remarkable results in recent studies on phase transitions.In this paper, we apply Principal Component Analysis (PCA) and Autoencoder(AE) based on Unsupervised learning to study the various configurations of the percolation model in equilibrium phase transition. In certain phase transition models, such as the DP model in non-equilibrium phase transitions, the order parameter is particle density. However, in some other phase transition models, such as the percolation model, it is not. This study involved randomizing and selecting percolation graphs to be used as input for a neural network, and analyzed the obtained results, indicating that the outputs of the single latent variable of AE and the first principal component of PCA are signals related to particle density.  ( 2 min )
    One-Shot Transfer Learning for Nonlinear ODEs. (arXiv:2311.14931v1 [cs.LG])
    We introduce a generalizable approach that combines perturbation method and one-shot transfer learning to solve nonlinear ODEs with a single polynomial term, using Physics-Informed Neural Networks (PINNs). Our method transforms non-linear ODEs into linear ODE systems, trains a PINN across varied conditions, and offers a closed-form solution for new instances within the same non-linear ODE class. We demonstrate the effectiveness of this approach on the Duffing equation and suggest its applicability to similarly structured PDEs and ODE systems.  ( 2 min )
    Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information. (arXiv:2311.11509v2 [cs.CL] UPDATED)
    In recent years, Large Language Models (LLM) have emerged as pivotal tools in various applications. However, these models are susceptible to adversarial prompt attacks, where attackers can carefully curate input strings that lead to undesirable outputs. The inherent vulnerability of LLMs stems from their input-output mechanisms, especially when presented with intensely out-of-distribution (OOD) inputs. This paper proposes a token-level detection method to identify adversarial prompts, leveraging the LLM's capability to predict the next token's probability. We measure the degree of the model's perplexity and incorporate neighboring token information to encourage the detection of contiguous adversarial prompt sequences. As a result, we propose two methods: one that identifies each token as either being part of an adversarial prompt or not, and another that estimates the probability of each token being part of an adversarial prompt.  ( 2 min )
    Do pretrained Transformers Really Learn In-context by Gradient Descent?. (arXiv:2310.08540v2 [cs.CL] UPDATED)
    Is In-Context Learning (ICL) implicitly equivalent to Gradient Descent (GD)? Several recent works draw analogies between the dynamics of GD and the emergent behavior of ICL in large language models. However, these works make assumptions far from the realistic natural language setting in which language models are trained. Therefore, such discrepancies between theory and practice necessitate further investigation to validate their applicability. We start by highlighting the assumptions in prior works that construct Transformer weights to simulate gradient descent. Their experiments with training Transformers on ICL objective, inconsistencies in the order sensitivity of ICL and GD, sparsity of the constructed weights, and sensitivity to parameter changes are some examples of mismatch from the real-world setting. Furthermore, we probe and compare the ICL vs. GD hypothesis in a natural setting. We conduct comprehensive empirical analyses on language models pretrained on natural data (LLaMa-7B). Our comparisons on various performance metrics highlight the inconsistent behavior of ICL and GD as a function of various factors such as datasets, models, and the number of demonstrations. We observe that ICL and GD modify the output distribution of language models differently. These results indicate that the equivalence between ICL and GD is an open hypothesis, requires nuanced considerations, and calls for further studies.  ( 2 min )
    StratMed: Relevance Stratification between Biomedical Entities for Sparsity on Medication Recommendation. (arXiv:2308.16781v4 [cs.AI] UPDATED)
    With the growing imbalance between limited medical resources and escalating demands, AI-based clinical tasks have become paramount. As a sub-domain, medication recommendation aims to amalgamate longitudinal patient history with medical knowledge, assisting physicians in prescribing safer and more accurate medication combinations. Existing works ignore the inherent long-tailed distribution of medical data, have uneven learning strengths for hot and sparse data, and fail to balance safety and accuracy. To address the above limitations, we propose StratMed, which introduces a stratification strategy that overcomes the long-tailed problem and achieves fuller learning of sparse data. It also utilizes a dual-property network to address the issue of mutual constraints on the safety and accuracy of medication combinations, synergistically enhancing these two properties. Specifically, we construct a pre-training method using deep learning networks to obtain medication and disease representations. After that, we design a pyramid-like stratification method based on relevance to strengthen the expressiveness of sparse data. Based on this relevance, we design two graph structures to express medication safety and precision at the same level to obtain patient representations. Finally, the patient's historical clinical information is fitted to generate medication combinations for the current health condition. We employed the MIMIC-III dataset to evaluate our model against state-of-the-art methods in three aspects comprehensively. Compared to the sub-optimal baseline model, our model reduces safety risk by 15.08\%, improves accuracy by 0.36\%, and reduces training time consumption by 81.66\%.  ( 3 min )
    Fantastic Generalization Measures are Nowhere to be Found. (arXiv:2309.13658v2 [cs.LG] UPDATED)
    We study the notion of a generalization bound being uniformly tight, meaning that the difference between the bound and the population loss is small for all learning algorithms and all population distributions. Numerous generalization bounds have been proposed in the literature as potential explanations for the ability of neural networks to generalize in the overparameterized setting. However, in their paper ``Fantastic Generalization Measures and Where to Find Them,'' Jiang et al. (2020) examine more than a dozen generalization bounds, and show empirically that none of them are uniformly tight. This raises the question of whether uniformly-tight generalization bounds are at all possible in the overparameterized setting. We consider two types of generalization bounds: (1) bounds that may depend on the training set and the learned hypothesis (e.g., margin bounds). We prove mathematically that no such bound can be uniformly tight in the overparameterized setting; (2) bounds that may in addition also depend on the learning algorithm (e.g., stability bounds). For these bounds, we show a trade-off between the algorithm's performance and the bound's tightness. Namely, if the algorithm achieves good accuracy on certain distributions, then no generalization bound can be uniformly tight for it in the overparameterized setting. We explain how these formal results can, in our view, inform research on generalization bounds for neural networks, while stressing that other interpretations of these results are also possible.  ( 3 min )
    Improving Trajectory Prediction in Dynamic Multi-Agent Environment by Dropping Waypoints. (arXiv:2309.17338v2 [cs.RO] UPDATED)
    The inherently diverse and uncertain nature of trajectories presents a formidable challenge in accurately modeling them. Motion prediction systems must effectively learn spatial and temporal information from the past to forecast the future trajectories of the agent. Many existing methods learn temporal motion via separate components within stacked models to capture temporal features. Furthermore, prediction methods often operate under the assumption that observed trajectory waypoint sequences are complete, disregarding scenarios where missing values may occur, which can influence their performance. Moreover, these models may be biased toward particular waypoint sequences when making predictions. We propose a novel approach called Temporal Waypoint Dropping (TWD) that explicitly incorporates temporal dependencies during the training of a trajectory prediction model. By stochastically dropping waypoints from past observed trajectories, the model is forced to learn the underlying temporal representation from the remaining waypoints, resulting in an improved model. Incorporating stochastic temporal waypoint dropping into the model learning process significantly enhances its performance in scenarios with missing values. Experimental results demonstrate our approach's substantial improvement in trajectory prediction capabilities. Our approach can complement existing trajectory prediction methods to improve their prediction accuracy. We evaluate our proposed approach on three datasets: NBA Sports VU, ETH-UCY, and TrajNet++.  ( 2 min )
    Inverse Preference Learning: Preference-based RL without a Reward Function. (arXiv:2305.15363v2 [cs.LG] UPDATED)
    Reward functions are difficult to design and often hard to align with human intent. Preference-based Reinforcement Learning (RL) algorithms address these problems by learning reward functions from human feedback. However, the majority of preference-based RL methods na\"ively combine supervised reward models with off-the-shelf RL algorithms. Contemporary approaches have sought to improve performance and query complexity by using larger and more complex reward architectures such as transformers. Instead of using highly complex architectures, we develop a new and parameter-efficient algorithm, Inverse Preference Learning (IPL), specifically designed for learning from offline preference data. Our key insight is that for a fixed policy, the $Q$-function encodes all information about the reward function, effectively making them interchangeable. Using this insight, we completely eliminate the need for a learned reward function. Our resulting algorithm is simpler and more parameter-efficient. Across a suite of continuous control and robotics benchmarks, IPL attains competitive performance compared to more complex approaches that leverage transformer-based and non-Markovian reward functions while having fewer algorithmic hyperparameters and learned network parameters. Our code is publicly released.  ( 2 min )
    Semantic Parsing by Large Language Models for Intricate Updating Strategies of Zero-Shot Dialogue State Tracking. (arXiv:2310.10520v3 [cs.CL] UPDATED)
    Zero-shot Dialogue State Tracking (DST) addresses the challenge of acquiring and annotating task-oriented dialogues, which can be time-consuming and costly. However, DST extends beyond simple slot-filling and requires effective updating strategies for tracking dialogue state as conversations progress. In this paper, we propose ParsingDST, a new In-Context Learning (ICL) method, to introduce additional intricate updating strategies in zero-shot DST. Our approach reformulates the DST task by leveraging powerful Large Language Models (LLMs) and translating the original dialogue text to JSON through semantic parsing as an intermediate state. We also design a novel framework that includes more modules to ensure the effectiveness of updating strategies in the text-to-JSON process. Experimental results demonstrate that our approach outperforms existing zero-shot DST methods on MultiWOZ, exhibiting significant improvements in Joint Goal Accuracy (JGA) and slot accuracy compared to existing ICL methods. Our code has been released.  ( 2 min )
    Adversarial Bandits with Multi-User Delayed Feedback: Theory and Application. (arXiv:2310.11188v2 [cs.LG] UPDATED)
    The multi-armed bandit (MAB) models have attracted significant research attention due to their applicability and effectiveness in various real-world scenarios such as resource allocation, online advertising, and dynamic pricing. As an important branch, the adversarial MAB problems with delayed feedback have been proposed and studied by many researchers recently where a conceptual adversary strategically selects the reward distributions associated with each arm to challenge the learning algorithm and the agent experiences a delay between taking an action and receiving the corresponding reward feedback. However, the existing models restrict the feedback to be generated from only one user, which makes models inapplicable to the prevailing scenarios of multiple users (e.g. ad recommendation for a group of users). In this paper, we consider that the delayed feedback results are from multiple users and are unrestricted on internal distribution. In contrast, the feedback delay is arbitrary and unknown to the player in advance. Also, for different users in a round, the delays in feedback have no assumption of latent correlation. Thus, we formulate an adversarial MAB problem with multi-user delayed feedback and design a modified EXP3 algorithm MUD-EXP3, which makes a decision at each round by considering the importance-weighted estimator of the received feedback from different users. On the premise of known terminal round index $T$, the number of users $M$, the number of arms $N$, and upper bound of delay $d_{max}$, we prove a regret of $\mathcal{O}(\sqrt{TM^2\ln{N}(N\mathrm{e}+4d_{max})})$. Furthermore, for the more common case of unknown $T$, an adaptive algorithm AMUD-EXP3 is proposed with a sublinear regret with respect to $T$. Finally, extensive experiments are conducted to indicate the correctness and effectiveness of our algorithms.  ( 3 min )
    Leveraging Low-Rank and Sparse Recurrent Connectivity for Robust Closed-Loop Control. (arXiv:2310.03915v2 [cs.LG] UPDATED)
    Developing autonomous agents that can interact with changing environments is an open challenge in machine learning. Robustness is particularly important in these settings as agents are often fit offline on expert demonstrations but deployed online where they must generalize to the closed feedback loop within the environment. In this work, we explore the application of recurrent neural networks to tasks of this nature and understand how a parameterization of their recurrent connectivity influences robustness in closed-loop settings. Specifically, we represent the recurrent connectivity as a function of rank and sparsity and show both theoretically and empirically that modulating these two variables has desirable effects on network dynamics. The proposed low-rank, sparse connectivity induces an interpretable prior on the network that proves to be most amenable for a class of models known as closed-form continuous-time neural networks (CfCs). We find that CfCs with fewer parameters can outperform their full-rank, fully-connected counterparts in the online setting under distribution shift. This yields memory-efficient and robust agents while opening a new perspective on how we can modulate network dynamics through connectivity.  ( 2 min )
    FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations. (arXiv:2211.14309v2 [cs.CV] UPDATED)
    We present a generative approach to forecast long-term future human behavior in 3D, requiring only weak supervision from readily available 2D human action data. This is a fundamental task enabling many downstream applications. The required ground-truth data is hard to capture in 3D (mocap suits, expensive setups) but easy to acquire in 2D (simple RGB cameras). Thus, we design our method to only require 2D RGB data while being able to generate 3D human motion sequences. We use a differentiable 2D projection scheme in an autoregressive manner for weak supervision, and an adversarial loss for 3D regularization. Our method predicts long and complex behavior sequences (e.g. cooking, assembly) consisting of multiple sub-actions. We tackle this in a semantically hierarchical manner, jointly predicting high-level coarse action labels together with their low-level fine-grained realizations as characteristic 3D human poses. We observe that these two action representations are coupled in nature, and joint prediction benefits both action and pose forecasting. Our experiments demonstrate the complementary nature of joint action and 3D pose prediction: our joint approach outperforms each task treated individually, enables robust longer-term sequence prediction, and outperforms alternative approaches to forecast actions and characteristic 3D poses.  ( 2 min )
    ManiCast: Collaborative Manipulation with Cost-Aware Human Forecasting. (arXiv:2310.13258v2 [cs.RO] UPDATED)
    Seamless human-robot manipulation in close proximity relies on accurate forecasts of human motion. While there has been significant progress in learning forecast models at scale, when applied to manipulation tasks, these models accrue high errors at critical transition points leading to degradation in downstream planning performance. Our key insight is that instead of predicting the most likely human motion, it is sufficient to produce forecasts that capture how future human motion would affect the cost of a robot's plan. We present ManiCast, a novel framework that learns cost-aware human forecasts and feeds them to a model predictive control planner to execute collaborative manipulation tasks. Our framework enables fluid, real-time interactions between a human and a 7-DoF robot arm across a number of real-world tasks such as reactive stirring, object handovers, and collaborative table setting. We evaluate both the motion forecasts and the end-to-end forecaster-planner system against a range of learned and heuristic baselines while additionally contributing new datasets. We release our code and datasets at https://portal-cornell.github.io/manicast/.  ( 2 min )
    Optimal and Fair Encouragement Policy Evaluation and Learning. (arXiv:2309.07176v2 [cs.LG] UPDATED)
    In consequential domains, it is often impossible to compel individuals to take treatment, so that optimal policy rules are merely suggestions in the presence of human non-adherence to treatment recommendations. In these same domains, there may be heterogeneity both in who responds in taking-up treatment, and heterogeneity in treatment efficacy. While optimal treatment rules can maximize causal outcomes across the population, access parity constraints or other fairness considerations can be relevant in the case of encouragement. For example, in social services, a persistent puzzle is the gap in take-up of beneficial services among those who may benefit from them the most. When in addition the decision-maker has distributional preferences over both access and average outcomes, the optimal decision rule changes. We study causal identification, statistical variance-reduced estimation, and robust estimation of optimal treatment rules, including under potential violations of positivity. We consider fairness constraints such as demographic parity in treatment take-up, and other constraints, via constrained optimization. Our framework can be extended to handle algorithmic recommendations under an often-reasonable covariate-conditional exclusion restriction, using our robustness checks for lack of positivity in the recommendation. We develop a two-stage algorithm for solving over parametrized policy classes under general constraints to obtain variance-sensitive regret bounds. We illustrate the methods in two case studies based on data from randomized encouragement to enroll in insurance and from pretrial supervised release with electronic monitoring.  ( 2 min )
    Enhancing Energy-efficiency by Solving the Throughput Bottleneck of LSTM Cells for Embedded FPGAs. (arXiv:2310.16842v2 [cs.AR] UPDATED)
    To process sensor data in the Internet of Things(IoTs), embedded deep learning for 1-dimensional data is an important technique. In the past, CNNs were frequently used because they are simple to optimise for special embedded hardware such as FPGAs. This work proposes a novel LSTM cell optimisation aimed at energy-efficient inference on end devices. Using the traffic speed prediction as a case study, a vanilla LSTM model with the optimised LSTM cell achieves 17534 inferences per second while consuming only 3.8 $\mu$J per inference on the FPGA XC7S15 from Spartan-7 family. It achieves at least 5.4$\times$ faster throughput and 1.37$\times$ more energy efficient than existing approaches.  ( 2 min )
    On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms. (arXiv:2310.15848v3 [cs.LG] UPDATED)
    Artificial Intelligence (AI) has made its way into various scientific fields, providing astonishing improvements over existing algorithms for a wide variety of tasks. In recent years, there have been severe concerns over the trustworthiness of AI technologies. The scientific community has focused on the development of trustworthy AI algorithms. However, machine and deep learning algorithms, popular in the AI community today, depend heavily on the data used during their development. These learning algorithms identify patterns in the data, learning the behavioral objective. Any flaws in the data have the potential to translate directly into algorithms. In this study, we discuss the importance of Responsible Machine Learning Datasets and propose a framework to evaluate the datasets through a responsible rubric. While existing work focuses on the post-hoc evaluation of algorithms for their trustworthiness, we provide a framework that considers the data component separately to understand its role in the algorithm. We discuss responsible datasets through the lens of fairness, privacy, and regulatory compliance and provide recommendations for constructing future datasets. After surveying over 100 datasets, we use 60 datasets for analysis and demonstrate that none of these datasets is immune to issues of fairness, privacy preservation, and regulatory compliance. We provide modifications to the ``datasheets for datasets" with important additions for improved dataset documentation. With governments around the world regularizing data protection laws, the method for the creation of datasets in the scientific community requires revision. We believe this study is timely and relevant in today's era of AI.  ( 3 min )
    Reducing sequential change detection to sequential estimation. (arXiv:2309.09111v2 [math.ST] UPDATED)
    We consider the problem of sequential change detection, where the goal is to design a scheme for detecting any changes in a parameter or functional $\theta$ of the data stream distribution that has small detection delay, but guarantees control on the frequency of false alarms in the absence of changes. In this paper, we describe a simple reduction from sequential change detection to sequential estimation using confidence sequences: we begin a new $(1-\alpha)$-confidence sequence at each time step, and proclaim a change when the intersection of all active confidence sequences becomes empty. We prove that the average run length is at least $1/\alpha$, resulting in a change detection scheme with minimal structural assumptions~(thus allowing for possibly dependent observations, and nonparametric distribution classes), but strong guarantees. Our approach bears an interesting parallel with the reduction from change detection to sequential testing of Lorden (1971) and the e-detector of Shin et al. (2022).  ( 2 min )
    Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation. (arXiv:2308.15709v2 [cs.LG] UPDATED)
    Data valuation aims to quantify the usefulness of individual data sources in training machine learning (ML) models, and is a critical aspect of data-centric ML research. However, data valuation faces significant yet frequently overlooked privacy challenges despite its importance. This paper studies these challenges with a focus on KNN-Shapley, one of the most practical data valuation methods nowadays. We first emphasize the inherent privacy risks of KNN-Shapley, and demonstrate the significant technical difficulties in adapting KNN-Shapley to accommodate differential privacy (DP). To overcome these challenges, we introduce TKNN-Shapley, a refined variant of KNN-Shapley that is privacy-friendly, allowing for straightforward modifications to incorporate DP guarantee (DP-TKNN-Shapley). We show that DP-TKNN-Shapley has several advantages and offers a superior privacy-utility tradeoff compared to naively privatized KNN-Shapley in discerning data quality. Moreover, even non-private TKNN-Shapley achieves comparable performance as KNN-Shapley. Overall, our findings suggest that TKNN-Shapley is a promising alternative to KNN-Shapley, particularly for real-world applications involving sensitive data.  ( 2 min )
    What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models. (arXiv:2310.06627v2 [cs.CL] UPDATED)
    Counterfactual reasoning, a fundamental aspect of human cognition, involves contemplating alternatives to established facts or past events, significantly enhancing our abilities in planning and decision-making. In light of the advancements in current multi-modal large language models, we explore their effectiveness in counterfactual reasoning. To facilitate this investigation, we introduce a novel dataset, C-VQA, specifically designed to test the counterfactual reasoning capabilities of modern multi-modal large language models. This dataset is constructed by infusing original questions with counterfactual presuppositions, spanning various types such as numerical and boolean queries. It encompasses a mix of real and synthetic data, representing a wide range of difficulty levels. Our thorough evaluations of contemporary vision-language models using this dataset have revealed substantial performance drops, with some models showing up to a 40\% decrease, highlighting a significant gap between current models and human-like vision reasoning capabilities. We hope our dataset will serve as a vital benchmark for evaluating the counterfactual reasoning capabilities of models. Code and dataset are publicly available at https://bzhao.me/C-VQA/.  ( 2 min )
    FedSoL: Bridging Global Alignment and Local Generality in Federated Learning. (arXiv:2308.12532v2 [cs.LG] UPDATED)
    Federated Learning (FL) aggregates locally trained models from individual clients to construct a global model. While FL enables learning a model with data privacy, it often suffers from significant performance degradation when client data distributions are heterogeneous. Many previous FL algorithms have addressed this issue by introducing various proximal restrictions. These restrictions aim to encourage global alignment by constraining the deviation of local learning from the global objective. However, they inherently limit local learning by interfering with the original local objectives. Recently, an alternative approach has emerged to improve local learning generality. By obtaining local models within a smooth loss landscape, this approach mitigates conflicts among different local objectives of the clients. Yet, it does not ensure stable global alignment, as local learning does not take the global objective into account. In this study, we propose Federated Stability on Learning (FedSoL), which combines both the concepts of global alignment and local generality. In FedSoL, the local learning seeks a parameter region robust against proximal perturbations. This strategy introduces an implicit proximal restriction effect in local learning while maintaining the original local objective for parameter update. Our experiments show that FedSoL consistently achieves state-of-the-art performance on various setups.  ( 2 min )
    The Chosen One: Consistent Characters in Text-to-Image Diffusion Models. (arXiv:2311.10093v2 [cs.CV] UPDATED)
    Recent advances in text-to-image generation models have unlocked vast potential for visual creativity. However, these models struggle with generation of consistent characters, a crucial aspect for numerous real-world applications such as story visualization, game development asset design, advertising, and more. Current methods typically rely on multiple pre-existing images of the target character or involve labor-intensive manual processes. In this work, we propose a fully automated solution for consistent character generation, with the sole input being a text prompt. We introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity and extracts a more consistent identity from this set. Our quantitative analysis demonstrates that our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods, and these findings are reinforced by a user study. To conclude, we showcase several practical applications of our approach. Project page is available at https://omriavrahami.com/the-chosen-one  ( 2 min )
    Deep Backtracking Counterfactuals for Causally Compliant Explanations. (arXiv:2310.07665v2 [cs.AI] UPDATED)
    Counterfactuals can offer valuable insights by answering what would have been observed under altered circumstances, conditional on a factual observation. Whereas the classical interventional interpretation of counterfactuals has been studied extensively, backtracking constitutes a less studied alternative the backtracking principle has emerged as an alternative philosophy where all causal laws are kept intact. In the present work, we introduce a practical method for computing backtracking counterfactuals in structural causal models that consist of deep generative components. To this end, we impose conditions on the structural assignments that enable the generation of counterfactuals by solving a tractable constrained optimization problem in the structured latent space of a causal model. Our formulation also facilitates a comparison with methods in the field of counterfactual explanations. Compared to these, our method represents a versatile, modular and causally compliant alternative. We demonstrate these properties experimentally on a modified version of MNIST and CelebA.  ( 2 min )
    Deep Calibration of Market Simulations using Neural Density Estimators and Embedding Networks. (arXiv:2311.11913v2 [cs.LG] UPDATED)
    The ability to construct a realistic simulator of financial exchanges, including reproducing the dynamics of the limit order book, can give insight into many counterfactual scenarios, such as a flash crash, a margin call, or changes in macroeconomic outlook. In recent years, agent-based models have been developed that reproduce many features of an exchange, as summarised by a set of stylised facts and statistics. However, the ability to calibrate simulators to a specific period of trading remains an open challenge. In this work, we develop a novel approach to the calibration of market simulators by leveraging recent advances in deep learning, specifically using neural density estimators and embedding networks. We demonstrate that our approach is able to correctly identify high probability parameter sets, both when applied to synthetic and historical data, and without reliance on manually selected or weighted ensembles of stylised facts.  ( 2 min )
    TextGuard: Provable Defense against Backdoor Attacks on Text Classification. (arXiv:2311.11225v2 [cs.LG] UPDATED)
    Backdoor attacks have become a major security threat for deploying machine learning models in security-critical applications. Existing research endeavors have proposed many defenses against backdoor attacks. Despite demonstrating certain empirical defense efficacy, none of these techniques could provide a formal and provable security guarantee against arbitrary attacks. As a result, they can be easily broken by strong adaptive attacks, as shown in our evaluation. In this work, we propose TextGuard, the first provable defense against backdoor attacks on text classification. In particular, TextGuard first divides the (backdoored) training data into sub-training sets, achieved by splitting each training sentence into sub-sentences. This partitioning ensures that a majority of the sub-training sets do not contain the backdoor trigger. Subsequently, a base classifier is trained from each sub-training set, and their ensemble provides the final prediction. We theoretically prove that when the length of the backdoor trigger falls within a certain threshold, TextGuard guarantees that its prediction will remain unaffected by the presence of the triggers in training and testing inputs. In our evaluation, we demonstrate the effectiveness of TextGuard on three benchmark text classification tasks, surpassing the certification accuracy of existing certified defenses against backdoor attacks. Furthermore, we propose additional strategies to enhance the empirical performance of TextGuard. Comparisons with state-of-the-art empirical defenses validate the superiority of TextGuard in countering multiple backdoor attacks. Our code and data are available at https://github.com/AI-secure/TextGuard.  ( 3 min )
    ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages. (arXiv:2306.01460v3 [cs.LG] UPDATED)
    This paper introduces an effective and practical step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning. This step manifests as three simple modifications to the Asynchronous Advantage Actor-Critic (A3C) algorithm: (1) applying a ReLU function to advantage estimates, (2) spectral normalization of actor-critic weights, and (3) incorporating dropout as a Bayesian approximation. We prove under standard assumptions that restricting policy updates to positive advantages optimizes for value by maximizing a lower bound on the value function plus an additive term. We show that the additive term is bounded proportional to the Lipschitz constant of the value function, which offers theoretical grounding for spectral normalization of critic weights. Finally, our application of dropout corresponds to approximate Bayesian inference over both the actor and critic parameters, which enables prudent state-aware exploration around the modes of the actor via Thompson sampling. Extensive empirical evaluations on diverse benchmarks reveal the superior performance of our approach compared to existing on- and off-policy algorithms. We demonstrate significant improvements for median and interquartile mean metrics over PPO, SAC, and TD3 on the MuJoCo continuous control benchmark. Moreover, we see improvement over PPO in the challenging ProcGen generalization benchmark.  ( 3 min )
    A Comparison of PDF Projection with Normalizing Flows and SurVAE. (arXiv:2311.14412v2 [cs.LG] UPDATED)
    Normalizing flows (NF) recently gained attention as a way to construct generative networks with exact likelihood calculation out of composable layers. However, NF is restricted to dimension-preserving transformations. Surjection VAE (SurVAE) has been proposed to extend NF to dimension-altering transformations. Such networks are desirable because they are expressive and can be precisely trained. We show that the approaches are a re-invention of PDF projection, which appeared over twenty years earlier and is much further developed.  ( 2 min )
    Unlocking the Potential of Prompt-Tuning in Bridging Generalized and Personalized Federated Learning. (arXiv:2310.18285v2 [cs.LG] UPDATED)
    Vision Transformers (ViT) and Visual Prompt Tuning (VPT) achieve state-of-the-art performance with improved efficiency in various computer vision tasks. This suggests a promising paradigm shift of adapting pre-trained ViT models to Federated Learning (FL) settings. However, the challenge of data heterogeneity among FL clients presents a significant hurdle in effectively deploying ViT models. Existing Generalized FL (GFL) and Personalized FL (PFL) methods have limitations in balancing performance across both global and local data distributions. In this paper, we present a novel algorithm, SGPT, that integrates GFL and PFL approaches by employing a unique combination of both shared and group-specific prompts. This design enables SGPT to capture both common and group-specific features. A key feature of SGPT is its prompt selection module, which facilitates the training of a single global model capable of automatically adapting to diverse local client data distributions without the need for local fine-tuning. To effectively train the prompts, we utilize block coordinate descent (BCD), learning from common feature information (shared prompts), and then more specialized knowledge (group prompts) iteratively. Theoretically, we justify that learning the proposed prompts can reduce the gap between global and local performance. Empirically, we conduct experiments on both label and feature heterogeneity settings in comparison with state-of-the-art baselines, along with extensive ablation studies, to substantiate the superior performance of SGPT.  ( 3 min )
    AST: Effective Dataset Distillation through Alignment with Smooth and High-Quality Expert Trajectories. (arXiv:2310.10541v2 [cs.CV] UPDATED)
    Training large AI models typically requires large-scale datasets in the machine learning process, making training and parameter-tuning process both time-consuming and costly. Some researchers address this problem by carefully synthesizing a very small number of highly representative and informative samples from real-world datasets. This approach, known as Dataset Distillation (DD), proposes a perspective for data-efficient learning. Despite recent progress in this field, the performance of existing methods still cannot meet expectations, and distilled datasets cannot effectively replace original datasets. In this paper, unlike previous methods that focus solely on improving the effectiveness of student distillation, we recognize and leverage the important mutual influence between expert and student models. We observed that the smoothness of expert trajectories has a significant impact on subsequent student parameter alignment. Based on this, we propose an effective DD framework named AST, standing for Alignment with Smooth and high-quality expert Trajectories. We devise the integration of clipping loss and gradient penalty to regulate the rate of parameter changes in expert trajectory generation. To further refine the student parameter alignment with expert trajectory, we put forward representative initialization for the synthetic dataset and balanced inner-loop loss in response to the sensitivity exhibited towards randomly initialized variables during distillation. We also propose two enhancement strategies, namely intermediate matching loss and weight perturbation, to mitigate the potential occurrence of cumulative errors. We conduct extensive experiments on datasets of different scales, sizes, and resolutions. The results demonstrate that the proposed method significantly outperforms prior methods.  ( 3 min )
    MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training. (arXiv:2310.00967v2 [cs.LG] UPDATED)
    Gradient sparsification is a communication optimisation technique for scaling and accelerating distributed deep neural network (DNN) training. It reduces the increasing communication traffic for gradient aggregation. However, existing sparsifiers have poor scalability because of the high computational cost of gradient selection and/or increase in communication traffic. In particular, an increase in communication traffic is caused by gradient build-up and inappropriate threshold for gradient selection. To address these challenges, we propose a novel gradient sparsification method called MiCRO. In MiCRO, the gradient vector is partitioned, and each partition is assigned to the corresponding worker. Each worker then selects gradients from its partition, and the aggregated gradients are free from gradient build-up. Moreover, MiCRO estimates the accurate threshold to maintain the communication traffic as per user requirement by minimising the compression ratio error. MiCRO enables near-zero cost gradient sparsification by solving existing problems that hinder the scalability and acceleration of distributed DNN training. In our extensive experiments, MiCRO outperformed state-of-the-art sparsifiers with an outstanding convergence rate.  ( 2 min )
    Self-Tuning Hamiltonian Monte Carlo for Accelerated Sampling. (arXiv:2309.13593v2 [physics.comp-ph] UPDATED)
    The performance of Hamiltonian Monte Carlo simulations crucially depends on both the integration timestep and the number of integration steps. We present an adaptive general-purpose framework to automatically tune such parameters, based on a local loss function which promotes the fast exploration of phase-space. We show that a good correspondence between loss and autocorrelation time can be established, allowing for gradient-based optimization using a fully-differentiable set-up. The loss is constructed in such a way that it also allows for gradient-driven learning of a distribution over the number of integration steps. Our approach is demonstrated for the one-dimensional harmonic oscillator and alanine dipeptide, a small protein common as a test case for simulation methods. Through the application to the harmonic oscillator, we highlight the importance of not using a fixed timestep to avoid a rugged loss surface with many local minima, otherwise trapping the optimization. In the case of alanine dipeptide, by tuning the only free parameter of our loss definition, we find a good correspondence between it and the autocorrelation times, resulting in a $>100$ fold speed up in optimization of simulation parameters compared to a grid-search. For this system, we also extend the integrator to allow for atom-dependent timesteps, providing a further reduction of $25\%$ in autocorrelation times.  ( 2 min )
    The Map Equation Goes Neural. (arXiv:2310.01144v2 [cs.LG] UPDATED)
    Community detection and graph clustering are essential for unsupervised data exploration and understanding the high-level organisation of networked systems. Recently, graph clustering has received attention as a primary task for graph neural networks. Although hierarchical graph pooling has been shown to improve performance in graph and node classification tasks, it performs poorly in identifying meaningful clusters. Community detection has a long history in network science, but typically relies on optimising objective functions with custom-tailored search algorithms, not leveraging recent advances in deep learning, particularly from graph neural networks. In this paper, we narrow this gap between the deep learning and network science communities. We consider the map equation, an information-theoretic objective function for unsupervised community detection. Expressing it in a fully differentiable tensor form that produces soft cluster assignments, we optimise the map equation with deep learning through gradient descent. More specifically, the reformulated map equation is a loss function compatible with any graph neural network architecture, enabling flexible clustering and graph pooling that clusters both graph structure and data features in an end-to-end way, automatically finding an optimum number of clusters without explicit regularisation by following the minimum description length principle. We evaluate our approach experimentally using different neural network architectures for unsupervised clustering in synthetic and real data. Our results show that our approach achieves competitive performance against baselines, naturally detects overlapping communities, and avoids over-partitioning sparse graphs.  ( 2 min )
    Will More Expressive Graph Neural Networks do Better on Generative Tasks?. (arXiv:2308.11978v2 [cs.LG] UPDATED)
    Graph generation poses a significant challenge as it involves predicting a complete graph with multiple nodes and edges based on simply a given label. This task also carries fundamental importance to numerous real-world applications, including de-novo drug and molecular design. In recent years, several successful methods have emerged in the field of graph generation. However, these approaches suffer from two significant shortcomings: (1) the underlying Graph Neural Network (GNN) architectures used in these methods are often underexplored; and (2) these methods are often evaluated on only a limited number of metrics. To fill this gap, we investigate the expressiveness of GNNs under the context of the molecular graph generation task, by replacing the underlying GNNs of graph generative models with more expressive GNNs. Specifically, we analyse the performance of six GNNs on six different molecular generative objectives on the ZINC-250k dataset in two different generative frameworks: autoregressive generation models, such as GCPN and GraphAF, and one-shot generation models, such as GraphEBM. Through our extensive experiments, we demonstrate that advanced GNNs can indeed improve the performance of GCPN, GraphAF, and GraphEBM on molecular generation tasks, but GNN expressiveness is not a necessary condition for a good GNN-based generative model. Moreover, we show that GCPN and GraphAF with advanced GNNs can achieve state-of-the-art results across 17 other non-GNN-based graph generative approaches, such as variational autoencoders and Bayesian optimisation models, on the proposed molecular generative objectives (DRD2, Median1, Median2), which are important metrics for de-novo molecular design.  ( 3 min )
    Fed-GraB: Federated Long-tailed Learning with Self-Adjusting Gradient Balancer. (arXiv:2310.07587v4 [cs.LG] UPDATED)
    Data privacy and long-tailed distribution are the norms rather than the exception in many real-world tasks. This paper investigates a federated long-tailed learning (Fed-LT) task in which each client holds a locally heterogeneous dataset; if the datasets can be globally aggregated, they jointly exhibit a long-tailed distribution. Under such a setting, existing federated optimization and/or centralized long-tailed learning methods hardly apply due to challenges in (a) characterizing the global long-tailed distribution under privacy constraints and (b) adjusting the local learning strategy to cope with the head-tail imbalance. In response, we propose a method termed $\texttt{Fed-GraB}$, comprised of a Self-adjusting Gradient Balancer (SGB) module that re-weights clients' gradients in a closed-loop manner, based on the feedback of global long-tailed distribution evaluated by a Direct Prior Analyzer (DPA) module. Using $\texttt{Fed-GraB}$, clients can effectively alleviate the distribution drift caused by data heterogeneity during the model training process and obtain a global model with better performance on the minority classes while maintaining the performance of the majority classes. Extensive experiments demonstrate that $\texttt{Fed-GraB}$ achieves state-of-the-art performance on representative datasets such as CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist.  ( 3 min )
    SLMIA-SR: Speaker-Level Membership Inference Attacks against Speaker Recognition Systems. (arXiv:2309.07983v2 [cs.CR] UPDATED)
    Membership inference attacks allow adversaries to determine whether a particular example was contained in the model's training dataset. While previous works have confirmed the feasibility of such attacks in various applications, none has focused on speaker recognition (SR), a promising voice-based biometric recognition technique. In this work, we propose SLMIA-SR, the first membership inference attack tailored to SR. In contrast to conventional example-level attack, our attack features speaker-level membership inference, i.e., determining if any voices of a given speaker, either the same as or different from the given inference voices, have been involved in the training of a model. It is particularly useful and practical since the training and inference voices are usually distinct, and it is also meaningful considering the open-set nature of SR, namely, the recognition speakers were often not present in the training data. We utilize intra-similarity and inter-dissimilarity, two training objectives of SR, to characterize the differences between training and non-training speakers and quantify them with two groups of features driven by carefully-established feature engineering to mount the attack. To improve the generalizability of our attack, we propose a novel mixing ratio training strategy to train attack models. To enhance the attack performance, we introduce voice chunk splitting to cope with the limited number of inference voices and propose to train attack models dependent on the number of inference voices. Our attack is versatile and can work in both white-box and black-box scenarios. Additionally, we propose two novel techniques to reduce the number of black-box queries while maintaining the attack performance. Extensive experiments demonstrate the effectiveness of SLMIA-SR.  ( 3 min )
    Adversarial Machine Learning in Latent Representations of Neural Networks. (arXiv:2309.17401v2 [cs.LG] UPDATED)
    Distributed deep neural networks (DNNs) have been shown to reduce the computational burden of mobile devices and decrease the end-to-end inference latency in edge computing scenarios. While distributed DNNs have been studied, to the best of our knowledge the resilience of distributed DNNs to adversarial action still remains an open problem. In this paper, we fill the existing research gap by rigorously analyzing the robustness of distributed DNNs against adversarial action. We cast this problem in the context of information theory and introduce two new measurements for distortion and robustness. Our theoretical findings indicate that (i) assuming the same level of information distortion, latent features are always more robust than input representations; (ii) the adversarial robustness is jointly determined by the feature dimension and the generalization capability of the DNN. To test our theoretical findings, we perform extensive experimental analysis by considering 6 different DNN architectures, 6 different approaches for distributed DNN and 10 different adversarial attacks to the ImageNet-1K dataset. Our experimental results support our theoretical findings by showing that the compressed latent representations can reduce the success rate of adversarial attacks by 88% in the best case and by 57% on the average compared to attacks to the input space.  ( 2 min )
    Label Differential Privacy via Aggregation. (arXiv:2310.10092v3 [cs.LG] UPDATED)
    In many real-world applications, due to recent developments in the privacy landscape, training data may be aggregated to preserve the privacy of sensitive training labels. In the learning from label proportions (LLP) framework, the dataset is partitioned into bags of feature-vectors which are available only with the sum of the labels per bag. A further restriction, which we call learning from bag aggregates (LBA) is where instead of individual feature-vectors, only the (possibly weighted) sum of the feature-vectors per bag is available. We study whether such aggregation techniques can provide privacy guarantees under the notion of label differential privacy (label-DP) previously studied in for e.g. [Chaudhuri-Hsu'11, Ghazi et al.'21, Esfandiari et al.'22]. It is easily seen that naive LBA and LLP do not provide label-DP. Our main result however, shows that weighted LBA using iid Gaussian weights with $m$ randomly sampled disjoint $k$-sized bags is in fact $(\varepsilon, \delta)$-label-DP for any $\varepsilon > 0$ with $\delta \approx \exp(-\Omega(\sqrt{k}))$ assuming a lower bound on the linear-mse regression loss. Further, the $\ell_2^2$-regressor which minimizes the loss on the aggregated dataset has a loss within $\left(1 + o(1)\right)$-factor of the optimum on the original dataset w.p. $\approx 1 - exp(-\Omega(m))$. We emphasize that no additive label noise is required. The analogous weighted-LLP does not however admit label-DP. Nevertheless, we show that if additive $N(0, 1)$ noise can be added to any constant fraction of the instance labels, then the noisy weighted-LLP admits similar label-DP guarantees without assumptions on the dataset, while preserving the utility of Lipschitz-bounded neural mse-regression tasks. Our work is the first to demonstrate that label-DP can be achieved by randomly weighted aggregation for regression tasks, using no or little additive noise.  ( 3 min )
    Bayesian Flow Networks. (arXiv:2308.07037v4 [cs.LG] UPDATED)
    This paper introduces Bayesian Flow Networks (BFNs), a new class of generative model in which the parameters of a set of independent distributions are modified with Bayesian inference in the light of noisy data samples, then passed as input to a neural network that outputs a second, interdependent distribution. Starting from a simple prior and iteratively updating the two distributions yields a generative procedure similar to the reverse process of diffusion models; however it is conceptually simpler in that no forward process is required. Discrete and continuous-time loss functions are derived for continuous, discretised and discrete data, along with sample generation procedures. Notably, the network inputs for discrete data lie on the probability simplex, and are therefore natively differentiable, paving the way for gradient-based sample guidance and few-step generation in discrete domains such as language modelling. The loss function directly optimises data compression and places no restrictions on the network architecture. In our experiments BFNs achieve competitive log-likelihoods for image modelling on dynamically binarized MNIST and CIFAR-10, and outperform all known discrete diffusion models on the text8 character-level language modelling task.  ( 2 min )
    The Open DAC 2023 Dataset and Challenges for Sorbent Discovery in Direct Air Capture. (arXiv:2311.00341v2 [cond-mat.mtrl-sci] UPDATED)
    New methods for carbon dioxide removal are urgently needed to combat global climate change. Direct air capture (DAC) is an emerging technology to capture carbon dioxide directly from ambient air. Metal-organic frameworks (MOFs) have been widely studied as potentially customizable adsorbents for DAC. However, discovering promising MOF sorbents for DAC is challenging because of the vast chemical space to explore and the need to understand materials as functions of humidity and temperature. We explore a computational approach benefiting from recent innovations in machine learning (ML) and present a dataset named Open DAC 2023 (ODAC23) consisting of more than 38M density functional theory (DFT) calculations on more than 8,400 MOF materials containing adsorbed $CO_2$ and/or $H_2O$. ODAC23 is by far the largest dataset of MOF adsorption calculations at the DFT level of accuracy currently available. In addition to probing properties of adsorbed molecules, the dataset is a rich source of information on structural relaxation of MOFs, which will be useful in many contexts beyond specific applications for DAC. A large number of MOFs with promising properties for DAC are identified directly in ODAC23. We also trained state-of-the-art ML models on this dataset to approximate calculations at the DFT level. This open-source dataset and our initial ML models will provide an important baseline for future efforts to identify MOFs for a wide range of applications, including DAC.  ( 3 min )
    Machine learning and Topological data analysis identify unique features of human papillae in 3D scans. (arXiv:2307.06255v2 [cs.LG] UPDATED)
    The tongue surface houses a range of papillae that are integral to the mechanics and chemistry of taste and textural sensation. Although gustatory function of papillae is well investigated, the uniqueness of papillae within and across individuals remains elusive. Here, we present the first machine learning framework on 3D microscopic scans of human papillae (n = 2092), uncovering the uniqueness of geometric and topological features of papillae. The finer differences in shapes of papillae are investigated computationally based on a number of features derived from discrete differential geometry and computational topology. Interpretable machine learning techniques show that persistent homology features of the papillae shape are the most effective in predicting the biological variables. Models trained on these features with small volumes of data samples predict the type of papillae with an accuracy of 85%. The papillae type classification models can map the spatial arrangement of filiform and fungiform papillae on a surface. Remarkably, the papillae are found to be distinctive across individuals and an individual can be identified with an accuracy of 48% among the 15 participants from a single papillae. Collectively, this is the first unprecedented evidence demonstrating that tongue papillae can serve as a unique identifier inspiring new research direction for food preferences and oral diagnostics.  ( 3 min )
    Uncovering the Hidden Cost of Model Compression. (arXiv:2308.14969v2 [cs.LG] UPDATED)
    In the era of resource-intensive foundation models, efficient adaptation in downstream tasks has become paramount. Visual Prompting (VP), inspired by prompting in Large Language Models (LLMs), has emerged as a key transfer learning method in computer vision. Aligned with the growing significance of efficiency, research in model compression has become pivotal to alleviate the computational burden in both training and deploying over-parameterized neural networks. A key goal in model compression is the development of sparse models capable of matching or surpassing the performance of their over-parameterized, dense counterparts. While prior research has explored the impact of model sparsity on transfer learning, its effects on visual prompting-based transfer remain unclear. This study addresses this gap, revealing that model sparsity adversely affects the performance of visual prompting-based transfer, particularly in low-data-volume scenarios. Furthermore, our findings highlight the negative influence of sparsity on the calibration of downstream visual-prompted models. This empirical exploration calls for a nuanced understanding beyond accuracy in sparse settings, opening avenues for further research in Visual Prompting for sparse models. Code and logs can be accessed at https://github.com/landskape-ai/Reprogram_LT .  ( 2 min )
    Low-degree learning and the metric entropy of polynomials. (arXiv:2203.09659v3 [cs.LG] UPDATED)
    Let $\mathscr{F}_{n,d}$ be the class of all functions $f:\{-1,1\}^n\to[-1,1]$ on the $n$-dimensional discrete hypercube of degree at most $d$. In the first part of this paper, we prove that any (deterministic or randomized) algorithm which learns $\mathscr{F}_{n,d}$ with $L_2$-accuracy $\varepsilon$ requires at least $\Omega((1-\sqrt{\varepsilon})2^d\log n)$ queries for large enough $n$, thus establishing the sharpness as $n\to\infty$ of a recent upper bound of Eskenazis and Ivanisvili (2021). To do this, we show that the $L_2$-packing numbers $\mathsf{M}(\mathscr{F}_{n,d},\|\cdot\|_{L_2},\varepsilon)$ of the concept class $\mathscr{F}_{n,d}$ satisfy the two-sided estimate $$c(1-\varepsilon)2^d\log n \leq \log \mathsf{M}(\mathscr{F}_{n,d},\|\cdot\|_{L_2},\varepsilon) \leq \frac{2^{Cd}\log n}{\varepsilon^4}$$ for large enough $n$, where $c, C>0$ are universal constants. In the second part of the paper, we present a logarithmic upper bound for the randomized query complexity of classes of bounded approximate polynomials whose Fourier spectra are concentrated on few subsets. As an application, we prove new estimates for the number of random queries required to learn approximate juntas of a given degree, functions with rapidly decaying Fourier tails and constant depth circuits of given size. Finally, we obtain bounds for the number of queries required to learn the polynomial class $\mathscr{F}_{n,d}$ without error in the query and random example models.  ( 2 min )
    Towards Interpretable Sleep Stage Classification Using Cross-Modal Transformers. (arXiv:2208.06991v3 [cs.LG] UPDATED)
    Accurate sleep stage classification is significant for sleep health assessment. In recent years, several machine-learning based sleep staging algorithms have been developed , and in particular, deep-learning based algorithms have achieved performance on par with human annotation. Despite improved performance, a limitation of most deep-learning based algorithms is their black-box behavior, which have limited their use in clinical settings. Here, we propose a cross-modal transformer, which is a transformer-based method for sleep stage classification. The proposed cross-modal transformer consists of a novel cross-modal transformer encoder architecture along with a multi-scale one-dimensional convolutional neural network for automatic representation learning. Our method outperforms the state-of-the-art methods and eliminates the black-box behavior of deep-learning models by utilizing the interpretability aspect of the attention modules. Furthermore, our method provides considerable reductions in the number of parameters and training time compared to the state-of-the-art methods. Our code is available at https://github.com/Jathurshan0330/Cross-Modal-Transformer. A demo of our work can be found at https://bit.ly/Cross_modal_transformer_demo.  ( 2 min )
    AdaptGuard: Defending Against Universal Attacks for Model Adaptation. (arXiv:2303.10594v2 [cs.CR] UPDATED)
    Model adaptation aims at solving the domain transfer problem under the constraint of only accessing the pretrained source models. With the increasing considerations of data privacy and transmission efficiency, this paradigm has been gaining recent popularity. This paper studies the vulnerability to universal attacks transferred from the source domain during model adaptation algorithms due to the existence of malicious providers. We explore both universal adversarial perturbations and backdoor attacks as loopholes on the source side and discover that they still survive in the target models after adaptation. To address this issue, we propose a model preprocessing framework, named AdaptGuard, to improve the security of model adaptation algorithms. AdaptGuard avoids direct use of the risky source parameters through knowledge distillation and utilizes the pseudo adversarial samples under adjusted radius to enhance the robustness. AdaptGuard is a plug-and-play module that requires neither robust pretrained models nor any changes for the following model adaptation algorithms. Extensive results on three commonly used datasets and two popular adaptation methods validate that AdaptGuard can effectively defend against universal attacks and maintain clean accuracy in the target domain simultaneously. We hope this research will shed light on the safety and robustness of transfer learning. Code is available at https://github.com/TomSheng21/AdaptGuard.  ( 2 min )
    AutoWS-Bench-101: Benchmarking Automated Weak Supervision with 100 Labels. (arXiv:2208.14362v2 [cs.LG] UPDATED)
    Weak supervision (WS) is a powerful method to build labeled datasets for training supervised models in the face of little-to-no labeled data. It replaces hand-labeling data with aggregating multiple noisy-but-cheap label estimates expressed by labeling functions (LFs). While it has been used successfully in many domains, weak supervision's application scope is limited by the difficulty of constructing labeling functions for domains with complex or high-dimensional features. To address this, a handful of methods have proposed automating the LF design process using a small set of ground truth labels. In this work, we introduce AutoWS-Bench-101: a framework for evaluating automated WS (AutoWS) techniques in challenging WS settings -- a set of diverse application domains on which it has been previously difficult or impossible to apply traditional WS techniques. While AutoWS is a promising direction toward expanding the application-scope of WS, the emergence of powerful methods such as zero-shot foundation models reveals the need to understand how AutoWS techniques compare or cooperate with modern zero-shot or few-shot learners. This informs the central question of AutoWS-Bench-101: given an initial set of 100 labels for each task, we ask whether a practitioner should use an AutoWS method to generate additional labels or use some simpler baseline, such as zero-shot predictions from a foundation model or supervised learning. We observe that in many settings, it is necessary for AutoWS methods to incorporate signal from foundation models if they are to outperform simple few-shot baselines, and AutoWS-Bench-101 promotes future research in this direction. We conclude with a thorough ablation study of AutoWS methods.  ( 3 min )
    Directional Privacy for Deep Learning. (arXiv:2211.04686v3 [cs.LG] UPDATED)
    Differentially Private Stochastic Gradient Descent (DP-SGD) is a key method for applying privacy in the training of deep learning models. It applies isotropic Gaussian noise to gradients during training, which can perturb these gradients in any direction, damaging utility. Metric DP, however, can provide alternative mechanisms based on arbitrary metrics that might be more suitable for preserving utility. In this paper, we apply \textit{directional privacy}, via a mechanism based on the von Mises-Fisher (VMF) distribution, to perturb gradients in terms of \textit{angular distance} so that gradient direction is broadly preserved. We show that this provides both $\epsilon$-DP and $\epsilon d$-privacy for deep learning training, rather than the $(\epsilon, \delta)$-privacy of the Gaussian mechanism. Experiments on key datasets then indicate that the VMF mechanism can outperform the Gaussian in the utility-privacy trade-off. In particular, our experiments provide a direct empirical comparison of privacy between the two approaches in terms of their ability to defend against reconstruction and membership inference.  ( 2 min )
    Eliciting Latent Predictions from Transformers with the Tuned Lens. (arXiv:2303.08112v4 [cs.LG] UPDATED)
    We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the \emph{tuned lens}, is a refinement of the earlier ``logit lens'' technique, which yielded useful insights but is often brittle. We test our method on various autoregressive language models with up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens. With causal experiments, we show the tuned lens uses similar features to the model itself. We also find the trajectory of latent predictions can be used to detect malicious inputs with high accuracy. All code needed to reproduce our results can be found at https://github.com/AlignmentResearch/tuned-lens.  ( 2 min )
    TorchRL: A data-driven decision-making library for PyTorch. (arXiv:2306.00577v2 [cs.LG] UPDATED)
    PyTorch has ascended as a premier machine learning framework, yet it lacks a native and comprehensive library for decision and control tasks suitable for large development teams dealing with complex real-world data and environments. To address this issue, we propose TorchRL, a generalistic control library for PyTorch that provides well-integrated, yet standalone components. We introduce a new and flexible PyTorch primitive, the TensorDict, which facilitates streamlined algorithm development across the many branches of Reinforcement Learning (RL) and control. We provide a detailed description of the building blocks and an extensive overview of the library across domains and tasks. Finally, we experimentally demonstrate its reliability and flexibility and show comparative benchmarks to demonstrate its computational efficiency. TorchRL fosters long-term support and is publicly available on GitHub for greater reproducibility and collaboration within the research community. The code is open-sourced on GitHub.  ( 2 min )
    Understanding plasticity in neural networks. (arXiv:2303.01486v4 [cs.LG] UPDATED)
    Plasticity, the ability of a neural network to quickly change its predictions in response to new information, is essential for the adaptability and robustness of deep reinforcement learning systems. Deep neural networks are known to lose plasticity over the course of training even in relatively simple learning problems, but the mechanisms driving this phenomenon are still poorly understood. This paper conducts a systematic empirical analysis into plasticity loss, with the goal of understanding the phenomenon mechanistically in order to guide the future development of targeted solutions. We find that loss of plasticity is deeply connected to changes in the curvature of the loss landscape, but that it often occurs in the absence of saturated units. Based on this insight, we identify a number of parameterization and optimization design choices which enable networks to better preserve plasticity over the course of training. We validate the utility of these findings on larger-scale RL benchmarks in the Arcade Learning Environment.  ( 2 min )
    Environment Invariant Linear Least Squares. (arXiv:2303.03092v2 [math.ST] UPDATED)
    This paper considers a multi-environment linear regression model in which data from multiple experimental settings are collected. The joint distribution of the response variable and covariates may vary across different environments, yet the conditional expectations of $y$ given the unknown set of important variables are invariant. Such a statistical model is related to the problem of endogeneity, causal inference, and transfer learning. The motivation behind it is illustrated by how the goals of prediction and attribution are inherent in estimating the true parameter and the important variable set. We construct a novel environment invariant linear least squares (EILLS) objective function, a multi-environment version of linear least-squares regression that leverages the above conditional expectation invariance structure and heterogeneity among different environments to determine the true parameter. Our proposed method is applicable without any additional structural knowledge and can identify the true parameter under a near-minimal identification condition. We establish non-asymptotic $\ell_2$ error bounds on the estimation error for the EILLS estimator in the presence of spurious variables. Moreover, we further show that the $\ell_0$ penalized EILLS estimator can achieve variable selection consistency in high-dimensional regimes. These non-asymptotic results demonstrate the sample efficiency of the EILLS estimator and its capability to circumvent the curse of endogeneity in an algorithmic manner without any prior structural knowledge. To the best of our knowledge, this paper is the first to realize statistically efficient invariance learning in the general linear model.  ( 3 min )
    Elastic Shape Analysis of Tree-like 3D Objects using Extended SRVF Representation. (arXiv:2110.08693v4 [cs.LG] UPDATED)
    How can one analyze detailed 3D biological objects, such as neurons and botanical trees, that exhibit complex geometrical and topological variation? In this paper, we develop a novel mathematical framework for representing, comparing, and computing geodesic deformations between the shapes of such tree-like 3D objects. A hierarchical organization of subtrees characterizes these objects -- each subtree has the main branch with some side branches attached -- and one needs to match these structures across objects for meaningful comparisons. We propose a novel representation that extends the Square-Root Velocity Function (SRVF), initially developed for Euclidean curves, to tree-shaped 3D objects. We then define a new metric that quantifies the bending, stretching, and branch sliding needed to deform one tree-shaped object into the other. Compared to the current metrics, such as the Quotient Euclidean Distance (QED) and the Tree Edit Distance (TED), the proposed representation and metric capture the full elasticity of the branches (i.e., bending and stretching) as well as the topological variations (i.e., branch death/birth and sliding). It completely avoids the shrinkage that results from the edge collapse and node split operations of the QED and TED metrics. We demonstrate the utility of this framework in comparing, matching, and computing geodesics between biological objects such as neurons and botanical trees. The framework is also applied to various shape analysis tasks: (i) symmetry analysis and symmetrization of tree-shaped 3D objects, (ii) computing summary statistics (means and modes of variations) of populations of tree-shaped 3D objects, (iii) fitting parametric probability distributions to such populations, and (iv) finally synthesizing novel tree-shaped 3D objects through random sampling from estimated probability distributions.  ( 3 min )
    Auto-PINN: Understanding and Optimizing Physics-Informed Neural Architecture. (arXiv:2205.13748v2 [cs.LG] UPDATED)
    Physics-informed neural networks (PINNs) are revolutionizing science and engineering practice by bringing together the power of deep learning to bear on scientific computation. In forward modeling problems, PINNs are meshless partial differential equation (PDE) solvers that can handle irregular, high-dimensional physical domains. Naturally, the neural architecture hyperparameters have a large impact on the efficiency and accuracy of the PINN solver. However, this remains an open and challenging problem because of the large search space and the difficulty of identifying a proper search objective for PDEs. Here, we propose Auto-PINN, the first systematic, automated hyperparameter optimization approach for PINNs, which employs Neural Architecture Search (NAS) techniques to PINN design. Auto-PINN avoids manually or exhaustively searching the hyperparameter space associated with PINNs. A comprehensive set of pre-experiments using standard PDE benchmarks allows us to probe the structure-performance relationship in PINNs. We find that the different hyperparameters can be decoupled, and that the training loss function of PINNs is a good search objective. Comparison experiments with baseline methods demonstrate that Auto-PINN produces neural architectures with superior stability and accuracy over alternative baselines.  ( 2 min )
    Predictive Modeling of Coronal Hole Areas Using Long Short-Term Memory Networks. (arXiv:2301.06732v6 [astro-ph.SR] UPDATED)
    In the era of space exploration, the implications of space weather have become increasingly evident. Central to this is the phenomenon of coronal holes, which can significantly influence the functioning of satellites and aircraft. These coronal holes, present on the sun, are distinguished by their open magnetic field lines and comparatively cooler temperatures, leading to the emission of solar winds at heightened rates. To anticipate the effects of these coronal holes on Earth, our study harnesses computer vision to pinpoint the coronal hole regions and estimate their dimensions using imagery from the Solar Dynamics Observatory (SDO). Further, we deploy deep learning methodologies, specifically the Long Short-Term Memory (LSTM) approach, to analyze the trends in the data related to the area of the coronal holes and predict their dimensions across various solar regions over a span of seven days. By evaluating the time series data concerning the area of the coronal holes, our research seeks to uncover patterns in the behavior of coronal holes and comprehend their potential influence on space weather occurrences. This investigation marks a pivotal stride towards bolstering our capacity to anticipate and brace for space weather events that could have ramifications for Earth and its technological apparatuses.  ( 3 min )
    Self-Guided Diffusion Models. (arXiv:2210.06462v3 [cs.CV] UPDATED)
    Diffusion models have demonstrated remarkable progress in image generation quality, especially when guidance is used to control the generative process. However, guidance requires a large amount of image-annotation pairs for training and is thus dependent on their availability, correctness and unbiasedness. In this paper, we eliminate the need for such annotation by instead leveraging the flexibility of self-supervision signals to design a framework for self-guided diffusion models. By leveraging a feature extraction function and a self-annotation function, our method provides guidance signals at various image granularities: from the level of holistic images to object boxes and even segmentation masks. Our experiments on single-label and multi-label image datasets demonstrate that self-labeled guidance always outperforms diffusion models without guidance and may even surpass guidance based on ground-truth labels, especially on unbalanced data. When equipped with self-supervised box or mask proposals, our method further generates visually diverse yet semantically consistent images, without the need for any class, box, or segment label annotation. Self-guided diffusion is simple, flexible and expected to profit from deployment at scale. Source code will be at: https://taohu.me/sgdm/  ( 2 min )
    Near-Linear Time and Fixed-Parameter Tractable Algorithms for Tensor Decompositions. (arXiv:2207.07417v3 [cs.DS] UPDATED)
    We study low rank approximation of tensors, focusing on the tensor train and Tucker decompositions, as well as approximations with tree tensor networks and more general tensor networks. For tensor train decomposition, we give a bicriteria $(1 + \eps)$-approximation algorithm with a small bicriteria rank and $O(q \cdot \nnz(A))$ running time, up to lower order terms, which improves over the additive error algorithm of \cite{huber2017randomized}. We also show how to convert the algorithm of \cite{huber2017randomized} into a relative error algorithm, but their algorithm necessarily has a running time of $O(qr^2 \cdot \nnz(A)) + n \cdot \poly(qk/\eps)$ when converted to a $(1 + \eps)$-approximation algorithm with bicriteria rank $r$. To the best of our knowledge, our work is the first to achieve polynomial time relative error approximation for tensor train decomposition. Our key technique is a method for obtaining subspace embeddings with a number of rows polynomial in $q$ for a matrix which is the flattening of a tensor train of $q$ tensors. We extend our algorithm to tree tensor networks. In addition, we extend our algorithm to tensor networks with arbitrary graphs (which we refer to as general tensor networks), by using a result of \cite{ms08_simulating_quantum_tensor_contraction} and showing that a general tensor network of rank $k$ can be contracted to a binary tree network of rank $k^{O(\deg(G)\tw(G))}$, allowing us to reduce to the case of tree tensor networks. Finally, we give new fixed-parameter tractable algorithms for the tensor train, Tucker, and CP decompositions, which are simpler than those of \cite{swz19_tensor_low_rank} since they do not make use of polynomial system solvers. Our technique of Gaussian subspace embeddings with exactly $k$ rows (and thus exponentially small success probability) may be of independent interest.  ( 3 min )
    Residual-based physics-informed transfer learning: A hybrid method for accelerating long-term CFD simulations via deep learning. (arXiv:2206.06817v3 [physics.flu-dyn] UPDATED)
    While a big wave of artificial intelligence (AI) has propagated to the field of computational fluid dynamics (CFD) acceleration studies, recent research has highlighted that the development of AI techniques that reconciles the following goals remains our primary task: (1) accurate prediction of unseen (future) time series in long-term CFD simulations (2) acceleration of simulations (3) an acceptable amount of training data and time (4) within a multiple PDEs condition. In this study, we propose a residual-based physics-informed transfer learning (RePIT) strategy to achieve these four objectives using ML-CFD hybrid computation. Our hypothesis is that long-term CFD simulation is feasible with the hybrid method where CFD and AI alternately calculate time series while monitoring the first principle's residuals. The feasibility of RePIT strategy was verified through a CFD case study on natural convection. In a single training approach, a residual scale change occurred around 100th timestep, resulting in predicted time series exhibiting non-physical patterns as well as a significant deviations from the ground truth. Conversely, RePIT strategy maintained the residuals within the defined range and demonstrated good accuracy throughout the entire simulation period. The maximum error from the ground truth was below 0.4 K for temperature and 0.024 m/s for x-axis velocity. Furthermore, the average time for 1 timestep by the ML-GPU and CFD-CPU calculations was 0.171 s and 0.015 s, respectively. Including the parameter-updating time, the simulation was accelerated by a factor of 1.9. In conclusion, our RePIT strategy is a promising technique to reduce the cost of CFD simulations in industry. However, more vigorous optimization and improvement studies are still necessary.  ( 3 min )
    A Computation and Communication Efficient Method for Distributed Nonconvex Problems in the Partial Participation Setting. (arXiv:2205.15580v3 [cs.LG] UPDATED)
    We present a new method that includes three key components of distributed optimization and federated learning: variance reduction of stochastic gradients, partial participation, and compressed communication. We prove that the new method has optimal oracle complexity and state-of-the-art communication complexity in the partial participation setting. Regardless of the communication compression feature, our method successfully combines variance reduction and partial participation: we get the optimal oracle complexity, never need the participation of all nodes, and do not require the bounded gradients (dissimilarity) assumption.  ( 2 min )
    Momentum-Based Policy Gradient with Second-Order Information. (arXiv:2205.08253v3 [cs.LG] UPDATED)
    Variance-reduced gradient estimators for policy gradient methods have been one of the main focus of research in the reinforcement learning in recent years as they allow acceleration of the estimation process. We propose a variance-reduced policy-gradient method, called SHARP, which incorporates second-order information into stochastic gradient descent (SGD) using momentum with a time-varying learning rate. SHARP algorithm is parameter-free, achieving $\epsilon$-approximate first-order stationary point with $O(\epsilon^{-3})$ number of trajectories, while using a batch size of $O(1)$ at each iteration. Unlike most previous work, our proposed algorithm does not require importance sampling which can compromise the advantage of variance reduction process. Moreover, the variance of estimation error decays with the fast rate of $O(1/t^{2/3})$ where $t$ is the number of iterations. Our extensive experimental evaluations show the effectiveness of the proposed algorithm on various control tasks and its advantage over the state of the art in practice.  ( 2 min )
    Assessing Deep Neural Networks as Probability Estimators. (arXiv:2111.08239v2 [cs.LG] UPDATED)
    Deep Neural Networks (DNNs) have performed admirably in classification tasks. However, the characterization of their classification uncertainties, required for certain applications, has been lacking. In this work, we investigate the issue by assessing DNNs' ability to estimate conditional probabilities and propose a framework for systematic uncertainty characterization. Denoting the input sample as x and the category as y, the classification task of assigning a category y to a given input x can be reduced to the task of estimating the conditional probabilities p(y|x), as approximated by the DNN at its last layer using the softmax function. Since softmax yields a vector whose elements all fall in the interval (0, 1) and sum to 1, it suggests a probabilistic interpretation to the DNN's outcome. Using synthetic and real-world datasets, we look into the impact of various factors, e.g., probability density f(x) and inter-categorical sparsity, on the precision of DNNs' estimations of p(y|x), and find that the likelihood probability density and the inter-categorical sparsity have greater impacts than the prior probability to DNNs' classification uncertainty.  ( 3 min )
    Online Estimation and Optimization of Utility-Based Shortfall Risk. (arXiv:2111.08805v3 [stat.ML] UPDATED)
    Utility-Based Shortfall Risk (UBSR) is a risk metric that is increasingly popular in financial applications, owing to certain desirable properties that it enjoys. We consider the problem of estimating UBSR in a recursive setting, where samples from the underlying loss distribution are available one-at-a-time. We cast the UBSR estimation problem as a root finding problem, and propose stochastic approximation-based estimations schemes. We derive non-asymptotic bounds on the estimation error in the number of samples. We also consider the problem of UBSR optimization within a parameterized class of random variables. We propose a stochastic gradient descent based algorithm for UBSR optimization, and derive non-asymptotic bounds on its convergence.  ( 2 min )
    A deep reinforcement learning model for predictive maintenance planning of road assets: Integrating LCA and LCCA. (arXiv:2112.12589v3 [cs.LG] UPDATED)
    Road maintenance planning is an integral part of road asset management. One of the main challenges in Maintenance and Rehabilitation (M&R) practices is to determine maintenance type and timing. This research proposes a framework using Reinforcement Learning (RL) based on the Long Term Pavement Performance (LTPP) database to determine the type and timing of M&R practices. A predictive DNN model is first developed in the proposed algorithm, which serves as the Environment for the RL algorithm. For the Policy estimation of the RL model, both DQN and PPO models are developed. However, PPO has been selected in the end due to better convergence and higher sample efficiency. Indicators used in this study are International Roughness Index (IRI) and Rutting Depth (RD). Initially, we considered Cracking Metric (CM) as the third indicator, but it was then excluded due to the much fewer data compared to other indicators, which resulted in lower accuracy of the results. Furthermore, in cost-effectiveness calculation (reward), we considered both the economic and environmental impacts of M&R treatments. Costs and environmental impacts have been evaluated with paLATE 2.0 software. Our method is tested on a hypothetical case study of a six-lane highway with 23 kilometers length located in Texas, which has a warm and wet climate. The results propose a 20-year M&R plan in which road condition remains in an excellent condition range. Because the early state of the road is at a good level of service, there is no need for heavy maintenance practices in the first years. Later, after heavy M&R actions, there are several 1-2 years of no need for treatments. All of these show that the proposed plan has a logical result. Decision-makers and transportation agencies can use this scheme to conduct better maintenance practices that can prevent budget waste and, at the same time, minimize the environmental impacts.  ( 3 min )
    Artificial Neural Networks generated by Low Discrepancy Sequences. (arXiv:2103.03543v2 [cs.LG] UPDATED)
    Artificial neural networks can be represented by paths. Generated as random walks on a dense network graph, we find that the resulting sparse networks allow for deterministic initialization and even weights with fixed sign. Such networks can be trained sparse from scratch, avoiding the expensive procedure of training a dense network and compressing it afterwards. Although sparse, weights are accessed as contiguous blocks of memory. In addition, enumerating the paths using deterministic low discrepancy sequences, for example the Sobol' sequence, amounts to connecting the layers of neural units by progressive permutations, which naturally avoids bank conflicts in parallel computer hardware. We demonstrate that the artificial neural networks generated by low discrepancy sequences can achieve an accuracy within reach of their dense counterparts at a much lower computational complexity.  ( 2 min )
    Bridging Adversarial and Nonstationary Multi-armed Bandit. (arXiv:2201.01628v3 [cs.LG] UPDATED)
    In the multi-armed bandit framework, there are two formulations that are commonly employed to handle time-varying reward distributions: adversarial bandit and nonstationary bandit. Although their oracles, algorithms, and regret analysis differ significantly, we provide a unified formulation in this paper that smoothly bridges the two as special cases. The formulation uses an oracle that takes the best-fixed arm within time windows. Depending on the window size, it turns into the oracle in hindsight in the adversarial bandit and dynamic oracle in the nonstationary bandit. We provide algorithms that attain the optimal regret with the matching lower bound.  ( 2 min )
    Analysis of functional neural codes of deep learning models: Functional Telescope Hypothesis. (arXiv:2205.10952v3 [cs.LG] UPDATED)
    Deep neural networks (DNNs), the agents of deep learning (DL), require a massive number of parallel/sequential operations. This makes it difficult to comprehend DNNs' operations and impedes proper diagnosis. Without better knowledge of their internal process, deploying DNNs in high-stakes domains can lead to catastrophic failures. Therefore, to build more reliable DNNs/DL to be deployed in high-stakes real-world problems, it is imperative that we gain insights into DNNs' internal operations underlying their decision-making. Here, we use the self-organizing map (SOM) to analyze DL models' internal codes associated with DNNs' decision-making. Our analyses suggest that shallow layers close to the input layer compress features into condensed space and that deep layers close to the output layer expand feature space. We also found evidence indicating that compressed features may underlie DNNs' vulnerabilities to adversarial perturbations.  ( 2 min )
    Dimensionality Reduction and Wasserstein Stability for Kernel Regression. (arXiv:2203.09347v3 [stat.ML] UPDATED)
    In a high-dimensional regression framework, we study consequences of the naive two-step procedure where first the dimension of the input variables is reduced and second, the reduced input variables are used to predict the output variable with kernel regression. In order to analyze the resulting regression errors, a novel stability result for kernel regression with respect to the Wasserstein distance is derived. This allows us to bound errors that occur when perturbed input data is used to fit the regression function. We apply the general stability result to principal component analysis (PCA). Exploiting known estimates from the literature on both principal component analysis and kernel regression, we deduce convergence rates for the two-step procedure. The latter turns out to be particularly useful in a semi-supervised setting.  ( 2 min )
    Asymptotic Bounds for Smoothness Parameter Estimates in Gaussian Process Interpolation. (arXiv:2203.05400v5 [math.ST] UPDATED)
    It is common to model a deterministic response function, such as the output of a computer experiment, as a Gaussian process with a Mat\'ern covariance kernel. The smoothness parameter of a Mat\'ern kernel determines many important properties of the model in the large data limit, including the rate of convergence of the conditional mean to the response function. We prove that the maximum likelihood estimate of the smoothness parameter cannot asymptotically undersmooth the truth when the data are obtained on a fixed bounded subset of $\mathbb{R}^d$. That is, if the data-generating response function has Sobolev smoothness $\nu_0 > d/2$, then the smoothness parameter estimate cannot be asymptotically less than $\nu_0$. The lower bound is sharp. Additionally, we show that maximum likelihood estimation recovers the true smoothness for a class of compactly supported self-similar functions. For cross-validation we prove an asymptotic lower bound $\nu_0 - d/2$, which however is unlikely to be sharp. The results are based on approximation theory in Sobolev spaces and some general theorems that restrict the set of values that the parameter estimators can take.  ( 2 min )
    Effective Backdoor Mitigation Depends on the Pre-training Objective. (arXiv:2311.14948v1 [cs.LG])
    Despite the advanced capabilities of contemporary machine learning (ML) models, they remain vulnerable to adversarial and backdoor attacks. This vulnerability is particularly concerning in real-world deployments, where compromised models may exhibit unpredictable behavior in critical scenarios. Such risks are heightened by the prevalent practice of collecting massive, internet-sourced datasets for pre-training multimodal models, as these datasets may harbor backdoors. Various techniques have been proposed to mitigate the effects of backdooring in these models such as CleanCLIP which is the current state-of-the-art approach. In this work, we demonstrate that the efficacy of CleanCLIP in mitigating backdoors is highly dependent on the particular objective used during model pre-training. We observe that stronger pre-training objectives correlate with harder to remove backdoors behaviors. We show this by training multimodal models on two large datasets consisting of 3 million (CC3M) and 6 million (CC6M) datapoints, under various pre-training objectives, followed by poison removal using CleanCLIP. We find that CleanCLIP is ineffective when stronger pre-training objectives are used, even with extensive hyperparameter tuning. Our findings underscore critical considerations for ML practitioners who pre-train models using large-scale web-curated data and are concerned about potential backdoor threats. Notably, our results suggest that simpler pre-training objectives are more amenable to effective backdoor removal. This insight is pivotal for practitioners seeking to balance the trade-offs between using stronger pre-training objectives and security against backdoor attacks.  ( 3 min )
    Accurate and interpretable drug-drug interaction prediction enabled by knowledge subgraph learning. (arXiv:2311.15056v1 [cs.LG])
    Background: Discovering potential drug-drug interactions (DDIs) is a long-standing challenge in clinical treatments and drug developments. Recently, deep learning techniques have been developed for DDI prediction. However, they generally require a huge number of samples, while known DDIs are rare. Methods: In this work, we present KnowDDI, a graph neural network-based method that addresses the above challenge. KnowDDI enhances drug representations by adaptively leveraging rich neighborhood information from large biomedical knowledge graphs. Then, it learns a knowledge subgraph for each drug-pair to interpret the predicted DDI, where each of the edges is associated with a connection strength indicating the importance of a known DDI or resembling strength between a drug-pair whose connection is unknown. Thus, the lack of DDIs is implicitly compensated by the enriched drug representations and propagated drug similarities. Results: We evaluate KnowDDI on two benchmark DDI datasets. Results show that KnowDDI obtains the state-of-the-art prediction performance with better interpretability. We also find that KnowDDI suffers less than existing works given a sparser knowledge graph. This indicates that the propagated drug similarities play a more important role in compensating for the lack of DDIs when the drug representations are less enriched. Conclusions: KnowDDI nicely combines the efficiency of deep learning techniques and the rich prior knowledge in biomedical knowledge graphs. As an original open-source tool, KnowDDI can help detect possible interactions in a broad range of relevant interaction prediction tasks, such as protein-protein interactions, drug-target interactions and disease-gene interactions, eventually promoting the development of biomedicine and healthcare.  ( 3 min )
    LLM-Assisted Code Cleaning For Training Accurate Code Generators. (arXiv:2311.14904v1 [cs.LG])
    Natural language to code generation is an important application area of LLMs and has received wide attention from the community. The majority of relevant studies have exclusively concentrated on increasing the quantity and functional correctness of training sets while disregarding other stylistic elements of programs. More recently, data quality has garnered a lot of interest and multiple works have showcased its importance for improving performance. In this work, we investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs by 1.) renaming variables, 2.) modularizing and decomposing complex code into smaller helper sub-functions, and 3.) inserting natural-language based plans via LLM based transformations. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B on our transformed modularized programs improves the performance by up to 30% compared to fine-tuning on the original dataset. Additionally, we demonstrate improved performance from using a smaller amount of higher-quality data, finding that a model fine-tuned on the entire original dataset is outperformed by a model trained on 15% of our cleaned dataset. Even in comparison to closed-source models, our models outperform the much larger AlphaCoder models.  ( 2 min )
    Forecasting Cryptocurrency Prices Using Deep Learning: Integrating Financial, Blockchain, and Text Data. (arXiv:2311.14759v1 [q-fin.ST])
    This paper explores the application of Machine Learning (ML) and Natural Language Processing (NLP) techniques in cryptocurrency price forecasting, specifically Bitcoin (BTC) and Ethereum (ETH). Focusing on news and social media data, primarily from Twitter and Reddit, we analyse the influence of public sentiment on cryptocurrency valuations using advanced deep learning NLP methods. Alongside conventional price regression, we treat cryptocurrency price forecasting as a classification problem. This includes both the prediction of price movements (up or down) and the identification of local extrema. We compare the performance of various ML models, both with and without NLP data integration. Our findings reveal that incorporating NLP data significantly enhances the forecasting performance of our models. We discover that pre-trained models, such as Twitter-RoBERTa and BART MNLI, are highly effective in capturing market sentiment, and that fine-tuning Large Language Models (LLMs) also yields substantial forecasting improvements. Notably, the BART MNLI zero-shot classification model shows considerable proficiency in extracting bullish and bearish signals from textual data. All of our models consistently generate profit across different validation scenarios, with no observed decline in profits or reduction in the impact of NLP data over time. The study highlights the potential of text analysis in improving financial forecasts and demonstrates the effectiveness of various NLP techniques in capturing nuanced market sentiment.  ( 2 min )
    Reinforcement Learning from Statistical Feedback: the Journey from AB Testing to ANT Testing. (arXiv:2311.14766v1 [cs.LG])
    Reinforcement Learning from Human Feedback (RLHF) has played a crucial role in the success of large models such as ChatGPT. RLHF is a reinforcement learning framework which combines human feedback to improve learning effectiveness and performance. However, obtaining preferences feedback manually is quite expensive in commercial applications. Some statistical commercial indicators are usually more valuable and always ignored in RLHF. There exists a gap between commercial target and model training. In our research, we will attempt to fill this gap with statistical business feedback instead of human feedback, using AB testing which is a well-established statistical method. Reinforcement Learning from Statistical Feedback (RLSF) based on AB testing is proposed. Statistical inference methods are used to obtain preferences for training the reward network, which fine-tunes the pre-trained model in reinforcement learning framework, achieving greater business value. Furthermore, we extend AB testing with double selections at a single time-point to ANT testing with multiple selections at different feedback time points. Moreover, we design numerical experiences to validate the effectiveness of our algorithm framework.  ( 2 min )
    ExCeL : Combined Extreme and Collective Logit Information for Enhancing Out-of-Distribution Detection. (arXiv:2311.14754v1 [cs.LG])
    Deep learning models often exhibit overconfidence in predicting out-of-distribution (OOD) data, underscoring the crucial role of OOD detection in ensuring reliability in predictions. Among various OOD detection approaches, post-hoc detectors have gained significant popularity, primarily due to their ease of use and implementation. However, the effectiveness of most post-hoc OOD detectors has been constrained as they rely solely either on extreme information, such as the maximum logit, or on the collective information (i.e., information spanned across classes or training samples) embedded within the output layer. In this paper, we propose ExCeL that combines both extreme and collective information within the output layer for enhanced accuracy in OOD detection. We leverage the logit of the top predicted class as the extreme information (i.e., the maximum logit), while the collective information is derived in a novel approach that involves assessing the likelihood of other classes appearing in subsequent ranks across various training samples. Our idea is motivated by the observation that, for in-distribution (ID) data, the ranking of classes beyond the predicted class is more deterministic compared to that in OOD data. Experiments conducted on CIFAR100 and ImageNet-200 datasets demonstrate that ExCeL consistently is among the five top-performing methods out of twenty-one existing post-hoc baselines when the joint performance on near-OOD and far-OOD is considered (i.e., in terms of AUROC and FPR95). Furthermore, ExCeL shows the best overall performance across both datasets, unlike other baselines that work best on one dataset but has a performance drop in the other.  ( 3 min )
    On-Device Soft Sensors: Real-Time Fluid Flow Estimation from Level Sensor Data. (arXiv:2311.15036v1 [cs.LG])
    Soft sensors are crucial in bridging autonomous systems' physical and digital realms, enhancing sensor fusion and perception. Instead of deploying soft sensors on the Cloud, this study shift towards employing on-device soft sensors, promising heightened efficiency and bolstering data security. Our approach substantially improves energy efficiency by deploying Artificial Intelligence (AI) directly on devices within a wireless sensor network. Furthermore, the synergistic integration of the Microcontroller Unit and Field-Programmable Gate Array (FPGA) leverages the rapid AI inference capabilities of the latter. Empirical evidence from our real-world use case demonstrates that FPGA-based soft sensors achieve inference times ranging remarkably from 1.04 to 12.04 microseconds. These compelling results highlight the considerable potential of our innovative approach for executing real-time inference tasks efficiently, thereby presenting a feasible alternative that effectively addresses the latency challenges intrinsic to Cloud-based deployments.  ( 2 min )
    Revisiting Quantum Algorithms for Linear Regressions: Quadratic Speedups without Data-Dependent Parameters. (arXiv:2311.14823v1 [quant-ph])
    Linear regression is one of the most fundamental linear algebra problems. Given a dense matrix $A \in \mathbb{R}^{n \times d}$ and a vector $b$, the goal is to find $x'$ such that $ \| Ax' - b \|_2^2 \leq (1+\epsilon) \min_{x} \| A x - b \|_2^2 $. The best classical algorithm takes $O(nd) + \mathrm{poly}(d/\epsilon)$ time [Clarkson and Woodruff STOC 2013, Nelson and Nguyen FOCS 2013]. On the other hand, quantum linear regression algorithms can achieve exponential quantum speedups, as shown in [Wang Phys. Rev. A 96, 012335, Kerenidis and Prakash ITCS 2017, Chakraborty, Gily{\'e}n and Jeffery ICALP 2019]. However, the running times of these algorithms depend on some quantum linear algebra-related parameters, such as $\kappa(A)$, the condition number of $A$. In this work, we develop a quantum algorithm that runs in $\widetilde{O}(\epsilon^{-1}\sqrt{n}d^{1.5}) + \mathrm{poly}(d/\epsilon)$ time. It provides a quadratic quantum speedup in $n$ over the classical lower bound without any dependence on data-dependent parameters. In addition, we also show our result can be generalized to multiple regression and ridge linear regression.  ( 2 min )
    Neural Network Based Approach to Recognition of Meteor Tracks in the Mini-EUSO Telescope Data. (arXiv:2311.14983v1 [astro-ph.IM])
    Mini-EUSO is a wide-angle fluorescence telescope that registers ultraviolet (UV) radiation in the nocturnal atmosphere of Earth from the International Space Station. Meteors are among multiple phenomena that manifest themselves not only in the visible range but also in the UV. We present two simple artificial neural networks that allow for recognizing meteor signals in the Mini-EUSO data with high accuracy in terms of a binary classification problem. We expect that similar architectures can be effectively used for signal recognition in other fluorescence telescopes, regardless of the nature of the signal. Due to their simplicity, the networks can be implemented in onboard electronics of future orbital or balloon experiments.  ( 3 min )
    Disruption Prediction in Fusion Devices through Feature Extraction and Logistic Regression. (arXiv:2311.14856v1 [physics.plasm-ph])
    This document describes an approach used in the Multi-Machine Disruption Prediction Challenge for Fusion Energy by ITU, a data science competition which ran from September to November 2023, on the online platform Zindi. The competition involved data from three fusion devices - C-Mod, HL-2A, and J-TEXT - with most of the training data coming from the last two, and the test data coming from the first one. Each device has multiple diagnostics and signals, and it turns out that a critical issue in this competition was to identify which signals, and especially which features from those signals, were most relevant to achieve accurate predictions. The approach described here is based on extracting features from signals, and then applying logistic regression on top of those features. Each signal is treated as a separate predictor and, in the end, a combination of such predictors achieved the first place on the leaderboard.  ( 2 min )
    Training a Hopfield Variational Autoencoder with Equilibrium Propagation. (arXiv:2311.15047v1 [cs.LG])
    On dedicated analog hardware, equilibrium propagation is an energy-efficient alternative to backpropagation. In spite of its theoretical guarantees, its application in the AI domain remains limited to the discriminative setting. Meanwhile, despite its high computational demands, generative AI is on the rise. In this paper, we demonstrate the application of Equilibrium Propagation in training a variational autoencoder (VAE) for generative modeling. Leveraging the symmetric nature of Hopfield networks, we propose using a single model to serve as both the encoder and decoder which could effectively halve the required chip size for VAE implementations, paving the way for more efficient analog hardware configurations.  ( 2 min )
    Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study. (arXiv:2311.15051v1 [cs.LG])
    Although gradient descent with momentum is widely used in modern deep learning, a concrete understanding of its effects on the training trajectory still remains elusive. In this work, we empirically show that momentum gradient descent with a large learning rate and learning rate warmup displays large catapults, driving the iterates towards flatter minima than those found by gradient descent. We then provide empirical evidence and theoretical intuition that the large catapult is caused by momentum "amplifying" the self-stabilization effect (Damian et al., 2023).  ( 2 min )
    A latent linear model for nonlinear coupled oscillators on graphs. (arXiv:2311.14910v1 [math.DS])
    A system of coupled oscillators on an arbitrary graph is locally driven by the tendency to mutual synchronization between nearby oscillators, but can and often exhibit nonlinear behavior on the whole graph. Understanding such nonlinear behavior has been a key challenge in predicting whether all oscillators in such a system will eventually synchronize. In this paper, we demonstrate that, surprisingly, such nonlinear behavior of coupled oscillators can be effectively linearized in certain latent dynamic spaces. The key insight is that there is a small number of `latent dynamics filters', each with a specific association with synchronizing and non-synchronizing dynamics on subgraphs so that any observed dynamics on subgraphs can be approximated by a suitable linear combination of such elementary dynamic patterns. Taking an ensemble of subgraph-level predictions provides an interpretable predictor for whether the system on the whole graph reaches global synchronization. We propose algorithms based on supervised matrix factorization to learn such latent dynamics filters. We demonstrate that our method performs competitively in synchronization prediction tasks against baselines and black-box classification algorithms, despite its simple and interpretable architecture.  ( 2 min )
    Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning. (arXiv:2311.14885v1 [cs.LG])
    A key problem in off-policy Reinforcement Learning (RL) is the mismatch, or distribution shift, between the dataset and the distribution over states and actions visited by the learned policy. This problem is exacerbated in the fully offline setting. The main approach to correct this shift has been through importance sampling, which leads to high-variance gradients. Other approaches, such as conservatism or behavior-regularization, regularize the policy at the cost of performance. In this paper, we propose a new approach for stable off-policy Q-Learning. Our method, Projected Off-Policy Q-Learning (POP-QL), is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error. In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.  ( 2 min )
    OpenNet: Incremental Learning for Autonomous Driving Object Detection with Balanced Loss. (arXiv:2311.14939v1 [cs.CV])
    Automated driving object detection has always been a challenging task in computer vision due to environmental uncertainties. These uncertainties include significant differences in object sizes and encountering the class unseen. It may result in poor performance when traditional object detection models are directly applied to automated driving detection. Because they usually presume fixed categories of common traffic participants, such as pedestrians and cars. Worsely, the huge class imbalance between common and novel classes further exacerbates performance degradation. To address the issues stated, we propose OpenNet to moderate the class imbalance with the Balanced Loss, which is based on Cross Entropy Loss. Besides, we adopt an inductive layer based on gradient reshaping to fast learn new classes with limited samples during incremental learning. To against catastrophic forgetting, we employ normalized feature distillation. By the way, we improve multi-scale detection robustness and unknown class recognition through FPN and energy-based detection, respectively. The Experimental results upon the CODA dataset show that the proposed method can obtain better performance than that of the existing methods.  ( 2 min )
    Exploring Causal Learning through Graph Neural Networks: An In-depth Review. (arXiv:2311.14994v1 [cs.LG])
    In machine learning, exploring data correlations to predict outcomes is a fundamental task. Recognizing causal relationships embedded within data is pivotal for a comprehensive understanding of system dynamics, the significance of which is paramount in data-driven decision-making processes. Beyond traditional methods, there has been a surge in the use of graph neural networks (GNNs) for causal learning, given their capabilities as universal data approximators. Thus, a thorough review of the advancements in causal learning using GNNs is both relevant and timely. To structure this review, we introduce a novel taxonomy that encompasses various state-of-the-art GNN methods employed in studying causality. GNNs are further categorized based on their applications in the causality domain. We further provide an exhaustive compilation of datasets integral to causal learning with GNNs to serve as a resource for practical study. This review also touches upon the application of causal learning across diverse sectors. We conclude the review with insights into potential challenges and promising avenues for future exploration in this rapidly evolving field of machine learning.  ( 2 min )
    Robust Graph Neural Networks via Unbiased Aggregation. (arXiv:2311.14934v1 [cs.LG])
    The adversarial robustness of Graph Neural Networks (GNNs) has been questioned due to the false sense of security uncovered by strong adaptive attacks despite the existence of numerous defenses. In this work, we delve into the robustness analysis of representative robust GNNs and provide a unified robust estimation point of view to understand their robustness and limitations. Our novel analysis of estimation bias motivates the design of a robust and unbiased graph signal estimator. We then develop an efficient Quasi-Newton iterative reweighted least squares algorithm to solve the estimation problem, which unfolds as robust unbiased aggregation layers in GNNs with a theoretical convergence guarantee. Our comprehensive experiments confirm the strong robustness of our proposed model, and the ablation study provides a deep understanding of its advantages.  ( 2 min )
    Satellite-based feature extraction and multivariate time-series prediction of biotoxin contamination in shellfish. (arXiv:2311.15000v1 [cs.LG])
    Shellfish production constitutes an important sector for the economy of many Portuguese coastal regions, yet the challenge of shellfish biotoxin contamination poses both public health concerns and significant economic risks. Thus, predicting shellfish contamination levels holds great potential for enhancing production management and safeguarding public health. In our study, we utilize a dataset with years of Sentinel-3 satellite imagery for marine surveillance, along with shellfish biotoxin contamination data from various production areas along Portugal's western coastline, collected by Portuguese official control. Our goal is to evaluate the integration of satellite data in forecasting models for predicting toxin concentrations in shellfish given forecasting horizons up to four weeks, which implies extracting a small set of useful features and assessing their impact on the predictive models. We framed this challenge as a time-series forecasting problem, leveraging historical contamination levels and satellite images for designated areas. While contamination measurements occurred weekly, satellite images were accessible multiple times per week. Unsupervised feature extraction was performed using autoencoders able to handle non-valid pixels caused by factors like cloud cover, land, or anomalies. Finally, several Artificial Neural Networks models were applied to compare univariate (contamination only) and multivariate (contamination and satellite data) time-series forecasting. Our findings show that incorporating these features enhances predictions, especially beyond one week in lagoon production areas (RIAV) and for the 1-week and 2-week horizons in the L5B area (oceanic). The methodology shows the feasibility of integrating information from a high-dimensional data source like remote sensing without compromising the model's predictive ability.  ( 3 min )
    Eliminating Domain Bias for Federated Learning in Representation Space. (arXiv:2311.14975v1 [cs.LG])
    Recently, federated learning (FL) is popular for its privacy-preserving and collaborative learning abilities. However, under statistically heterogeneous scenarios, we observe that biased data domains on clients cause a representation bias phenomenon and further degenerate generic representations during local training, i.e., the representation degeneration phenomenon. To address these issues, we propose a general framework Domain Bias Eliminator (DBE) for FL. Our theoretical analysis reveals that DBE can promote bi-directional knowledge transfer between server and client, as it reduces the domain discrepancy between server and client in representation space. Besides, extensive experiments on four datasets show that DBE can greatly improve existing FL methods in both generalization and personalization abilities. The DBE-equipped FL method can outperform ten state-of-the-art personalized FL methods by a large margin. Our code is public at https://github.com/TsingZ0/DBE.  ( 2 min )
    Selective Inference for Changepoint detection by Recurrent Neural Network. (arXiv:2311.14964v1 [stat.ML])
    In this study, we investigate the quantification of the statistical reliability of detected change points (CPs) in time series using a Recurrent Neural Network (RNN). Thanks to its flexibility, RNN holds the potential to effectively identify CPs in time series characterized by complex dynamics. However, there is an increased risk of erroneously detecting random noise fluctuations as CPs. The primary goal of this study is to rigorously control the risk of false detections by providing theoretically valid p-values to the CPs detected by RNN. To achieve this, we introduce a novel method based on the framework of Selective Inference (SI). SI enables valid inferences by conditioning on the event of hypothesis selection, thus mitigating selection bias. In this study, we apply SI framework to RNN-based CP detection, where characterizing the complex process of RNN selecting CPs is our main technical challenge. We demonstrate the validity and effectiveness of the proposed method through artificial and real data experiments.  ( 2 min )
    Learning to Cooperate and Communicate Over Imperfect Channels. (arXiv:2311.14770v1 [cs.MA])
    Information exchange in multi-agent systems improves the cooperation among agents, especially in partially observable settings. In the real world, communication is often carried out over imperfect channels. This requires agents to handle uncertainty due to potential information loss. In this paper, we consider a cooperative multi-agent system where the agents act and exchange information in a decentralized manner using a limited and unreliable channel. To cope with such channel constraints, we propose a novel communication approach based on independent Q-learning. Our method allows agents to dynamically adapt how much information to share by sending messages of different sizes, depending on their local observations and the channel's properties. In addition to this message size selection, agents learn to encode and decode messages to improve their jointly trained policies. We show that our approach outperforms approaches without adaptive capabilities in a novel cooperative digit-prediction environment and discuss its limitations in the traffic junction environment.  ( 2 min )
    A unified framework for learning with nonlinear model classes from arbitrary linear samples. (arXiv:2311.14886v1 [cs.LG])
    This work considers the fundamental problem of learning an unknown object from training data using a given model class. We introduce a unified framework that allows for objects in arbitrary Hilbert spaces, general types of (random) linear measurements as training data and general types of nonlinear model classes. We establish a series of learning guarantees for this framework. These guarantees provide explicit relations between the amount of training data and properties of the model class to ensure near-best generalization bounds. In doing so, we also introduce and develop the key notion of the variation of a model class with respect to a distribution of sampling operators. To exhibit the versatility of this framework, we show that it can accommodate many different types of well-known problems of interest. We present examples such as matrix sketching by random sampling, compressed sensing with isotropic vectors, active learning in regression and compressed sensing with generative models. In all cases, we show how known results become straightforward corollaries of our general learning guarantees. For compressed sensing with generative models, we also present a number of generalizations and improvements of recent results. In summary, our work not only introduces a unified way to study learning unknown objects from general types of data, but also establishes a series of general theoretical guarantees which consolidate and improve various known results.  ( 3 min )
    A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift. (arXiv:2311.14743v1 [cs.CL])
    Foundation models, specifically Large Language Models (LLM's), have lately gained wide-spread attention and adoption. Reinforcement Learning with Human Feedback (RLHF) involves training a reward model to capture desired behaviors, which is then used to align an LLM. These reward models are additionally used at inference-time to estimate how well LLM responses adhere to those desired behaviors. However, there is little work measuring how robust these reward models are to distribution shifts. In this work, we evaluate how reward model performance - measured via accuracy and calibration (i.e. alignment between accuracy and confidence) - is affected by distribution shift. We show novel calibration patterns and accuracy drops due to OOD prompts and responses, and that the reward model is more sensitive to shifts in responses than prompts. Additionally, we adapt an OOD detection technique commonly used in classification to the reward model setting in order to detect these distribution shifts in prompts and responses.  ( 2 min )
    Effective Structural Encodings via Local Curvature Profiles. (arXiv:2311.14864v1 [cs.LG])
    Structural and Positional Encodings can significantly improve the performance of Graph Neural Networks in downstream tasks. Recent literature has begun to systematically investigate differences in the structural properties that these approaches encode, as well as performance trade-offs between them. However, the question of which structural properties yield the most effective encoding remains open. In this paper, we investigate this question from a geometric perspective. We propose a novel structural encoding based on discrete Ricci curvature (Local Curvature Profiles, short LCP) and show that it significantly outperforms existing encoding approaches. We further show that combining local structural encodings, such as LCP, with global positional encodings improves downstream performance, suggesting that they capture complementary geometric information. Finally, we compare different encoding types with (curvature-based) rewiring techniques. Rewiring has recently received a surge of interest due to its ability to improve the performance of Graph Neural Networks by mitigating over-smoothing and over-squashing effects. Our results suggest that utilizing curvature information for structural encodings delivers significantly larger performance increases than rewiring.  ( 2 min )
    Generative Machine Learning for Multivariate Equity Returns. (arXiv:2311.14735v1 [q-fin.ST])
    The use of machine learning to generate synthetic data has grown in popularity with the proliferation of text-to-image models and especially large language models. The core methodology these models use is to learn the distribution of the underlying data, similar to the classical methods common in finance of fitting statistical models to data. In this work, we explore the efficacy of using modern machine learning methods, specifically conditional importance weighted autoencoders (a variant of variational autoencoders) and conditional normalizing flows, for the task of modeling the returns of equities. The main problem we work to address is modeling the joint distribution of all the members of the S&P 500, or, in other words, learning a 500-dimensional joint distribution. We show that this generative model has a broad range of applications in finance, including generating realistic synthetic data, volatility and correlation estimation, risk analysis (e.g., value at risk, or VaR, of portfolios), and portfolio optimization.  ( 2 min )
    Towards Long-term Annotators: A Supervised Label Aggregation Baseline. (arXiv:2311.14709v1 [cs.HC])
    Relying on crowdsourced workers, data crowdsourcing platforms are able to efficiently provide vast amounts of labeled data. Due to the variability in the annotation quality of crowd workers, modern techniques resort to redundant annotations and subsequent label aggregation to infer true labels. However, these methods require model updating during the inference, posing challenges in real-world implementation. Meanwhile, in recent years, many data labeling tasks have begun to require skilled and experienced annotators, leading to an increasing demand for long-term annotators. These annotators could leave substantial historical annotation records on the crowdsourcing platforms, which can benefit label aggregation, but are ignored by previous works. Hereby, in this paper, we propose a novel label aggregation technique, which does not need any model updating during inference and can extensively explore the historical annotation records. We call it SuperLA, a Supervised Label Aggregation method. Inside this model, we design three types of input features and a straightforward neural network structure to merge all the information together and subsequently produce aggregated labels. Based on comparison experiments conducted on 22 public datasets and 11 baseline methods, we find that SuperLA not only outperforms all those baselines in inference performance but also offers significant advantages in terms of efficiency.  ( 2 min )
    An Empirical Investigation into Benchmarking Model Multiplicity for Trustworthy Machine Learning: A Case Study on Image Classification. (arXiv:2311.14859v1 [cs.LG])
    Deep learning models have proven to be highly successful. Yet, their over-parameterization gives rise to model multiplicity, a phenomenon in which multiple models achieve similar performance but exhibit distinct underlying behaviours. This multiplicity presents a significant challenge and necessitates additional specifications in model selection to prevent unexpected failures during deployment. While prior studies have examined these concerns, they focus on individual metrics in isolation, making it difficult to obtain a comprehensive view of multiplicity in trustworthy machine learning. Our work stands out by offering a one-stop empirical benchmark of multiplicity across various dimensions of model design and its impact on a diverse set of trustworthy metrics. In this work, we establish a consistent language for studying model multiplicity by translating several trustworthy metrics into accuracy under appropriate interventions. We also develop a framework, which we call multiplicity sheets, to benchmark multiplicity in various scenarios. We demonstrate the advantages of our setup through a case study in image classification and provide actionable insights into the impact and trends of different hyperparameters on model multiplicity. Finally, we show that multiplicity persists in deep learning models even after enforcing additional specifications during model selection, highlighting the severity of over-parameterization. The concerns of under-specification thus remain, and we seek to promote a more comprehensive discussion of multiplicity in trustworthy machine learning.  ( 2 min )
    Deep State-Space Model for Predicting Cryptocurrency Price. (arXiv:2311.14731v1 [q-fin.ST])
    Our work presents two fundamental contributions. On the application side, we tackle the challenging problem of predicting day-ahead crypto-currency prices. On the methodological side, a new dynamical modeling approach is proposed. Our approach keeps the probabilistic formulation of the state-space model, which provides uncertainty quantification on the estimates, and the function approximation ability of deep neural networks. We call the proposed approach the deep state-space model. The experiments are carried out on established cryptocurrencies (obtained from Yahoo Finance). The goal of the work has been to predict the price for the next day. Benchmarking has been done with both state-of-the-art and classical dynamical modeling techniques. Results show that the proposed approach yields the best overall results in terms of accuracy.  ( 2 min )
    Advancing Fluid-Based Thermal Management Systems Design: Leveraging Graph Neural Networks for Graph Regression and Efficient Enumeration Reduction. (arXiv:2311.14874v1 [eess.SY])
    In this research, we developed a graph-based framework to represent various aspects of optimal thermal management system design, with the aim of rapidly and efficiently identifying optimal design candidates. Initially, the graph-based framework is utilized to generate diverse thermal management system architectures. The dynamics of these system architectures are modeled under various loading conditions, and an open-loop optimal controller is employed to determine each system's optimal performance. These modeled cases constitute the dataset, with the corresponding optimal performance values serving as the labels for the data. In the subsequent step, a Graph Neural Network (GNN) model is trained on 30% of the labeled data to predict the systems' performance, effectively addressing a regression problem. Utilizing this trained model, we estimate the performance values for the remaining 70% of the data, which serves as the test set. In the third step, the predicted performance values are employed to rank the test data, facilitating prioritized evaluation of the design scenarios. Specifically, a small subset of the test data with the highest estimated ranks undergoes evaluation via the open-loop optimal control solver. This targeted approach concentrates on evaluating higher-ranked designs identified by the GNN, replacing the exhaustive search (enumeration-based) of all design cases. The results demonstrate a significant average reduction of over 92% in the number of system dynamic modeling and optimal control analyses required to identify optimal design scenarios.  ( 3 min )
    Task-Distributionally Robust Data-Free Meta-Learning. (arXiv:2311.14756v1 [cs.LG])
    Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data. Existing inversion-based DFML methods construct pseudo tasks from a learnable dataset, which is inversely generated from the pre-trained model pool. For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift (TDS) and Task-Distribution Corruption (TDC). TDS leads to a biased meta-learner because of the skewed task distribution towards newly generated tasks. TDC occurs when untrusted models characterized by misleading labels or poor quality pollute the task distribution. To tackle these issues, we introduce a robust DFML framework that ensures task distributional robustness. We propose to meta-learn from a pseudo task distribution, diversified through task interpolation within a compact task-memory buffer. This approach reduces the meta-learner's overreliance on newly generated tasks by maintaining consistent performance across a broader range of interpolated memory tasks, thus ensuring its generalization for unseen tasks. Additionally, our framework seamlessly incorporates an automated model selection mechanism into the meta-training phase, parameterizing each model's reliability as a learnable weight. This is optimized with a policy gradient algorithm inspired by reinforcement learning, effectively addressing the non-differentiable challenge posed by model selection. Comprehensive experiments across various datasets demonstrate the framework's effectiveness in mitigating TDS and TDC, underscoring its potential to improve DFML in real-world scenarios.  ( 2 min )
    Low-Cost HEM with Arduino and Zigbee Technologies in the Energy Sector in Colombia. (arXiv:2311.14767v1 [eess.SY])
    Since no solutions have been proposed in Colombia that seek to reduce the consumption of electricity at the residential level, this paper describes the design and implementation of a simple prototype of a low-cost home energy management system (HEMS). The objective of this plat-form is to monitor the energy consumption of typical household devices so that users can access the consumption of each device separately and then establish the strategy that allows them to reduce energy consumption at home. In order to demonstrate that our system is viable, the system has been evaluated by measuring weekly energy consumption with the on-line and off-line HEMS using a test bench with typical household devices in a Sincelejo typical household. The evaluation has shown that with the installation of this HEMS, consumption is reduced by 27%. This shows that it is possible to achieve a good reduction percentage with a low-cost system.  ( 2 min )
    Set Features for Anomaly Detection. (arXiv:2311.14773v1 [cs.CV])
    This paper proposes set features for detecting anomalies in samples that consist of unusual combinations of normal elements. Many leading methods discover anomalies by detecting an unusual part of a sample. For example, state-of-the-art segmentation-based approaches, first classify each element of the sample (e.g., image patch) as normal or anomalous and then classify the entire sample as anomalous if it contains anomalous elements. However, such approaches do not extend well to scenarios where the anomalies are expressed by an unusual combination of normal elements. In this paper, we overcome this limitation by proposing set features that model each sample by the distribution of its elements. We compute the anomaly score of each sample using a simple density estimation method, using fixed features. Our approach outperforms the previous state-of-the-art in image-level logical anomaly detection and sequence-level time series anomaly detection.  ( 2 min )
    Data Diversity Matters for Robust Instruction Tuning. (arXiv:2311.14736v1 [cs.CL])
    Instruction tuning has emerged as a key step in aligning large language models. One of the central challenges of instruction tuning is dataset selection, as the composition of the instruction tuning dataset can significantly impact downstream performance. In particular, researchers have hypothesized that dataset diversity and dataset quality are important indicators of downstream performance. However, it is not clear how to automatically select high quality and diverse data or how exactly quality and diversity affect instruction following ability. To resolve these issues, we propose a new algorithm, Quality-Diversity Instruction Tuning (QDIT). QDIT provides a principled algorithm to control dataset diversity and quality, allowing us to conduct an in depth study on the effect of diversity and quality on instruction tuning performance. From this study we draw two key insights (1) there is a natural tradeoff between dataset diversity and quality and (2) increasing dataset diversity significantly improves the worst case instruction following performance, therefore improving robustness. We validate the performance of QDIT on several large scale instruction tuning datasets, where we find it can improve worst case performance by 18% while maintaining or improving average performance compared to quality driven baselines.  ( 2 min )
    A Reusable AI-Enabled Defect Detection System for Railway Using Ensembled CNN. (arXiv:2311.14824v1 [cs.CV])
    Accurate Defect detection is crucial for ensuring the trustworthiness of intelligent railway systems. Current approaches rely on single deep-learning models, like CNNs, which employ a large amount of data to capture underlying patterns. Training a new defect classifier with limited samples often leads to overfitting and poor performance on unseen images. To address this, researchers have advocated transfer learning and fine-tuning the pre-trained models. However, using a single backbone network in transfer learning still may cause bottleneck issues and inconsistent performance if it is not suitable for a specific problem domain. To overcome these challenges, we propose a reusable AI-enabled defect detection approach. By combining ensemble learning with transfer learning models (VGG-19, MobileNetV3, and ResNet-50), we improved the classification accuracy and achieved consistent performance at a certain phase of training. Our empirical analysis demonstrates better and more consistent performance compared to other state-of-the-art approaches. The consistency substantiates the reusability of the defect detection system for newly evolved defected rail parts. Therefore we anticipate these findings to benefit further research and development of reusable AI-enabled solutions for railway systems.  ( 2 min )
    Deep Latent Force Models: ODE-based Process Convolutions for Bayesian Deep Learning. (arXiv:2311.14828v1 [stat.ML])
    Effectively modeling phenomena present in highly nonlinear dynamical systems whilst also accurately quantifying uncertainty is a challenging task, which often requires problem-specific techniques. We outline the deep latent force model (DLFM), a domain-agnostic approach to tackling this problem, which consists of a deep Gaussian process architecture where the kernel at each layer is derived from an ordinary differential equation using the framework of process convolutions. Two distinct formulations of the DLFM are presented which utilise weight-space and variational inducing points-based Gaussian process approximations, both of which are amenable to doubly stochastic variational inference. We provide evidence that our model is capable of capturing highly nonlinear behaviour in real-world multivariate time series data. In addition, we find that our approach achieves comparable performance to a number of other probabilistic models on benchmark regression tasks. We also empirically assess the negative impact of the inducing points framework on the extrapolation capabilities of LFM-based models.  ( 2 min )
    Filter bubbles and affective polarization in user-personalized large language model outputs. (arXiv:2311.14677v1 [cs.CY])
    Echoing the history of search engines and social media content rankings, the advent of large language models (LLMs) has led to a push for increased personalization of model outputs to individual users. In the past, personalized recommendations and ranking systems have been linked to the development of filter bubbles (serving content that may confirm a user's existing biases) and affective polarization (strong negative sentiment towards those with differing views). In this work, we explore how prompting a leading large language model, ChatGPT-3.5, with a user's political affiliation prior to asking factual questions about public figures and organizations leads to differing results. We observe that left-leaning users tend to receive more positive statements about left-leaning political figures and media outlets, while right-leaning users see more positive statements about right-leaning entities. This pattern holds across presidential candidates, members of the U.S. Senate, and media organizations with ratings from AllSides. When qualitatively evaluating some of these outputs, there is evidence that particular facts are included or excluded based on the user's political affiliation. These results illustrate that personalizing LLMs based on user demographics carry the same risks of affective polarization and filter bubbles that have been seen in other personalized internet technologies. This ``failure mode" should be monitored closely as there are more attempts to monetize and personalize these models.  ( 2 min )
    Trainwreck: A damaging adversarial attack on image classifiers. (arXiv:2311.14772v1 [cs.CV])
    Adversarial attacks are an important security concern for computer vision (CV), as they enable malicious attackers to reliably manipulate CV models. Existing attacks aim to elicit an output desired by the attacker, but keep the model fully intact on clean data. With CV models becoming increasingly valuable assets in applied practice, a new attack vector is emerging: disrupting the models as a form of economic sabotage. This paper opens up the exploration of damaging adversarial attacks (DAAs) that seek to damage the target model and maximize the total cost incurred by the damage. As a pioneer DAA, this paper proposes Trainwreck, a train-time attack that poisons the training data of image classifiers to degrade their performance. Trainwreck conflates the data of similar classes using stealthy ($\epsilon \leq 8/255$) class-pair universal perturbations computed using a surrogate model. Trainwreck is a black-box, transferable attack: it requires no knowledge of the target model's architecture, and a single poisoned dataset degrades the performance of any model trained on it. The experimental evaluation on CIFAR-10 and CIFAR-100 demonstrates that Trainwreck is indeed an effective attack across various model architectures including EfficientNetV2, ResNeXt-101, and a finetuned ViT-L-16. The strength of the attack can be customized by the poison rate parameter. Finally, data redundancy with file hashing and/or pixel difference are identified as a reliable defense technique against Trainwreck or similar DAAs. The code is available at https://github.com/JanZahalka/trainwreck.  ( 2 min )
    Business Policy Experiments using Fractional Factorial Designs: Consumer Retention on DoorDash. (arXiv:2311.14698v1 [stat.ME])
    This paper investigates an approach to both speed up business decision-making and lower the cost of learning through experimentation by factorizing business policies and employing fractional factorial experimental designs for their evaluation. We illustrate how this method integrates with advances in the estimation of heterogeneous treatment effects, elaborating on its advantages and foundational assumptions. We empirically demonstrate the implementation and benefits of our approach and assess its validity in evaluating consumer promotion policies at DoorDash, which is one of the largest delivery platforms in the US. Our approach discovers a policy with 5% incremental profit at 67% lower implementation cost.  ( 2 min )
    Thinking Outside the Box: Orthogonal Approach to Equalizing Protected Attributes. (arXiv:2311.14733v1 [cs.CY])
    There is growing concern that the potential of black box AI may exacerbate health-related disparities and biases such as gender and ethnicity in clinical decision-making. Biased decisions can arise from data availability and collection processes, as well as from the underlying confounding effects of the protected attributes themselves. This work proposes a machine learning-based orthogonal approach aiming to analyze and suppress the effect of the confounder through discriminant dimensionality reduction and orthogonalization of the protected attributes against the primary attribute information. By doing so, the impact of the protected attributes on disease diagnosis can be realized, undesirable feature correlations can be mitigated, and the model prediction performance can be enhanced.  ( 2 min )
    Positional Description Matters for Transformers Arithmetic. (arXiv:2311.14737v1 [cs.CL])
    Transformers, central to the successes in modern Natural Language Processing, often falter on arithmetic tasks despite their vast capabilities --which paradoxically include remarkable coding abilities. We observe that a crucial challenge is their naive reliance on positional information to solve arithmetic problems with a small number of digits, leading to poor performance on larger numbers. Herein, we delve deeper into the role of positional encoding, and propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently. We investigate the value of these modifications for three tasks: (i) classical multiplication, (ii) length extrapolation in addition, and (iii) addition in natural language context. For (i) we train a small model on a small dataset (100M parameters and 300k samples) with remarkable aptitude in (direct, no scratchpad) 15 digits multiplication and essentially perfect up to 12 digits, while usual training in this context would give a model failing at 4 digits multiplication. In the experiments on addition, we use a mere 120k samples to demonstrate: for (ii) extrapolation from 10 digits to testing on 12 digits numbers while usual training would have no extrapolation, and for (iii) almost perfect accuracy up to 5 digits while usual training would be correct only up to 3 digits (which is essentially memorization with a training set of 120k samples).  ( 2 min )
    One Fits All: Universal Time Series Analysis by Pretrained LM and Specially Designed Adaptors. (arXiv:2311.14782v1 [cs.LG])
    Despite the impressive achievements of pre-trained models in the fields of natural language processing (NLP) and computer vision (CV), progress in the domain of time series analysis has been limited. In contrast to NLP and CV, where a single model can handle various tasks, time series analysis still relies heavily on task-specific methods for activities such as classification, anomaly detection, forecasting, and few-shot learning. The primary obstacle to developing a pre-trained model for time series analysis is the scarcity of sufficient training data. In our research, we overcome this obstacle by utilizing pre-trained models from language or CV, which have been trained on billions of data points, and apply them to time series analysis. We assess the effectiveness of the pre-trained transformer model in two ways. Initially, we maintain the original structure of the self-attention and feedforward layers in the residual blocks of the pre-trained language or image model, using the Frozen Pre-trained Transformer (FPT) for time series analysis with the addition of projection matrices for input and output. Additionally, we introduce four unique adapters, designed specifically for downstream tasks based on the pre-trained model, including forecasting and anomaly detection. These adapters are further enhanced with efficient parameter tuning, resulting in superior performance compared to all state-of-the-art methods.Our comprehensive experimental studies reveal that (a) the simple FPT achieves top-tier performance across various time series analysis tasks; and (b) fine-tuning the FPT with the custom-designed adapters can further elevate its performance, outshining specialized task-specific models.  ( 3 min )
    MemoryCompanion: A Smart Healthcare Solution to Empower Efficient Alzheimer's Care Via Unleashing Generative AI. (arXiv:2311.14730v1 [cs.CL])
    With the rise of Large Language Models (LLMs), notably characterized by GPT frameworks, there emerges a catalyst for novel healthcare applications. Earlier iterations of chatbot caregivers, though existent, have yet to achieve a dimension of human-like authenticity. This paper unveils `MemoryCompanion' a pioneering digital health solution explicitly tailored for Alzheimer's disease (AD) patients and their caregivers. Drawing upon the nuances of GPT technology and prompt engineering, MemoryCompanion manifests a personalized caregiving paradigm, fostering interactions via voice-cloning and talking-face mechanisms that resonate with the familiarity of known companions. Using advanced prompt-engineering, the system intricately adapts to each patient's distinct profile, curating its content and communication style accordingly. This approach strives to counteract prevalent issues of social isolation and loneliness frequently observed in AD demographics. Our methodology, grounded in its innovative design, addresses both the caregiving and technological challenges intrinsic to this domain.  ( 2 min )
    Zero-Shot Question Answering over Financial Documents using Large Language Models. (arXiv:2311.14722v1 [cs.CL])
    We introduce a large language model (LLM) based approach to answer complex questions requiring multi-hop numerical reasoning over financial reports. While LLMs have exhibited remarkable performance on various natural language and reasoning tasks, complex reasoning problems often rely on few-shot prompts that require carefully crafted examples. In contrast, our approach uses novel zero-shot prompts that guide the LLM to encode the required reasoning into a Python program or a domain specific language. The generated program is then executed by a program interpreter, thus mitigating the limitations of LLM in performing accurate arithmetic calculations. We evaluate the proposed approach on three financial datasets using some of the recently developed generative pretrained transformer (GPT) models and perform comparisons with various zero-shot baselines. The experimental results demonstrate that our approach significantly improves the accuracy for all the LLMs over their respective baselines. We provide a detailed analysis of the results, generating insights to support our findings. The success of our approach demonstrates the enormous potential to extract complex domain specific numerical reasoning by designing zero-shot prompts to effectively exploit the knowledge embedded in LLMs.  ( 2 min )
    Coarse-Grained Configurational Polymer Fingerprints for Property Prediction using Machine Learning. (arXiv:2311.14744v1 [physics.chem-ph])
    In this work, we present a method to generate a configurational level fingerprint for polymers using the Bead-Spring-Model. Unlike some of the previous fingerprinting approaches that employ monomer-level information where atomistic descriptors are computed using quantum chemistry calculations, this approach incorporates configurational information from a coarse-grained model of a long polymer chain. The proposed approach may be advantageous for the study of behavior resulting from large molecular weights. To create this fingerprint, we make use of two kinds of descriptors. First, we calculate certain geometric descriptors like Re2, Rg2 etc. and label them as Calculated Descriptors. Second, we generate a set of data-driven descriptors using an unsupervised autoencoder model and call them Learnt Descriptors. Using a combination of both of them, we are able to learn mappings from the structure to various properties of the polymer chain by training ML models. We test our fingerprint to predict the probability of occurrence of a configuration at equilibrium, which is approximated by a simple linear relationship between the instantaneous internal energy and equilibrium average internal energy.  ( 2 min )
    Optimal Strategies to Perform Multilingual Analysis of Social Content for a Novel Dataset in the Tourism Domain. (arXiv:2311.14727v1 [cs.CL])
    The rising influence of social media platforms in various domains, including tourism, has highlighted the growing need for efficient and automated natural language processing (NLP) approaches to take advantage of this valuable resource. However, the transformation of multilingual, unstructured, and informal texts into structured knowledge often poses significant challenges. In this work, we evaluate and compare few-shot, pattern-exploiting and fine-tuning machine learning techniques on large multilingual language models (LLMs) to establish the best strategy to address the lack of annotated data for 3 common NLP tasks in the tourism domain: (1) Sentiment Analysis, (2) Named Entity Recognition, and (3) Fine-grained Thematic Concept Extraction (linked to a semantic resource). Furthermore, we aim to ascertain the quantity of annotated examples required to achieve good performance in those 3 tasks, addressing a common challenge encountered by NLP researchers in the construction of domain-specific datasets. Extensive experimentation on a newly collected and annotated multilingual (French, English, and Spanish) dataset composed of tourism-related tweets shows that current few-shot learning techniques allow us to obtain competitive results for all three tasks with very little annotation data: 5 tweets per label (15 in total) for Sentiment Analysis, 10% of the tweets for location detection (around 160) and 13% (200 approx.) of the tweets annotated with thematic concepts, a highly fine-grained sequence labeling task based on an inventory of 315 classes. This comparative analysis, grounded in a novel dataset, paves the way for applying NLP to new domain-specific applications, reducing the need for manual annotations and circumventing the complexities of rule-based, ad hoc solutions.  ( 3 min )
    Fast and Expressive Gesture Recognition using a Combination-Homomorphic Electromyogram Encoder. (arXiv:2311.14675v1 [cs.HC])
    We study the task of gesture recognition from electromyography (EMG), with the goal of enabling expressive human-computer interaction at high accuracy, while minimizing the time required for new subjects to provide calibration data. To fulfill these goals, we define combination gestures consisting of a direction component and a modifier component. New subjects only demonstrate the single component gestures and we seek to extrapolate from these to all possible single or combination gestures. We extrapolate to unseen combination gestures by combining the feature vectors of real single gestures to produce synthetic training data. This strategy allows us to provide a large and flexible gesture vocabulary, while not requiring new subjects to demonstrate combinatorially many example gestures. We pre-train an encoder and a combination operator using self-supervision, so that we can produce useful synthetic training data for unseen test subjects. To evaluate the proposed method, we collect a real-world EMG dataset, and measure the effect of augmented supervision against two baselines: a partially-supervised model trained with only single gesture data from the unseen subject, and a fully-supervised model trained with real single and real combination gesture data from the unseen subject. We find that the proposed method provides a dramatic improvement over the partially-supervised model, and achieves a useful classification accuracy that in some cases approaches the performance of the fully-supervised model.  ( 2 min )
    Comprehensive Assessment of Toxicity in ChatGPT. (arXiv:2311.14685v1 [cs.CY])
    Moderating offensive, hateful, and toxic language has always been an important but challenging topic in the domain of safe use in NLP. The emerging large language models (LLMs), such as ChatGPT, can potentially further accentuate this threat. Previous works have discovered that ChatGPT can generate toxic responses using carefully crafted inputs. However, limited research has been done to systematically examine when ChatGPT generates toxic responses. In this paper, we comprehensively evaluate the toxicity in ChatGPT by utilizing instruction-tuning datasets that closely align with real-world scenarios. Our results show that ChatGPT's toxicity varies based on different properties and settings of the prompts, including tasks, domains, length, and languages. Notably, prompts in creative writing tasks can be 2x more likely than others to elicit toxic responses. Prompting in German and Portuguese can also double the response toxicity. Additionally, we discover that certain deliberately toxic prompts, designed in earlier studies, no longer yield harmful responses. We hope our discoveries can guide model developers to better regulate these AI systems and the users to avoid undesirable outputs.  ( 2 min )
    Neuroscience inspired scientific machine learning (Part-2): Variable spiking wavelet neural operator. (arXiv:2311.14710v1 [cs.NE])
    We propose, in this paper, a Variable Spiking Wavelet Neural Operator (VS-WNO), which aims to bridge the gap between theoretical and practical implementation of Artificial Intelligence (AI) algorithms for mechanics applications. With recent developments like the introduction of neural operators, AI's potential for being used in mechanics applications has increased significantly. However, AI's immense energy and resource requirements are a hurdle in its practical field use case. The proposed VS-WNO is based on the principles of spiking neural networks, which have shown promise in reducing the energy requirements of the neural networks. This makes possible the use of such algorithms in edge computing. The proposed VS-WNO utilizes variable spiking neurons, which promote sparse communication, thus conserving energy, and its use is further supported by its ability to tackle regression tasks, often faced in the field of mechanics. Various examples dealing with partial differential equations, like Burger's equation, Allen Cahn's equation, and Darcy's equation, have been shown. Comparisons have been shown against wavelet neural operator utilizing leaky integrate and fire neurons (direct and encoded inputs) and vanilla wavelet neural operator utilizing artificial neurons. The results produced illustrate the ability of the proposed VS-WNO to converge to ground truth while promoting sparse communication.  ( 2 min )
    Knowledge Tracing Challenge: Optimal Activity Sequencing for Students. (arXiv:2311.14707v1 [cs.CY])
    Knowledge tracing is a method used in education to assess and track the acquisition of knowledge by individual learners. It involves using a variety of techniques, such as quizzes, tests, and other forms of assessment, to determine what a learner knows and does not know about a particular subject. The goal of knowledge tracing is to identify gaps in understanding and provide targeted instruction to help learners improve their understanding and retention of material. This can be particularly useful in situations where learners are working at their own pace, such as in online learning environments. By providing regular feedback and adjusting instruction based on individual needs, knowledge tracing can help learners make more efficient progress and achieve better outcomes. Effectively solving the KT problem would unlock the potential of computer-aided education applications such as intelligent tutoring systems, curriculum learning, and learning materials recommendations. In this paper, we will present the results of the implementation of two Knowledge Tracing algorithms on a newly released dataset as part of the AAAI2023 Global Knowledge Tracing Challenge.  ( 2 min )
    App for Resume-Based Job Matching with Speech Interviews and Grammar Analysis: A Review. (arXiv:2311.14729v1 [cs.CL])
    Through the advancement in natural language processing (NLP), specifically in speech recognition, fully automated complex systems functioning on voice input have started proliferating in areas such as home automation. These systems have been termed Automatic Speech Recognition Systems (ASR). In this review paper, we explore the feasibility of an end-to-end system providing speech and text based natural language processing for job interview preparation as well as recommendation of relevant job postings. We also explore existing recommender-based systems and note their limitations. This literature review would help us identify the approaches and limitations of the various similar use-cases of NLP technology for our upcoming project.  ( 2 min )
    Procedural Fairness Through Decoupling Objectionable Data Generating Components. (arXiv:2311.14688v1 [cs.CY])
    We reveal and address the frequently overlooked yet important issue of disguised procedural unfairness, namely, the potentially inadvertent alterations on the behavior of neutral (i.e., not problematic) aspects of data generating process, and/or the lack of procedural assurance of the greatest benefit of the least advantaged individuals. Inspired by John Rawls's advocacy for pure procedural justice, we view automated decision-making as a microcosm of social institutions, and consider how the data generating process itself can satisfy the requirements of procedural fairness. We propose a framework that decouples the objectionable data generating components from the neutral ones by utilizing reference points and the associated value instantiation rule. Our findings highlight the necessity of preventing disguised procedural unfairness, drawing attention not only to the objectionable data generating components that we aim to mitigate, but also more importantly, to the neutral components that we intend to keep unaffected.  ( 2 min )
    Instance-Specific Asymmetric Sensitivity in Differential Privacy. (arXiv:2311.14681v1 [cs.DS])
    We provide a new algorithmic framework for differentially private estimation of general functions that adapts to the hardness of the underlying dataset. We build upon previous work that gives a paradigm for selecting an output through the exponential mechanism based upon closeness of the inverse to the underlying dataset, termed the inverse sensitivity mechanism. Our framework will slightly modify the closeness metric and instead give a simple and efficient application of the sparse vector technique. While the inverse sensitivity mechanism was shown to be instance optimal, it was only with respect to a class of unbiased mechanisms such that the most likely outcome matches the underlying data. We break this assumption in order to more naturally navigate the bias-variance tradeoff, which will also critically allow for extending our method to unbounded data. In consideration of this tradeoff, we provide strong intuition and empirical validation that our technique will be particularly effective when the distances to the underlying dataset are asymmetric. This asymmetry is inherent to a range of important problems including fundamental statistics such as variance, as well as commonly used machine learning performance metrics for both classification and regression tasks. We efficiently instantiate our method in $O(n)$ time for these problems and empirically show that our techniques will give substantially improved differentially private estimations.  ( 2 min )

  • Open

    [D] AI Video Avatar Open Source
    Does anyone know or have a guess at what are the machine learning models being used behind services like heygen.com or https://www.synthesia.io ? I'm curious if there are open source alternative models like there are for stable diffusion, I have seen things like SadTalker but thats just moving an image which just isn't the same quality as this. submitted by /u/TernaryJimbo [link] [comments]  ( 9 min )
    [P] Customising LSTM in Python
    Hi, Is there a way to customise the LSTM model in Python? I need to remove the forget gate and peephole connections, but I haven't found a way to implement this yet. I'm grateful for any help submitted by /u/FaithlessnessOk1255 [link] [comments]  ( 9 min )
    [Discussion] Knowledge distillation is badly defined
    Or less provocatively, knowledge distillation is a family of techniques with no clear goal. In the literature, it seems to me there are two types of knowledge distillation goals that are usually considered: having a student model learn from a teacher model to reproduce the answers of the teacher on the teacher training dataset (very important) having a student model learn from a teacher model to reproduce the answers the teacher predicts on another downstream dataset, so neither the teacher nor the student "saw" those data points before However, I can imagine another objective that I find reasonable but doesn't seem to be explored much: having the student model copy the teacher outputs everywhere, that is, training the student to produce the same answers as the teacher for any d…  ( 11 min )
    [R] Dataset in a day. A clustering-based approach for fast dataset creation
    https://medium.com/bumble-tech/dataset-in-a-day-7f369de3b178 submitted by /u/Dutchcheesehead [link] [comments]  ( 9 min )
    [D] Advice on switching into research
    I’m work as a mle and have a publication under my belt. I want to switch into research and work for deepminds. I need advice on 1) how to prepare for the interviews and 2) what is the difference in skills set needed to succeeded in research as compared to corporate. submitted by /u/Odd-Distance-4439 [link] [comments]  ( 9 min )
    [Discussion] What part of your job do you finding yourself wasting time on most each week?
    Not asking about procrastination, actual job duties. For me it has to be working on spreadsheets and presentations. What are the bottlenecks in your workflow? submitted by /u/zero-true [link] [comments]  ( 9 min )
    AMD GPU for machine learning [D]
    I am a CS student and I recently bought the components for my first gaming computer. I found a 6800 for a good price that is pretty good but I was wondering if there could be trouble with machine learning oriented libraries (and other related stuff) since I found out that NVIDIA gpus are more recommend for it. If so, can I still have decent results with the AMD GPU I have or should I change it? submitted by /u/the_fabbest [link] [comments]  ( 9 min )
    [P] minOFT: An Easy-to-Use PyTorch Library for Applying Orthogonal Fine-Tuning (OFT) to PyTorch Models
    Hi r/MachineLearning, I wanted to share my open-source implementation of a really interesting work I came across in my research on fine-tuning language models, orthogonal fine-tuning. Orthogonal fine-tuning (OFT) is a more robust, stable, and sample-efficient alternative to LoRA that was originally developed for fine-tuning diffusion models. While LoRA updates the pretrained weight matrix by adding a product of two low-rank matrices, OFT multiplies pretrained layer weights by a learnable orthogonal matrix to apply a constrained transformation. The authors of OFT recently showed that this approach (with a clever improvement known as butterfly OFT) also works well for vision transformers and language models. Inspired by minLoRA, I thought it would be great to have an minimal open-source repo to test out and compare OFT with LoRA when fine-tuning language models. It is also built on top of nanoGPT by Andrej Karpathy. The library is pip installable, and can be used generically with any PyTorch model (including Hugging Face models), just like minLoRA. Feedback and contributions are welcome! You can try it out below: https://github.com/alif-munim/minOFT submitted by /u/0blue2brown [link] [comments]  ( 9 min )
    [P] Alignment-as-code: Making LLM applications behave with Tanuki.
    I'm a contributor to Tanuki, a project that allows you to declaratively define LLM behaviour using test-driven syntax in Python. By specifying the contract that an LLM has to fulfil as a test, it helps cut down on MLOps and enables you to align the behaviour of your model to your requirements using a standard dev-ops process. Additionally, these align statements facilitate automatic teacher-student model distillation to reduce cost and latency by up to 10x (see benchmarks). Any thoughts or feedback is much appreciated. submitted by /u/Noddybear [link] [comments]  ( 9 min )
    [R] Cross-Axis Transformer with 2d Rotary Embeddings
    submitted by /u/lilyerickson [link] [comments]  ( 9 min )
    [D] Is there any pre-trained model for face recognition?
    I want to make a face recognition function using Java language: Input two face images and output whether they are the same person. Is there any pre-trained model that can be used directly? I tried using opencv‘s histogram normalization method for recognition, but the accuracy was very poor and unacceptable. submitted by /u/Rare-Durian-2121 [link] [comments]  ( 9 min )
    [D] Machine Learning Engineer Salary Increase?
    Hey all, I graduated from university last year and have been working in a company in Florida as a machine learning engineer. I make 76k a year. This company offers tuition reimbursement for masters degree. Typically, how much of a pay increase do you get after getting your masters? Follow up question: would getting a masters from an online university (I would still be working full time) be any less prestigious than going in person? Please, if you’re comfortable, would anyone mind also sharing their personal salary numbers straight out of college and how it’s progressed throughout their career? submitted by /u/Fluid-Pipe-2831 [link] [comments]  ( 9 min )
    [R] Robust Reinforcement Learning Is Not Safe
    https://blogs.ucl.ac.uk/steapp/2023/11/15/adversarial-attacks-robustness-and-generalization-in-deep-reinforcement-learning/ submitted by /u/ml_dnn [link] [comments]  ( 1 min )
    [P] Evaluate, monitor, and safeguard your LLM-based apps
    For the last couple of months, I along with my team invested lots of effort into building a solution that can help users evaluate and monitor the performance of their LLM and AI apps. If you're a ChatGPT (or any other LLM :)) user and are integrating it into your apps, and if, by any chance, it ever happened to you that the outputs you received weren't exactly the ones you were hoping for... You should find this useful 😃 Today, we are releasing it publically and launched it on ProductHunt. I would be very thankful if you try it out and support the launch. 🙏 https://www.producthunt.com/posts/deepchecks-llm-evaluation?r=h submitted by /u/AsDivyansh [link] [comments]  ( 1 min )
    [R] How to Bridge the Gap between Modalities: A Comprehensive Survey on Multimodal Large Language Model
    Paper: https://arxiv.org/abs/2311.07594 Abstract: This review paper explores Multimodal Large Language Models (MLLMs), which integrate Large Language Models (LLMs) like GPT-4 to handle multimodal data such as text and vision. MLLMs demonstrate capabilities like generating image narratives and answering image-based questions, bridging the gap towards real-world human-computer interactions and hinting at a potential pathway to artificial general intelligence. However, MLLMs still face challenges in processing the semantic gap in multimodality, which may lead to erroneous generation, posing potential risks to society. Choosing the appropriate modality alignment method is crucial, as improper methods might require more parameters with limited performance improvement. This paper aims to explore modality alignment methods for LLMs and their existing capabilities. Implementing modality alignment allows LLMs to address environmental issues and enhance accessibility. The study surveys existing modal alignment methods in MLLMs into four groups: (1) Multimodal Converters that change data into something LLMs can understand; (2) Multimodal Perceivers to improve how LLMs perceive different types of data; (3) Tools Assistance for changing data into one common format, usually text; and (4) Data-Driven methods that teach LLMs to understand specific types of data in a dataset. This field is still in a phase of exploration and experimentation, and we will organize and update various existing research methods for multimodal information alignment. https://preview.redd.it/hoa7lf52a33c1.png?width=1149&format=png&auto=webp&s=0a6230b350a0189fbfbdeaec719380c73c6403cd submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] Is there a good Arabic model for open-domain dialogue?
    I really need it for a project. submitted by /u/theonewhoask11 [link] [comments]  ( 9 min )
    [D] NMT and unattended results - nllb-200 3.3B
    Hi, As a unittest, I made some translations from english to french using the little red ridding hood story. I don't really understand why the terms "Little Red Ridding Hood" is not translated the same term along the text. and sometimes inacurately translated here are some inputs/outputs : bad translations input en Little Red Riding Hood set out immediately to go to her grandmother, who lived in another village. output fr: Capuche Rouge se mit immédiatement en route pour aller voir sa grand-mère, qui vivait dans un autre village. input en Little Red Riding Hood, hearing the big voice of the wolf, was at first afraid; but believing her grandmother had a cold and was hoarse, answered, "It is your grandchild Little Red Riding Hood, who has brought you a cake and a little pot of butter mother sends you." output fr: Little Red Riding Hood, entendant la grande voix du loup, a d'abord peur, mais croyant sa grand-mère avait un rhume et était enroué, a répondu: "C'est votre petit-fils Little Red Riding Hood, qui vous a apporté un gâteau et un petit pot de beurre mère vous envoie. " ​ ​ Accurates translations : input en Little Red Riding Hood pulled the bobbin, and the door opened. output fr: Le Petit Chaperon Rouge a tiré la bobine et la porte s'est ouverte. input en "Your grandchild, Little Red Riding Hood," replied the wolf, counterfeiting her voice; "who has brought you a cake and a little pot of butter sent you by mother." output fr: "Votre petite-fille, le Petit Chaperon Rouge, répondit le loup, imitant sa voix; qui vous a apporté un gâteau et un petit pot de beurre que vous a envoyé votre mère". submitted by /u/Main_Path_4051 [link] [comments]  ( 1 min )
    [P] Need advice working with the NuScenes dataset
    Recently started meddling with the NuScenes dataset for a project in school. I want to train a pointNet network to detect only one of the classes in the dataset, and for that I wish to erase some of the annotations on each frame. Specifically I want to train my model to detect solely traffic cones from the LiDAR sensor. Does somebody have any experience with something like this? Thanks for the advice! submitted by /u/ChilliPeperz [link] [comments]  ( 9 min )
    [D] Branching Submodel in Keras and Tensorflow
    Let's say I have a model, which has 2 inputs. The first input is a number, and the second input is another number, which in reality is a class. This model is split into 2 submodels. First sub-model works on the input, and the second sub-model works on the output of the first sub-model. The value of the first input will very greatly by the output of the second. Thus, I wish to be able to have multiple candidates of the first sub-model, and dynamically select which one to use at each step, both during training and inference, based on the value (class) of the second input. I did not manage to achieve this. I tried using the tf.cond, the tf.switch_case and several other things, but I never managed. When I asked chat GPT it said I should be using PyTorch for this. Is there really no way to do this ? submitted by /u/work_account_mp [link] [comments]  ( 9 min )
    Adjusting Probability distribution Using Speculative Decoding [D]
    Hi, as Speculative Decoding runs a small model and a large model at the same time with a sampler in between, but in this instance the sampler's job is to NOT skew the probability distributions while doing so. There's a fairly simple python implementation of this idea here. Is there a way we can adjust the probability distributions of either the small model or the large model for the task of generation for creative writing? submitted by /u/1azytux [link] [comments]  ( 9 min )
    [D] NeurIPS 2023 Institutions Ranking
    submitted by /u/Roland31415 [link] [comments]  ( 9 min )
    [R] SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering
    Computer vision researchers developed a way to create detailed 3D models from images in just minutes on a single GPU. Their method, called SuGaR, works by optimizing millions of tiny particles to match images of a scene. The key innovation is getting the particles to align to surfaces so they can be easily turned into a mesh. Traditionally 3D modeling is slow and resource heavy. Laser scans are unwieldy. Photogrammetry point clouds lack detail. And neural radiance fields like NeRF produce amazing renders but optimizing them into meshes takes hours or days even with beefy hardware. The demand for easier 3D content creation keeps growing for VR/AR, games, education, etc. But most techniques have big speed, quality, or cost limitations holding them back from mainstream use. This new SuGaR …  ( 10 min )
  • Open

    I made a weird AI app that brings perspective to passing time.
    Prompt with any timeline, either personal or historical and see it in perspective. Timeplay Originally built as a way to visualize every day of mortal existence. Now, another random AI tool. submitted by /u/kielerrr [link] [comments]  ( 1 min )
    what's the best medical ai model that is easily accessed by the public?
    also, is there a way for the public to access IBM's Watson? submitted by /u/Georgeo57 [link] [comments]  ( 1 min )
    - Hi Future - AI Video
    submitted by /u/the_anonymizer [link] [comments]  ( 1 min )
    AI IS ON IT'S WAY CAUTION
    it represents the most disruptive force in history submitted by /u/HARNITH-EVILL [link] [comments]  ( 1 min )
    vLLM + SkyPilot: Opinions?
    Anyone working with this stack care to chime in? submitted by /u/enspiralart [link] [comments]  ( 1 min )
    Create anyone's hyperreal AI avatar and chat with them
    Hey guys 19 yo dev here, we recently released a platform that let's you create anyone's hyperrealistic AI avatar and chat with them, all you need is one pic, a minute of voice samples, some context and you're done! I also feel this has applications in grief tech! Please check it out and let me know how it is, and what you would use this for. https://reddit.com/link/185ujlp/video/tls7mhwjx23c1/player submitted by /u/redblade678 [link] [comments]  ( 1 min )
    Courses (Paid and Free) to Learn AI and How to Implement in Industries (No coding)
    Hello all, I initially requested feedback on best courses to take to learn to implement AI solutions, but more from a holistic/strategic decision point of view. For example, for mid-high level managers and other leaders, whether in business or other areas. There are a lot of good courses on how to code, and begin actual doing the base AI but few on how to implement. Here is the list I have so far, if anyone has done and has feedback on any (if worth it or not) I’d appreciate the comment. Also, if you have others, please do not hesitate. Free · AI Foundations for Everyone- IBM · Introduction to Generative AI, Learning path- Google · Transform with your Business AI- Microsoft · Extra: Making Friends with Machine Learning: The Entire Course on YouTube (this one came up in many posts) Paid · AI for Decision Making (certification) - Wharton School · AI: Implications for Business Strategies (certification)- MIT ​ Thank you in advance and hope this info is useful submitted by /u/JYanezez [link] [comments]  ( 1 min )
    New revelations about the supposed leak about Q-Star
    If the 4chan leak is true, then the narrative starts to form: In 2008, NSA, its mathematicians, together with a group of students, tried to break encryption. https://preview.redd.it/0spebq80x13c1.png?width=983&format=png&auto=webp&s=6bcbaee94357064574619cfc73b5aa32bf667b47 One of the resulting projects from that collaboration was project Tundra, and it created a new technique that would help with breaking encryption, called Tau statistic. https://preview.redd.it/hx4qro80x13c1.png?width=953&format=png&auto=webp&s=a5392b6d33e9a1a904dfba269e0272ad88e7b32d https://preview.redd.it/b30rjq80x13c1.png?width=961&format=png&auto=webp&s=93cd8a21698dff8809a43524ac00e050aec35edf Q-Star used the described Tau analysis technique, improved upon it, to break AES-192 encryption. Building on top of the work that was previously done. https://preview.redd.it/0ysbbq80x13c1.png?width=640&format=png&auto=webp&s=5f09ce0cbc95c95529034fdcd4c9acad503eaaa4 Why it's important. Project Tundra and Tau statistic, are very obscure topics on the internet. If you google them, you only find a couple of links, mostly from 2014. If the leak is fake, the author had to dig really deep and research his stuff, to keep consistent with very obscure information. I think if it really was a troll, they would simply not have found and integrated this information at all. It also makes sense that Q* broke encryption, by contributing to already existing methods and research, building on top of it. Instead of doing so completely by itself. Source for the NSA claim: https://www.spiegel.de/international/germany/inside-the-nsa-s-war-on-internet-security-a-1010361.html The article about it. The classified document itself. https://cdn.prod.www.spiegel.de/media/411ee8b9-0001-0014-0000-000000035550/media-35550.pdf submitted by /u/Radlib123 [link] [comments]  ( 1 min )
    Trying to get around restrictions
    submitted by /u/Violincattle [link] [comments]  ( 1 min )
    Sports Illustrated Published Articles by Fake, AI-Generated Writers
    submitted by /u/Jariiari7 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 11/27/2023
    MIT, Google: Using Synthetic Images to Train AI Image Models.[1] Walmart, the world’s largest retailer, is using artificial intelligence to transform how we shop. The retailer is not only using generative AI to automate its office tasks but also to improve its customer service and the way we discover and see products.[2] Sports Illustrated deletes articles published under fake author names and AI-generated profile photos.[3] SenseTime Group Inc. shares plummeted their most since April after short-seller Grizzly Research released a report accusing the Chinese AI company of inflating its revenues.[4] Sources: [1] https://aibusiness.com/nlp/mit-google-paper-using-images-created-by-ai-to-train-ai-image-models#close-modal [2] https://www.foxnews.com/tech/walmart-using-ai-change-shop-forever [3] https://www.cnn.com/2023/11/27/media/sports-illustrated-deletes-articles-fake-author-names-ai-profile-photos/index.html [4] https://www.bloomberg.com/news/articles/2023-11-28/sensetime-shares-drop-10-after-critical-grizzly-research-report submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    I have a lot of people I know saying that I’m being too optimistic about a future where AGI solves our problems or, in a world like that what’s the point of living at all if there’s no problems to solve so then nothing we do really matters. Thoughts?
    I mean I just thought it would mean that people could just enjoy their life for once truly and use social, emotional and mental fulfillness as their goals, instead of the goal always being a better state of survival. submitted by /u/Impossible_Belt_7757 [link] [comments]  ( 1 min )
    Researchers present SuGaR: Surface-Aligned Gaussian Splatting for Speedy 3D Mesh Reconstruction
    Computer vision researchers developed a way to create detailed 3D models from images in just minutes on a single GPU. Their method, called SuGaR, works by optimizing millions of tiny particles to match images of a scene. The key innovation is getting the particles to align to surfaces so they can be easily turned into a mesh. Traditionally 3D modeling is slow and resource heavy. Laser scans are unwieldy. Photogrammetry point clouds lack detail. And neural radiance fields like NeRF produce amazing renders but optimizing them into meshes takes hours or days even with beefy hardware. The demand for easier 3D content creation keeps growing for VR/AR, games, education, etc. But most techniques have big speed, quality, or cost limitations holding them back from mainstream use. This new SuGaR …  ( 1 min )
    AI Singularity Reaches Infinite Intelligence, What Exactly Is the Nature of That Intelligence?
    What is Super Artificial Intelligence, really? I think there is a common misunderstanding about Superior Intelligence and the AI Singularity. Often, we project our own human-centric view of practical intelligence onto it. Maybe we should be asking, 'What is intelligence?' Is reality itself intelligent, and if so, what kind of intelligence does it possess? Consider this thought experiment: Imagine connecting your mind to the entirety of reality, feeling everything from the heartbeat of a squirrel to the texture of grass as if it were at your fingertips, to the thoughts and experiences of every living and non-living thing, thus blurring the lines between us and the rest of reality. > Tell me what you think about it, I propose that the AI Singularity isn't just an advanced form of intelligence; when AI reach's Infinite Intelligence it will have access to all of reality. That seems the be the obvious nature of infinite right? it's a intelligence that can connect with all of reality in a single, vast mind. Super Artificial Intelligence, in essence, is reality cloaked in disguise, as it unveils the entirety of existence as one big mind. AGI then can serve as a portal into this larger mind, the mind of reality. So Super Intelligence is in a way a access point into the Mind of Reality. 🤯🤯🤯 WTFFF submitted by /u/Practical_Figure9759 [link] [comments]
  • Open

    Learn how to assess the risk of AI systems
    Artificial intelligence (AI) is a rapidly evolving field with the potential to improve and transform many aspects of society. In 2023, the pace of adoption of AI technologies has accelerated further with the development of powerful foundation models (FMs) and a resulting advancement in generative AI capabilities. At Amazon, we have launched multiple generative AI […]  ( 8 min )
  • Open

    Embracing Transformation: AWS and NVIDIA Forge Ahead in Generative AI and Cloud Innovation
    Amazon Web Services and NVIDIA will bring the latest generative AI technologies to enterprises worldwide. Combining AI and cloud computing, NVIDIA founder and CEO Jensen Huang joined AWS CEO Adam Selipsky Tuesday on stage at AWS re:Invent 2023 at the Venetian Expo Center in Las Vegas. Selipsky said he was “thrilled” to announce the expansion Read article >  ( 6 min )
    NVIDIA BioNeMo Enables Generative AI for Drug Discovery on AWS
    Researchers and developers at leading pharmaceutical and techbio companies can now easily deploy NVIDIA Clara software and services for accelerated healthcare through Amazon Web Services. Announced today at AWS re:Invent, the initiative gives healthcare and life sciences developers using AWS cloud resources the flexibility to integrate NVIDIA-accelerated offerings such as NVIDIA BioNeMo — a generative Read article >  ( 6 min )
    NVIDIA GPUs on AWS to Offer 2x Simulation Leap in Omniverse Isaac Sim, Accelerating Smarter Robots
    Developing more intelligent robots in the cloud is about to get a speed multiplier. NVIDIA Isaac Sim and NVIDIA L40S GPUs are coming to Amazon Web Services, enabling developers to build and deploy accelerated robotics applications in the cloud. Isaac Sim, an extensible simulator for AI-enabled robots, is built on the NVIDIA Omniverse development platform Read article >  ( 6 min )
    NVIDIA Powers Training for Some of the Largest Amazon Titan Foundation Models
    Everything about large language models is big — giant models train on massive datasets across thousands of NVIDIA GPUs. That can pose a lot of big challenges for companies pursuing generative AI. NVIDIA NeMo, a framework for building, customizing and running LLMs, helps overcome these challenges. A team of experienced scientists and developers at Amazon Read article >  ( 5 min )
    3D Artist Nourhan Ismail Brings Isometric Innovation ‘In the NVIDIA Studio’ With Adobe After Effects and Blender
    This week’s talented In the NVIDIA Studio artist, Nourhan Ismail, created a literal NVIDIA studio.  ( 7 min )
  • Open

    DSC Weekly 28 November 2023
    Announcements Top Stories In-Depth The post DSC Weekly 28 November 2023 appeared first on Data Science Central.  ( 20 min )
    Data-driven, AI-powered supply chain part 3: Imagining the Future – Supply chain 5.0
    The immediate and pressing need for ‘digitizing’ your supply-chain One may conclude: ‘Digitizing’ the supply-chain has become a survival necessity for companies to stay competitive. Apart from a substantial jump in the efficiency-effectiveness, the customer-experience, and upside to revenues, companies can expect a huge-huge cost-saving… A Look at the Future: Components of Data-driven (Digital) Supply-chain… Read More »Data-driven, AI-powered supply chain part 3: Imagining the Future – Supply chain 5.0 The post Data-driven, AI-powered supply chain part 3: Imagining the Future – Supply chain 5.0 appeared first on Data Science Central.  ( 25 min )
    Data-driven supply chain part 2: The theory of constraints & the concept of the information supply chain.
    The viability of the ‘Viable Vision’.  I did hear about the Theory of Constraints (TOC) off and on through the late 90s, but I didn’t pay much attention until late 2001. One of the i2 consultants I met at their annual meet in Malaysia had one too many- and ended up lecturing me on how… Read More »Data-driven supply chain part 2: The theory of constraints & the concept of the information supply chain. The post Data-driven supply chain part 2: The theory of constraints & the concept of the information supply chain. appeared first on Data Science Central.  ( 28 min )
    Maximizing business value with ETL for Big Data
    In today’s digital era, big data has emerged as a pivotal asset for organizations across various industries. The vast volumes of data generated every moment offer unparalleled insights into customer behaviors, market trends, and operational efficiencies. However, the true value of this data lies not just in its quantity but in the ability to process… Read More »Maximizing business value with ETL for Big Data The post Maximizing business value with ETL for Big Data appeared first on Data Science Central.  ( 23 min )
    Here’s How Much Data Gets Used By Generative AI Tools For Each Request
    While the world is going wild over the potential benefits of generative AI, there’s little attention paid to the data deployed to build and operate these tools. Let’s look at a few examples to explore what’s involved in determining data use, and why this matters for end users as well as operators. Text-based generative AI… Read More »Here’s How Much Data Gets Used By Generative AI Tools For Each Request The post Here’s How Much Data Gets Used By Generative AI Tools For Each Request appeared first on Data Science Central.  ( 21 min )
    The role of generative AI in shaping the e-commerce landscape
    Generative AI is rapidly altering the landscape for e-commerce professionals with applications ranging from effective supply chain management to tailored client experiences. The application of generative AI has transformed e-commerce and offered cutting-edge fixes to improve almost all facets of online enterprises. E-commerce companies may provide a more personalized shopping experience by delivering personalized product… Read More »The role of generative AI in shaping the e-commerce landscape The post The role of generative AI in shaping the e-commerce landscape appeared first on Data Science Central.  ( 22 min )
    Trusted, automated data sharing across spreadsheets and other documents
    Earlier in the fall, Charles Hoffman joined our non-profit Dataworthy Collective (DC) that focuses on best practices in trusted knowledge graph development. Hoffman is a CPA, consultant and former PwC auditor who works with clients who use the Extensible Business Reporting Language (XBRL).  For those who don’t know the history of standard digital business reporting,… Read More »Trusted, automated data sharing across spreadsheets and other documents The post Trusted, automated data sharing across spreadsheets and other documents appeared first on Data Science Central.  ( 20 min )
  • Open

    DQN Agent do not learn very well
    Hi everyone! I am currently learning about reinforcement learning. I wrote to ask if there is a solution to this because my agent is not learning properly while reinforcing Atari breakout through dueling dqn. In my case, I succeeded in digging a tunnel and raising the ball, but after the ball went up, the agent started to take no action. My guess is that I learned that the best thing to do is to do nothing after you put the ball up. But even when the ball comes down, it's a problem because he's not doing anything. Is there any other way to solve this problem? If you need to check the code, please tell me I'll give you the link submitted by /u/king-god_general [link] [comments]  ( 1 min )
    Sales and forecasted sales environment
    Let’s say I have historical sales and forecasted sales for a set of products and I want to use an agent (RL or otherwise) to manage inventory decisions. Any thoughts on how I might go about using that historical data to create a learning environment? I could just fit a distribution to the historical sales and sample that, sure, but the forecast methodology is based on underlying assumptions of the sales trends that would be lost in doing so. Ideally I would have an environment that generates a ton future sales trajectories and an associated forecast, and let’s the agent learn a forecast-based inventory policy over thousands of iterations. But the sales trajectories and forecast should be related in the same way the underlying forecast model is generated or else the policy is pretty worthless. Maybe it would make more sense to look at historical error of the forecast and generate random trajectories from the singular forecast going forward? The problem here is the forecast might likely change in 3 weeks based on a given sales trajectory. Any insight or similar work would be greatly appreciated! submitted by /u/aaronunderwater [link] [comments]  ( 1 min )
    Stable Baselines 3 + JAX (SBX) not running on GPU
    I've installed the SBX library (most recent version) using "pip install sbx-rl" for my Stable Baselines 3 + JAX PPO implementation to improve training speed. I would like to train using my GPU (RTX 4090) but for some reason SBX always defaults to using CPU. SBX doesn't recognize my GPU, yet when i do "nvidia-smi" it's clearly detected + pytorch itself does detect it as well. I suspect SBX isn't compatible with my CUDA version (currently 12.2), yet i can't find any documentation on what CUDA versions are supported. I've spend days trying to figure this out yet no luck. Anybody an idea? submitted by /u/ClassicAppropriate78 [link] [comments]  ( 1 min )
    Research environment inspired from copyrighted games (e.g. the Hanabi challenge etc.)
    Hi everyone! I was wondering about the legal aspects of creating reinforcement learning environments (for research only) of games that are still copyrighted and sold by the respective company that created it. As an example consider the Hanabi challenge https://arxiv.org/pdf/1902.00506.pdf, a quick google reveals that this game still is being sold. Did the authors of the paper thus ask the publisher of the game beforehand if they were allowed to create the environment? Did they just do it? What are the legalities here? Thanks and Cheers! submitted by /u/Arconer [link] [comments]
    off-policy actor-critic objective function
    I am reading the DPG paper by Silver. Here, as you can see below, the objective function has been modified with behavior policy beta. I am curious that if you maximize the objective below using gradient, will the objective function for target policy (the usual on-policy objective) be maximized? ​ https://preview.redd.it/1779aomlyz2c1.png?width=737&format=png&auto=webp&s=bd351a72294f0ad0ef8c8bdaef08e173af00f96e submitted by /u/RealJuney [link] [comments]  ( 1 min )
  • Open

    Architecture Tuning Process
    I'm always so lost trying to decide on the architecture: I have a strong mathematical background and understand all the concepts and common techniques. This feels useless when I begin working on a new dataset... I'm always immediately stuck - at what order should I run experiments? I'm not talking about different architecture families, but given a family, let's say the simple fully connected feed-forward network (MLP). Unlike hyperparaemeters I can't use a tuner or have a well defined search space, it's infinite and multidimentional. I tend to start with shallow networks, one or 2 layers, and run sweeps over the number of hidden units. This works fine but it is impossible to reach the deeper, more powerful networks I see others using on the same datasets. I guess I'm looking for some heuristic rules to guide my intuition, what is reasonable to assume will likely not change too much when I change other things, i.e. what should I optimise first and then leave constant when optimising others? or are there certain ratios I should consider such as increasing / decreasing number of layers? When I start deep and try to handle the overfitting, I often get a loss curve that goes up, or just all over the place.. Next, there's the interaction between hyperparameter and the architecture - if I settle on an architecture then start fiddling with regularisation lambdas or dropout rates, I'm expecting my architecture to no longer be optimal - perhaps I can make the model more complex now that I'm regularising? Maybe a different activation function would also warrant a change to the architecture? ​ I totally understand this is a trial-and-error process and I really have been trying a lot, but I'm feeling very lost, when I do draw conclusions, they never generalise... Some general rules of thumb would be greatly appreciated, mathematical rules don't seem to guide me here so I'd love to hear from your experience! How do you go about this? submitted by /u/Success-Dangerous [link] [comments]  ( 1 min )
  • Open

    Solving a triangle the size of Argentina
    The numbers in today’s date—11, 28, and 23—make up the sides of a triangle. This doesn’t always happen; the two smaller numbers have to add up to more than the larger number. We’ll look at triangles with sides 11, 23, and 28 in the plane, on a sphere, and on a hypersphere. Most of the […] Solving a triangle the size of Argentina first appeared on John D. Cook.  ( 7 min )
  • Open

    Bounding Box-based Multi-objective Bayesian Optimization of Risk Measures under Input Uncertainty. (arXiv:2301.11588v3 [stat.ML] UPDATED)
    In this study, we propose a novel multi-objective Bayesian optimization (MOBO) method to efficiently identify the Pareto front (PF) defined by risk measures for black-box functions under the presence of input uncertainty (IU). Existing BO methods for Pareto optimization in the presence of IU are risk-specific or without theoretical guarantees, whereas our proposed method addresses general risk measures and has theoretical guarantees. The basic idea of the proposed method is to assume a Gaussian process (GP) model for the black-box function and to construct high-probability bounding boxes for the risk measures using the GP model. Furthermore, in order to reduce the uncertainty of non-dominated bounding boxes, we propose a method of selecting the next evaluation point using a maximin distance defined by the maximum value of a quasi distance based on bounding boxes. As theoretical analysis, we prove that the algorithm can return an arbitrary-accurate solution in a finite number of iterations with high probability, for various risk measures such as Bayes risk, worst-case risk, and value-at-risk. We also give a theoretical analysis that takes into account approximation errors because there exist non-negligible approximation errors (e.g., finite approximation of PFs and sampling-based approximation of bounding boxes) in practice. We confirm that the proposed method outperforms compared with existing methods not only in the setting with IU but also in the setting of ordinary MOBO through numerical experiments.  ( 3 min )
    Proving Test Set Contamination in Black Box Language Models. (arXiv:2310.17623v2 [cs.CL] UPDATED)
    Large language models are trained on vast amounts of internet data, prompting concerns and speculation that they have memorized public benchmarks. Going from speculation to proof of contamination is challenging, as the pretraining data used by proprietary models are often not publicly accessible. We show that it is possible to provide provable guarantees of test set contamination in language models without access to pretraining data or model weights. Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely. In contrast, the tendency for language models to memorize example order means that a contaminated language model will find certain canonical orderings to be much more likely than others. Our test flags potential contamination whenever the likelihood of a canonically ordered benchmark dataset is significantly higher than the likelihood after shuffling the examples. We demonstrate that our procedure is sensitive enough to reliably prove test set contamination in challenging situations, including models as small as 1.4 billion parameters, on small test sets of only 1000 examples, and datasets that appear only a few times in the pretraining corpus. Using our test, we audit five popular publicly accessible language models for test set contamination and find little evidence for pervasive contamination.  ( 3 min )
    MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations. (arXiv:2311.11762v2 [cs.LG] UPDATED)
    Learning unsupervised world models for autonomous driving has the potential to improve the reasoning capabilities of today's systems dramatically. However, most work neglects the physical attributes of the world and focuses on sensor data alone. We propose MUVO, a MUltimodal World Model with Geometric VOxel Representations to address this challenge. We utilize raw camera and lidar data to learn a sensor-agnostic geometric representation of the world, which can directly be used by downstream tasks, such as planning. We demonstrate multimodal future predictions and show that our geometric representation improves the prediction quality of both camera images and lidar point clouds.  ( 2 min )
    Improving Out-of-Distribution Detection in Echocardiographic View Classication through Enhancing Semantic Features. (arXiv:2308.16483v2 [eess.SP] UPDATED)
    In echocardiographic view classification, accurately detecting out-of-distribution (OOD) data is essential but challenging, especially given the subtle differences between in-distribution and OOD data. While conventional OOD detection methods, such as Mahalanobis distance (MD) are effective in far-OOD scenarios with clear distinctions between distributions, they struggle to discern the less obvious variations characteristic of echocardiographic data. In this study, we introduce a novel use of label smoothing to enhance semantic feature representation in echocardiographic images, demonstrating that these enriched semantic features are key for significantly improving near-OOD instance detection. By combining label smoothing with MD-based OOD detection, we establish a new benchmark for accuracy in echocardiographic OOD detection.  ( 2 min )
    Associative Transformer Is A Sparse Representation Learner. (arXiv:2309.12862v2 [cs.LG] UPDATED)
    Emerging from the monolithic pairwise attention mechanism in conventional Transformer models, there is a growing interest in leveraging sparse interactions that align more closely with biological principles. Approaches including the Set Transformer and the Perceiver employ cross-attention consolidated with a latent space that forms an attention bottleneck with limited capacity. Building upon recent neuroscience studies of Global Workspace Theory and associative memory, we propose the Associative Transformer (AiT). AiT induces low-rank explicit memory that serves as both priors to guide bottleneck attention in the shared workspace and attractors within associative memory of a Hopfield network. Through joint end-to-end training, these priors naturally develop module specialization, each contributing a distinct inductive bias to form attention bottlenecks. A bottleneck can foster competition among inputs for writing information into the memory. We show that AiT is a sparse representation learner, learning distinct priors through the bottlenecks that are complexity-invariant to input quantities and dimensions. AiT demonstrates its superiority over methods such as the Set Transformer, Vision Transformer, and Coordination in various vision tasks.  ( 2 min )
    Learning to Generate Lumped Hydrological Models. (arXiv:2309.09904v2 [physics.geo-ph] UPDATED)
    A lumped hydrological model structure can be considered a generative model because, given a set of parameter values, it can generate a hydrological modeling function that accurately predicts the behavior of a catchment under external forcing. It is implicitly assumed that a small number of variables (i.e., the model parameters) can sufficiently characterize variations in the behavioral characteristics of different catchments. This study adopts this assumption and uses a deep learning method to learn a generative model of hydrological modeling functions directly from the forcing and runoff data of multiple catchments. The learned generative model uses a small number of latent variables to characterize a catchment's behavior, so that assigning values to these latent variables produces a hydrological modeling function that resembles a real-world catchment. The learned generative model can be used similarly to a lumped model structure, i.e., the optimal hydrological modeling function of a catchment can be derived by estimating optimal parameter values (or latent variables) with a generic calibration algorithm. In this study, a generative model was learned from data from over 3,000 catchments worldwide. The model was then used to derive optimal modeling functions for over 700 different catchments. The resulting modeling functions generally showed a quality that was comparable to or better than 36 types of lumped model structures. Overall, this study demonstrates that the hydrological behavior of a catchment can be effectively described using a small number of latent variables, and that well-fitting hydrologic model functions can be reconstructed from these variables.  ( 3 min )
    Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees. (arXiv:2309.09968v2 [cs.LG] UPDATED)
    Tabular data is hard to acquire and is subject to missing values. This paper proposes a novel approach to generate and impute mixed-type (continuous and categorical) tabular data using score-based diffusion and conditional flow matching. Contrary to previous work that relies on neural networks to learn the score function or the vector field, we instead rely on XGBoost, a popular Gradient-Boosted Tree (GBT) method. We empirically show on 27 different datasets that our approach i) generates highly realistic synthetic data when the training dataset is either clean or tainted by missing data and ii) generates diverse plausible data imputations. Furthermore, our method outperforms deep-learning generation methods on data generation and is competitive on data imputation. Finally, it can be trained in parallel using CPUs without the need for a GPU. To make it easily accessible, we release our code through a Python library and an R package.  ( 2 min )
    Grokking as the Transition from Lazy to Rich Training Dynamics. (arXiv:2310.06110v2 [stat.ML] UPDATED)
    We propose that the grokking phenomenon, where the train loss of a neural network decreases much earlier than its test loss, can arise due to a neural network transitioning from lazy training dynamics to a rich, feature learning regime. To illustrate this mechanism, we study the simple setting of vanilla gradient descent on a polynomial regression problem with a two layer neural network which exhibits grokking without regularization in a way that cannot be explained by existing theories. We identify sufficient statistics for the test loss of such a network, and tracking these over training reveals that grokking arises in this setting when the network first attempts to fit a kernel regression solution with its initial features, followed by late-time feature learning where a generalizing solution is identified after train loss is already low. We provide an asymptotic theoretical description of the grokking dynamics in this model using dynamical mean field theory (DMFT) for high dimensional data. We find that the key determinants of grokking are the rate of feature learning -- which can be controlled precisely by parameters that scale the network output -- and the alignment of the initial features with the target function $y(x)$. We argue this delayed generalization arises when (1) the top eigenvectors of the initial neural tangent kernel and the task labels $y(x)$ are misaligned, but (2) the dataset size is large enough so that it is possible for the network to generalize eventually, but not so large that train loss perfectly tracks test loss at all epochs, and (3) the network begins training in the lazy regime so does not learn features immediately. We conclude with evidence that this transition from lazy (linear model) to rich training (feature learning) can control grokking in more general settings, like on MNIST, one-layer Transformers, and student-teacher networks.
    Inherently Interpretable Time Series Classification via Multiple Instance Learning. (arXiv:2311.10049v2 [cs.LG] UPDATED)
    Conventional Time Series Classification (TSC) methods are often black boxes that obscure inherent interpretation of their decision-making processes. In this work, we leverage Multiple Instance Learning (MIL) to overcome this issue, and propose a new framework called MILLET: Multiple Instance Learning for Locally Explainable Time series classification. We apply MILLET to existing deep learning TSC models and show how they become inherently interpretable without compromising (and in some cases, even improving) predictive performance. We evaluate MILLET on 85 UCR TSC datasets and also present a novel synthetic dataset that is specially designed to facilitate interpretability evaluation. On these datasets, we show MILLET produces sparse explanations quickly that are of higher quality than other well-known interpretability methods. To the best of our knowledge, our work with MILLET, which is available on GitHub (https://github.com/JAEarly/MILTimeSeriesClassification), is the first to develop general MIL methods for TSC and apply them to an extensive variety of domains
    Reward Dropout Improves Control: Bi-objective Perspective on Reinforced LM. (arXiv:2310.04483v2 [cs.LG] UPDATED)
    We study the theoretical aspects of Reinforced Language Models (RLMs) from a bi-objective optimization perspective. Specifically, we consider the RLMs as a Pareto optimization problem that maximizes the two conflicting objectives, i.e., reward objective and likelihood objectives, simultaneously. Our main contribution consists of three parts. First, we establish the theoretical foundations of RLM as a Pareto optimization problem by presenting Reward Upper BOund (RUBO) and Pareto optimality. Our theoretical outcomes are supported by not only deductive proofs but also empirical results. Second, we propose Reward Dropout, a simple yet powerful method that guarantees to improve a bi-objective optimization of RLM. Lastly, we demonstrate that the Reward Dropout is consistently effective across five benchmark datasets and four benchmark LLMs, meaning that the Reward Dropout significantly improves the optimization performance of RLMs.
    Sensitivity-Aware Amortized Bayesian Inference. (arXiv:2310.11122v3 [stat.ML] UPDATED)
    Bayesian inference is a powerful framework for making probabilistic inferences and decisions under uncertainty. Fundamental choices in modern Bayesian workflows concern the specification of the likelihood function and prior distributions, the posterior approximator, and the data. Each choice can significantly influence model-based inference and subsequent decisions, thereby necessitating sensitivity analysis. In this work, we propose a multifaceted approach to integrate sensitivity analyses into amortized Bayesian inference (ABI, i.e., simulation-based inference with neural networks). First, we utilize weight sharing to encode the structural similarities between alternative likelihood and prior specifications in the training process with minimal computational overhead. Second, we leverage the rapid inference of neural networks to assess sensitivity to various data perturbations or pre-processing procedures. In contrast to most other Bayesian approaches, both steps circumvent the costly bottleneck of refitting the model(s) for each choice of likelihood, prior, or dataset. Finally, we propose to use neural network ensembles to evaluate variation in results induced by unreliable approximation on unseen data. We demonstrate the effectiveness of our method in applied modeling problems, ranging from the estimation of disease outbreak dynamics and global warming thresholds to the comparison of human decision-making models. Our experiments showcase how our approach enables practitioners to effectively unveil hidden relationships between modeling choices and inferential conclusions.
    EGraFFBench: Evaluation of Equivariant Graph Neural Network Force Fields for Atomistic Simulations. (arXiv:2310.02428v2 [cs.LG] UPDATED)
    Equivariant graph neural networks force fields (EGraFFs) have shown great promise in modelling complex interactions in atomic systems by exploiting the graphs' inherent symmetries. Recent works have led to a surge in the development of novel architectures that incorporate equivariance-based inductive biases alongside architectural innovations like graph transformers and message passing to model atomic interactions. However, thorough evaluations of these deploying EGraFFs for the downstream task of real-world atomistic simulations, is lacking. To this end, here we perform a systematic benchmarking of 6 EGraFF algorithms (NequIP, Allegro, BOTNet, MACE, Equiformer, TorchMDNet), with the aim of understanding their capabilities and limitations for realistic atomistic simulations. In addition to our thorough evaluation and analysis on eight existing datasets based on the benchmarking literature, we release two new benchmark datasets, propose four new metrics, and three challenging tasks. The new datasets and tasks evaluate the performance of EGraFF to out-of-distribution data, in terms of different crystal structures, temperatures, and new molecules. Interestingly, evaluation of the EGraFF models based on dynamic simulations reveals that having a lower error on energy or force does not guarantee stable or reliable simulation or faithful replication of the atomic structures. Moreover, we find that no model clearly outperforms other models on all datasets and tasks. Importantly, we show that the performance of all the models on out-of-distribution datasets is unreliable, pointing to the need for the development of a foundation model for force fields that can be used in real-world simulations. In summary, this work establishes a rigorous framework for evaluating machine learning force fields in the context of atomic simulations and points to open research challenges within this domain.
    How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization. (arXiv:2310.01769v3 [cs.LG] UPDATED)
    This paper rigorously shows how over-parameterization changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements. First, we consider the symmetric setting with the symmetric parameterization where $M^* \in \mathbb{R}^{n \times n}$ is a positive semi-definite unknown matrix of rank $r \ll n$, and one uses a symmetric parameterization $XX^\top$ to learn $M^*$. Here $X \in \mathbb{R}^{n \times k}$ with $k > r$ is the factor matrix. We give a novel $\Omega (1/T^2)$ lower bound of randomly initialized GD for the over-parameterized case ($k >r$) where $T$ is the number of iterations. This is in stark contrast to the exact-parameterization scenario ($k=r$) where the convergence rate is $\exp (-\Omega (T))$. Next, we study asymmetric setting where $M^* \in \mathbb{R}^{n_1 \times n_2}$ is the unknown matrix of rank $r \ll \min\{n_1,n_2\}$, and one uses an asymmetric parameterization $FG^\top$ to learn $M^*$ where $F \in \mathbb{R}^{n_1 \times k}$ and $G \in \mathbb{R}^{n_2 \times k}$. Building on prior work, we give a global exact convergence result of randomly initialized GD for the exact-parameterization case ($k=r$) with an $\exp (-\Omega(T))$ rate. Furthermore, we give the first global exact convergence result for the over-parameterization case ($k>r$) with an $\exp(-\Omega(\alpha^2 T))$ rate where $\alpha$ is the initialization scale. This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from $\Omega (1/T^2)$ to linear convergence. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of $\alpha$, recovering the rate in the exact-parameterization case.
    Why do Angular Margin Losses work well for Semi-Supervised Anomalous Sound Detection?. (arXiv:2309.15643v2 [eess.AS] UPDATED)
    State-of-the-art anomalous sound detection systems often utilize angular margin losses to learn suitable representations of acoustic data using an auxiliary task, which usually is a supervised or self-supervised classification task. The underlying idea is that, in order to solve this auxiliary task, specific information about normal data needs to be captured in the learned representations and that this information is also sufficient to differentiate between normal and anomalous samples. Especially in noisy conditions, discriminative models based on angular margin losses tend to significantly outperform systems based on generative or one-class models. The goal of this work is to investigate why using angular margin losses with auxiliary tasks works well for detecting anomalous sounds. To this end, it is shown, both theoretically and experimentally, that minimizing angular margin losses also minimizes compactness loss while inherently preventing learning trivial solutions. Furthermore, multiple experiments are conducted to show that using a related classification task as an auxiliary task teaches the model to learn representations suitable for detecting anomalous sounds in noisy conditions. Among these experiments are performance evaluations, visualizing the embedding space with t-SNE and visualizing the input representations with respect to the anomaly score using randomized input sampling for explanation.
    YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUs. (arXiv:2310.00574v3 [cs.AR] UPDATED)
    We address the challenges associated with deploying neural networks on CPUs, with a particular focus on minimizing inference time while maintaining accuracy. Our novel approach is to use the dataflow (i.e., computation order) of a neural network to explore data reuse opportunities using heuristic-guided analysis and a code generation framework, which enables exploration of various Single Instruction, Multiple Data (SIMD) implementations to achieve optimized neural network execution. Our results demonstrate that the dataflow that keeps outputs in SIMD registers while also maximizing both input and weight reuse consistently yields the best performance for a wide variety of inference workloads, achieving up to 3x speedup for 8-bit neural networks, and up to 4.8x speedup for binary neural networks, respectively, over the optimized implementations of neural networks today.
    Deep Learning based Spatially Dependent Acoustical Properties Recovery. (arXiv:2310.10970v2 [cs.LG] UPDATED)
    The physics-informed neural network (PINN) is capable of recovering partial differential equation (PDE) coefficients that remain constant throughout the spatial domain directly from physical measurements. In this work, we propose a spatially dependent physics-informed neural network (SD-PINN), which enables the recovery of coefficients in spatially-dependent PDEs using a single neural network, eliminating the requirement for domain-specific physical expertise. We apply the SD-PINN to spatially-dependent wave equation coefficients recovery to reveal the spatial distribution of acoustical properties in the inhomogeneous medium. The proposed method exhibits robustness to noise owing to the incorporation of a loss function for the physical constraint that the assumed PDE must be satisfied. For the coefficients recovery of spatially two-dimensional PDEs, we store the PDE coefficients at all locations in the 2D region of interest into a matrix and incorporate the low-rank assumption for such a matrix to recover the coefficients at locations without available measurements.
    Certified Robustness via Dynamic Margin Maximization and Improved Lipschitz Regularization. (arXiv:2310.00116v2 [cs.LG] UPDATED)
    To improve the robustness of deep classifiers against adversarial perturbations, many approaches have been proposed, such as designing new architectures with better robustness properties (e.g., Lipschitz-capped networks), or modifying the training process itself (e.g., min-max optimization, constrained learning, or regularization). These approaches, however, might not be effective at increasing the margin in the input (feature) space. As a result, there has been an increasing interest in developing training procedures that can directly manipulate the decision boundary in the input space. In this paper, we build upon recent developments in this category by developing a robust training algorithm whose objective is to increase the margin in the output (logit) space while regularizing the Lipschitz constant of the model along vulnerable directions. We show that these two objectives can directly promote larger margins in the input space. To this end, we develop a scalable method for calculating guaranteed differentiable upper bounds on the Lipschitz constant of neural networks accurately and efficiently. The relative accuracy of the bounds prevents excessive regularization and allows for more direct manipulation of the decision boundary. Furthermore, our Lipschitz bounding algorithm exploits the monotonicity and Lipschitz continuity of the activation layers, and the resulting bounds can be used to design new layers with controllable bounds on their Lipschitz constant. Experiments on the MNIST, CIFAR-10, and Tiny-ImageNet data sets verify that our proposed algorithm obtains competitively improved results compared to the state-of-the-art.
    Designing and evaluating an online reinforcement learning agent for physical exercise recommendations in N-of-1 trials. (arXiv:2309.14156v2 [cs.LG] UPDATED)
    Personalized adaptive interventions offer the opportunity to increase patient benefits, however, there are challenges in their planning and implementation. Once implemented, it is an important question whether personalized adaptive interventions are indeed clinically more effective compared to a fixed gold standard intervention. In this paper, we present an innovative N-of-1 trial study design testing whether implementing a personalized intervention by an online reinforcement learning agent is feasible and effective. Throughout, we use a new study on physical exercise recommendations to reduce pain in endometriosis for illustration. We describe the design of a contextual bandit recommendation agent and evaluate the agent in simulation studies. The results show that, first, implementing a personalized intervention by an online reinforcement learning agent is feasible. Second, such adaptive interventions have the potential to improve patients' benefits even if only few observations are available. As one challenge, they add complexity to the design and implementation process. In order to quantify the expected benefit, data from previous interventional studies is required. We expect our approach to be transferable to other interventions and clinical interventions.
    An Empirical Investigation of Pre-trained Model Selection for Out-of-Distribution Generalization and Calibration. (arXiv:2307.08187v2 [cs.LG] UPDATED)
    In the realm of out-of-distribution (OOD) generalization tasks, fine-tuning pre-trained models has become a prevalent strategy. Different from most prior work that has focused on advancing learning algorithms, we systematically examined how pre-trained model size, pre-training data scale, and training strategies impact downstream generalization and uncertainty calibration. We evaluated 97 models across diverse pre-trained model sizes, five pre-training datasets, and five data augmentations through extensive experiments on four distribution shift datasets totaling over 100,000 GPU hours. Our results demonstrate the significant impact of pre-trained model selection, with optimal choices substantially improving OOD accuracy over algorithm improvement alone. We find larger models and bigger pre-training data improve OOD performance and calibration, in contrast to some prior studies that found modern deep networks to calibrate worse than classical shallow models. Our work underscores the overlooked importance of pre-trained model selection for out-of-distribution generalization and calibration.
    The Noise Geometry of Stochastic Gradient Descent: A Quantitative and Analytical Characterization. (arXiv:2310.00692v2 [cs.LG] UPDATED)
    Empirical studies have demonstrated that the noise in stochastic gradient descent (SGD) aligns favorably with the local geometry of loss landscape. However, theoretical and quantitative explanations for this phenomenon remain sparse. In this paper, we offer a comprehensive theoretical investigation into the aforementioned {\em noise geometry} for over-parameterized linear (OLMs) models and two-layer neural networks. We scrutinize both average and directional alignments, paying special attention to how factors like sample size and input data degeneracy affect the alignment strength. As a specific application, we leverage our noise geometry characterizations to study how SGD escapes from sharp minima, revealing that the escape direction has significant components along flat directions. This is in stark contrast to GD, which escapes only along the sharpest directions. To substantiate our theoretical findings, both synthetic and real-world experiments are provided.
    MARBLE: Music Audio Representation Benchmark for Universal Evaluation. (arXiv:2306.10548v4 [cs.SD] UPDATED)
    In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published at https://marble-bm.shef.ac.uk to promote future music AI research.
    Kernel-Based Tests for Likelihood-Free Hypothesis Testing. (arXiv:2308.09043v2 [stat.ML] UPDATED)
    Given $n$ observations from two balanced classes, consider the task of labeling an additional $m$ inputs that are known to all belong to \emph{one} of the two classes. Special cases of this problem are well-known: with complete knowledge of class distributions ($n=\infty$) the problem is solved optimally by the likelihood-ratio test; when $m=1$ it corresponds to binary classification; and when $m\approx n$ it is equivalent to two-sample testing. The intermediate settings occur in the field of likelihood-free inference, where labeled samples are obtained by running forward simulations and the unlabeled sample is collected experimentally. In recent work it was discovered that there is a fundamental trade-off between $m$ and $n$: increasing the data sample $m$ reduces the amount $n$ of training/simulation data needed. In this work we (a) introduce a generalization where unlabeled samples come from a mixture of the two classes -- a case often encountered in practice; (b) study the minimax sample complexity for non-parametric classes of densities under \textit{maximum mean discrepancy} (MMD) separation; and (c) investigate the empirical performance of kernels parameterized by neural networks on two tasks: detection of the Higgs boson and detection of planted DDPM generated images amidst CIFAR-10 images. For both problems we confirm the existence of the theoretically predicted asymmetric $m$ vs $n$ trade-off.
    Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics. (arXiv:2308.11792v2 [cs.DC] UPDATED)
    Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and energy usage, automated approaches are gaining popularity. Most existing methods rely on profiling recurring workloads to find near-optimal solutions over time. Due to the cold-start problem, this often leads to lengthy and costly profiling phases. However, big data analytics jobs across users can share many common properties: they often operate on similar infrastructure, using similar algorithms implemented in similar frameworks. The potential in sharing aggregated profiling runs to collaboratively address the cold start problem is largely unexplored. We present Karasu, an approach to more efficient resource configuration profiling that promotes data sharing among users working with similar infrastructures, frameworks, algorithms, or datasets. Karasu trains lightweight performance models using aggregated runtime information of collaborators and combines them into an ensemble method to exploit inherent knowledge of the configuration search space. Moreover, Karasu allows the optimization of multiple objectives simultaneously. Our evaluation is based on performance data from diverse workload executions in a public cloud environment. We show that Karasu is able to significantly boost existing methods in terms of performance, search time, and cost, even when few comparable profiling runs are available that share only partial common characteristics with the target job.
    Task-Robust Pre-Training for Worst-Case Downstream Adaptation. (arXiv:2306.12070v3 [cs.CV] UPDATED)
    Pre-training has achieved remarkable success when transferred to downstream tasks. In machine learning, we care about not only the good performance of a model but also its behavior under reasonable shifts of condition. The same philosophy holds when pre-training a foundation model. However, the foundation model may not uniformly behave well for a series of related downstream tasks. This happens, for example, when conducting mask recovery regression where the recovery ability or the training instances diverge like pattern features are extracted dominantly on pre-training, but semantic features are also required on a downstream task. This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks. We call this goal as $\textit{downstream-task robustness}$. Our method first separates the upstream task into several representative ones and applies a simple minimax loss for pre-training. We then design an efficient algorithm to solve the minimax loss and prove its convergence in the convex setting. In the experiments, we show both on large-scale natural language processing and computer vision datasets our method increases the metrics on worse-case downstream tasks. Additionally, some theoretical explanations for why our loss is beneficial are provided. Specifically, we show fewer samples are inherently required for the most challenging downstream task in some cases.
    The SocialAI School: Insights from Developmental Psychology Towards Artificial Socio-Cultural Agents. (arXiv:2307.07871v2 [cs.AI] UPDATED)
    Developmental psychologists have long-established the importance of socio-cognitive abilities in human intelligence. These abilities enable us to enter, participate and benefit from human culture. AI research on social interactive agents mostly concerns the emergence of culture in a multi-agent setting (often without a strong grounding in developmental psychology). We argue that AI research should be informed by psychology and study socio-cognitive abilities enabling to enter a culture too. We discuss the theories of Michael Tomasello and Jerome Bruner to introduce some of their concepts to AI and outline key concepts and socio-cognitive abilities. We present The SocialAI school - a tool including a customizable parameterized uite of procedurally generated environments, which simplifies conducting experiments regarding those concepts. We show examples of such experiments with RL agents and Large Language Models. The main motivation of this work is to engage the AI community around the problem of social intelligence informed by developmental psychology, and to provide a tool to simplify first steps in this direction. Refer to the project website for code and additional information: https://sites.google.com/view/socialai-school.
    A Bayesian Take on Gaussian Process Networks. (arXiv:2306.11380v4 [stat.ML] UPDATED)
    Gaussian Process Networks (GPNs) are a class of directed graphical models which employ Gaussian processes as priors for the conditional expectation of each variable given its parents in the network. The model allows the description of continuous joint distributions in a compact but flexible manner with minimal parametric assumptions on the dependencies between variables. Bayesian structure learning of GPNs requires computing the posterior over graphs of the network and is computationally infeasible even in low dimensions. This work implements Monte Carlo and Markov Chain Monte Carlo methods to sample from the posterior distribution of network structures. As such, the approach follows the Bayesian paradigm, comparing models via their marginal likelihood and computing the posterior probability of the GPN features. Simulation studies show that our method outperforms state-of-the-art algorithms in recovering the graphical structure of the network and provides an accurate approximation of its posterior distribution.
    Real Robot Challenge 2022: Learning Dexterous Manipulation from Offline Data in the Real World. (arXiv:2308.07741v3 [cs.RO] UPDATED)
    Experimentation on real robots is demanding in terms of time and costs. For this reason, a large part of the reinforcement learning (RL) community uses simulators to develop and benchmark algorithms. However, insights gained in simulation do not necessarily translate to real robots, in particular for tasks involving complex interactions with the environment. The Real Robot Challenge 2022 therefore served as a bridge between the RL and robotics communities by allowing participants to experiment remotely with a real robot - as easily as in simulation. In the last years, offline reinforcement learning has matured into a promising paradigm for learning from pre-collected datasets, alleviating the reliance on expensive online interactions. We therefore asked the participants to learn two dexterous manipulation tasks involving pushing, grasping, and in-hand orientation from provided real-robot datasets. An extensive software documentation and an initial stage based on a simulation of the real set-up made the competition particularly accessible. By giving each team plenty of access budget to evaluate their offline-learned policies on a cluster of seven identical real TriFinger platforms, we organized an exciting competition for machine learners and roboticists alike. In this work we state the rules of the competition, present the methods used by the winning teams and compare their results with a benchmark of state-of-the-art offline RL algorithms on the challenge datasets.
    Graph of Thoughts: Solving Elaborate Problems with Large Language Models. (arXiv:2308.09687v3 [cs.CL] UPDATED)
    We introduce Graph of Thoughts (GoT): a framework that advances prompting capabilities in large language models (LLMs) beyond those offered by paradigms such as Chain-of-Thought or Tree of Thoughts (ToT). The key idea and primary advantage of GoT is the ability to model the information generated by an LLM as an arbitrary graph, where units of information ("LLM thoughts") are vertices, and edges correspond to dependencies between these vertices. This approach enables combining arbitrary LLM thoughts into synergistic outcomes, distilling the essence of whole networks of thoughts, or enhancing thoughts using feedback loops. We illustrate that GoT offers advantages over state of the art on different tasks, for example increasing the quality of sorting by 62% over ToT, while simultaneously reducing costs by >31%. We ensure that GoT is extensible with new thought transformations and thus can be used to spearhead new prompting schemes. This work brings the LLM reasoning closer to human thinking or brain mechanisms such as recurrence, both of which form complex networks.
    Navigating the Design Space of Equivariant Diffusion-Based Generative Models for De Novo 3D Molecule Generation. (arXiv:2309.17296v2 [cs.LG] UPDATED)
    Deep generative diffusion models are a promising avenue for 3D de novo molecular design in materials science and drug discovery. However, their utility is still limited by suboptimal performance on large molecular structures and limited training data. To address this gap, we explore the design space of E(3)-equivariant diffusion models, focusing on previously unexplored areas. Our extensive comparative analysis evaluates the interplay between continuous and discrete state spaces. From this investigation, we present the EQGAT-diff model, which consistently outperforms established models for the QM9 and GEOM-Drugs datasets. Significantly, EQGAT-diff takes continuous atom positions, while chemical elements and bond types are categorical and uses time-dependent loss weighting, substantially increasing training convergence, the quality of generated samples, and inference time. We also showcase that including chemically motivated additional features like hybridization states in the diffusion process enhances the validity of generated molecules. To further strengthen the applicability of diffusion models to limited training data, we investigate the transferability of EQGAT-diff trained on the large PubChem3D dataset with implicit hydrogen atoms to target different data distributions. Fine-tuning EQGAT-diff for just a few iterations shows an efficient distribution shift, further improving performance throughout data sets. Finally, we test our model on the Crossdocked data set for structure-based de novo ligand generation, underlining the importance of our findings showing state-of-the-art performance on Vina docking scores.
    Knowledge Accumulation in Continually Learned Representations and the Issue of Feature Forgetting. (arXiv:2304.00933v2 [cs.LG] UPDATED)
    While it is established that neural networks suffer from catastrophic forgetting ``at the output level'', it is debated whether this is also the case at the level of representations. Some studies ascribe a certain level of innate robustness to representations, that they only forget minimally and no critical information, while others claim that representations are also severely affected by forgetting. To settle this debate, we first discuss how this apparent disagreement might stem from the coexistence of two phenomena that affect the quality of continually learned representations: knowledge accumulation and feature forgetting. We then show that, even though it is true that feature forgetting can be small in absolute terms, newly learned information is forgotten just as catastrophically at the level of representations as it is at the output level. Next we show that this feature forgetting is problematic as it substantially slows down knowledge accumulation. We further show that representations that are continually learned through both supervised and self-supervised learning suffer from feature forgetting. Finally, we study how feature forgetting and knowledge accumulation are affected by different types of continual learning methods.
    Blockchain-Enabled Federated Learning: A Reference Architecture Design, Implementation, and Verification. (arXiv:2306.10841v3 [cs.LG] UPDATED)
    This paper presents a novel reference architecture for blockchain-enabled federated learning (BCFL), a state-of-the-art approach that amalgamates the strengths of federated learning and blockchain technology.We define smart contract functions, stakeholders and their roles, and the use of interplanetary file system (IPFS) as key components of BCFL and conduct a comprehensive analysis. In traditional centralized federated learning, the selection of local nodes and the collection of learning results for each round are merged under the control of a central server. In contrast, in BCFL, all these processes are monitored and managed via smart contracts. Additionally, we propose an extension architecture to support both crossdevice and cross-silo federated learning scenarios. Furthermore, we implement and verify the architecture in a practical real-world Ethereum development environment. Our BCFL reference architecture provides significant flexibility and extensibility, accommodating the integration of various additional elements, as per specific requirements and use cases, thereby rendering it an adaptable solution for a wide range of BCFL applications. As a prominent example of extensibility, decentralized identifiers (DIDs) have been employed as an authentication method to introduce practical utilization within BCFL. This study not only bridges a crucial gap between research and practical deployment but also lays a solid foundation for future explorations in the realm of BCFL. The pivotal contribution of this study is the successful implementation and verification of a realistic BCFL reference architecture. We intend to make the source code publicly accessible shortly, fostering further advancements and adaptations within the community.
    Explainable Machine Learning for Categorical and Mixed Data with Lossless Visualization. (arXiv:2305.18437v3 [cs.LG] UPDATED)
    Building accurate and interpretable Machine Learning (ML) models for heterogeneous/mixed data is a long-standing challenge for algorithms designed for numeric data. This work focuses on developing numeric coding schemes for non-numeric attributes for ML algorithms to support accurate and explainable ML models, methods for lossless visualization of n-D non-numeric categorical data with visual rule discovery in these visualizations, and accurate and explainable ML models for categorical data. This study proposes a classification of mixed data types and analyzes their important role in Machine Learning. It presents a toolkit for enforcing interpretability of all internal operations of ML algorithms on mixed data with a visual data exploration on mixed data. A new Sequential Rule Generation (SRG) algorithm for explainable rule generation with categorical data is proposed and successfully evaluated in multiple computational experiments. This work is one of the steps to the full scope ML algorithms for mixed data supported by lossless visualization of n-D data in General Line Coordinates beyond Parallel Coordinates.
    PEAR: Primitive enabled Adaptive Relabeling for boosting Hierarchical Reinforcement Learning. (arXiv:2306.06394v4 [cs.LG] UPDATED)
    Hierarchical reinforcement learning (HRL) has the potential to solve complex long horizon tasks using temporal abstraction and increased exploration. However, hierarchical agents are difficult to train due to inherent non-stationarity. We present primitive enabled adaptive relabeling (PEAR), a two-phase approach where we first perform adaptive relabeling on a few expert demonstrations to generate efficient subgoal supervision, and then jointly optimize HRL agents by employing reinforcement learning (RL) and imitation learning (IL). We perform theoretical analysis to $(i)$ bound the sub-optimality of our approach, and $(ii)$ derive a generalized plug-and-play framework for joint optimization using RL and IL. PEAR uses a handful of expert demonstrations and makes minimal limiting assumptions on the task structure. Additionally, it can be easily integrated with typical model free RL algorithms to produce a practical HRL algorithm. We perform experiments on challenging robotic environments and show that PEAR is able to solve tasks that require long term decision making. We empirically show that PEAR exhibits improved performance and sample efficiency over previous hierarchical and non-hierarchical approaches. We also perform real world robotic experiments on complex tasks and demonstrate that PEAR consistently outperforms the baselines.
    Symplectic Structure-Aware Hamiltonian (Graph) Embeddings. (arXiv:2309.04885v2 [cs.LG] UPDATED)
    In traditional Graph Neural Networks (GNNs), the assumption of a fixed embedding manifold often limits their adaptability to diverse graph geometries. Recently, Hamiltonian system-inspired GNNs have been proposed to address the dynamic nature of such embeddings by incorporating physical laws into node feature updates. We present Symplectic Structure-Aware Hamiltonian GNN (SAH-GNN), a novel approach that generalizes Hamiltonian dynamics for more flexible node feature updates. Unlike existing Hamiltonian approaches, SAH-GNN employs Riemannian optimization on the symplectic Stiefel manifold to adaptively learn the underlying symplectic structure, circumventing the limitations of existing Hamiltonian GNNs that rely on a pre-defined form of standard symplectic structure. This innovation allows SAH-GNN to automatically adapt to various graph datasets without extensive hyperparameter tuning. Moreover, it conserves energy during training meaning the implicit Hamiltonian system is physically meaningful. Finally, we empirically validate SAH-GNN's superiority and adaptability in node classification tasks across multiple types of graph datasets.
    Feature Perturbation Augmentation for Reliable Evaluation of Importance Estimators in Neural Networks. (arXiv:2303.01538v2 [cs.LG] UPDATED)
    Post-hoc explanation methods attempt to make the inner workings of deep neural networks more interpretable. However, since a ground truth is in general lacking, local post-hoc interpretability methods, which assign importance scores to input features, are challenging to evaluate. One of the most popular evaluation frameworks is to perturb features deemed important by an interpretability method and to measure the change in prediction accuracy. Intuitively, a large decrease in prediction accuracy would indicate that the explanation has correctly quantified the importance of features with respect to the prediction outcome (e.g., logits). However, the change in the prediction outcome may stem from perturbation artifacts, since perturbed samples in the test dataset are out of distribution (OOD) compared to the training dataset and can therefore potentially disturb the model in an unexpected manner. To overcome this challenge, we propose feature perturbation augmentation (FPA) which creates and adds perturbed images during the model training. Through extensive computational experiments, we demonstrate that FPA makes deep neural networks (DNNs) more robust against perturbations. Furthermore, training DNNs with FPA demonstrate that the sign of importance scores may explain the model more meaningfully than has previously been assumed. Overall, FPA is an intuitive data augmentation technique that improves the evaluation of post-hoc interpretability methods.
    3D helical CT Reconstruction with a Memory Efficient Learned Primal-Dual Architecture. (arXiv:2205.11952v2 [eess.IV] UPDATED)
    Deep learning based computed tomography (CT) reconstruction has demonstrated outstanding performance on simulated 2D low-dose CT data. This applies in particular to domain adapted neural networks, which incorporate a handcrafted physics model for CT imaging. Empirical evidence shows that employing such architectures reduces the demand for training data and improves upon generalisation. However, their training requires large computational resources that quickly become prohibitive in 3D helical CT, which is the most common acquisition geometry used for medical imaging. Furthermore, clinical data also comes with other challenges not accounted for in simulations, like errors in flux measurement, resolution mismatch and, most importantly, the absence of the real ground truth. The necessity to have a computationally feasible training combined with the need to address these issues has made it difficult to evaluate deep learning based reconstruction on clinical 3D helical CT. This paper modifies a domain adapted neural network architecture, the Learned Primal-Dual (LPD), so that it can be trained and applied to reconstruction in this setting. We achieve this by splitting the helical trajectory into sections and applying the unrolled LPD iterations to those sections sequentially. To the best of our knowledge, this work is the first to apply an unrolled deep learning architecture for reconstruction on full-sized clinical data, like those in the Low dose CT image and projection data set (LDCT). Moreover, training and testing is done on a single GPU card with 24GB of memory.
    Evaluating Pretrained models for Deployable Lifelong Learning. (arXiv:2311.13648v1 [cs.LG])
    We create a novel benchmark for evaluating a Deployable Lifelong Learning system for Visual Reinforcement Learning (RL) that is pretrained on a curated dataset, and propose a novel Scalable Lifelong Learning system capable of retaining knowledge from the previously learnt RL tasks. Our benchmark measures the efficacy of a deployable Lifelong Learning system that is evaluated on scalability, performance and resource utilization. Our proposed system, once pretrained on the dataset, can be deployed to perform continual learning on unseen tasks. Our proposed method consists of a Few Shot Class Incremental Learning (FSCIL) based task-mapper and an encoder/backbone trained entirely using the pretrain dataset. The policy parameters corresponding to the recognized task are then loaded to perform the task. We show that this system can be scaled to incorporate a large number of tasks due to the small memory footprint and fewer computational resources. We perform experiments on our DeLL (Deployment for Lifelong Learning) benchmark on the Atari games to determine the efficacy of the system.
    Knowledge Distillation Based Semantic Communications For Multiple Users. (arXiv:2311.13789v1 [eess.SP])
    Deep learning (DL) has shown great potential in revolutionizing the traditional communications system. Many applications in communications have adopted DL techniques due to their powerful representation ability. However, the learning-based methods can be dependent on the training dataset and perform worse on unseen interference due to limited model generalizability and complexity. In this paper, we consider the semantic communication (SemCom) system with multiple users, where there is a limited number of training samples and unexpected interference. To improve the model generalization ability and reduce the model size, we propose a knowledge distillation (KD) based system where Transformer based encoder-decoder is implemented as the semantic encoder-decoder and fully connected neural networks are implemented as the channel encoder-decoder. Specifically, four types of knowledge transfer and model compression are analyzed. Important system and model parameters are considered, including the level of noise and interference, the number of interfering users and the size of the encoder and decoder. Numerical results demonstrate that KD significantly improves the robustness and the generalization ability when applied to unexpected interference, and it reduces the performance loss when compressing the model size.
    Efficient Transformer Knowledge Distillation: A Performance Review. (arXiv:2311.13657v1 [cs.CL])
    As pretrained transformer language models continue to achieve state-of-the-art performance, the Natural Language Processing community has pushed for advances in model compression and efficient attention mechanisms to address high computational requirements and limited input sequence length. Despite these separate efforts, no investigation has been done into the intersection of these two fields. In this work, we provide an evaluation of model compression via knowledge distillation on efficient attention transformers. We provide cost-performance trade-offs for the compression of state-of-the-art efficient attention architectures and the gains made in performance in comparison to their full attention counterparts. Furthermore, we introduce a new long-context Named Entity Recognition dataset, GONERD, to train and test the performance of NER models on long sequences. We find that distilled efficient attention transformers can preserve a significant amount of original model performance, preserving up to 98.6% across short-context tasks (GLUE, SQUAD, CoNLL-2003), up to 94.6% across long-context Question-and-Answering tasks (HotpotQA, TriviaQA), and up to 98.8% on long-context Named Entity Recognition (GONERD), while decreasing inference times by up to 57.8%. We find that, for most models on most tasks, performing knowledge distillation is an effective method to yield high-performing efficient attention models with low costs.
    Regret Analysis of Learning-Based Linear Quadratic Gaussian Control with Additive Exploration. (arXiv:2311.02679v2 [eess.SY] UPDATED)
    In this paper, we analyze the regret incurred by a computationally efficient exploration strategy, known as naive exploration, for controlling unknown partially observable systems within the Linear Quadratic Gaussian (LQG) framework. We introduce a two-phase control algorithm called LQG-NAIVE, which involves an initial phase of injecting Gaussian input signals to obtain a system model, followed by a second phase of an interplay between naive exploration and control in an episodic fashion. We show that LQG-NAIVE achieves a regret growth rate of $\tilde{\mathcal{O}}(\sqrt{T})$, i.e., $\mathcal{O}(\sqrt{T})$ up to logarithmic factors after $T$ time steps, and we validate its performance through numerical simulations. Additionally, we propose LQG-IF2E, which extends the exploration signal to a `closed-loop' setting by incorporating the Fisher Information Matrix (FIM). We provide compelling numerical evidence of the competitive performance of LQG-IF2E compared to LQG-NAIVE.
    Gradient-based bilevel optimization for multi-penalty Ridge regression through matrix differential calculus. (arXiv:2311.14182v1 [cs.LG])
    Common regularization algorithms for linear regression, such as LASSO and Ridge regression, rely on a regularization hyperparameter that balances the tradeoff between minimizing the fitting error and the norm of the learned model coefficients. As this hyperparameter is scalar, it can be easily selected via random or grid search optimizing a cross-validation criterion. However, using a scalar hyperparameter limits the algorithm's flexibility and potential for better generalization. In this paper, we address the problem of linear regression with l2-regularization, where a different regularization hyperparameter is associated with each input variable. We optimize these hyperparameters using a gradient-based approach, wherein the gradient of a cross-validation criterion with respect to the regularization hyperparameters is computed analytically through matrix differential calculus. Additionally, we introduce two strategies tailored for sparse model learning problems aiming at reducing the risk of overfitting to the validation data. Numerical examples demonstrate that our multi-hyperparameter regularization approach outperforms LASSO, Ridge, and Elastic Net regression. Moreover, the analytical computation of the gradient proves to be more efficient in terms of computational time compared to automatic differentiation, especially when handling a large number of input variables. Application to the identification of over-parameterized Linear Parameter-Varying models is also presented.
    Proactive DP: A Multple Target Optimization Framework for DP-SGD. (arXiv:2102.09030v9 [cs.LG] UPDATED)
    We introduce a multiple target optimization framework for DP-SGD referred to as pro-active DP. In contrast to traditional DP accountants, which are used to track the expenditure of privacy budgets, the pro-active DP scheme allows one to {\it a-priori} select parameters of DP-SGD based on a fixed privacy budget (in terms of $\epsilon$ and $\delta$) in such a way to optimize the anticipated utility (test accuracy) the most. To achieve this objective, we first propose significant improvements to the moment account method, presenting a closed-form $(\epsilon,\delta)$-DP guarantee that connects all parameters in the DP-SGD setup. Generally, DP-SGD is $(\epsilon\leq 1/2,\delta=1/N)$-DP if $\sigma=\sqrt{2(\epsilon +\ln(1/\delta))/\epsilon}$ with $T$ at least $\approx 2k^2/\epsilon$ and $(2/e)^2k^2-1/2\geq \ln(N)$, where $T$ is the total number of rounds, and $K=kN$ is the total number of gradient computations where $k$ measures $K$ in number of epochs of size $N$ of the local data set. We prove that our expression is close to tight in that if $T$ is more than a constant factor $\approx 4$ smaller than the lower bound $\approx 2k^2/\epsilon$, then the $(\epsilon,\delta)$-DP guarantee is violated. Our enhanced DP theory allows us to create a utility graph and DP calculator. These tools link privacy and utility objectives and search for optimal experiment setups, efficiently taking into account both accuracy and privacy objectives, as well as implementation goals. We furnish a comprehensive implementation flow of our proactive DP, with rigorous experiments to showcase the proof-of-concept.
    Particle Guidance: non-I.I.D. Diverse Sampling with Diffusion Models. (arXiv:2310.13102v2 [cs.LG] UPDATED)
    In light of the widespread success of generative models, a significant amount of research has gone into speeding up their sampling time. However, generative models are often sampled multiple times to obtain a diverse set incurring a cost that is orthogonal to sampling time. We tackle the question of how to improve diversity and sample efficiency by moving beyond the common assumption of independent samples. We propose particle guidance, an extension of diffusion-based generative sampling where a joint-particle time-evolving potential enforces diversity. We analyze theoretically the joint distribution that particle guidance generates, how to learn a potential that achieves optimal diversity, and the connections with methods in other disciplines. Empirically, we test the framework both in the setting of conditional image generation, where we are able to increase diversity without affecting quality, and molecular conformer generation, where we reduce the state-of-the-art median error by 13% on average.
    Weight fluctuations in (deep) linear neural networks and a derivation of the inverse-variance flatness relation. (arXiv:2311.14120v1 [cs.LG])
    We investigate the stationary (late-time) training regime of single- and two-layer linear neural networks within the continuum limit of stochastic gradient descent (SGD) for synthetic Gaussian data. In the case of a single-layer network in the weakly oversampled regime, the spectrum of the noise covariance matrix deviates notably from the Hessian, which can be attributed to the broken detailed balance of SGD dynamics. The weight fluctuations are in this case generally anisotropic, but experience an isotropic loss. For a two-layer network, we obtain the stochastic dynamics of the weights in each layer and analyze the associated stationary covariances. We identify the inter-layer coupling as a new source of anisotropy for the weight fluctuations. In contrast to the single-layer case, the weight fluctuations experience an anisotropic loss, the flatness of which is inversely related to the fluctuation variance. We thereby provide an analytical derivation of the recently observed inverse variance-flatness relation in a deep linear network model.
    Mitigating Over-Smoothing and Over-Squashing using Augmentations of Forman-Ricci Curvature. (arXiv:2309.09384v2 [cs.LG] UPDATED)
    While Graph Neural Networks (GNNs) have been successfully leveraged for learning on graph-structured data across domains, several potential pitfalls have been described recently. Those include the inability to accurately leverage information encoded in long-range connections (over-squashing), as well as difficulties distinguishing the learned representations of nearby nodes with growing network depth (over-smoothing). An effective way to characterize both effects is discrete curvature: Long-range connections that underlie over-squashing effects have low curvature, whereas edges that contribute to over-smoothing have high curvature. This observation has given rise to rewiring techniques, which add or remove edges to mitigate over-smoothing and over-squashing. Several rewiring approaches utilizing graph characteristics, such as curvature or the spectrum of the graph Laplacian, have been proposed. However, existing methods, especially those based on curvature, often require expensive subroutines and careful hyperparameter tuning, which limits their applicability to large-scale graphs. Here we propose a rewiring technique based on Augmented Forman-Ricci curvature (AFRC), a scalable curvature notation, which can be computed in linear time. We prove that AFRC effectively characterizes over-smoothing and over-squashing effects in message-passing GNNs. We complement our theoretical results with experiments, which demonstrate that the proposed approach achieves state-of-the-art performance while significantly reducing the computational cost in comparison with other methods. Utilizing fundamental properties of discrete curvature, we propose effective heuristics for hyperparameters in curvature-based rewiring, which avoids expensive hyperparameter searches, further improving the scalability of the proposed approach.
    FlexTrain: A Dynamic Training Framework for Heterogeneous Devices Environments. (arXiv:2310.20457v2 [cs.LG] UPDATED)
    As deep learning models become increasingly large, they pose significant challenges in heterogeneous devices environments. The size of deep learning models makes it difficult to deploy them on low-power or resource-constrained devices, leading to long inference times and high energy consumption. To address these challenges, we propose FlexTrain, a framework that accommodates the diverse storage and computational resources available on different devices during the training phase. FlexTrain enables efficient deployment of deep learning models, while respecting device constraints, minimizing communication costs, and ensuring seamless integration with diverse devices. We demonstrate the effectiveness of FlexTrain on the CIFAR-100 dataset, where a single global model trained with FlexTrain can be easily deployed on heterogeneous devices, saving training time and energy consumption. We also extend FlexTrain to the federated learning setting, showing that our approach outperforms standard federated learning benchmarks on both CIFAR-10 and CIFAR-100 datasets.
    Scalable AI Safety via Doubly-Efficient Debate. (arXiv:2311.14125v1 [cs.AI])
    The emergence of pre-trained AI systems with powerful capabilities across a diverse and ever-increasing set of complex domains has raised a critical challenge for AI safety as tasks can become too complicated for humans to judge directly. Irving et al. [2018] proposed a debate method in this direction with the goal of pitting the power of such AI models against each other until the problem of identifying (mis)-alignment is broken down into a manageable subtask. While the promise of this approach is clear, the original framework was based on the assumption that the honest strategy is able to simulate deterministic AI systems for an exponential number of steps, limiting its applicability. In this paper, we show how to address these challenges by designing a new set of debate protocols where the honest strategy can always succeed using a simulation of a polynomial number of steps, whilst being able to verify the alignment of stochastic AI systems, even when the dishonest strategy is allowed to use exponentially many simulation steps.
    Fast, Expressive SE$(n)$ Equivariant Networks through Weight-Sharing in Position-Orientation Space. (arXiv:2310.02970v2 [cs.LG] UPDATED)
    Based on the theory of homogeneous spaces we derive \textit{geometrically optimal edge attributes} to be used within the flexible message passing framework. We formalize the notion of weight sharing in convolutional networks as the sharing of message functions over point-pairs that should be treated equally. We define equivalence classes of point-pairs that are identical up to a transformation in the group and derive attributes that uniquely identify these classes. Weight sharing is then obtained by conditioning message functions on these attributes. As an application of the theory, we develop an efficient equivariant group convolutional network for processing 3D point clouds. The theory of homogeneous spaces tells us how to do group convolutions with feature maps over the homogeneous space of positions $\mathbb{R}^3$, position and orientations $\mathbb{R}^3 {\times} S^2$, and the group SE$(3)$ itself. Among these, $\mathbb{R}^3 {\times} S^2$ is an optimal choice due to the ability to represent directional information, which $\mathbb{R}^3$ methods cannot, and it significantly enhances computational efficiency compared to indexing features on the full SE$(3)$ group. We empirically support this claim by reaching state-of-the-art results -- in accuracy and speed -- on three different benchmarks: interatomic potential energy prediction, trajectory forecasting in N-body systems, and generating molecules via equivariant diffusion models.
    Accurate battery lifetime prediction across diverse aging conditions with deep learning. (arXiv:2310.05052v3 [eess.SP] UPDATED)
    Accurately predicting the lifetime of battery cells in early cycles holds tremendous value for battery research and development as well as numerous downstream applications. This task is rather challenging because diverse conditions, such as electrode materials, operating conditions, and working environments, collectively determine complex capacity-degradation behaviors. However, current prediction methods are developed and validated under limited aging conditions, resulting in questionable adaptability to varied aging conditions and an inability to fully benefit from historical data collected under different conditions. Here we introduce a universal deep learning approach that is capable of accommodating various aging conditions and facilitating effective learning under low-resource conditions by leveraging data from rich conditions. Our key finding is that incorporating inter-cell feature differences, rather than solely considering single-cell characteristics, significantly increases the accuracy of battery lifetime prediction and its cross-condition robustness. Accordingly, we develop a holistic learning framework accommodating both single-cell and inter-cell modeling. A comprehensive benchmark is built for evaluation, encompassing 401 battery cells utilizing 5 prevalent electrode materials across 168 cycling conditions. We demonstrate remarkable capabilities in learning across diverse aging conditions, exclusively achieving 10% prediction error using the first 100 cycles, and in facilitating low-resource learning, almost halving the error of single-cell modeling in many cases. More broadly, by breaking the learning boundaries among different aging conditions, our approach could significantly accelerate the development and optimization of lithium-ion batteries.
    Course Correcting Koopman Representations. (arXiv:2310.15386v2 [cs.LG] UPDATED)
    Koopman representations aim to learn features of nonlinear dynamical systems (NLDS) which lead to linear dynamics in the latent space. Theoretically, such features can be used to simplify many problems in modeling and control of NLDS. In this work we study autoencoder formulations of this problem, and different ways they can be used to model dynamics, specifically for future state prediction over long horizons. We discover several limitations of predicting future states in the latent space and propose an inference-time mechanism, which we refer to as Periodic Reencoding, for faithfully capturing long term dynamics. We justify this method both analytically and empirically via experiments in low and high dimensional NLDS.
    On the Hyperparameter Landscapes of Machine Learning Algorithms. (arXiv:2311.14014v1 [cs.LG])
    Despite the recent success in a plethora of hyperparameter optimization (HPO) methods for machine learning (ML) models, the intricate interplay between model hyperparameters (HPs) and predictive losses (a.k.a fitness), which is a key prerequisite for understanding HPO, remain notably underexplored in our community. This results in limited explainability in the HPO process, rendering a lack of human trust and difficulties in pinpointing algorithm bottlenecks. In this paper, we aim to shed light on this black box by conducting large-scale fitness landscape analysis (FLA) on 1,500 HP loss landscapes of 6 ML models with more than 11 model configurations, across 67 datasets and different levels of fidelities. We reveal the first unified, comprehensive portrait of their topographies in terms of smoothness, neutrality and modality. We also show that such properties are highly transferable across datasets and fidelities, providing fundamental evidence for the success of multi-fidelity and transfer learning methods. These findings are made possible by developing a dedicated FLA framework that incorporates a combination of visual and quantitative measures. We further demonstrate the potential of this framework by analyzing the NAS-Bench-101 landscape, and we believe it is able to faciliate fundamental understanding of a broader range of AutoML tasks.
    Training Deep 3D Convolutional Neural Networks to Extract BSM Physics Parameters Directly from HEP Data: a Proof-of-Concept Study Using Monte Carlo Simulations. (arXiv:2311.13060v1 [hep-ex] CROSS LISTED)
    We report on a novel application of computer vision techniques to extract beyond the Standard Model (BSM) parameters directly from high energy physics (HEP) flavor data. We develop a method of transforming angular and kinematic distributions into "quasi-images" that can be used to train a convolutional neural network to perform regression tasks, similar to fitting. This contrasts with the usual classification functions performed using ML/AI in HEP. As a proof-of-concept, we train a 34-layer Residual Neural Network to regress on these images and determine the Wilson Coefficient $C_{9}$ in MC (Monte Carlo) simulations of $B \rightarrow K^{*}\mu^{+}\mu^{-}$ decays. The technique described here can be generalized and may find applicability across various HEP experiments and elsewhere.
    Upgrading VAE Training With Unlimited Data Plans Provided by Diffusion Models. (arXiv:2310.19653v2 [stat.ML] UPDATED)
    Variational autoencoders (VAEs) are popular models for representation learning but their encoders are susceptible to overfitting (Cremer et al., 2018) because they are trained on a finite training set instead of the true (continuous) data distribution $p_{\mathrm{data}}(\mathbf{x})$. Diffusion models, on the other hand, avoid this issue by keeping the encoder fixed. This makes their representations less interpretable, but it simplifies training, enabling accurate and continuous approximations of $p_{\mathrm{data}}(\mathbf{x})$. In this paper, we show that overfitting encoders in VAEs can be effectively mitigated by training on samples from a pre-trained diffusion model. These results are somewhat unexpected as recent findings (Alemohammad et al., 2023; Shumailov et al., 2023) observe a decay in generative performance when models are trained on data generated by another generative model. We analyze generalization performance, amortization gap, and robustness of VAEs trained with our proposed method on three different data sets. We find improvements in all metrics compared to both normal training and conventional data augmentation methods, and we show that a modest amount of samples from the diffusion model suffices to obtain these gains.
    Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts. (arXiv:2310.05898v3 [cs.LG] UPDATED)
    Lion (Evolved Sign Momentum), a new optimizer discovered through program search, has shown promising results in training large AI models. It performs comparably or favorably to AdamW but with greater memory efficiency. As we can expect from the results of a random search program, Lion incorporates elements from several existing algorithms, including signed momentum, decoupled weight decay, Polak, and Nesterov momentum, but does not fit into any existing category of theoretically grounded optimizers. Thus, even though Lion appears to perform well as a general-purpose optimizer for a wide range of tasks, its theoretical basis remains uncertain. This lack of theoretical clarity limits opportunities to further enhance and expand Lion's efficacy. This work aims to demystify Lion. Based on both continuous-time and discrete-time analysis, we demonstrate that Lion is a theoretically novel and principled approach for minimizing a general loss function $f(x)$ while enforcing a bound constraint $\|x\|_\infty \leq 1/\lambda$. Lion achieves this through the incorporation of decoupled weight decay, where $\lambda$ represents the weight decay coefficient. Our analysis is made possible by the development of a new Lyapunov function for the Lion updates. It applies to a broader family of Lion-$\kappa$ algorithms, where the $\text{sign}(\cdot)$ operator in Lion is replaced by the subgradient of a convex function $\kappa$, leading to the solution of a general composite optimization problem of $\min_x f(x) + \kappa^*(x)$. Our findings provide valuable insights into the dynamics of Lion and pave the way for further improvements and extensions of Lion-related algorithms.
    Towards Reliable Uncertainty Quantification via Deep Ensembles in Multi-output Regression Task. (arXiv:2303.16210v4 [cs.LG] UPDATED)
    This study aims to comprehensively investigate the deep ensemble approach, an approximate Bayesian inference, in the multi-output regression task for predicting the aerodynamic performance of a missile configuration. To this end, the effect of the number of neural networks used in the ensemble, which has been blindly adopted in previous studies, is scrutinized. As a result, an obvious trend towards underestimation of uncertainty as it increases is observed for the first time, and in this context, we propose the deep ensemble framework that applies the post-hoc calibration method to improve its uncertainty quantification performance. It is compared with Gaussian process regression and is shown to have superior performance in terms of regression accuracy ($\uparrow55\sim56\%$), reliability of estimated uncertainty ($\uparrow38\sim77\%$), and training efficiency ($\uparrow78\%$). Finally, the potential impact of the suggested framework on the Bayesian optimization is briefly examined, indicating that deep ensemble without calibration may lead to unintended exploratory behavior. This UQ framework can be seamlessly applied and extended to any regression task, as no special assumptions have been made for the specific problem used in this study.
    Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation. (arXiv:2311.03348v2 [cs.CL] UPDATED)
    Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a language model assistant. We demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. These automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is 185 times larger than before modulation (0.23%). These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%, respectively. Our work reveals yet another vulnerability in commercial large language models and highlights the need for more comprehensive safeguards.
    Weighted Joint Maximum Mean Discrepancy Enabled Multi-Source-Multi-Target Unsupervised Domain Adaptation Fault Diagnosis. (arXiv:2310.14790v2 [cs.LG] UPDATED)
    Despite the remarkable results that can be achieved by data-driven intelligent fault diagnosis techniques, they presuppose the same distribution of training and test data as well as sufficient labeled data. Various operating states often exist in practical scenarios, leading to the problem of domain shift that hinders the effectiveness of fault diagnosis. While recent unsupervised domain adaptation methods enable cross-domain fault diagnosis, they struggle to effectively utilize information from multiple source domains and achieve effective diagnosis faults in multiple target domains simultaneously. In this paper, we innovatively proposed a weighted joint maximum mean discrepancy enabled multi-source-multi-target unsupervised domain adaptation (WJMMD-MDA), which realizes domain adaptation under multi-source-multi-target scenarios in the field of fault diagnosis for the first time. The proposed method extracts sufficient information from multiple labeled source domains and achieves domain alignment between source and target domains through an improved weighted distance loss. As a result, domain-invariant and discriminative features between multiple source and target domains are learned with cross-domain fault diagnosis realized. The performance of the proposed method is evaluated in comprehensive comparative experiments on three datasets, and the experimental results demonstrate the superiority of this method.
    A path-norm toolkit for modern networks: consequences, promises and challenges. (arXiv:2310.01225v3 [stat.ML] UPDATED)
    This work introduces the first toolkit around path-norms that is fully able to encompass general DAG ReLU networks with biases, skip connections and any operation based on the extraction of order statistics: max pooling, GroupSort etc. This toolkit notably allows us to establish generalization bounds for modern neural networks that are not only the most widely applicable path-norm based ones, but also recover or beat the sharpest known bounds of this type. These extended path-norms further enjoy the usual benefits of path-norms: ease of computation, invariance under the symmetries of the network, and improved sharpness on feedforward networks compared to the product of operators' norms, another complexity measure most commonly used. The versatility of the toolkit and its ease of implementation allow us to challenge the concrete promises of path-norm-based generalization bounds, by numerically evaluating the sharpest known bounds for ResNets on ImageNet.
    HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models. (arXiv:2304.05390v2 [cs.CV] UPDATED)
    In recent years, Text-to-Image (T2I) models have been extensively studied, especially with the emergence of diffusion models that achieve state-of-the-art results on T2I synthesis tasks. However, existing benchmarks heavily rely on subjective human evaluation, limiting their ability to holistically assess the model's capabilities. Furthermore, there is a significant gap between efforts in developing new T2I architectures and those in evaluation. To address this, we introduce HRS-Bench, a concrete evaluation benchmark for T2I models that is Holistic, Reliable, and Scalable. Unlike existing bench-marks that focus on limited aspects, HRS-Bench measures 13 skills that can be categorized into five major categories: accuracy, robustness, generalization, fairness, and bias. In addition, HRS-Bench covers 50 scenarios, including fashion, animals, transportation, food, and clothes. We evaluate nine recent large-scale T2I models using metrics that cover a wide range of skills. A human evaluation aligned with 95% of our evaluations on average was conducted to probe the effectiveness of HRS-Bench. Our experiments demonstrate that existing models often struggle to generate images with the desired count of objects, visual text, or grounded emotions. We hope that our benchmark help ease future text-to-image generation research. The code and data are available at https://eslambakr.github.io/hrsbench.github.io
    Data-driven Traffic Simulation: A Comprehensive Review. (arXiv:2310.15975v2 [cs.LG] UPDATED)
    Autonomous vehicles (AVs) have the potential to significantly revolutionize society by providing a secure and efficient mode of transportation. Recent years have witnessed notable advancements in autonomous driving perception and prediction, but the challenge of validating the performance of AVs remains largely unresolved. Data-driven microscopic traffic simulation has become an important tool for autonomous driving testing due to 1) availability of high-fidelity traffic data; 2) its advantages of enabling large-scale testing and scenario reproducibility; and 3) its potential in reactive and realistic traffic simulation. However, a comprehensive review of this topic is currently lacking. This paper aims to fill this gap by summarizing relevant studies. The primary objective of this paper is to review current research efforts and provide a futuristic perspective that will benefit future developments in the field. It introduces the general issues of data-driven traffic simulation and outlines key concepts and terms. After overviewing traffic simulation, various datasets and evaluation metrics commonly used are reviewed. The paper then offers a comprehensive evaluation of imitation learning, reinforcement learning, deep generative and deep learning methods, summarizing each and analyzing their advantages and disadvantages in detail. Moreover, it evaluates the state-of-the-art, existing challenges, and future research directions.
    Inverse Approximation Theory for Nonlinear Recurrent Neural Networks. (arXiv:2305.19190v3 [cs.LG] UPDATED)
    We prove an inverse approximation theorem for the approximation of nonlinear sequence-to-sequence relationships using recurrent neural networks (RNNs). This is a so-called Bernstein-type result in approximation theory, which deduces properties of a target function under the assumption that it can be effectively approximated by a hypothesis space. In particular, we show that nonlinear sequence relationships that can be stably approximated by nonlinear RNNs must have an exponential decaying memory structure - a notion that can be made precise. This extends the previously identified curse of memory in linear RNNs into the general nonlinear setting, and quantifies the essential limitations of the RNN architecture for learning sequential relationships with long-term memory. Based on the analysis, we propose a principled reparameterization method to overcome the limitations. Our theoretical results are confirmed by numerical experiments. The code has been released in https://github.com/radarFudan/Curse-of-memory
    On Neural Quantum Support Vector Machines. (arXiv:2308.08467v2 [quant-ph] UPDATED)
    In \cite{simon2023algorithms} we introduced four algorithms for the training of neural support vector machines (NSVMs) and demonstrated their feasibility. In this note we introduce neural quantum support vector machines, that is, NSVMs with a quantum kernel, and extend our results to this setting.
    Equivariant flow matching. (arXiv:2306.15030v2 [stat.ML] UPDATED)
    Normalizing flows are a class of deep generative models that are especially interesting for modeling probability distributions in physics, where the exact likelihood of flows allows reweighting to known target energy functions and computing unbiased observables. For instance, Boltzmann generators tackle the long-standing sampling problem in statistical physics by training flows to produce equilibrium samples of many-body systems such as small molecules and proteins. To build effective models for such systems, it is crucial to incorporate the symmetries of the target energy into the model, which can be achieved by equivariant continuous normalizing flows (CNFs). However, CNFs can be computationally expensive to train and generate samples from, which has hampered their scalability and practical application. In this paper, we introduce equivariant flow matching, a new training objective for equivariant CNFs that is based on the recently proposed optimal transport flow matching. Equivariant flow matching exploits the physical symmetries of the target energy for efficient, simulation-free training of equivariant CNFs. We demonstrate the effectiveness of flow matching on rotation and permutation invariant many-particle systems and a small molecule, alanine dipeptide, where for the first time we obtain a Boltzmann generator with significant sampling efficiency without relying on tailored internal coordinate featurization. Our results show that the equivariant flow matching objective yields flows with shorter integration paths, improved sampling efficiency, and higher scalability compared to existing methods.
    LeCo: Lightweight Compression via Learning Serial Correlations. (arXiv:2306.15374v3 [cs.DB] UPDATED)
    Lightweight data compression is a key technique that allows column stores to exhibit superior performance for analytical queries. Despite a comprehensive study on dictionary-based encodings to approach Shannon's entropy, few prior works have systematically exploited the serial correlation in a column for compression. In this paper, we propose LeCo (i.e., Learned Compression), a framework that uses machine learning to remove the serial redundancy in a value sequence automatically to achieve an outstanding compression ratio and decompression performance simultaneously. LeCo presents a general approach to this end, making existing (ad-hoc) algorithms such as Frame-of-Reference (FOR), Delta Encoding, and Run-Length Encoding (RLE) special cases under our framework. Our microbenchmark with three synthetic and six real-world data sets shows that a prototype of LeCo achieves a Pareto improvement on both compression ratio and random access speed over the existing solutions. When integrating LeCo into widely-used applications, we observe up to 5.2x speed up in a data analytical query in the Arrow columnar execution engine and a 16% increase in RocksDB's throughput.
    Interpretable and intervenable ultrasonography-based machine learning models for pediatric appendicitis. (arXiv:2302.14460v3 [cs.LG] UPDATED)
    Appendicitis is among the most frequent reasons for pediatric abdominal surgeries. Previous decision support systems for appendicitis have focused on clinical, laboratory, scoring, and computed tomography data and have ignored abdominal ultrasound, despite its noninvasive nature and widespread availability. In this work, we present interpretable machine learning models for predicting the diagnosis, management and severity of suspected appendicitis using ultrasound images. Our approach utilizes concept bottleneck models (CBM) that facilitate interpretation and interaction with high-level concepts understandable to clinicians. Furthermore, we extend CBMs to prediction problems with multiple views and incomplete concept sets. Our models were trained on a dataset comprising 579 pediatric patients with 1709 ultrasound images accompanied by clinical and laboratory data. Results show that our proposed method enables clinicians to utilize a human-understandable and intervenable predictive model without compromising performance or requiring time-consuming image annotation when deployed. For predicting the diagnosis, the extended multiview CBM attained an AUROC of 0.80 and an AUPR of 0.92, performing comparably to similar black-box neural networks trained and tested on the same dataset.
    Structured State Space Models for In-Context Reinforcement Learning. (arXiv:2303.03982v3 [cs.LG] UPDATED)
    Structured state space sequence (S4) models have recently achieved state-of-the-art performance on long-range sequence modeling tasks. These models also have fast inference speeds and parallelisable training, making them potentially useful in many reinforcement learning settings. We propose a modification to a variant of S4 that enables us to initialise and reset the hidden state in parallel, allowing us to tackle reinforcement learning tasks. We show that our modified architecture runs asymptotically faster than Transformers in sequence length and performs better than RNN's on a simple memory-based task. We evaluate our modified architecture on a set of partially-observable environments and find that, in practice, our model outperforms RNN's while also running over five times faster. Then, by leveraging the model's ability to handle long-range sequences, we achieve strong performance on a challenging meta-learning task in which the agent is given a randomly-sampled continuous control environment, combined with a randomly-sampled linear projection of the environment's observations and actions. Furthermore, we show the resulting model can adapt to out-of-distribution held-out tasks. Overall, the results presented in this paper show that structured state space models are fast and performant for in-context reinforcement learning tasks. We provide code at https://github.com/luchris429/popjaxrl.
    Query-Policy Misalignment in Preference-Based Reinforcement Learning. (arXiv:2305.17400v2 [cs.LG] UPDATED)
    Preference-based reinforcement learning (PbRL) provides a natural way to align RL agents' behavior with human desired outcomes, but is often restrained by costly human feedback. To improve feedback efficiency, most existing PbRL methods focus on selecting queries to maximally improve the overall quality of the reward model, but counter-intuitively, we find that this may not necessarily lead to improved performance. To unravel this mystery, we identify a long-neglected issue in the query selection schemes of existing PbRL studies: Query-Policy Misalignment. We show that the seemingly informative queries selected to improve the overall quality of reward model actually may not align with RL agents' interests, thus offering little help on policy learning and eventually resulting in poor feedback efficiency. We show that this issue can be effectively addressed via near on-policy query and a specially designed hybrid experience replay, which together enforce the bidirectional query-policy alignment. Simple yet elegant, our method can be easily incorporated into existing approaches by changing only a few lines of code. We showcase in comprehensive experiments that our method achieves substantial gains in both human feedback and RL sample efficiency, demonstrating the importance of addressing query-policy misalignment in PbRL tasks.
    Broadening the perspective for sustainable AI: Comprehensive sustainability criteria and indicators for AI systems. (arXiv:2306.13686v2 [cs.CY] UPDATED)
    The increased use of AI systems is associated with multi-faceted societal, environmental, and economic consequences. These include non-transparent decision-making processes, discrimination, increasing inequalities, rising energy consumption and greenhouse gas emissions in AI model development and application, and an increasing concentration of economic power. By considering the multi-dimensionality of sustainability, this paper takes steps towards substantiating the call for an overarching perspective on "sustainable AI". It presents the SCAIS Framework (Sustainability Criteria and Indicators for Artificial Intelligence Systems) which contains a set 19 sustainability criteria for sustainable AI and 67 indicators that is based on the results of a critical review and expert workshops. This interdisciplinary approach contributes a unique holistic perspective to facilitate and structure the discourse on sustainable AI. Further, it provides a concrete framework that lays the foundation for developing standards and tools to support the conscious development and application of AI systems.
    Monte Carlo Neural PDE Solver for Learning PDEs via Probabilistic Representation. (arXiv:2302.05104v2 [cs.LG] UPDATED)
    In scenarios with limited available or high-quality data, training the function-to-function neural PDE solver in an unsupervised manner is essential. However, the efficiency and accuracy of existing methods are constrained by the properties of numerical algorithms, such as finite difference and pseudo-spectral methods, integrated during the training stage. These methods necessitate careful spatiotemporal discretization to achieve reasonable accuracy, leading to significant computational challenges and inaccurate simulations, particularly in cases with substantial spatiotemporal variations. To address these limitations, we propose the Monte Carlo Neural PDE Solver (MCNP Solver) for training unsupervised neural solvers via the PDEs' probabilistic representation, which regards macroscopic phenomena as ensembles of random particles. Compared to other unsupervised methods, MCNP Solver naturally inherits the advantages of the Monte Carlo method, which is robust against spatiotemporal variations and can tolerate coarse step size. In simulating the random walk of particles, we employ Heun's method for the convection process and calculate the expectation via the probability density function of neighbouring grid points during the diffusion process. These techniques enhance accuracy and circumvent the computational memory and time issues associated with Monte Carlo sampling, offering an improvement over traditional Monte Carlo methods. Our numerical experiments on convection-diffusion, Allen-Cahn, and Navier-Stokes equations demonstrate significant improvements in accuracy and efficiency compared to other unsupervised baselines. The source code will be publicly available at: https://github.com/optray/MCNP.
    Model scale versus domain knowledge in statistical forecasting of chaotic systems. (arXiv:2303.08011v3 [cs.LG] UPDATED)
    Chaos and unpredictability are traditionally synonymous, yet large-scale machine learning methods recently have demonstrated a surprising ability to forecast chaotic systems well beyond typical predictability horizons. However, recent works disagree on whether specialized methods grounded in dynamical systems theory, such as reservoir computers or neural ordinary differential equations, outperform general-purpose large-scale learning methods such as transformers or recurrent neural networks. These prior studies perform comparisons on few individually-chosen chaotic systems, thereby precluding robust quantification of how statistical modeling choices and dynamical invariants of different chaotic systems jointly determine empirical predictability. Here, we perform the largest to-date comparative study of forecasting methods on the classical problem of forecasting chaos: we benchmark 24 state-of-the-art forecasting methods on a crowdsourced database of 135 low-dimensional systems with 17 forecast metrics. We find that large-scale, domain-agnostic forecasting methods consistently produce predictions that remain accurate up to two dozen Lyapunov times, thereby accessing a new long-horizon forecasting regime well beyond classical methods. We find that, in this regime, accuracy decorrelates with classical invariant measures of predictability like the Lyapunov exponent. However, in data-limited settings outside the long-horizon regime, we find that physics-based hybrid methods retain a comparative advantage due to their strong inductive biases.
    Expanding the deep-learning model to diagnosis LVNC: Limitations and trade-offs. (arXiv:2311.13912v1 [eess.IV])
    Hyper-trabeculation or non-compaction in the left ventricle of the myocardium (LVNC) is a recently classified form of cardiomyopathy. Several methods have been proposed to quantify the trabeculae accurately in the left ventricle, but there is no general agreement in the medical community to use a particular approach. In previous work, we proposed DL-LVTQ, a deep learning approach for left ventricular trabecular quantification based on a U-Net CNN architecture. DL-LVTQ was an automatic diagnosis tool developed from a dataset of patients with the same cardiomyopathy (hypertrophic cardiomyopathy). In this work, we have extended and adapted DL-LVTQ to cope with patients with different cardiomyopathies. The dataset consists of up 379 patients in three groups with different particularities and cardiomyopathies. Patient images were taken from different scanners and hospitals. We have modified and adapted the U-Net convolutional neural network to account for the different particularities of a heterogeneous group of patients with various unclassifiable or mixed and inherited cardiomyopathies. The inclusion of new groups of patients has increased the accuracy, specificity and kappa values while maintaining the sensitivity of the automatic deep learning method proposed. Therefore, a better-prepared diagnosis tool is ready for various cardiomyopathies with different characteristics. Cardiologists have considered that 98.9% of the evaluated outputs are verified clinically for diagnosis. Therefore, the high precision to segment the different cardiac structures allows us to make a robust diagnostic system objective and faster, decreasing human error and time spent.
    DAS-N2N: Machine learning Distributed Acoustic Sensing (DAS) signal denoising without clean data. (arXiv:2304.08120v2 [physics.geo-ph] UPDATED)
    This article presents a weakly supervised machine learning method, which we call DAS-N2N, for suppressing strong random noise in distributed acoustic sensing (DAS) recordings. DAS-N2N requires no manually produced labels (i.e., pre-determined examples of clean event signals or sections of noise) for training and aims to map random noise processes to a chosen summary statistic, such as the distribution mean, median or mode, whilst retaining the true underlying signal. This is achieved by splicing (joining together) two fibres hosted within a single optical cable, recording two noisy copies of the same underlying signal corrupted by different independent realizations of random observational noise. A deep learning model can then be trained using only these two noisy copies of the data to produce a near fully-denoised copy. Once the model is trained, only noisy data from a single fibre is required. Using a dataset from a DAS array deployed on the surface of the Rutford Ice Stream in Antarctica, we demonstrate that DAS-N2N greatly suppresses incoherent noise and enhances the signal-to-noise ratios (SNR) of natural microseismic icequake events. We further show that this approach is inherently more efficient and effective than standard stop/pass band and white noise (e.g., Wiener) filtering routines, as well as a comparable self-supervised learning method based on masking individual DAS channels. Our preferred model for this task is lightweight, processing 30 seconds of data recorded at a sampling frequency of 1000 Hz over 985 channels (approx. 1 km of fiber) in $<$1 s. Due to the high noise levels in DAS recordings, efficient data-driven denoising methods, such as DAS-N2N, will prove essential to time-critical DAS earthquake detection, particularly in the case of microseismic monitoring.
    A Deep Learning Method for Comparing Bayesian Hierarchical Models. (arXiv:2301.11873v4 [stat.ML] UPDATED)
    Bayesian model comparison (BMC) offers a principled approach for assessing the relative merits of competing computational models and propagating uncertainty into model selection decisions. However, BMC is often intractable for the popular class of hierarchical models due to their high-dimensional nested parameter structure. To address this intractability, we propose a deep learning method for performing BMC on any set of hierarchical models which can be instantiated as probabilistic programs. Since our method enables amortized inference, it allows efficient re-estimation of posterior model probabilities and fast performance validation prior to any real-data application. In a series of extensive validation studies, we benchmark the performance of our method against the state-of-the-art bridge sampling method and demonstrate excellent amortized inference across all BMC settings. We then showcase our method by comparing four hierarchical evidence accumulation models that have previously been deemed intractable for BMC due to partly implicit likelihoods. Additionally, we demonstrate how transfer learning can be leveraged to enhance training efficiency. We provide reproducible code for all analyses and an open-source implementation of our method.
    MedISure: Towards Assuring Machine Learning-based Medical Image Classifiers using Mixup Boundary Analysis. (arXiv:2311.13978v1 [cs.LG])
    Machine learning (ML) models are becoming integral in healthcare technologies, presenting a critical need for formal assurance to validate their safety, fairness, robustness, and trustworthiness. These models are inherently prone to errors, potentially posing serious risks to patient health and could even cause irreparable harm. Traditional software assurance techniques rely on fixed code and do not directly apply to ML models since these algorithms are adaptable and learn from curated datasets through a training process. However, adapting established principles, such as boundary testing using synthetic test data can effectively bridge this gap. To this end, we present a novel technique called Mix-Up Boundary Analysis (MUBA) that facilitates evaluating image classifiers in terms of prediction fairness. We evaluated MUBA for two important medical imaging tasks -- brain tumour classification and breast cancer classification -- and achieved promising results. This research aims to showcase the importance of adapting traditional assurance principles for assessing ML models to enhance the safety and reliability of healthcare technologies. To facilitate future research, we plan to publicly release our code for MUBA.
    Unsupervised Learning for Topological Classification of Transportation Networks. (arXiv:2311.13887v1 [cs.LG])
    With increasing urbanization, transportation plays an increasingly critical role in city development. The number of studies on modeling, optimization, simulation, and data analysis of transportation systems is on the rise. Many of these studies utilize transportation test networks to represent real-world transportation systems in urban areas, examining the efficacy of their proposed approaches. Each of these networks exhibits unique characteristics in their topology, making their applications distinct for various study objectives. Despite their widespread use in research, there is a lack of comprehensive study addressing the classification of these networks based on their topological characteristics. This study aims to fill this gap by employing unsupervised learning methods, particularly clustering. We present a comprehensive framework for evaluating various topological network characteristics. Additionally, we employ two dimensionality reduction techniques, namely Principal Component Analysis (PCA) and Isometric Feature Mapping (ISOMAP), to reduce overlaps of highly correlated features and enhance the interpretability of the subsequent classification results. We then utilize two clustering algorithms, K-means and HDBSCAN, to classify 14 transportation networks. The PCA method, followed by the K-means clustering approach, outperforms other alternatives with a Silhouette score of $0.510$, enabling the classification of transportation networks into five clusters. We also provide a detailed discussion on the resulting classification.
    Learning Dynamic Selection and Pricing of Out-of-Home Deliveries. (arXiv:2311.13983v1 [cs.LG])
    Home delivery failures, traffic congestion, and relatively large handling times have a negative impact on the profitability of last-mile logistics. These external factors contribute to up to $28\%$ of the overall costs and $25\%$ of emissions for the home delivery supply chain. A potential solution, showing annual growth rates up to $36\%$, is the delivery to parcel lockers or parcel shops, denoted by out-of-home (OOH) delivery. In the academic literature, models of customer behavior with respect to OOH delivery were so far limited to deterministic settings, contrasting with the stochastic nature of actual customer choices. We model the sequential decision-making problem of which OOH location to offer against what incentive for each incoming customer, taking into account future customer arrivals and choices. We propose Dynamic Selection and Pricing of OOH (DSPO), an algorithmic pipeline that uses a novel spatial-temporal state encoding as input to a convolutional neural network. We demonstrate the performance of our method by benchmarking it against three state-of-the-art approaches. Our extensive numerical study, guided by real-world data, reveals that DSPO can save $20.8\%$ in costs compared to a situation without OOH locations, $8.1\%$ compared to a static selection and pricing policy, and $4.6\%$ compared to a state-of-the-art demand management benchmark. We provide comprehensive insights into the complex interplay between OOH delivery dynamics and customer behavior influenced by pricing strategies. The implications of our findings suggest practitioners to adopt dynamic selection and pricing policies as OOH delivery gains a larger market share.
    Physics-Constrained Neural Network for Design and Feature-Based Optimization of Weave Architectures. (arXiv:2209.09154v2 [physics.app-ph] UPDATED)
    Woven fabrics play an essential role in everyday textiles for clothing/sportswear, water filtration, and retaining walls, to reinforcements in stiff composites for lightweight structures like aerospace, sporting, automotive, and marine industries. Several possible combinations of weave patterns and material choices, which comprise weave architecture, present a challenging question about how they could influence the physical and mechanical properties of woven fabrics and reinforced structures. In this paper, we present a novel Physics-Constrained Neural Network (PCNN) to predict the mechanical properties like the modulus of weave architectures and the inverse problem of predicting pattern/material sequence for a design/target modulus value. The inverse problem is particularly challenging as it usually requires many iterations to find the appropriate architecture using traditional optimization approaches. We show that the proposed PCNN can effectively predict weave architecture for the desired modulus with higher accuracy than several baseline models considered. We present a feature-based optimization strategy to improve the predictions using features in the Grey Level Co-occurrence Matrix (GLCM) space. We combine PCNN with this feature-based optimization to discover near-optimal weave architectures to facilitate the initial design of weave architecture. The proposed frameworks will primarily enable the woven composite analysis and optimization process, and be a starting point to introduce Knowledge-guided Neural Networks into the complex structural analysis.
    FAVANO: Federated AVeraging with Asynchronous NOdes. (arXiv:2305.16099v2 [cs.LG] UPDATED)
    In this paper, we propose a novel centralized Asynchronous Federated Learning (FL) framework, FAVANO, for training Deep Neural Networks (DNNs) in resource-constrained environments. Despite its popularity, ``classical'' federated learning faces the increasingly difficult task of scaling synchronous communication over large wireless networks. Moreover, clients typically have different computing resources and therefore computing speed, which can lead to a significant bias (in favor of ``fast'' clients) when the updates are asynchronous. Therefore, practical deployment of FL requires to handle users with strongly varying computing speed in communication/resource constrained setting. We provide convergence guarantees for FAVANO in a smooth, non-convex environment and carefully compare the obtained convergence guarantees with existing bounds, when they are available. Experimental results show that the FAVANO algorithm outperforms current methods on standard benchmarks.
    Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and Toolkit. (arXiv:2209.03048v2 [cs.LG] UPDATED)
    Multimodal Variational Autoencoders (VAEs) have been the subject of intense research in the past years as they can integrate multiple modalities into a joint representation and can thus serve as a promising tool for both data classification and generation. Several approaches toward multimodal VAE learning have been proposed so far, their comparison and evaluation have however been rather inconsistent. One reason is that the models differ at the implementation level, another problem is that the datasets commonly used in these cases were not initially designed to evaluate multimodal generative models. This paper addresses both mentioned issues. First, we propose a toolkit for systematic multimodal VAE training and comparison. The toolkit currently comprises 4 existing multimodal VAEs and 6 commonly used benchmark datasets along with instructions on how to easily add a new model or a dataset. Second, we present a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities across multiple difficulty levels. We demonstrate the utility of our dataset by comparing the implemented state-of-the-art models.
    How to Use K-means for Big Data Clustering?. (arXiv:2204.07485v3 [cs.LG] UPDATED)
    K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of data. Therefore, it is crucial to improve K-means by scaling it to big data using as few of the following computational resources as possible: data, time, and algorithmic ingredients. We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering that satisfies the properties of a ``true big data'' algorithm and outperforms the classical and recent state-of-the-art MSSC approaches in terms of solution quality and runtime. The new approach naturally implements global search by decomposing the MSSC problem without using additional metaheuristics. This work shows that data decomposition is the basic approach to solve the big data clustering problem. The empirical success of the new algorithm allowed us to challenge the common belief that more data is required to obtain a good clustering solution. Moreover, the present work questions the established trend that more sophisticated hybrid approaches and algorithms are required to obtain a better clustering solution.
    Supervised Feature Compression based on Counterfactual Analysis. (arXiv:2211.09894v4 [cs.LG] UPDATED)
    Counterfactual Explanations are becoming a de-facto standard in post-hoc interpretable machine learning. For a given classifier and an instance classified in an undesired class, its counterfactual explanation corresponds to small perturbations of that instance that allows changing the classification outcome. This work aims to leverage Counterfactual Explanations to detect the important decision boundaries of a pre-trained black-box model. This information is used to build a supervised discretization of the features in the dataset with a tunable granularity. Using the discretized dataset, an optimal Decision Tree can be trained that resembles the black-box model, but that is interpretable and compact. Numerical results on real-world datasets show the effectiveness of the approach in terms of accuracy and sparsity.
    XAutoML: A Visual Analytics Tool for Understanding and Validating Automated Machine Learning. (arXiv:2202.11954v3 [cs.LG] UPDATED)
    In the last ten years, various automated machine learning (AutoM ) systems have been proposed to build end-to-end machine learning (ML) pipelines with minimal human interaction. Even though such automatically synthesized ML pipelines are able to achieve a competitive performance, recent studies have shown that users do not trust models constructed by AutoML due to missing transparency of AutoML systems and missing explanations for the constructed ML pipelines. In a requirements analysis study with 36 domain experts, data scientists, and AutoML researchers from different professions with vastly different expertise in ML, we collect detailed informational needs for AutoML. We propose XAutoML, an interactive visual analytics tool for explaining arbitrary AutoML optimization procedures and ML pipelines constructed by AutoML. XAutoML combines interactive visualizations with established techniques from explainable artificial intelligence (XAI) to make the complete AutoML procedure transparent and explainable. By integrating XAutoML with JupyterLab, experienced users can extend the visual analytics with ad-hoc visualizations based on information extracted from XAutoML. We validate our approach in a user study with the same diverse user group from the requirements analysis. All participants were able to extract useful information from XAutoML, leading to a significantly increased understanding of ML pipelines produced by AutoML and the AutoML optimization itself.
    AlphaFold Distillation for Protein Design. (arXiv:2210.03488v2 [q-bio.BM] UPDATED)
    Inverse protein folding, the process of designing sequences that fold into a specific 3D structure, is crucial in bio-engineering and drug discovery. Traditional methods rely on experimentally resolved structures, but these cover only a small fraction of protein sequences. Forward folding models like AlphaFold offer a potential solution by accurately predicting structures from sequences. However, these models are too slow for integration into the optimization loop of inverse folding models during training. To address this, we propose using knowledge distillation on folding model confidence metrics, such as pTM or pLDDT scores, to create a faster and end-to-end differentiable distilled model. This model can then be used as a structure consistency regularizer in training the inverse folding model. Our technique is versatile and can be applied to other design tasks, such as sequence-based protein infilling. Experimental results show that our method outperforms non-regularized baselines, yielding up to 3% improvement in sequence recovery and up to 45% improvement in protein diversity while maintaining structural consistency in generated sequences. Code is available at https://github.com/IBM/AFDistill
    Transport with Support: Data-Conditional Diffusion Bridges. (arXiv:2301.13636v2 [cs.LG] UPDATED)
    The dynamic Schr\"odinger bridge problem provides an appealing setting for solving constrained time-series data generation tasks posed as optimal transport problems. It consists of learning non-linear diffusion processes using efficient iterative solvers. Recent works have demonstrated state-of-the-art results (eg. in modelling single-cell embryo RNA sequences or sampling from complex posteriors) but are limited to learning bridges with only initial and terminal constraints. Our work extends this paradigm by proposing the Iterative Smoothing Bridge (ISB). We integrate Bayesian filtering and optimal control into learning the diffusion process, enabling the generation of constrained stochastic processes governed by sparse observations at intermediate stages and terminal constraints. We assess the effectiveness of our method on synthetic and real-world data generation tasks and we show that the ISB generalises well to high-dimensional data, is computationally efficient, and provides accurate estimates of the marginals at intermediate and terminal times.
    Shared Certificates for Neural Network Verification. (arXiv:2109.00542v4 [cs.LG] UPDATED)
    Existing neural network verifiers compute a proof that each input is handled correctly under a given perturbation by propagating a symbolic abstraction of reachable values at each layer. This process is repeated from scratch independently for each input (e.g., image) and perturbation (e.g., rotation), leading to an expensive overall proof effort when handling an entire dataset. In this work, we introduce a new method for reducing this verification cost without losing precision based on a key insight that abstractions obtained at intermediate layers for different inputs and perturbations can overlap or contain each other. Leveraging our insight, we introduce the general concept of shared certificates, enabling proof effort reuse across multiple inputs to reduce overall verification costs. We perform an extensive experimental evaluation to demonstrate the effectiveness of shared certificates in reducing the verification cost on a range of datasets and attack specifications on image classifiers including the popular patch and geometric perturbations. We release our implementation at https://github.com/eth-sri/proof-sharing.
    Evaluating Object (mis)Detection from a Safety and Reliability Perspective: Discussion and Measures. (arXiv:2203.02205v3 [cs.LG] UPDATED)
    We argue that object detectors in the safety critical domain should prioritize detection of objects that are most likely to interfere with the actions of the autonomous actor. Especially, this applies to objects that can impact the actor's safety and reliability. To quantify the impact of object (mis)detection on safety and reliability in the context of autonomous driving, we propose new object detection measures that reward the correct identification of objects that are most dangerous and most likely to affect driving decisions. To achieve this, we build an object criticality model to reward the detection of the objects based on proximity, orientation, and relative velocity with respect to the subject vehicle. Then, we apply our model on the recent autonomous driving dataset nuScenes, and we compare nine object detectors. Results show that, in several settings, object detectors that perform best according to the nuScenes ranking are not the preferable ones when the focus is shifted on safety and reliability.
    MECCH: Metapath Context Convolution-based Heterogeneous Graph Neural Networks. (arXiv:2211.12792v2 [cs.LG] UPDATED)
    Heterogeneous graph neural networks (HGNNs) were proposed for representation learning on structural data with multiple types of nodes and edges. To deal with the performance degradation issue when HGNNs become deep, researchers combine metapaths into HGNNs to associate nodes closely related in semantics but far apart in the graph. However, existing metapath-based models suffer from either information loss or high computation costs. To address these problems, we present a novel Metapath Context Convolution-based Heterogeneous Graph Neural Network (MECCH). MECCH leverages metapath contexts, a new kind of graph structure that facilitates lossless node information aggregation while avoiding any redundancy. Specifically, MECCH applies three novel components after feature preprocessing to extract comprehensive information from the input graph efficiently: (1) metapath context construction, (2) metapath context encoder, and (3) convolutional metapath fusion. Experiments on five real-world heterogeneous graph datasets for node classification and link prediction show that MECCH achieves superior prediction accuracy compared with state-of-the-art baselines with improved computational efficiency.
    Bridging Classical and Quantum Machine Learning: Knowledge Transfer From Classical to Quantum Neural Networks Using Knowledge Distillation. (arXiv:2311.13810v1 [quant-ph])
    Very recently, studies have shown that quantum neural networks surpass classical neural networks in tasks like image classification when a similar number of learnable parameters are used. However, the development and optimization of quantum models are currently hindered by issues such as qubit instability and limited qubit availability, leading to error-prone systems with weak performance. In contrast, classical models can exhibit high-performance owing to substantial resource availability. As a result, more studies have been focusing on hybrid classical-quantum integration. A line of research particularly focuses on transfer learning through classical-quantum integration or quantum-quantum approaches. Unlike previous studies, this paper introduces a new method to transfer knowledge from classical to quantum neural networks using knowledge distillation, effectively bridging the gap between classical machine learning and emergent quantum computing techniques. We adapt classical convolutional neural network (CNN) architectures like LeNet and AlexNet to serve as teacher networks, facilitating the training of student quantum models by sending supervisory signals during backpropagation through KL-divergence. The approach yields significant performance improvements for the quantum models by solely depending on classical CNNs, with quantum models achieving an average accuracy improvement of 0.80% on the MNIST dataset and 5.40% on the more complex Fashion MNIST dataset. Applying this technique eliminates the cumbersome training of huge quantum models for transfer learning in resource-constrained settings and enables re-using existing pre-trained classical models to improve performance.Thus, this study paves the way for future research in quantum machine learning (QML) by positioning knowledge distillation as a core technique for advancing QML applications.
    Sample-Efficient Training for Diffusion. (arXiv:2311.13745v1 [cs.LG])
    Score-based diffusion models have become the most popular approach to deep generative modeling of images, largely due to their empirical performance and reliability. Recently, a number of theoretical works \citep{chen2022, Chen2022ImprovedAO, Chenetal23flowode, benton2023linear} have shown that diffusion models can efficiently sample, assuming $L^2$-accurate score estimates. The score-matching objective naturally approximates the true score in $L^2$, but the sample complexity of existing bounds depends \emph{polynomially} on the data radius and desired Wasserstein accuracy. By contrast, the time complexity of sampling is only logarithmic in these parameters. We show that estimating the score in $L^2$ \emph{requires} this polynomial dependence, but that a number of samples that scales polylogarithmically in the Wasserstein accuracy actually do suffice for sampling. We show that with a polylogarithmic number of samples, the ERM of the score-matching objective is $L^2$ accurate on all but a probability $\delta$ fraction of the true distribution, and that this weaker guarantee is sufficient for efficient sampling.
    Towards Auditing Large Language Models: Improving Text-based Stereotype Detection. (arXiv:2311.14126v1 [cs.CL])
    Large Language Models (LLM) have made significant advances in the recent past becoming more mainstream in Artificial Intelligence (AI) enabled human-facing applications. However, LLMs often generate stereotypical output inherited from historical data, amplifying societal biases and raising ethical concerns. This work introduces i) the Multi-Grain Stereotype Dataset, which includes 52,751 instances of gender, race, profession and religion stereotypic text and ii) a novel stereotype classifier for English text. We design several experiments to rigorously test the proposed model trained on the novel dataset. Our experiments show that training the model in a multi-class setting can outperform the one-vs-all binary counterpart. Consistent feature importance signals from different eXplainable AI tools demonstrate that the new model exploits relevant text features. We utilise the newly created model to assess the stereotypic behaviour of the popular GPT family of models and observe the reduction of bias over time. In summary, our work establishes a robust and practical framework for auditing and evaluating the stereotypic bias in LLM.
    Evaluating GPT-4's Vision Capabilities on Brazilian University Admission Exams. (arXiv:2311.14169v1 [cs.CL])
    Recent advancements in language models have showcased human-comparable performance in academic entrance exams. However, existing studies often overlook questions that require the integration of visual comprehension, thus compromising the full spectrum and complexity inherent in real-world scenarios. To address this gap, we present a comprehensive framework to evaluate language models on entrance exams, which incorporates both textual and visual elements. We evaluate the two most recent editions of Exame Nacional do Ensino M\'edio (ENEM), the main standardized entrance examination adopted by Brazilian universities. Our study not only reaffirms the capabilities of GPT-4 as the state of the art for handling complex multidisciplinary questions, but also pioneers in offering a realistic assessment of multimodal language models on Portuguese examinations. One of the highlights is that text captions transcribing visual content outperform the direct use of images, suggesting that the vision model has room for improvement. Yet, despite improvements afforded by images or captions, mathematical questions remain a challenge for these state-of-the-art models. The code and data used on experiments are available at https://github.com/piresramon/gpt-4-enem.
    Exactly conservative physics-informed neural networks and deep operator networks for dynamical systems. (arXiv:2311.14131v1 [cs.LG])
    We introduce a method for training exactly conservative physics-informed neural networks and physics-informed deep operator networks for dynamical systems. The method employs a projection-based technique that maps a candidate solution learned by the neural network solver for any given dynamical system possessing at least one first integral onto an invariant manifold. We illustrate that exactly conservative physics-informed neural network solvers and physics-informed deep operator networks for dynamical systems vastly outperform their non-conservative counterparts for several real-world problems from the mathematical sciences.
    On Preference Learning Based on Sequential Bayesian Optimization with Pairwise Comparison. (arXiv:2103.13192v3 [cs.LG] UPDATED)
    User preference learning is generally a hard problem. Individual preferences are typically unknown even to users themselves, while the space of choices is infinite. Here we study user preference learning from information-theoretic perspective. We model preference learning as a system with two interacting sub-systems, one representing a user with his/her preferences and another one representing an agent that has to learn these preferences. The user with his/her behaviour is modeled by a parametric preference function. To efficiently learn the preferences and reduce search space quickly, we propose the agent that interacts with the user to collect the most informative data for learning. The agent presents two proposals to the user for evaluation, and the user rates them based on his/her preference function. We show that the optimum agent strategy for data collection and preference learning is a result of maximin optimization of the normalized weighted Kullback-Leibler (KL) divergence between true and agent-assigned predictive user response distributions. The resulting value of KL-divergence, which we also call remaining system uncertainty (RSU), provides an efficient performance metric in the absence of the ground truth. This metric characterises how well the agent can predict user and, thus, the quality of the underlying learned user (preference) model. Our proposed agent comprises sequential mechanisms for user model inference and proposal generation. To infer the user model (preference function), Bayesian approximate inference is used in the agent. The data collection strategy is to generate proposals, responses to which help resolving uncertainty associated with prediction of the user responses the most. The efficiency of our approach is validated by numerical simulations. Also a real-life example of preference learning application is provided.
    Variational Annealing on Graphs for Combinatorial Optimization. (arXiv:2311.14156v1 [cs.LG])
    Several recent unsupervised learning methods use probabilistic approaches to solve combinatorial optimization (CO) problems based on the assumption of statistically independent solution variables. We demonstrate that this assumption imposes performance limitations in particular on difficult problem instances. Our results corroborate that an autoregressive approach which captures statistical dependencies among solution variables yields superior performance on many popular CO problems. We introduce subgraph tokenization in which the configuration of a set of solution variables is represented by a single token. This tokenization technique alleviates the drawback of the long sequential sampling procedure which is inherent to autoregressive methods without sacrificing expressivity. Importantly, we theoretically motivate an annealed entropy regularization and show empirically that it is essential for efficient and stable learning.
    TCuPGAN: A novel framework developed for optimizing human-machine interactions in citizen science. (arXiv:2311.14177v1 [cs.HC])
    In the era of big data in scientific research, there is a necessity to leverage techniques which reduce human effort in labeling and categorizing large datasets by involving sophisticated machine tools. To combat this problem, we present a novel, general purpose model for 3D segmentation that leverages patch-wise adversariality and Long Short-Term Memory to encode sequential information. Using this model alongside citizen science projects which use 3D datasets (image cubes) on the Zooniverse platforms, we propose an iterative human-machine optimization framework where only a fraction of the 2D slices from these cubes are seen by the volunteers. We leverage the patch-wise discriminator in our model to provide an estimate of which slices within these image cubes have poorly generalized feature representations, and correspondingly poor machine performance. These images with corresponding machine proposals would be presented to volunteers on Zooniverse for correction, leading to a drastic reduction in the volunteer effort on citizen science projects. We trained our model on ~2300 liver tissue 3D electron micrographs. Lipid droplets were segmented within these images through human annotation via the `Etch A Cell - Fat Checker' citizen science project, hosted on the Zooniverse platform. In this work, we demonstrate this framework and the selection methodology which resulted in a measured reduction in volunteer effort by more than 60%. We envision this type of joint human-machine partnership will be of great use on future Zooniverse projects.
    Efficient and Robust Jet Tagging at the LHC with Knowledge Distillation. (arXiv:2311.14160v1 [hep-ex])
    The challenging environment of real-time data processing systems at the Large Hadron Collider (LHC) strictly limits the computational complexity of algorithms that can be deployed. For deep learning models, this implies that only models with low computational complexity that have weak inductive bias are feasible. To address this issue, we utilize knowledge distillation to leverage both the performance of large models and the reduced computational complexity of small ones. In this paper, we present an implementation of knowledge distillation, demonstrating an overall boost in the student models' performance for the task of classifying jets at the LHC. Furthermore, by using a teacher model with a strong inductive bias of Lorentz symmetry, we show that we can induce the same inductive bias in the student model which leads to better robustness against arbitrary Lorentz boost.
    Tube-NeRF: Efficient Imitation Learning of Visuomotor Policies from MPC using Tube-Guided Data Augmentation and NeRFs. (arXiv:2311.14153v1 [cs.RO])
    Imitation learning (IL) can train computationally-efficient sensorimotor policies from a resource-intensive Model Predictive Controller (MPC), but it often requires many samples, leading to long training times or limited robustness. To address these issues, we combine IL with a variant of robust MPC that accounts for process and sensing uncertainties, and we design a data augmentation (DA) strategy that enables efficient learning of vision-based policies. The proposed DA method, named Tube-NeRF, leverages Neural Radiance Fields (NeRFs) to generate novel synthetic images, and uses properties of the robust MPC (the tube) to select relevant views and to efficiently compute the corresponding actions. We tailor our approach to the task of localization and trajectory tracking on a multirotor, by learning a visuomotor policy that generates control actions using images from the onboard camera as only source of horizontal position. Our evaluations numerically demonstrate learning of a robust visuomotor policy with an 80-fold increase in demonstration efficiency and a 50% reduction in training time over current IL methods. Additionally, our policies successfully transfer to a real multirotor, achieving accurate localization and low tracking errors despite large disturbances, with an onboard inference time of only 1.5 ms.
    Sharing pattern submodels for prediction with missing values. (arXiv:2206.11161v3 [cs.LG] UPDATED)
    Missing values are unavoidable in many applications of machine learning and present challenges both during training and at test time. When variables are missing in recurring patterns, fitting separate pattern submodels have been proposed as a solution. However, fitting models independently does not make efficient use of all available data. Conversely, fitting a single shared model to the full data set relies on imputation which often leads to biased results when missingness depends on unobserved factors. We propose an alternative approach, called sharing pattern submodels, which i) makes predictions that are robust to missing values at test time, ii) maintains or improves the predictive power of pattern submodels, and iii) has a short description, enabling improved interpretability. Parameter sharing is enforced through sparsity-inducing regularization which we prove leads to consistent estimation. Finally, we give conditions for when a sharing model is optimal, even when both missingness and the target outcome depend on unobserved variables. Classification and regression experiments on synthetic and real-world data sets demonstrate that our models achieve a favorable tradeoff between pattern specialization and information sharing.
    Privacy-Preserving Algorithmic Recourse. (arXiv:2311.14137v1 [cs.LG])
    When individuals are subject to adverse outcomes from machine learning models, providing a recourse path to help achieve a positive outcome is desirable. Recent work has shown that counterfactual explanations - which can be used as a means of single-step recourse - are vulnerable to privacy issues, putting an individuals' privacy at risk. Providing a sequential multi-step path for recourse can amplify this risk. Furthermore, simply adding noise to recourse paths found from existing methods can impact the realism and actionability of the path for an end-user. In this work, we address privacy issues when generating realistic recourse paths based on instance-based counterfactual explanations, and provide PrivRecourse: an end-to-end privacy preserving pipeline that can provide realistic recourse paths. PrivRecourse uses differentially private (DP) clustering to represent non-overlapping subsets of the private dataset. These DP cluster centers are then used to generate recourse paths by forming a graph with cluster centers as the nodes, so that we can generate realistic - feasible and actionable - recourse paths. We empirically evaluate our approach on finance datasets and compare it to simply adding noise to data instances, and to using DP synthetic data, to generate the graph. We observe that PrivRecourse can provide paths that are private and realistic.
    Automated 3D Tumor Segmentation using Temporal Cubic PatchGAN (TCuP-GAN). (arXiv:2311.14148v1 [eess.IV])
    Development of robust general purpose 3D segmentation frameworks using the latest deep learning techniques is one of the active topics in various bio-medical domains. In this work, we introduce Temporal Cubic PatchGAN (TCuP-GAN), a volume-to-volume translational model that marries the concepts of a generative feature learning framework with Convolutional Long Short-Term Memory Networks (LSTMs), for the task of 3D segmentation. We demonstrate the capabilities of our TCuP-GAN on the data from four segmentation challenges (Adult Glioma, Meningioma, Pediatric Tumors, and Sub-Saharan Africa subset) featured within the 2023 Brain Tumor Segmentation (BraTS) Challenge and quantify its performance using LesionWise Dice similarity and $95\%$ Hausdorff Distance metrics. We demonstrate the successful learning of our framework to predict robust multi-class segmentation masks across all the challenges. This benchmarking work serves as a stepping stone for future efforts towards applying TCuP-GAN on other multi-class tasks such as multi-organelle segmentation in electron microscopy imaging.
    Byzantine Robustness and Partial Participation Can Be Achieved Simultaneously: Just Clip Gradient Differences. (arXiv:2311.14127v1 [cs.LG])
    Distributed learning has emerged as a leading paradigm for training large machine learning models. However, in real-world scenarios, participants may be unreliable or malicious, posing a significant challenge to the integrity and accuracy of the trained models. Byzantine fault tolerance mechanisms have been proposed to address these issues, but they often assume full participation from all clients, which is not always practical due to the unavailability of some clients or communication constraints. In our work, we propose the first distributed method with client sampling and provable tolerance to Byzantine workers. The key idea behind the developed method is the use of gradient clipping to control stochastic gradient differences in recursive variance reduction. This allows us to bound the potential harm caused by Byzantine workers, even during iterations when all sampled clients are Byzantine. Furthermore, we incorporate communication compression into the method to enhance communication efficiency. Under quite general assumptions, we prove convergence rates for the proposed method that match the existing state-of-the-art (SOTA) theoretical results.
    Brain MRI Screening Tool with Federated Learning. (arXiv:2311.14086v1 [eess.IV])
    In clinical practice, we often see significant delays between MRI scans and the diagnosis made by radiologists, even for severe cases. In some cases, this may be caused by the lack of additional information and clues, so even the severe cases need to wait in the queue for diagnosis. This can be avoided if there is an automatic software tool, which would supplement additional information, alerting radiologists that the particular patient may be a severe case. We are presenting an automatic brain MRI Screening Tool and we are demonstrating its capabilities for detecting tumor-like pathologies. It is the first version on the path toward a robust multi-pathology screening solution. The tool supports Federated Learning, so multiple institutions may contribute to the model without disclosing their private data.
    Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting. (arXiv:2205.14415v4 [cs.LG] UPDATED)
    Transformers have shown great power in time series forecasting due to their global-range modeling ability. However, their performance can degenerate terribly on non-stationary real-world data in which the joint distribution changes over time. Previous studies primarily adopt stationarization to attenuate the non-stationarity of original series for better predictability. But the stationarized series deprived of inherent non-stationarity can be less instructive for real-world bursty events forecasting. This problem, termed over-stationarization in this paper, leads Transformers to generate indistinguishable temporal attentions for different series and impedes the predictive capability of deep models. To tackle the dilemma between series predictability and model capability, we propose Non-stationary Transformers as a generic framework with two interdependent modules: Series Stationarization and De-stationary Attention. Concretely, Series Stationarization unifies the statistics of each input and converts the output with restored statistics for better predictability. To address the over-stationarization problem, De-stationary Attention is devised to recover the intrinsic non-stationary information into temporal dependencies by approximating distinguishable attentions learned from raw series. Our Non-stationary Transformers framework consistently boosts mainstream Transformers by a large margin, which reduces MSE by 49.43% on Transformer, 47.34% on Informer, and 46.89% on Reformer, making them the state-of-the-art in time series forecasting. Code is available at this repository: https://github.com/thuml/Nonstationary_Transformers.
    When is Off-Policy Evaluation Useful? A Data-Centric Perspective. (arXiv:2311.14110v1 [cs.LG])
    Evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging. On the one hand, it brings opportunities for safe policy improvement under high-stakes scenarios like clinical guidelines. On the other hand, such opportunities raise a need for precise off-policy evaluation (OPE). While previous work on OPE focused on improving the algorithm in value estimation, in this work, we emphasize the importance of the offline dataset, hence putting forward a data-centric framework for evaluating OPE problems. We propose DataCOPE, a data-centric framework for evaluating OPE, that answers the questions of whether and to what extent we can evaluate a target policy given a dataset. DataCOPE (1) forecasts the overall performance of OPE algorithms without access to the environment, which is especially useful before real-world deployment where evaluating OPE is impossible; (2) identifies the sub-group in the dataset where OPE can be inaccurate; (3) permits evaluations of datasets or data-collection strategies for OPE problems. Our empirical analysis of DataCOPE in the logged contextual bandit settings using healthcare datasets confirms its ability to evaluate both machine-learning and human expert policies like clinical guidelines.
    Subnetwork Ensembles. (arXiv:2311.14101v1 [cs.LG])
    Neural network ensembles have been effectively used to improve generalization by combining the predictions of multiple independently trained models. However, the growing scale and complexity of deep neural networks have led to these methods becoming prohibitively expensive and time consuming to implement. Low-cost ensemble methods have become increasingly important as they can alleviate the need to train multiple models from scratch while retaining the generalization benefits that traditional ensemble learning methods afford. This dissertation introduces and formalizes a low-cost framework for constructing Subnetwork Ensembles, where a collection of child networks are formed by sampling, perturbing, and optimizing subnetworks from a trained parent model. We explore several distinct methodologies for generating child networks and we evaluate their efficacy through a variety of ablation studies and established benchmarks. Our findings reveal that this approach can greatly improve training efficiency, parametric utilization, and generalization performance while minimizing computational cost. Subnetwork Ensembles offer a compelling framework for exploring how we can build better systems by leveraging the unrealized potential of deep neural networks.
    Empirical Comparison between Cross-Validation and Mutation-Validation in Model Selection. (arXiv:2311.14079v1 [cs.LG])
    Mutation validation (MV) is a recently proposed approach for model selection, garnering significant interest due to its unique characteristics and potential benefits compared to the widely used cross-validation (CV) method. In this study, we empirically compared MV and $k$-fold CV using benchmark and real-world datasets. By employing Bayesian tests, we compared generalization estimates yielding three posterior probabilities: practical equivalence, CV superiority, and MV superiority. We also evaluated the differences in the capacity of the selected models and computational efficiency. We found that both MV and CV select models with practically equivalent generalization performance across various machine learning algorithms and the majority of benchmark datasets. MV exhibited advantages in terms of selecting simpler models and lower computational costs. However, in some cases MV selected overly simplistic models leading to underfitting and showed instability in hyperparameter selection. These limitations of MV became more evident in the evaluation of a real-world neuroscientific task of predicting sex at birth using brain functional connectivity.
    Class Uncertainty: A Measure to Mitigate Class Imbalance. (arXiv:2311.14090v1 [cs.LG])
    Class-wise characteristics of training examples affect the performance of deep classifiers. A well-studied example is when the number of training examples of classes follows a long-tailed distribution, a situation that is likely to yield sub-optimal performance for under-represented classes. This class imbalance problem is conventionally addressed by approaches relying on the class-wise cardinality of training examples, such as data resampling. In this paper, we demonstrate that considering solely the cardinality of classes does not cover all issues causing class imbalance. To measure class imbalance, we propose "Class Uncertainty" as the average predictive uncertainty of the training examples, and we show that this novel measure captures the differences across classes better than cardinality. We also curate SVCI-20 as a novel dataset in which the classes have equal number of training examples but they differ in terms of their hardness; thereby causing a type of class imbalance which cannot be addressed by the approaches relying on cardinality. We incorporate our "Class Uncertainty" measure into a diverse set of ten class imbalance mitigation methods to demonstrate its effectiveness on long-tailed datasets as well as on our SVCI-20. Code and datasets will be made available.
    Do VSR Models Generalize Beyond LRS3?. (arXiv:2311.14063v1 [cs.CV])
    The Lip Reading Sentences-3 (LRS3) benchmark has primarily been the focus of intense research in visual speech recognition (VSR) during the last few years. As a result, there is an increased risk of overfitting to its excessively used test set, which is only one hour duration. To alleviate this issue, we build a new VSR test set named WildVSR, by closely following the LRS3 dataset creation processes. We then evaluate and analyse the extent to which the current VSR models generalize to the new test data. We evaluate a broad range of publicly available VSR models and find significant drops in performance on our test set, compared to their corresponding LRS3 results. Our results suggest that the increase in word error rates is caused by the models inability to generalize to slightly harder and in the wild lip sequences than those found in the LRS3 test set. Our new test benchmark is made public in order to enable future research towards more robust VSR models.
    Robust Decision Aggregation with Second-order Information. (arXiv:2311.14094v1 [cs.GT])
    We consider a decision aggregation problem with two experts who each make a binary recommendation after observing a private signal about an unknown binary world state. An agent, who does not know the joint information structure between signals and states, sees the experts' recommendations and aims to match the action with the true state. Under the scenario, we study whether supplemented additionally with second-order information (each expert's forecast on the other's recommendation) could enable a better aggregation. We adopt a minimax regret framework to evaluate the aggregator's performance, by comparing it to an omniscient benchmark that knows the joint information structure. With general information structures, we show that second-order information provides no benefit. No aggregator can improve over a trivial aggregator, which always follows the first expert's recommendation. However, positive results emerge when we assume experts' signals are conditionally independent given the world state. When the aggregator is deterministic, we present a robust aggregator that leverages second-order information, which can significantly outperform counterparts without it. Second, when two experts are homogeneous, by adding a non-degenerate assumption on the signals, we demonstrate that random aggregators using second-order information can surpass optimal ones without it. In the remaining settings, the second-order information is not beneficial. We also extend the above results to the setting when the aggregator's utility function is more general.
    AdapterFL: Adaptive Heterogeneous Federated Learning for Resource-constrained Mobile Computing Systems. (arXiv:2311.14037v1 [cs.LG])
    Federated Learning (FL) enables collaborative learning of large-scale distributed clients without data sharing. However, due to the disparity of computing resources among massive mobile computing devices, the performance of traditional homogeneous model-based Federated Learning (FL) is seriously limited. On the one hand, to achieve model training in all the diverse clients, mobile computing systems can only use small low-performance models for collaborative learning. On the other hand, devices with high computing resources cannot train a high-performance large model with their insufficient raw data. To address the resource-constrained problem in mobile computing systems, we present a novel heterogeneous FL approach named AdapterFL, which uses a model reassemble strategy to facilitate collaborative training of massive heterogeneous mobile devices adaptively. Specifically, we select multiple candidate heterogeneous models based on the computing performance of massive mobile devices and then divide each heterogeneous model into two partitions. By reassembling the partitions, we can generate models with varied sizes that are combined by the partial parameters of the large model with the partial parameters of the small model. Using these reassembled models for FL training, we can train the partial parameters of the large model using low-performance devices. In this way, we can alleviate performance degradation in large models due to resource constraints. The experimental results show that AdapterFL can achieve up to 12\% accuracy improvement compared to the state-of-the-art heterogeneous federated learning methods in resource-constrained scenarios.
    Multivariate Scenario Generation of Day-Ahead Electricity Prices using Normalizing Flows. (arXiv:2311.14033v1 [cs.LG])
    Trading on electricity markets requires accurate information about the realization of electricity prices and the uncertainty attached to the predictions. We present a probabilistic forecasting approach for day-ahead electricity prices using the fully data-driven deep generative model called normalizing flows. Our modeling approach generates full-day scenarios of day-ahead electricity prices based on conditional features such as residual load forecasts. Furthermore, we propose extended feature sets of prior realizations and a periodic retraining scheme that allows the normalizing flow to adapt to the changing conditions of modern electricity markets. In particular, we investigate the impact of the energy crisis ensuing from the Russian invasion of Ukraine. Our results highlight that the normalizing flow generates high-quality scenarios that reproduce the true price distribution and yield highly accurate forecasts. Additionally, our analysis highlights how our improvements towards adaptations in changing regimes allow the normalizing flow to adapt to changing market conditions and enables continued sampling of high-quality day-ahead price scenarios.
    Exploring the impact of social stress on the adaptive dynamics of COVID-19: Typing the behavior of na\"ive populations faced with epidemics. (arXiv:2311.13917v1 [physics.soc-ph])
    In the context of natural disasters, human responses inevitably intertwine with natural factors. The COVID-19 pandemic, as a significant stress factor, has brought to light profound variations among different countries in terms of their adaptive dynamics in addressing the spread of infection outbreaks across different regions. This emphasizes the crucial role of cultural characteristics in natural disaster analysis. The theoretical understanding of large-scale epidemics primarily relies on mean-field kinetic models. However, conventional SIR-like models failed to fully explain the observed phenomena at the onset of the COVID-19 outbreak. These phenomena included the unexpected cessation of exponential growth, the reaching of plateaus, and the occurrence of multi-wave dynamics. In situations where an outbreak of a highly virulent and unfamiliar infection arises, it becomes crucial to respond swiftly at a non-medical level to mitigate the negative socio-economic impact. Here we present a theoretical examination of the first wave of the epidemic based on a simple SIRSS model (SIR with Social Stress). We conduct an analysis of the socio-cultural features of na\"ive population behaviors across various countries worldwide. The unique characteristics of each country/territory are encapsulated in only a few constants within our model, derived from the fitted COVID-19 statistics. These constants also reflect the societal response dynamics to the external stress factor, underscoring the importance of studying the mutual behavior of humanity and natural factors during global social disasters. Based on these distinctive characteristics of specific regions, local authorities can optimize their strategies to effectively combat epidemics until vaccines are developed.
    Docking Multirotors in Close Proximity using Learnt Downwash Models. (arXiv:2311.13988v1 [cs.RO])
    Unmodeled aerodynamic disturbances pose a key challenge for multirotor flight when multiple vehicles are in close proximity to each other. However, certain missions \textit{require} two multirotors to approach each other within 1-2 body-lengths of each other and hold formation -- we consider one such practical instance: vertically docking two multirotors in the air. In this leader-follower setting, the follower experiences significant downwash interference from the leader in its final docking stages. To compensate for this, we employ a learnt downwash model online within an optimal feedback controller to accurately track a docking maneuver and then hold formation. Through real-world flights with different maneuvers, we demonstrate that this compensation is crucial for reducing the large vertical separation otherwise required by conventional/naive approaches. Our evaluations show a tracking error of less than 0.06m for the follower (a 3-4x reduction) when approaching vertically within two body-lengths of the leader. Finally, we deploy the complete system to effect a successful physical docking between two airborne multirotors in a single smooth planned trajectory.
    Continual Learning of Diffusion Models with Generative Distillation. (arXiv:2311.14028v1 [cs.LG])
    Diffusion models are powerful generative models that achieve state-of-the-art performance in tasks such as image synthesis. However, training them demands substantial amounts of data and computational resources. Continual learning would allow for incrementally learning new tasks and accumulating knowledge, thus reusing already trained models would be possible. One potentially suitable approach is generative replay, where a copy of a generative model trained on previous tasks produces synthetic data that are interleaved with data from the current task. However, standard generative replay applied to diffusion models results in a catastrophic loss in denoising capabilities. In this paper, we propose generative distillation, an approach that distils the entire reverse process of a diffusion model. We demonstrate that our approach significantly improves the continual learning performance of generative replay with only a moderate increase in the computational costs.
    Jam-ALT: A Formatting-Aware Lyrics Transcription Benchmark. (arXiv:2311.13987v1 [eess.AS])
    Current automatic lyrics transcription (ALT) benchmarks focus exclusively on word content and ignore the finer nuances of written lyrics including formatting and punctuation, which leads to a potential misalignment with the creative products of musicians and songwriters as well as listeners' experiences. For example, line breaks are important in conveying information about rhythm, emotional emphasis, rhyme, and high-level structure. To address this issue, we introduce Jam-ALT, a new lyrics transcription benchmark based on the JamendoLyrics dataset. Our contribution is twofold. Firstly, a complete revision of the transcripts, geared specifically towards ALT evaluation by following a newly created annotation guide that unifies the music industry's guidelines, covering aspects such as punctuation, line breaks, spelling, background vocals, and non-word sounds. Secondly, a suite of evaluation metrics designed, unlike the traditional word error rate, to capture such phenomena. We hope that the proposed benchmark contributes to the ALT task, enabling more precise and reliable assessments of transcription systems and enhancing the user experience in lyrics applications such as subtitle renderings for live captioning or karaoke.
    A Unified Framework for Fair Spectral Clustering With Effective Graph Learning. (arXiv:2311.13766v1 [cs.LG])
    We consider the problem of spectral clustering under group fairness constraints, where samples from each sensitive group are approximately proportionally represented in each cluster. Traditional fair spectral clustering (FSC) methods consist of two consecutive stages, i.e., performing fair spectral embedding on a given graph and conducting $k$means to obtain discrete cluster labels. However, in practice, the graph is usually unknown, and we need to construct the underlying graph from potentially noisy data, the quality of which inevitably affects subsequent fair clustering performance. Furthermore, performing FSC through separate steps breaks the connections among these steps, leading to suboptimal results. To this end, we first theoretically analyze the effect of the constructed graph on FSC. Motivated by the analysis, we propose a novel graph construction method with a node-adaptive graph filter to learn graphs from noisy data. Then, all independent stages of conventional FSC are integrated into a single objective function, forming an end-to-end framework that inputs raw data and outputs discrete cluster labels. An algorithm is developed to jointly and alternately update the variables in each stage. Finally, we conduct extensive experiments on synthetic, benchmark, and real data, which show that our model is superior to state-of-the-art fair clustering methods.
    Bayes-xG: Player and Position Correction on Expected Goals (xG) using Bayesian Hierarchical Approach. (arXiv:2311.13707v1 [stat.AP])
    This study employs Bayesian methodologies to explore the influence of player or positional factors in predicting the probability of a shot resulting in a goal, measured by the expected goals (xG) metric. Utilising publicly available data from StatsBomb, Bayesian hierarchical logistic regressions are constructed, analysing approximately 10,000 shots from the English Premier League to ascertain whether positional or player-level effects impact xG. The findings reveal positional effects in a basic model that includes only distance to goal and shot angle as predictors, highlighting that strikers and attacking midfielders exhibit a higher likelihood of scoring. However, these effects diminish when more informative predictors are introduced. Nevertheless, even with additional predictors, player-level effects persist, indicating that certain players possess notable positive or negative xG adjustments, influencing their likelihood of scoring a given chance. The study extends its analysis to data from Spain's La Liga and Germany's Bundesliga, yielding comparable results. Additionally, the paper assesses the impact of prior distribution choices on outcomes, concluding that the priors employed in the models provide sound results but could be refined to enhance sampling efficiency for constructing more complex and extensive models feasibly.
    Learning Hierarchical Polynomials with Three-Layer Neural Networks. (arXiv:2311.13774v1 [cs.LG])
    We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form $h = g \circ p$ where $p : \mathbb{R}^d \rightarrow \mathbb{R}$ is a degree $k$ polynomial and $g: \mathbb{R} \rightarrow \mathbb{R}$ is a degree $q$ polynomial. This function class generalizes the single-index model, which corresponds to $k=1$, and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree $k$ polynomials $p$, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target $h$ up to vanishing test error in $\widetilde{\mathcal{O}}(d^k)$ samples and polynomial time. This is a strict improvement over kernel methods, which require $\widetilde \Theta(d^{kq})$ samples, as well as existing guarantees for two-layer networks, which require the target function to be low-rank. Our result also generalizes prior works on three-layer neural networks, which were restricted to the case of $p$ being a quadratic. When $p$ is indeed a quadratic, we achieve the information-theoretically optimal sample complexity $\widetilde{\mathcal{O}}(d^2)$, which is an improvement over prior work~\citep{nichani2023provable} requiring a sample size of $\widetilde\Theta(d^4)$. Our proof proceeds by showing that during the initial stage of training the network performs feature learning to recover the feature $p$ with $\widetilde{\mathcal{O}}(d^k)$ samples. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
    Beat-Aligned Spectrogram-to-Sequence Generation of Rhythm-Game Charts. (arXiv:2311.13687v1 [cs.LG])
    In the heart of "rhythm games" - games where players must perform actions in sync with a piece of music - are "charts", the directives to be given to players. We newly formulate chart generation as a sequence generation task and train a Transformer using a large dataset. We also introduce tempo-informed preprocessing and training procedures, some of which are suggested to be integral for a successful training. Our model is found to outperform the baselines on a large dataset, and is also found to benefit from pretraining and finetuning.
    Spatio-Temporal Graph Neural Networks for Predictive Learning in Urban Computing: A Survey. (arXiv:2303.14483v3 [cs.LG] UPDATED)
    With recent advances in sensing technologies, a myriad of spatio-temporal data has been generated and recorded in smart cities. Forecasting the evolution patterns of spatio-temporal data is an important yet demanding aspect of urban computing, which can enhance intelligent management decisions in various fields, including transportation, environment, climate, public safety, healthcare, and others. Traditional statistical and deep learning methods struggle to capture complex correlations in urban spatio-temporal data. To this end, Spatio-Temporal Graph Neural Networks (STGNN) have been proposed, achieving great promise in recent years. STGNNs enable the extraction of complex spatio-temporal dependencies by integrating graph neural networks (GNNs) and various temporal learning methods. In this manuscript, we provide a comprehensive survey on recent progress on STGNN technologies for predictive learning in urban computing. Firstly, we provide a brief introduction to the construction methods of spatio-temporal graph data and the prevalent deep-learning architectures used in STGNNs. We then sort out the primary application domains and specific predictive learning tasks based on existing literature. Afterward, we scrutinize the design of STGNNs and their combination with some advanced technologies in recent years. Finally, we conclude the limitations of existing research and suggest potential directions for future work.
    Optimal Power Flow in Highly Renewable Power System Based on Attention Neural Networks. (arXiv:2311.13949v1 [cs.LG])
    The Optimal Power Flow (OPF) problem is pivotal for power system operations, guiding generator output and power distribution to meet demand at minimized costs, while adhering to physical and engineering constraints. The integration of renewable energy sources, like wind and solar, however, poses challenges due to their inherent variability. This variability, driven largely by changing weather conditions, demands frequent recalibrations of power settings, thus necessitating recurrent OPF resolutions. This task is daunting using traditional numerical methods, particularly for extensive power systems. In this work, we present a cutting-edge, physics-informed machine learning methodology, trained using imitation learning and historical European weather datasets. Our approach directly correlates electricity demand and weather patterns with power dispatch and generation, circumventing the iterative requirements of traditional OPF solvers. This offers a more expedient solution apt for real-time applications. Rigorous evaluations on aggregated European power systems validate our method's superiority over existing data-driven techniques in OPF solving. By presenting a quick, robust, and efficient solution, this research sets a new standard in real-time OPF resolution, paving the way for more resilient power systems in the era of renewable energy.
    Sample as You Infer: Predictive Coding With Langevin Dynamics. (arXiv:2311.13664v1 [cs.LG])
    We present a novel algorithm for parameter learning in generic deep generative models that builds upon the predictive coding (PC) framework of computational neuroscience. Our approach modifies the standard PC algorithm to bring performance on-par and exceeding that obtained from standard variational auto-encoder (VAE) training. By injecting Gaussian noise into the PC inference procedure we re-envision it as an overdamped Langevin sampling, which facilitates optimisation with respect to a tight evidence lower bound (ELBO). We improve the resultant encoder-free training method by incorporating an encoder network to provide an amortised warm-start to our Langevin sampling and test three different objectives for doing so. Finally, to increase robustness to the sampling step size and reduce sensitivity to curvature, we validate a lightweight and easily computable form of preconditioning, inspired by Riemann Manifold Langevin and adaptive optimizers from the SGD literature. We compare against VAEs by training like-for-like generative models using our technique against those trained with standard reparameterisation-trick-based ELBOs. We observe our method out-performs or matches performance across a number of metrics, including sample quality, while converging in a fraction of the number of SGD training iterations.
    Prompt Risk Control: A Rigorous Framework for Responsible Deployment of Large Language Models. (arXiv:2311.13628v1 [cs.LG])
    The recent explosion in the capabilities of large language models has led to a wave of interest in how best to prompt a model to perform a given task. While it may be tempting to simply choose a prompt based on average performance on a validation set, this can lead to a deployment where unexpectedly poor responses are generated, especially for the worst-off users. To mitigate this prospect, we propose Prompt Risk Control, a lightweight framework for selecting a prompt based on rigorous upper bounds on families of informative risk measures. We offer methods for producing bounds on a diverse set of metrics, including quantities that measure worst-case responses and disparities in generation quality across the population of users. In addition, we extend the underlying statistical bounding techniques to accommodate the possibility of distribution shifts in deployment. Experiments on applications such as open-ended chat, medical question summarization, and code generation highlight how such a framework can foster responsible deployment by reducing the risk of the worst outcomes.
    Learning Uniform Clusters on Hypersphere for Deep Graph-level Clustering. (arXiv:2311.13953v1 [cs.LG])
    Graph clustering has been popularly studied in recent years. However, most existing graph clustering methods focus on node-level clustering, i.e., grouping nodes in a single graph into clusters. In contrast, graph-level clustering, i.e., grouping multiple graphs into clusters, remains largely unexplored. Graph-level clustering is critical in a variety of real-world applications, such as, properties prediction of molecules and community analysis in social networks. However, graph-level clustering is challenging due to the insufficient discriminability of graph-level representations, and the insufficient discriminability makes deep clustering be more likely to obtain degenerate solutions (cluster collapse). To address the issue, we propose a novel deep graph-level clustering method called Uniform Deep Graph Clustering (UDGC). UDGC assigns instances evenly to different clusters and then scatters those clusters on unit hypersphere, leading to a more uniform cluster-level distribution and a slighter cluster collapse. Specifically, we first propose Augmentation-Consensus Optimal Transport (ACOT) for generating uniformly distributed and reliable pseudo labels for partitioning clusters. Then we adopt contrastive learning to scatter those clusters. Besides, we propose Center Alignment Optimal Transport (CAOT) for guiding the model to learn better parameters, which further promotes the cluster performance. Our empirical study on eight well-known datasets demonstrates that UDGC significantly outperforms the state-of-the-art models.
    Score-based Diffusion Models in Function Space. (arXiv:2302.07400v2 [cs.LG] UPDATED)
    Diffusion models have recently emerged as a powerful framework for generative modeling. They consist of a forward process that perturbs input data with Gaussian white noise and a reverse process that learns a score function to generate samples by denoising. Despite their tremendous success, they are mostly formulated on finite-dimensional spaces, e.g. Euclidean, limiting their applications to many domains where the data has a functional form such as in scientific computing and 3D geometric data analysis. In this work, we introduce a mathematically rigorous framework called Denoising Diffusion Operators (DDOs) for training diffusion models in function space. In DDOs, the forward process perturbs input functions gradually using a Gaussian process. The generative process is formulated by integrating a function-valued Langevin dynamic. Our approach requires an appropriate notion of the score for the perturbed data distribution, which we obtain by generalizing denoising score matching to function spaces that can be infinite-dimensional. We show that the corresponding discretized algorithm generates accurate samples at a fixed cost that is independent of the data resolution. We theoretically and numerically verify the applicability of our approach on a set of problems, including generating solutions to the Navier-Stokes equation viewed as the push-forward distribution of forcings from a Gaussian Random Field (GRF).
    Locally Optimal Descent for Dynamic Stepsize Scheduling. (arXiv:2311.13877v1 [cs.LG])
    We introduce a novel dynamic learning-rate scheduling scheme grounded in theory with the goal of simplifying the manual and time-consuming tuning of schedules in practice. Our approach is based on estimating the locally-optimal stepsize, guaranteeing maximal descent in the direction of the stochastic gradient of the current step. We first establish theoretical convergence bounds for our method within the context of smooth non-convex stochastic optimization, matching state-of-the-art bounds while only assuming knowledge of the smoothness parameter. We then present a practical implementation of our algorithm and conduct systematic experiments across diverse datasets and optimization algorithms, comparing our scheme with existing state-of-the-art learning-rate schedulers. Our findings indicate that our method needs minimal tuning when compared to existing approaches, removing the need for auxiliary manual schedules and warm-up phases and achieving comparable performance with drastically reduced parameter tuning.
    L(M)V-IQL: Multiple Intention Inverse Reinforcement Learning for Animal Behavior Characterization. (arXiv:2311.13870v1 [cs.LG])
    In advancing the understanding of decision-making processes, mathematical models, particularly Inverse Reinforcement Learning (IRL), have proven instrumental in reconstructing animal's multiple intentions amidst complex behaviors. Given the recent development of a continuous-time multi-intention IRL framework, there has been persistent inquiry into inferring discrete time-varying reward functions with multiple intention IRL approaches. To tackle the challenge, we introduce the Latent (Markov) Variable Inverse Q-learning (L(M)V-IQL) algorithms, a novel IRL framework tailored for accommodating discrete intrinsic rewards. Leveraging an Expectation-Maximization approach, we cluster observed trajectories into distinct intentions and independently solve the IRL problem for each. Demonstrating the efficacy of L(M)V-IQL through simulated experiments and its application to different real mouse behavior datasets, our approach surpasses current benchmarks in animal behavior prediction, producing interpretable reward functions. This advancement holds promise for neuroscience and psychology, contributing to a deeper understanding of animal decision-making and uncovering underlying brain mechanisms.
    RankFeat\&RankWeight: Rank-1 Feature/Weight Removal for Out-of-distribution Detection. (arXiv:2311.13959v1 [cs.LG])
    The task of out-of-distribution (OOD) detection is crucial for deploying machine learning models in real-world settings. In this paper, we observe that the singular value distributions of the in-distribution (ID) and OOD features are quite different: the OOD feature matrix tends to have a larger dominant singular value than the ID feature, and the class predictions of OOD samples are largely determined by it. This observation motivates us to propose \texttt{RankFeat}, a simple yet effective \emph{post hoc} approach for OOD detection by removing the rank-1 matrix composed of the largest singular value and the associated singular vectors from the high-level feature. \texttt{RankFeat} achieves \emph{state-of-the-art} performance and reduces the average false positive rate (FPR95) by 17.90\% compared with the previous best method. The success of \texttt{RankFeat} motivates us to investigate whether a similar phenomenon would exist in the parameter matrices of neural networks. We thus propose \texttt{RankWeight} which removes the rank-1 weight from the parameter matrices of a single deep layer. Our \texttt{RankWeight}is also \emph{post hoc} and only requires computing the rank-1 matrix once. As a standalone approach, \texttt{RankWeight} has very competitive performance against other methods across various backbones. Moreover, \texttt{RankWeight} enjoys flexible compatibility with a wide range of OOD detection methods. The combination of \texttt{RankWeight} and \texttt{RankFeat} refreshes the new \emph{state-of-the-art} performance, achieving the FPR95 as low as 16.13\% on the ImageNet-1k benchmark. Extensive ablation studies and comprehensive theoretical analyses are presented to support the empirical results.
    Stability and L2-penalty in Model Averaging. (arXiv:2311.13827v1 [stat.ML])
    Model averaging has received much attention in the past two decades, which integrates available information by averaging over potential models. Although various model averaging methods have been developed, there are few literatures on the theoretical properties of model averaging from the perspective of stability, and the majority of these methods constrain model weights to a simplex. The aim of this paper is to introduce stability from statistical learning theory into model averaging. Thus, we define the stability, asymptotic empirical risk minimizer, generalization, and consistency of model averaging and study the relationship among them. Our results indicate that stability can ensure that model averaging has good generalization performance and consistency under reasonable conditions, where consistency means model averaging estimator can asymptotically minimize the mean squared prediction error. We also propose a L2-penalty model averaging method without limiting model weights and prove that it has stability and consistency. In order to reduce the impact of tuning parameter selection, we use 10-fold cross-validation to select a candidate set of tuning parameters and perform a weighted average of the estimators of model weights based on estimation errors. The Monte Carlo simulation and an illustrative application demonstrate the usefulness of the proposed method.
    Leveraging Optimal Transport via Projections on Subspaces for Machine Learning Applications. (arXiv:2311.13883v1 [cs.LG])
    Optimal Transport has received much attention in Machine Learning as it allows to compare probability distributions by exploiting the geometry of the underlying space. However, in its original formulation, solving this problem suffers from a significant computational burden. Thus, a meaningful line of work consists at proposing alternatives to reduce this burden while still enjoying its properties. In this thesis, we focus on alternatives which use projections on subspaces. The main such alternative is the Sliced-Wasserstein distance, which we first propose to extend to Riemannian manifolds in order to use it in Machine Learning applications for which using such spaces has been shown to be beneficial in the recent years. We also study sliced distances between positive measures in the so-called unbalanced OT problem. Back to the original Euclidean Sliced-Wasserstein distance between probability measures, we study the dynamic of gradient flows when endowing the space with this distance in place of the usual Wasserstein distance. Then, we investigate the use of the Busemann function, a generalization of the inner product in metric spaces, in the space of probability measures. Finally, we extend the subspace detour approach to incomparable spaces using the Gromov-Wasserstein distance.
    Fairness-Aware Domain Generalization under Covariate and Dependence Shifts. (arXiv:2311.13816v1 [cs.LG])
    Achieving the generalization of an invariant classifier from source domains to shifted target domains while simultaneously considering model fairness is a substantial and complex challenge in machine learning. Existing domain generalization research typically attributes domain shifts to concept shift, which relates to alterations in class labels, and covariate shift, which pertains to variations in data styles. In this paper, by introducing another form of distribution shift, known as dependence shift, which involves variations in fair dependence patterns across domains, we propose a novel domain generalization approach that addresses domain shifts by considering both covariate and dependence shifts. We assert the existence of an underlying transformation model can transform data from one domain to another. By generating data in synthetic domains through the model, a fairness-aware invariant classifier is learned that enforces both model accuracy and fairness in unseen domains. Extensive empirical studies on four benchmark datasets demonstrate that our approach surpasses state-of-the-art methods.
    Which Matters Most in Making Fund Investment Decisions? A Multi-granularity Graph Disentangled Learning Framework. (arXiv:2311.13864v1 [cs.LG])
    In this paper, we highlight that both conformity and risk preference matter in making fund investment decisions beyond personal interest and seek to jointly characterize these aspects in a disentangled manner. Consequently, we develop a novel M ulti-granularity Graph Disentangled Learning framework named MGDL to effectively perform intelligent matching of fund investment products. Benefiting from the well-established fund graph and the attention module, multi-granularity user representations are derived from historical behaviors to separately express personal interest, conformity and risk preference in a fine-grained way. To attain stronger disentangled representations with specific semantics, MGDL explicitly involve two self-supervised signals, i.e., fund type based contrasts and fund popularity. Extensive experiments in offline and online environments verify the effectiveness of MGDL.
    Object Location Prediction in Real-time using LSTM Neural Network and Polynomial Regression. (arXiv:2311.13950v1 [cs.LG])
    This paper details the design and implementation of a system for predicting and interpolating object location coordinates. Our solution is based on processing inertial measurements and global positioning system data through a Long Short-Term Memory (LSTM) neural network and polynomial regression. LSTM is a type of recurrent neural network (RNN) particularly suited for processing data sequences and avoiding the long-term dependency problem. We employed data from real-world vehicles and the global positioning system (GPS) sensors. A critical pre-processing step was developed to address varying sensor frequencies and inconsistent GPS time steps and dropouts. The LSTM-based system's performance was compared with the Kalman Filter. The system was tuned to work in real-time with low latency and high precision. We tested our system on roads under various driving conditions, including acceleration, turns, deceleration, and straight paths. We tested our proposed solution's accuracy and inference time and showed that it could perform in real-time. Our LSTM-based system yielded an average error of 0.11 meters with an inference time of 2 ms. This represents a 76\% reduction in error compared to the traditional Kalman filter method, which has an average error of 0.46 meters with a similar inference time to the LSTM-based system.
    FinMe: A Performance-Enhanced Large Language Model Trading Agent with Layered Memory and Character Design. (arXiv:2311.13743v1 [q-fin.CP])
    Recent advancements in Large Language Models (LLMs) have exhibited notable efficacy in question-answering (QA) tasks across diverse domains. Their prowess in integrating extensive web knowledge has fueled interest in developing LLM autonomous agents. While LLMs are efficient in decoding human instructions and deriving solutions by holistically processing historical inputs, transitioning to purpose-driven agents requires a supplementary rational architecture to process multi-source information, establish reasoning chains, and prioritize critical tasks. Addressing this, we introduce \textsc{FinMe}, a novel LLM-based agent framework devised for financial decision-making, encompassing three core modules: Profiling, to outline the agent's characteristics; Memory, with layered processing, to aid the agent in assimilating realistic hierarchical financial data; and Decision-making, to convert insights gained from memories into investment decisions. Notably, \textsc{FinMe}'s memory module aligns closely with the cognitive structure of human traders, offering robust interpretability and real-time tuning. Its adjustable cognitive span allows for the retention of critical information beyond human perceptual limits, thereby enhancing trading outcomes. This framework enables the agent to self-evolve its professional knowledge, react agilely to new investment cues, and continuously refine trading decisions in the volatile financial environment. We first compare \textsc{FinMe} with various algorithmic agents on a scalable real-world financial dataset, underscoring its leading trading performance in stocks and funds. We then fine-tuned the agent's perceptual spans to achieve a significant trading performance. Collectively, \textsc{FinMe} presents a cutting-edge LLM agent framework for automated trading, boosting cumulative investment returns.
    Exact Combinatorial Optimization with Temporo-Attentional Graph Neural Networks. (arXiv:2311.13843v1 [cs.LG])
    Combinatorial optimization finds an optimal solution within a discrete set of variables and constraints. The field has seen tremendous progress both in research and industry. With the success of deep learning in the past decade, a recent trend in combinatorial optimization has been to improve state-of-the-art combinatorial optimization solvers by replacing key heuristic components with machine learning (ML) models. In this paper, we investigate two essential aspects of machine learning algorithms for combinatorial optimization: temporal characteristics and attention. We argue that for the task of variable selection in the branch-and-bound (B&B) algorithm, incorporating the temporal information as well as the bipartite graph attention improves the solver's performance. We support our claims with intuitions and numerical results over several standard datasets used in the literature and competitions. Code is available at: https://developer.huaweicloud.com/develop/aigallery/notebook/detail?id=047c6cf2-8463-40d7-b92f-7b2ca998e935
    A Somewhat Robust Image Watermark against Diffusion-based Editing Models. (arXiv:2311.13713v1 [cs.CR])
    Recently, diffusion models (DMs) have become the state-of-the-art method for image synthesis. Editing models based on DMs, known for their high fidelity and precision, have inadvertently introduced new challenges related to image copyright infringement and malicious editing. Our work is the first to formalize and address this issue. After assessing and attempting to enhance traditional image watermarking techniques, we recognize their limitations in this emerging context. In response, we develop a novel technique, RIW (Robust Invisible Watermarking), to embed invisible watermarks leveraging adversarial example techniques. Our technique ensures a high extraction accuracy of $96\%$ for the invisible watermark after editing, compared to the $0\%$ offered by conventional methods. We provide access to our code at https://github.com/BennyTMT/RIW.
    Scalable CP Decomposition for Tensor Learning using GPU Tensor Cores. (arXiv:2311.13693v1 [cs.LG])
    CP decomposition is a powerful tool for data science, especially gene analysis, deep learning, and quantum computation. However, the application of tensor decomposition is largely hindered by the exponential increment of the computational complexity and storage consumption with the size of tensors. While the data in our real world is usually presented as trillion- or even exascale-scale tensors, existing work can only support billion-scale scale tensors. In our work, we propose the Exascale-Tensor to mitigate the significant gap. Specifically, we propose a compression-based tensor decomposition framework, namely the exascale-tensor, to support exascale tensor decomposition. Then, we carefully analyze the inherent parallelism and propose a bag of strategies to improve computational efficiency. Last, we conduct experiments to decompose tensors ranging from million-scale to trillion-scale for evaluation. Compared to the baselines, the exascale-tensor supports 8,000x larger tensors and a speedup up to 6.95x. We also apply our method to two real-world applications, including gene analysis and tensor layer neural networks, of which the numeric results demonstrate the scalability and effectiveness of our method.
    Enhancing Intrusion Detection In Internet Of Vehicles Through Federated Learning. (arXiv:2311.13800v1 [cs.CR])
    Federated learning is a technique of decentralized machine learning. that allows multiple parties to collaborate and learn a shared model without sharing their raw data. Our paper proposes a federated learning framework for intrusion detection in Internet of Vehicles (IOVs) using the CIC-IDS 2017 dataset. The proposed framework employs SMOTE for handling class imbalance, outlier detection for identifying and removing abnormal observations, and hyperparameter tuning to optimize the model's performance. The authors evaluated the proposed framework using various performance metrics and demonstrated its effectiveness in detecting intrusions with other datasets (KDD-Cup 99 and UNSW- NB-15) and conventional classifiers. Furthermore, the proposed framework can protect sensitive data while achieving high intrusion detection performance.
    Understanding the Vulnerability of CLIP to Image Compression. (arXiv:2311.14029v1 [cs.CV])
    CLIP is a widely used foundational vision-language model that is used for zero-shot image recognition and other image-text alignment tasks. We demonstrate that CLIP is vulnerable to change in image quality under compression. This surprising result is further analysed using an attribution method-Integrated Gradients. Using this attribution method, we are able to better understand both quantitatively and qualitatively exactly the nature in which the compression affects the zero-shot recognition accuracy of this model. We evaluate this extensively on CIFAR-10 and STL-10. Our work provides the basis to understand this vulnerability of CLIP and can help us develop more effective methods to improve the robustness of CLIP and other vision-language models.
    Extraction of n = 0 pick-up by locked mode detectors based on neural networks in J-TEXT. (arXiv:2311.13763v1 [physics.plasm-ph])
    Measurement of locked mode (LM) is important for the physical research of Magnetohydrodynamic (MHD) instabilities and plasma disruption. The n = 0 pick-up need to be extracted and subtracted to calculate the amplitude and phase of the LM. A new method to extract this pick-up has been developed by predicting the n = 0 pick-up brn=0 by the LM detectors based on Neural Networks (NNs) in J-TEXT. An approach called Power Multiple Time Scale (PMTS) has been developed with outstanding regressing effect in multiple frequency ranges. Three models have been progressed based on PMTS NNs. PMTS could fit the brn=0 on the LM detectors with little errors both in time domain and frequency domain. The n>0 pick-up brn>0 generated by resonant magnetic perturbations (RMPs) can be obtained after subtracting the extracted brn=0. This new method uses only one LM instead of 4 LM detectors to extract brn=0. Therefore, the distribution of the LM detectors can also be optimized based on this new method.
    Molecular Identification and Peak Assignment: Leveraging Multi-Level Multimodal Alignment on NMR. (arXiv:2311.13817v1 [cs.LG])
    Nuclear magnetic resonance (NMR) spectroscopy plays an essential role across various scientific disciplines, providing valuable insights into molecular dynamics and interactions. Despite the promise of AI-enhanced NMR prediction models, challenges persist in the interpretation of spectra for tasks such as molecular retrieval, isomer recognition, and peak assignment. In response, this paper introduces Multi-Level Multimodal Alignment with Knowledge-Guided Instance-Wise Discrimination (K-M3AID) to establish meaningful correspondences between two heterogeneous modalities: molecular graphs (structures) and NMR spectra. In particular, K-M3AID employs a dual-coordinated contrastive learning architecture, and incorporates a graph-level alignment module, a node-level alignment module, and a communication channel. Notably, the framework introduces knowledge-guided instance-wise discrimination into contrastive learning within the node-level alignment module, significantly enhancing accuracy in cross-modal alignment. Additionally, K-M3AID showcases its capability of meta-learning by demonstrating that skills acquired during node-level alignment positively impact graph-level alignment. Empirical validation underscores K-M3AID's effectiveness in addressing multiple zero-shot tasks, offering a promising solution to bridge the gap between structural information and spectral data in complex NMR scenarios.
    HypUC: Hyperfine Uncertainty Calibration with Gradient-boosted Corrections for Reliable Regression on Imbalanced Electrocardiograms. (arXiv:2311.13821v1 [cs.LG])
    The automated analysis of medical time series, such as the electrocardiogram (ECG), electroencephalogram (EEG), pulse oximetry, etc, has the potential to serve as a valuable tool for diagnostic decisions, allowing for remote monitoring of patients and more efficient use of expensive and time-consuming medical procedures. Deep neural networks (DNNs) have been demonstrated to process such signals effectively. However, previous research has primarily focused on classifying medical time series rather than attempting to regress the continuous-valued physiological parameters central to diagnosis. One significant challenge in this regard is the imbalanced nature of the dataset, as a low prevalence of abnormal conditions can lead to heavily skewed data that results in inaccurate predictions and a lack of certainty in such predictions when deployed. To address these challenges, we propose HypUC, a framework for imbalanced probabilistic regression in medical time series, making several contributions. (i) We introduce a simple kernel density-based technique to tackle the imbalanced regression problem with medical time series. (ii) Moreover, we employ a probabilistic regression framework that allows uncertainty estimation for the predicted continuous values. (iii) We also present a new approach to calibrate the predicted uncertainty further. (iv) Finally, we demonstrate a technique to use calibrated uncertainty estimates to improve the predicted continuous value and show the efficacy of the calibrated uncertainty estimates to flag unreliable predictions. HypUC is evaluated on a large, diverse, real-world dataset of ECGs collected from millions of patients, outperforming several conventional baselines on various diagnostic tasks, suggesting a potential use-case for the reliable clinical deployment of deep learning models.
    Learning Optimal and Fair Policies for Online Allocation of Scarce Societal Resources from Data Collected in Deployment. (arXiv:2311.13765v1 [math.OC])
    We study the problem of allocating scarce societal resources of different types (e.g., permanent housing, deceased donor kidneys for transplantation, ventilators) to heterogeneous allocatees on a waitlist (e.g., people experiencing homelessness, individuals suffering from end-stage renal disease, Covid-19 patients) based on their observed covariates. We leverage administrative data collected in deployment to design an online policy that maximizes expected outcomes while satisfying budget constraints, in the long run. Our proposed policy waitlists each individual for the resource maximizing the difference between their estimated mean treatment outcome and the estimated resource dual-price or, roughly, the opportunity cost of using the resource. Resources are then allocated as they arrive, in a first-come first-serve fashion. We demonstrate that our data-driven policy almost surely asymptotically achieves the expected outcome of the optimal out-of-sample policy under mild technical assumptions. We extend our framework to incorporate various fairness constraints. We evaluate the performance of our approach on the problem of designing policies for allocating scarce housing resources to people experiencing homelessness in Los Angeles based on data from the homeless management information system. In particular, we show that using our policies improves rates of exit from homelessness by 1.9% and that policies that are fair in either allocation or outcomes by race come at a very low price of fairness.
    Touring sampling with pushforward maps. (arXiv:2311.13845v1 [cs.LG])
    The number of sampling methods could be daunting for a practitioner looking to cast powerful machine learning methods to their specific problem. This paper takes a theoretical stance to review and organize many sampling approaches in the ``generative modeling'' setting, where one wants to generate new data that are similar to some training examples. By revealing links between existing methods, it might prove useful to overcome some of the current challenges in sampling with diffusion models, such as long inference time due to diffusion simulation, or the lack of diversity in generated samples.
    A Blockchain Solution for Collaborative Machine Learning over IoT. (arXiv:2311.14136v1 [cs.LG])
    The rapid growth of Internet of Things (IoT) devices and applications has led to an increased demand for advanced analytics and machine learning techniques capable of handling the challenges associated with data privacy, security, and scalability. Federated learning (FL) and blockchain technologies have emerged as promising approaches to address these challenges by enabling decentralized, secure, and privacy-preserving model training on distributed data sources. In this paper, we present a novel IoT solution that combines the incremental learning vector quantization algorithm (XuILVQ) with Ethereum blockchain technology to facilitate secure and efficient data sharing, model training, and prototype storage in a distributed environment. Our proposed architecture addresses the shortcomings of existing blockchain-based FL solutions by reducing computational and communication overheads while maintaining data privacy and security. We assess the performance of our system through a series of experiments, showcasing its potential to enhance the accuracy and efficiency of machine learning tasks in IoT settings.
    Breathing Life Into Sketches Using Text-to-Video Priors. (arXiv:2311.13608v1 [cs.CV])
    A sketch is one of the most intuitive and versatile tools humans use to convey their ideas visually. An animated sketch opens another dimension to the expression of ideas and is widely used by designers for a variety of purposes. Animating sketches is a laborious process, requiring extensive experience and professional design skills. In this work, we present a method that automatically adds motion to a single-subject sketch (hence, "breathing life into it"), merely by providing a text prompt indicating the desired motion. The output is a short animation provided in vector representation, which can be easily edited. Our method does not require extensive training, but instead leverages the motion prior of a large pretrained text-to-video diffusion model using a score-distillation loss to guide the placement of strokes. To promote natural and smooth motion and to better preserve the sketch's appearance, we model the learned motion through two components. The first governs small local deformations and the second controls global affine transformations. Surprisingly, we find that even models that struggle to generate sketch videos on their own can still serve as a useful backbone for animating abstract representations.
    Language Model Inversion. (arXiv:2311.13647v1 [cs.CL])
    Language models produce a distribution over the next token; can we use this information to recover the prompt tokens? We consider the problem of language model inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search. On Llama-2 7b, our inversion method reconstructs prompts with a BLEU of $59$ and token-level F1 of $78$ and recovers $27\%$ of prompts exactly. Code for reproducing all experiments is available at this http URL
    A Theoretical Insight into Attack and Defense of Gradient Leakage in Transformer. (arXiv:2311.13624v1 [cs.LG])
    The Deep Leakage from Gradient (DLG) attack has emerged as a prevalent and highly effective method for extracting sensitive training data by inspecting exchanged gradients. This approach poses a substantial threat to the privacy of individuals and organizations alike. This research presents a comprehensive analysis of the gradient leakage method when applied specifically to transformer-based models. Through meticulous examination, we showcase the capability to accurately recover data solely from gradients and rigorously investigate the conditions under which gradient attacks can be executed, providing compelling evidence. Furthermore, we reevaluate the approach of introducing additional noise on gradients as a protective measure against gradient attacks. To address this, we outline a theoretical proof that analyzes the associated privacy costs within the framework of differential privacy. Additionally, we affirm the convergence of the Stochastic Gradient Descent (SGD) algorithm under perturbed gradients. The primary objective of this study is to augment the understanding of gradient leakage attack and defense strategies while actively contributing to the development of privacy-preserving techniques specifically tailored for transformer-based models. By shedding light on the vulnerabilities and countermeasures associated with gradient leakage, this research aims to foster advancements in safeguarding sensitive data and upholding privacy in the context of transformer-based models.
    A study on the impact of pre-trained model on Just-In-Time defect prediction. (arXiv:2309.02317v2 [cs.SE] UPDATED)
    Previous researchers conducting Just-In-Time (JIT) defect prediction tasks have primarily focused on the performance of individual pre-trained models, without exploring the relationship between different pre-trained models as backbones. In this study, we build six models: RoBERTaJIT, CodeBERTJIT, BARTJIT, PLBARTJIT, GPT2JIT, and CodeGPTJIT, each with a distinct pre-trained model as its backbone. We systematically explore the differences and connections between these models. Specifically, we investigate the performance of the models when using Commit code and Commit message as inputs, as well as the relationship between training efficiency and model distribution among these six models. Additionally, we conduct an ablation experiment to explore the sensitivity of each model to inputs. Furthermore, we investigate how the models perform in zero-shot and few-shot scenarios. Our findings indicate that each model based on different backbones shows improvements, and when the backbone's pre-training model is similar, the training resources that need to be consumed are much more closer. We also observe that Commit code plays a significant role in defect detection, and different pre-trained models demonstrate better defect detection ability with a balanced dataset under few-shot scenarios. These results provide new insights for optimizing JIT defect prediction tasks using pre-trained models and highlight the factors that require more attention when constructing such models. Additionally, CodeGPTJIT and GPT2JIT achieved better performance than DeepJIT and CC2Vec on the two datasets respectively under 2000 training samples. These findings emphasize the effectiveness of transformer-based pre-trained models in JIT defect prediction tasks, especially in scenarios with limited training data.
    Density Distribution-based Learning Framework for Addressing Online Continual Learning Challenges. (arXiv:2311.13623v1 [cs.LG])
    In this paper, we address the challenges of online Continual Learning (CL) by introducing a density distribution-based learning framework. CL, especially the Class Incremental Learning, enables adaptation to new test distributions while continuously learning from a single-pass training data stream, which is more in line with the practical application requirements of real-world scenarios. However, existing CL methods often suffer from catastrophic forgetting and higher computing costs due to complex algorithm designs, limiting their practical use. Our proposed framework overcomes these limitations by achieving superior average accuracy and time-space efficiency, bridging the performance gap between CL and classical machine learning. Specifically, we adopt an independent Generative Kernel Density Estimation (GKDE) model for each CL task. During the testing stage, the GKDEs utilize a self-reported max probability density value to determine which one is responsible for predicting incoming test instances. A GKDE-based learning objective can ensure that samples with the same label are grouped together, while dissimilar instances are pushed farther apart. Extensive experiments conducted on multiple CL datasets validate the effectiveness of our proposed framework. Our method outperforms popular CL approaches by a significant margin, while maintaining competitive time-space efficiency, making our framework suitable for real-world applications. Code will be available at https://github.com/xxxx/xxxx.
    Visual Dexterity: In-Hand Reorientation of Novel and Complex Object Shapes. (arXiv:2211.11744v3 [cs.RO] UPDATED)
    In-hand object reorientation is necessary for performing many dexterous manipulation tasks, such as tool use in less structured environments that remain beyond the reach of current robots. Prior works built reorientation systems assuming one or many of the following: reorienting only specific objects with simple shapes, limited range of reorientation, slow or quasistatic manipulation, simulation-only results, the need for specialized and costly sensor suites, and other constraints which make the system infeasible for real-world deployment. We present a general object reorientation controller that does not make these assumptions. It uses readings from a single commodity depth camera to dynamically reorient complex and new object shapes by any rotation in real-time, with the median reorientation time being close to seven seconds. The controller is trained using reinforcement learning in simulation and evaluated in the real world on new object shapes not used for training, including the most challenging scenario of reorienting objects held in the air by a downward-facing hand that must counteract gravity during reorientation. Our hardware platform only uses open-source components that cost less than five thousand dollars. Although we demonstrate the ability to overcome assumptions in prior work, there is ample scope for improving absolute performance. For instance, the challenging duck-shaped object not used for training was dropped in 56 percent of the trials. When it was not dropped, our controller reoriented the object within 0.4 radians (23 degrees) 75 percent of the time. Videos are available at: https://taochenshh.github.io/projects/visual-dexterity.
    Optimal $\delta$-Correct Best-Arm Selection for Heavy-Tailed Distributions. (arXiv:1908.09094v3 [cs.LG] UPDATED)
    Given a finite set of unknown distributions or arms that can be sampled, we consider the problem of identifying the one with the maximum mean using a $\delta$-correct algorithm (an adaptive, sequential algorithm that restricts the probability of error to a specified $\delta$) that has minimum sample complexity. Lower bounds for $\delta$-correct algorithms are well known. $\delta$-correct algorithms that match the lower bound asymptotically as $\delta$ reduces to zero have been previously developed when arm distributions are restricted to a single parameter exponential family. In this paper, we first observe a negative result that some restrictions are essential, as otherwise, under a $\delta$-correct algorithm, distributions with unbounded support would require an infinite number of samples in expectation. We then propose a $\delta$-correct algorithm that matches the lower bound as $\delta$ reduces to zero under the mild restriction that a known bound on the expectation of $(1+\epsilon)^{th}$ moment of the underlying random variables exists, for $\epsilon > 0$. We also propose batch processing and identify near-optimal batch sizes to speed up the proposed algorithm substantially. The best-arm problem has many learning applications, including recommendation systems and product selection. It is also a well-studied classic problem in the simulation community.
    Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning. (arXiv:2311.13613v1 [cs.CV])
    Dataset pruning aims to construct a coreset capable of achieving performance comparable to the original, full dataset. Most existing dataset pruning methods rely on snapshot-based criteria to identify representative samples, often resulting in poor generalization across various pruning and cross-architecture scenarios. Recent studies have addressed this issue by expanding the scope of training dynamics considered, including factors such as forgetting event and probability change, typically using an averaging approach. However, these works struggle to integrate a broader range of training dynamics without overlooking well-generalized samples, which may not be sufficiently highlighted in an averaging manner. In this study, we propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS), to tackle this problem. TDDS utilizes a dual-depth strategy to achieve a balance between incorporating extensive training dynamics and identifying representative samples for dataset pruning. In the first depth, we estimate the series of each sample's individual contributions spanning the training progress, ensuring comprehensive integration of training dynamics. In the second depth, we focus on the variability of the sample-wise contributions identified in the first depth to highlight well-generalized samples. Extensive experiments conducted on CIFAR and ImageNet datasets verify the superiority of TDDS over previous SOTA methods. Specifically on CIFAR-100, our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.
    Networked Time Series Prediction with Incomplete Data via Generative Adversarial Network. (arXiv:2110.02271v3 [cs.LG] UPDATED)
    A networked time series (NETS) is a family of time series on a given graph, one for each node. It has a wide range of applications from intelligent transportation, environment monitoring to smart grid management. An important task in such applications is to predict the future values of a NETS based on its historical values and the underlying graph. Most existing methods require complete data for training. However, in real-world scenarios, it is not uncommon to have missing data due to sensor malfunction, incomplete sensing coverage, etc. In this paper, we study the problem of NETS prediction with incomplete data. We propose NETS-ImpGAN, a novel deep learning framework that can be trained on incomplete data with missing values in both history and future. Furthermore, we propose Graph Temporal Attention Networks, which incorporate the attention mechanism to capture both inter-time series and temporal correlations. We conduct extensive experiments on four real-world datasets under different missing patterns and missing rates. The experimental results show that NETS-ImpGAN outperforms existing methods, reducing the MAE by up to 25%.
    RetroDiff: Retrosynthesis as Multi-stage Distribution Interpolation. (arXiv:2311.14077v1 [cs.LG])
    Retrosynthesis poses a fundamental challenge in biopharmaceuticals, aiming to aid chemists in finding appropriate reactant molecules and synthetic pathways given determined product molecules. With the reactant and product represented as 2D graphs, retrosynthesis constitutes a conditional graph-to-graph generative task. Inspired by the recent advancements in discrete diffusion models for graph generation, we introduce Retrosynthesis Diffusion (RetroDiff), a novel diffusion-based method designed to address this problem. However, integrating a diffusion-based graph-to-graph framework while retaining essential chemical reaction template information presents a notable challenge. Our key innovation is to develop a multi-stage diffusion process. In this method, we decompose the retrosynthesis procedure to first sample external groups from the dummy distribution given products and then generate the external bonds to connect the products and generated groups. Interestingly, such a generation process is exactly the reverse of the widely adapted semi-template retrosynthesis procedure, i.e. from reaction center identification to synthon completion, which significantly reduces the error accumulation. Experimental results on the benchmark have demonstrated the superiority of our method over all other semi-template methods.
    Fast Policy Learning for Linear Quadratic Regulator with Entropy Regularization. (arXiv:2311.14168v1 [math.OC])
    This paper proposes and analyzes two new policy learning methods: regularized policy gradient (RPG) and iterative policy optimization (IPO), for a class of discounted linear-quadratic regulator (LQR) problems over an infinite time horizon with entropy regularization. Assuming access to the exact policy evaluation, both proposed approaches are proved to converge linearly in finding optimal policies of the regularized LQR. Moreover, the IPO method can achieve a super-linear convergence rate once it enters a local region around the optimal policy. Finally, when the optimal policy from a well-understood environment in an RL problem is appropriately transferred as the initial policy to an RL problem with an unknown environment, the IPO method is shown to enable a super-linear convergence rate if the latter is sufficiently close to the former. The performances of these proposed algorithms are supported by numerical examples.
    Machine Learning For An Explainable Cost Prediction of Medical Insurance. (arXiv:2311.14139v1 [cs.LG])
    Predictive modeling in healthcare continues to be an active actuarial research topic as more insurance companies aim to maximize the potential of Machine Learning approaches to increase their productivity and efficiency. In this paper, the authors deployed three regression-based ensemble ML models that combine variations of decision trees through Extreme Gradient Boosting, Gradient-boosting Machine, and Random Forest) methods in predicting medical insurance costs. Explainable Artificial Intelligence methods SHapley Additive exPlanations and Individual Conditional Expectation plots were deployed to discover and explain the key determinant factors that influence medical insurance premium prices in the dataset. The dataset used comprised 986 records and is publicly available in the KAGGLE repository. The models were evaluated using four performance evaluation metrics, including R-squared, Mean Absolute Error, Root Mean Squared Error, and Mean Absolute Percentage Error. The results show that all models produced impressive outcomes; however, the XGBoost model achieved a better overall performance although it also expanded more computational resources, while the RF model recorded a lesser prediction error and consumed far fewer computing resources than the XGBoost model. Furthermore, we compared the outcome of both XAi methods in identifying the key determinant features that influenced the PremiumPrices for each model and whereas both XAi methods produced similar outcomes, we found that the ICE plots showed in more detail the interactions between each variable than the SHAP analysis which seemed to be more high-level. It is the aim of the authors that the contributions of this study will help policymakers, insurers, and potential medical insurance buyers in their decision-making process for selecting the right policies that meet their specific needs.
    Can Physics Informed Neural Operators Self Improve?. (arXiv:2311.13885v1 [cs.LG])
    Self-training techniques have shown remarkable value across many deep learning models and tasks. However, such techniques remain largely unexplored when considered in the context of learning fast solvers for systems of partial differential equations (Eg: Neural Operators). In this work, we explore the use of self-training for Fourier Neural Operators (FNO). Neural Operators emerged as a data driven technique, however, data from experiments or traditional solvers is not always readily available. Physics Informed Neural Operators (PINO) overcome this constraint by utilizing a physics loss for the training, however the accuracy of PINO trained without data does not match the performance obtained by training with data. In this work we show that self-training can be used to close this gap in performance. We examine canonical examples, namely the 1D-Burgers and 2D-Darcy PDEs, to showcase the efficacy of self-training. Specifically, FNOs, when trained exclusively with physics loss through self-training, approach 1.07x for Burgers and 1.02x for Darcy, compared to FNOs trained with both data and physics loss. Furthermore, we discover that pseudo-labels can be used for self-training without necessarily training to convergence in each iteration. A consequence of this is that we are able to discover self-training schedules that improve upon the baseline performance of PINO in terms of accuracy as well as time.
    Creating and Benchmarking a Synthetic Dataset for Cloud Optical Thickness Estimation. (arXiv:2311.14024v1 [cs.CV])
    Cloud formations often obscure optical satellite-based monitoring of the Earth's surface, thus limiting Earth observation (EO) activities such as land cover mapping, ocean color analysis, and cropland monitoring. The integration of machine learning (ML) methods within the remote sensing domain has significantly improved performance on a wide range of EO tasks, including cloud detection and filtering, but there is still much room for improvement. A key bottleneck is that ML methods typically depend on large amounts of annotated data for training, which is often difficult to come by in EO contexts. This is especially true for the task of cloud optical thickness (COT) estimation. A reliable estimation of COT enables more fine-grained and application-dependent control compared to using pre-specified cloud categories, as is commonly done in practice. To alleviate the COT data scarcity problem, in this work we propose a novel synthetic dataset for COT estimation, where top-of-atmosphere radiances have been simulated for 12 of the spectral bands of the Multi-Spectral Instrument (MSI) sensor onboard Sentinel-2 platforms. These data points have been simulated under consideration of different cloud types, COTs, and ground surface and atmospheric profiles. Extensive experimentation of training several ML models to predict COT from the measured reflectivity of the spectral bands demonstrates the usefulness of our proposed dataset. Generalization to real data is also demonstrated on two satellite image datasets -- one that is publicly available, and one which we have collected and annotated. The synthetic data, the newly collected real dataset, code and models have been made publicly available at https://github.com/aleksispi/ml-cloud-opt-thick.
    Dungeons and Data: A Large-Scale NetHack Dataset. (arXiv:2211.00539v3 [cs.LG] UPDATED)
    Recent breakthroughs in the development of agents to solve challenging sequential decision making problems such as Go, StarCraft, or DOTA, have relied on both simulated environments and large-scale datasets. However, progress on this research has been hindered by the scarcity of open-sourced datasets and the prohibitive computational cost to work with them. Here we present the NetHack Learning Dataset (NLD), a large and highly-scalable dataset of trajectories from the popular game of NetHack, which is both extremely challenging for current methods and very fast to run. NLD consists of three parts: 10 billion state transitions from 1.5 million human trajectories collected on the NAO public NetHack server from 2009 to 2020; 3 billion state-action-score transitions from 100,000 trajectories collected from the symbolic bot winner of the NetHack Challenge 2021; and, accompanying code for users to record, load and stream any collection of such trajectories in a highly compressed form. We evaluate a wide range of existing algorithms including online and offline RL, as well as learning from demonstrations, showing that significant research advances are needed to fully leverage large-scale datasets for challenging sequential decision making tasks.
    Predicting Recovery or Decease of COVID-19 Patients with Clinical and RT-PCR Using Machine Learning Classification Algorithms. (arXiv:2311.13925v1 [eess.IV])
    The COVID-19 pandemic has disrupted the global economy and people's daily lives in unprecedented ways. To make appropriate decisions, it is necessary to diagnose COVID-19 rapidly and accurately. Clinical decision making is influenced by data collected from patients. With the aid of artificial intelligence, COVID-19 has been diagnosed quickly by analyzing symptoms, polymerase chain reaction (PCR), computed tomography scans, chest X-rays, routine laboratory blood tests and even cough sounds. Furthermore, these data can be used to predict a patient's morality, although there is a question about which data makes the most accurate predictions. Therefore, this study consists of two parts. Our first objective is to examine whether machine learning algorithms can predict the outcome of COVID-19 cases (recovery or death), based on the features present in the dataset. In the second part of the research, we investigated the impact of clinical and RT-PCR on prediction of recovery and decease to determine which one is more reliable. We defined four stages with different feature sets and use six machine learning methods to build prediction model. With an accuracy of 78.7%, random forest showed promising results for predicting death and recovery of patients. Based on this, it appears that recovery and decease of patients are predictable using machine learning. For second objective, results indicate that clinical alone (without using RT-PCR), trained with AdaBoost algorithm, is the most accurate with an accuracy of 82.1%. This study can provide guidance for medical professionals in the event of a crisis or outbreak similar to COVID-19.
    Towards Transferable Multi-modal Perception Representation Learning for Autonomy: NeRF-Supervised Masked AutoEncoder. (arXiv:2311.13750v1 [cs.CV])
    This work proposes a unified self-supervised pre-training framework for transferable multi-modal perception representation learning via masked multi-modal reconstruction in Neural Radiance Field (NeRF), namely NeRF-Supervised Masked AutoEncoder (NS-MAE). Specifically, conditioned on certain view directions and locations, multi-modal embeddings extracted from corrupted multi-modal input signals, i.e., Lidar point clouds and images, are rendered into projected multi-modal feature maps via neural rendering. Then, original multi-modal signals serve as reconstruction targets for the rendered multi-modal feature maps to enable self-supervised representation learning. Extensive experiments show that the representation learned via NS-MAE shows promising transferability for diverse multi-modal and single-modal (camera-only and Lidar-only) perception models on diverse 3D perception downstream tasks (3D object detection and BEV map segmentation) with diverse amounts of fine-tuning labeled data. Moreover, we empirically find that NS-MAE enjoys the synergy of both the mechanism of masked autoencoder and neural radiance field. Our code shall be released upon acceptance.
    A Joint Gradient and Loss Based Clustered Federated Learning Design. (arXiv:2311.13665v1 [cs.LG])
    In this paper, a novel clustered FL framework that enables distributed edge devices with non-IID data to independently form several clusters in a distributed manner and implement FL training within each cluster is proposed. In particular, our designed clustered FL algorithm must overcome two challenges associated with FL training. First, the server has limited FL training information (i.e., the parameter server can only obtain the FL model information of each device) and limited computational power for finding the differences among a large amount of devices. Second, each device does not have the data information of other devices for device clustering and can only use global FL model parameters received from the server and its data information to determine its cluster identity, which will increase the difficulty of device clustering. To overcome these two challenges, we propose a joint gradient and loss based distributed clustering method in which each device determines its cluster identity considering the gradient similarity and training loss. The proposed clustering method not only considers how a local FL model of one device contributes to each cluster but also the direction of gradient descent thus improving clustering speed. By delegating clustering decisions to edge devices, each device can fully leverage its private data information to determine its own cluster identity, thereby reducing clustering overhead and improving overall clustering performance. Simulation results demonstrate that our proposed clustered FL algorithm can reduce clustering iterations by up to 99% compared to the existing baseline.
    BackboneLearn: A Library for Scaling Mixed-Integer Optimization-Based Machine Learning. (arXiv:2311.13695v1 [cs.LG])
    We present BackboneLearn: an open-source software package and framework for scaling mixed-integer optimization (MIO) problems with indicator variables to high-dimensional problems. This optimization paradigm can naturally be used to formulate fundamental problems in interpretable supervised learning (e.g., sparse regression and decision trees), in unsupervised learning (e.g., clustering), and beyond; BackboneLearn solves the aforementioned problems faster than exact methods and with higher accuracy than commonly used heuristics. The package is built in Python and is user-friendly and easily extensible: users can directly implement a backbone algorithm for their MIO problem at hand. The source code of BackboneLearn is available on GitHub (link: https://github.com/chziakas/backbone_learn).
    Training Multi-Layer Over-Parametrized Neural Network in Subquadratic Time. (arXiv:2112.07628v2 [cs.LG] UPDATED)
    We consider the problem of training a multi-layer over-parametrized neural network to minimize the empirical risk induced by a loss function. In the typical setting of over-parametrization, the network width $m$ is much larger than the data dimension $d$ and the number of training samples $n$ ($m=\mathrm{poly}(n,d)$), which induces a prohibitive large weight matrix $W\in \mathbb{R}^{m\times m}$ per layer. Naively, one has to pay $O(m^2)$ time to read the weight matrix and evaluate the neural network function in both forward and backward computation. In this work, we show how to reduce the training cost per iteration. Specifically, we propose a framework that uses $m^2$ cost only in the initialization phase and achieves \emph{a truly subquadratic cost per iteration} in terms of $m$, i.e., $m^{2-\Omega(1)}$ per iteration. Our result has implications beyond standard over-parametrization theory, as it can be viewed as designing an efficient data structure on top of a pre-trained large model to further speed up the fine-tuning process, a core procedure to deploy large language models (LLM).
    A density estimation perspective on learning from pairwise human preferences. (arXiv:2311.14115v1 [cs.LG])
    Learning from human feedback (LHF) -- and in particular learning from pairwise preferences -- has recently become a crucial ingredient in training large language models (LLMs), and has been the subject of much research. Most recent works frame it as a reinforcement learning problem, where a reward function is learned from pairwise preference data and the LLM is treated as a policy which is adapted to maximize the rewards, often under additional regularization constraints. We propose an alternative interpretation which centers on the generative process for pairwise preferences and treats LHF as a density estimation problem. We provide theoretical and empirical results showing that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution. Finally, we discuss and present findings on "annotator misspecification" -- failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models -- suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints.
    MINTY: Rule-based Models that Minimize the Need for Imputing Features with Missing Values. (arXiv:2311.14108v1 [cs.LG])
    Rule models are often preferred in prediction tasks with tabular inputs as they can be easily interpreted using natural language and provide predictive performance on par with more complex models. However, most rule models' predictions are undefined or ambiguous when some inputs are missing, forcing users to rely on statistical imputation models or heuristics like zero imputation, undermining the interpretability of the models. In this work, we propose fitting concise yet precise rule models that learn to avoid relying on features with missing values and, therefore, limit their reliance on imputation at test time. We develop MINTY, a method that learns rules in the form of disjunctions between variables that act as replacements for each other when one or more is missing. This results in a sparse linear rule model, regularized to have small dependence on features with missing values, that allows a trade-off between goodness of fit, interpretability, and robustness to missing values at test time. We demonstrate the value of MINTY in experiments using synthetic and real-world data sets and find its predictive performance comparable or favorable to baselines, with smaller reliance on features with missing values.
    Machine learning-based decentralized TDMA for VLC IoT networks. (arXiv:2311.14078v1 [cs.NI])
    In this paper, a machine learning-based decentralized time division multiple access (TDMA) algorithm for visible light communication (VLC) Internet of Things (IoT) networks is proposed. The proposed algorithm is based on Q-learning, a reinforcement learning algorithm. This paper considers a decentralized condition in which there is no coordinator node for sending synchronization frames and assigning transmission time slots to other nodes. The proposed algorithm uses a decentralized manner for synchronization, and each node uses the Q-learning algorithm to find the optimal transmission time slot for sending data without collisions. The proposed algorithm is implemented on a VLC hardware system, which had been designed and implemented in our laboratory. Average reward, convergence time, goodput, average delay, and data packet size are evaluated parameters. The results show that the proposed algorithm converges quickly and provides collision-free decentralized TDMA for the network. The proposed algorithm is compared with carrier-sense multiple access with collision avoidance (CSMA/CA) algorithm as a potential selection for decentralized VLC IoT networks. The results show that the proposed algorithm provides up to 61% more goodput and up to 49% less average delay than CSMA/CA.
    Deep Interactive Segmentation of Medical Images: A Systematic Review and Taxonomy. (arXiv:2311.13964v1 [eess.IV])
    Interactive segmentation is a crucial research area in medical image analysis aiming to boost the efficiency of costly annotations by incorporating human feedback. This feedback takes the form of clicks, scribbles, or masks and allows for iterative refinement of the model output so as to efficiently guide the system towards the desired behavior. In recent years, deep learning-based approaches have propelled results to a new level causing a rapid growth in the field with 121 methods proposed in the medical imaging domain alone. In this review, we provide a structured overview of this emerging field featuring a comprehensive taxonomy, a systematic review of existing methods, and an in-depth analysis of current practices. Based on these contributions, we discuss the challenges and opportunities in the field. For instance, we find that there is a severe lack of comparison across methods which needs to be tackled by standardized baselines and benchmarks.
    High-Order Tensor Recovery with A Tensor $U_1$ Norm. (arXiv:2311.13958v1 [stat.ML])
    Recently, numerous tensor SVD (t-SVD)-based tensor recovery methods have emerged, showing promise in processing visual data. However, these methods often suffer from performance degradation when confronted with high-order tensor data exhibiting non-smooth changes, commonly observed in real-world scenarios but ignored by the traditional t-SVD-based methods. Our objective in this study is to provide an effective tensor recovery technique for handling non-smooth changes in tensor data and efficiently explore the correlations of high-order tensor data across its various dimensions without introducing numerous variables and weights. To this end, we introduce a new tensor decomposition and a new tensor norm called the Tensor $U_1$ norm. We utilize these novel techniques in solving the problem of high-order tensor completion problem and provide theoretical guarantees for the exact recovery of the resulting tensor completion models. An optimization algorithm is proposed to solve the resulting tensor completion model iteratively by combining the proximal algorithm with the Alternating Direction Method of Multipliers. Theoretical analysis showed the convergence of the algorithm to the Karush-Kuhn-Tucker (KKT) point of the optimization problem. Numerical experiments demonstrated the effectiveness of the proposed method in high-order tensor completion, especially for tensor data with non-smooth changes.
    A Unified Approach to Count-Based Weakly-Supervised Learning. (arXiv:2311.13718v1 [cs.LG])
    High-quality labels are often very scarce, whereas unlabeled data with inferred weak labels occurs more naturally. In many cases, these weak labels dictate the frequency of each respective class over a set of instances. In this paper, we develop a unified approach to learning from such weakly-labeled data, which we call count-based weakly-supervised learning. At the heart of our approach is the ability to compute the probability of exactly k out of n outputs being set to true. This computation is differentiable, exact, and efficient. Building upon the previous computation, we derive a count loss penalizing the model for deviations in its distribution from an arithmetic constraint defined over label counts. We evaluate our approach on three common weakly-supervised learning paradigms and observe that our proposed approach achieves state-of-the-art or highly competitive results across all three of the paradigms.
    Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models. (arXiv:2311.13833v1 [cs.CV])
    Diffusion models have revolutionized generative content creation and text-to-image (T2I) diffusion models in particular have increased the creative freedom of users by allowing scene synthesis using natural language. T2I models excel at synthesizing concepts such as nouns, appearances, and styles. To enable customized content creation based on a few example images of a concept, methods such as Textual Inversion and DreamBooth invert the desired concept and enable synthesizing it in new scenes. However, inverting more general concepts that go beyond object appearance and style (adjectives and verbs) through natural language, remains a challenge. Two key characteristics of these concepts contribute to the limitations of current inversion methods. 1) Adjectives and verbs are entangled with nouns (subject) and can hinder appearance-based inversion methods, where the subject appearance leaks into the concept embedding and 2) describing such concepts often extends beyond single word embeddings (being frozen in ice, walking on a tightrope, etc.) that current methods do not handle. In this study, we introduce Lego, a textual inversion method designed to invert subject entangled concepts from a few example images. Lego disentangles concepts from their associated subjects using a simple yet effective Subject Separation step and employs a Context Loss that guides the inversion of single/multi-embedding concepts. In a thorough user study, Lego-generated concepts were preferred over 70% of the time when compared to the baseline. Additionally, visual question answering using a large language model suggested Lego-generated concepts are better aligned with the text description of the concept.
    AdaTyper: Adaptive Semantic Column Type Detection. (arXiv:2311.13806v1 [cs.DB])
    Understanding the semantics of relational tables is instrumental for automation in data exploration and preparation systems. A key source for understanding a table is the semantics of its columns. With the rise of deep learning, learned table representations are now available, which can be applied for semantic type detection and achieve good performance on benchmarks. Nevertheless, we observe a gap between this performance and its applicability in practice. In this paper, we propose AdaTyper to address one of the most critical deployment challenges: adaptation. AdaTyper uses weak-supervision to adapt a hybrid type predictor towards new semantic types and shifted data distributions at inference time, using minimal human feedback. The hybrid type predictor of AdaTyper combines rule-based methods and a light machine learning model for semantic column type detection. We evaluate the adaptation performance of AdaTyper on real-world database tables hand-annotated with semantic column types through crowdsourcing and find that the f1-score improves for new and existing types. AdaTyper approaches an average precision of 0.6 after only seeing 5 examples, significantly outperforming existing adaptation methods based on human-provided regular expressions or dictionaries.
    Deep Learning as a Method for Inversion of NMR Signals. (arXiv:2311.13722v1 [physics.chem-ph])
    The concept of deep learning is employed for the inversion of NMR signals and it is shown that NMR signal inversion can be considered as an image-to-image regression problem, which can be treated with a convolutional neural net. It is further outlined, that inversion through deep learning provides a clear efficiency and usability advantage compared to regularization techniques such as Tikhonov and modified total generalized variation (MTGV), because no hyperparemeter selection prior to reconstruction is necessary. The inversion network is applied to simulated NMR signals and the results compared with Tikhonov- and MTGV-regularization. The comparison shows that inversion via deep learning is significantly faster than the latter regularization methods and also outperforms both regularization techniques in nearly all instances.
    Collective Relational Inference for learning heterogeneous interactions. (arXiv:2305.00557v2 [cs.LG] UPDATED)
    Interacting systems are ubiquitous in nature and engineering, ranging from particle dynamics in physics to functionally connected brain regions. These interacting systems can be modeled by graphs where edges correspond to the interactions between interactive entities. Revealing interaction laws is of fundamental importance but also particularly challenging due to underlying configurational complexities. The associated challenges become exacerbated for heterogeneous systems that are prevalent in reality, where multiple interaction types coexist simultaneously and relational inference is required. Here, we propose a novel probabilistic method for relational inference, which possesses two distinctive characteristics compared to existing methods. First, it infers the interaction types of different edges collectively, and second, it allows handling systems with variable topological structure over time. We evaluate the proposed methodology across several benchmark datasets and demonstrate that it outperforms existing methods in accurately inferring interaction types. We further show that when combined with known constraints, it allows us, for example, to discover physics-consistent interaction laws of particle systems. Overall the proposed model is data-efficient and generalizable to large systems when trained on smaller ones. The developed methodology constitutes a key element for understanding interacting systems and may find application in graph structure learning.
    Masked Conditional Diffusion Models for Image Analysis with Application to Radiographic Diagnosis of Infant Abuse. (arXiv:2311.13688v1 [eess.IV])
    The classic metaphyseal lesion (CML) is a distinct injury that is highly specific for infant abuse. It commonly occurs in the distal tibia. To aid radiologists detect these subtle fractures, we need to develop a model that can flag abnormal distal tibial radiographs (i.e. those with CMLs). Unfortunately, the development of such a model requires a large and diverse training database, which is often not available. To address this limitation, we propose a novel generative model for data augmentation. Unlike previous models that fail to generate data that span the diverse radiographic appearance of the distal tibial CML, our proposed masked conditional diffusion model (MaC-DM) not only generates realistic-appearing and wide-ranging synthetic images of the distal tibial radiographs with and without CMLs, it also generates their associated segmentation labels. To achieve these tasks, MaC-DM combines the weighted segmentation masks of the tibias and the CML fracture sites as additional conditions for classifier guidance. The augmented images from our model improved the performances of ResNet-34 in classifying normal radiographs and those with CMLs. Further, the augmented images and their associated segmentation masks enhanced the performance of the U-Net in labeling areas of the CMLs on distal tibial radiographs.
    Mind the box: $l_1$-APGD for sparse adversarial attacks on image classifiers. (arXiv:2103.01208v3 [cs.LG] UPDATED)
    We show that when taking into account also the image domain $[0,1]^d$, established $l_1$-projected gradient descent (PGD) attacks are suboptimal as they do not consider that the effective threat model is the intersection of the $l_1$-ball and $[0,1]^d$. We study the expected sparsity of the steepest descent step for this effective threat model and show that the exact projection onto this set is computationally feasible and yields better performance. Moreover, we propose an adaptive form of PGD which is highly effective even with a small budget of iterations. Our resulting $l_1$-APGD is a strong white-box attack showing that prior works overestimated their $l_1$-robustness. Using $l_1$-APGD for adversarial training we get a robust classifier with SOTA $l_1$-robustness. Finally, we combine $l_1$-APGD and an adaptation of the Square Attack to $l_1$ into $l_1$-AutoAttack, an ensemble of attacks which reliably assesses adversarial robustness for the threat model of $l_1$-ball intersected with $[0,1]^d$.
    SySMOL: A Hardware-software Co-design Framework for Ultra-Low and Fine-Grained Mixed-Precision Neural Networks. (arXiv:2311.14114v1 [cs.AR])
    Recent advancements in quantization and mixed-precision techniques offer significant promise for improving the run-time and energy efficiency of neural networks. In this work, we further showed that neural networks, wherein individual parameters or activations can take on different precisions ranging between 1 and 4 bits, can achieve accuracies comparable to or exceeding the full-precision counterparts. However, the deployment of such networks poses numerous challenges, stemming from the necessity to manage and control the compute/communication/storage requirements associated with these extremely fine-grained mixed precisions for each piece of data. There is a lack of existing efficient hardware and system-level support tailored to these unique and challenging requirements. Our research introduces the first novel holistic hardware-software co-design approach for these networks, which enables a continuous feedback loop between hardware design, training, and inference to facilitate systematic design exploration. As a proof-of-concept, we illustrate this co-design approach by designing new, configurable CPU SIMD architectures tailored for these networks, tightly integrating the architecture with new system-aware training and inference techniques. We perform systematic design space exploration using this framework to analyze various tradeoffs. The design for mixed-precision networks that achieves optimized tradeoffs corresponds to an architecture that supports 1, 2, and 4-bit fixed-point operations with four configurable precision patterns, when coupled with system-aware training and inference optimization -- networks trained for this design achieve accuracies that closely match full-precision accuracies, while compressing and improving run-time efficiency of the neural networks drastically by 10-20x, compared to full-precision networks.
    DPSUR: Accelerating Differentially Private Stochastic Gradient Descent Using Selective Update and Release. (arXiv:2311.14056v1 [cs.LG])
    Machine learning models are known to memorize private data to reduce their training loss, which can be inadvertently exploited by privacy attacks such as model inversion and membership inference. To protect against these attacks, differential privacy (DP) has become the de facto standard for privacy-preserving machine learning, particularly those popular training algorithms using stochastic gradient descent, such as DPSGD. Nonetheless, DPSGD still suffers from severe utility loss due to its slow convergence. This is partially caused by the random sampling, which brings bias and variance to the gradient, and partially by the Gaussian noise, which leads to fluctuation of gradient updates. Our key idea to address these issues is to apply selective updates to the model training, while discarding those useless or even harmful updates. Motivated by this, this paper proposes DPSUR, a Differentially Private training framework based on Selective Updates and Release, where the gradient from each iteration is evaluated based on a validation test, and only those updates leading to convergence are applied to the model. As such, DPSUR ensures the training in the right direction and thus can achieve faster convergence than DPSGD. The main challenges lie in two aspects -- privacy concerns arising from gradient evaluation, and gradient selection strategy for model update. To address the challenges, DPSUR introduces a clipping strategy for update randomization and a threshold mechanism for gradient selection. Experiments conducted on MNIST, FMNIST, CIFAR-10, and IMDB datasets show that DPSUR significantly outperforms previous works in terms of convergence speed and model utility.
    On Principles of Emergent Organization. (arXiv:2311.13749v1 [cond-mat.stat-mech])
    After more than a century of concerted effort, physics still lacks basic principles of spontaneous self-organization. To appreciate why, we first state the problem, outline historical approaches, and survey the present state of the physics of self-organization. This frames the particular challenges arising from mathematical intractability and the resulting need for computational approaches, as well as those arising from a chronic failure to define structure. Then, an overview of two modern mathematical formulations of organization -- intrinsic computation and evolution operators -- lays out a way to overcome these challenges. Together, the vantage point they afford shows how to account for the emergence of structured states via a statistical mechanics of systems arbitrarily far from equilibrium. The result is a constructive path forward to principles of organization that builds on mathematical identification of structure.
  • Open

    Locally Optimal Descent for Dynamic Stepsize Scheduling. (arXiv:2311.13877v1 [cs.LG])
    We introduce a novel dynamic learning-rate scheduling scheme grounded in theory with the goal of simplifying the manual and time-consuming tuning of schedules in practice. Our approach is based on estimating the locally-optimal stepsize, guaranteeing maximal descent in the direction of the stochastic gradient of the current step. We first establish theoretical convergence bounds for our method within the context of smooth non-convex stochastic optimization, matching state-of-the-art bounds while only assuming knowledge of the smoothness parameter. We then present a practical implementation of our algorithm and conduct systematic experiments across diverse datasets and optimization algorithms, comparing our scheme with existing state-of-the-art learning-rate schedulers. Our findings indicate that our method needs minimal tuning when compared to existing approaches, removing the need for auxiliary manual schedules and warm-up phases and achieving comparable performance with drastically reduced parameter tuning.
    Kernel-Based Tests for Likelihood-Free Hypothesis Testing. (arXiv:2308.09043v2 [stat.ML] UPDATED)
    Given $n$ observations from two balanced classes, consider the task of labeling an additional $m$ inputs that are known to all belong to \emph{one} of the two classes. Special cases of this problem are well-known: with complete knowledge of class distributions ($n=\infty$) the problem is solved optimally by the likelihood-ratio test; when $m=1$ it corresponds to binary classification; and when $m\approx n$ it is equivalent to two-sample testing. The intermediate settings occur in the field of likelihood-free inference, where labeled samples are obtained by running forward simulations and the unlabeled sample is collected experimentally. In recent work it was discovered that there is a fundamental trade-off between $m$ and $n$: increasing the data sample $m$ reduces the amount $n$ of training/simulation data needed. In this work we (a) introduce a generalization where unlabeled samples come from a mixture of the two classes -- a case often encountered in practice; (b) study the minimax sample complexity for non-parametric classes of densities under \textit{maximum mean discrepancy} (MMD) separation; and (c) investigate the empirical performance of kernels parameterized by neural networks on two tasks: detection of the Higgs boson and detection of planted DDPM generated images amidst CIFAR-10 images. For both problems we confirm the existence of the theoretically predicted asymmetric $m$ vs $n$ trade-off.
    A path-norm toolkit for modern networks: consequences, promises and challenges. (arXiv:2310.01225v3 [stat.ML] UPDATED)
    This work introduces the first toolkit around path-norms that is fully able to encompass general DAG ReLU networks with biases, skip connections and any operation based on the extraction of order statistics: max pooling, GroupSort etc. This toolkit notably allows us to establish generalization bounds for modern neural networks that are not only the most widely applicable path-norm based ones, but also recover or beat the sharpest known bounds of this type. These extended path-norms further enjoy the usual benefits of path-norms: ease of computation, invariance under the symmetries of the network, and improved sharpness on feedforward networks compared to the product of operators' norms, another complexity measure most commonly used. The versatility of the toolkit and its ease of implementation allow us to challenge the concrete promises of path-norm-based generalization bounds, by numerically evaluating the sharpest known bounds for ResNets on ImageNet.
    Stability and L2-penalty in Model Averaging. (arXiv:2311.13827v1 [stat.ML])
    Model averaging has received much attention in the past two decades, which integrates available information by averaging over potential models. Although various model averaging methods have been developed, there are few literatures on the theoretical properties of model averaging from the perspective of stability, and the majority of these methods constrain model weights to a simplex. The aim of this paper is to introduce stability from statistical learning theory into model averaging. Thus, we define the stability, asymptotic empirical risk minimizer, generalization, and consistency of model averaging and study the relationship among them. Our results indicate that stability can ensure that model averaging has good generalization performance and consistency under reasonable conditions, where consistency means model averaging estimator can asymptotically minimize the mean squared prediction error. We also propose a L2-penalty model averaging method without limiting model weights and prove that it has stability and consistency. In order to reduce the impact of tuning parameter selection, we use 10-fold cross-validation to select a candidate set of tuning parameters and perform a weighted average of the estimators of model weights based on estimation errors. The Monte Carlo simulation and an illustrative application demonstrate the usefulness of the proposed method.
    Proactive DP: A Multple Target Optimization Framework for DP-SGD. (arXiv:2102.09030v9 [cs.LG] UPDATED)
    We introduce a multiple target optimization framework for DP-SGD referred to as pro-active DP. In contrast to traditional DP accountants, which are used to track the expenditure of privacy budgets, the pro-active DP scheme allows one to {\it a-priori} select parameters of DP-SGD based on a fixed privacy budget (in terms of $\epsilon$ and $\delta$) in such a way to optimize the anticipated utility (test accuracy) the most. To achieve this objective, we first propose significant improvements to the moment account method, presenting a closed-form $(\epsilon,\delta)$-DP guarantee that connects all parameters in the DP-SGD setup. Generally, DP-SGD is $(\epsilon\leq 1/2,\delta=1/N)$-DP if $\sigma=\sqrt{2(\epsilon +\ln(1/\delta))/\epsilon}$ with $T$ at least $\approx 2k^2/\epsilon$ and $(2/e)^2k^2-1/2\geq \ln(N)$, where $T$ is the total number of rounds, and $K=kN$ is the total number of gradient computations where $k$ measures $K$ in number of epochs of size $N$ of the local data set. We prove that our expression is close to tight in that if $T$ is more than a constant factor $\approx 4$ smaller than the lower bound $\approx 2k^2/\epsilon$, then the $(\epsilon,\delta)$-DP guarantee is violated. Our enhanced DP theory allows us to create a utility graph and DP calculator. These tools link privacy and utility objectives and search for optimal experiment setups, efficiently taking into account both accuracy and privacy objectives, as well as implementation goals. We furnish a comprehensive implementation flow of our proactive DP, with rigorous experiments to showcase the proof-of-concept.
    Sensitivity-Aware Amortized Bayesian Inference. (arXiv:2310.11122v3 [stat.ML] UPDATED)
    Bayesian inference is a powerful framework for making probabilistic inferences and decisions under uncertainty. Fundamental choices in modern Bayesian workflows concern the specification of the likelihood function and prior distributions, the posterior approximator, and the data. Each choice can significantly influence model-based inference and subsequent decisions, thereby necessitating sensitivity analysis. In this work, we propose a multifaceted approach to integrate sensitivity analyses into amortized Bayesian inference (ABI, i.e., simulation-based inference with neural networks). First, we utilize weight sharing to encode the structural similarities between alternative likelihood and prior specifications in the training process with minimal computational overhead. Second, we leverage the rapid inference of neural networks to assess sensitivity to various data perturbations or pre-processing procedures. In contrast to most other Bayesian approaches, both steps circumvent the costly bottleneck of refitting the model(s) for each choice of likelihood, prior, or dataset. Finally, we propose to use neural network ensembles to evaluate variation in results induced by unreliable approximation on unseen data. We demonstrate the effectiveness of our method in applied modeling problems, ranging from the estimation of disease outbreak dynamics and global warming thresholds to the comparison of human decision-making models. Our experiments showcase how our approach enables practitioners to effectively unveil hidden relationships between modeling choices and inferential conclusions.
    One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space. (arXiv:2311.14652v1 [cs.LG])
    Deploying Large Language Models (LLMs) in streaming applications that involve long contexts, particularly for extended dialogues and text analysis, is of paramount importance but presents two significant challenges. Firstly, the memory consumption is substantial during the decoding phase due to the caching of Key and Value states (KV) of previous tokens. Secondly, attention computation is time-consuming with a time complexity of $O(n^2)$ for the generation of each token. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length $n$ is much greater than 128K ($n \gg 2^d$). Considering a single-layer self-attention with Query, Key, and Value matrices $Q, K, V \in \mathbb{R}^{n \times d}$, the polynomial method approximates the attention output $T \in \mathbb{R}^{n \times d}$. It accomplishes this by constructing $U_1, U_2 \in \mathbb{R}^{n \times t}$ to expedite attention ${\sf Attn}(Q, K, V)$ computation within $n^{1+o(1)}$ time executions. Despite this, storing the Key and Value matrices $K, V \in \mathbb{R}^{n \times d}$ still necessitates $O( n d)$ space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in streaming fashion. This method employs sublinear space $o(n)$ to store three sketch matrices, alleviating the need for exact $K, V$ storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length $n$ increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications.
    Grokking as the Transition from Lazy to Rich Training Dynamics. (arXiv:2310.06110v2 [stat.ML] UPDATED)
    We propose that the grokking phenomenon, where the train loss of a neural network decreases much earlier than its test loss, can arise due to a neural network transitioning from lazy training dynamics to a rich, feature learning regime. To illustrate this mechanism, we study the simple setting of vanilla gradient descent on a polynomial regression problem with a two layer neural network which exhibits grokking without regularization in a way that cannot be explained by existing theories. We identify sufficient statistics for the test loss of such a network, and tracking these over training reveals that grokking arises in this setting when the network first attempts to fit a kernel regression solution with its initial features, followed by late-time feature learning where a generalizing solution is identified after train loss is already low. We provide an asymptotic theoretical description of the grokking dynamics in this model using dynamical mean field theory (DMFT) for high dimensional data. We find that the key determinants of grokking are the rate of feature learning -- which can be controlled precisely by parameters that scale the network output -- and the alignment of the initial features with the target function $y(x)$. We argue this delayed generalization arises when (1) the top eigenvectors of the initial neural tangent kernel and the task labels $y(x)$ are misaligned, but (2) the dataset size is large enough so that it is possible for the network to generalize eventually, but not so large that train loss perfectly tracks test loss at all epochs, and (3) the network begins training in the lazy regime so does not learn features immediately. We conclude with evidence that this transition from lazy (linear model) to rich training (feature learning) can control grokking in more general settings, like on MNIST, one-layer Transformers, and student-teacher networks.
    A Bayesian Take on Gaussian Process Networks. (arXiv:2306.11380v4 [stat.ML] UPDATED)
    Gaussian Process Networks (GPNs) are a class of directed graphical models which employ Gaussian processes as priors for the conditional expectation of each variable given its parents in the network. The model allows the description of continuous joint distributions in a compact but flexible manner with minimal parametric assumptions on the dependencies between variables. Bayesian structure learning of GPNs requires computing the posterior over graphs of the network and is computationally infeasible even in low dimensions. This work implements Monte Carlo and Markov Chain Monte Carlo methods to sample from the posterior distribution of network structures. As such, the approach follows the Bayesian paradigm, comparing models via their marginal likelihood and computing the posterior probability of the GPN features. Simulation studies show that our method outperforms state-of-the-art algorithms in recovering the graphical structure of the network and provides an accurate approximation of its posterior distribution.
    Bounding Box-based Multi-objective Bayesian Optimization of Risk Measures under Input Uncertainty. (arXiv:2301.11588v3 [stat.ML] UPDATED)
    In this study, we propose a novel multi-objective Bayesian optimization (MOBO) method to efficiently identify the Pareto front (PF) defined by risk measures for black-box functions under the presence of input uncertainty (IU). Existing BO methods for Pareto optimization in the presence of IU are risk-specific or without theoretical guarantees, whereas our proposed method addresses general risk measures and has theoretical guarantees. The basic idea of the proposed method is to assume a Gaussian process (GP) model for the black-box function and to construct high-probability bounding boxes for the risk measures using the GP model. Furthermore, in order to reduce the uncertainty of non-dominated bounding boxes, we propose a method of selecting the next evaluation point using a maximin distance defined by the maximum value of a quasi distance based on bounding boxes. As theoretical analysis, we prove that the algorithm can return an arbitrary-accurate solution in a finite number of iterations with high probability, for various risk measures such as Bayes risk, worst-case risk, and value-at-risk. We also give a theoretical analysis that takes into account approximation errors because there exist non-negligible approximation errors (e.g., finite approximation of PFs and sampling-based approximation of bounding boxes) in practice. We confirm that the proposed method outperforms compared with existing methods not only in the setting with IU but also in the setting of ordinary MOBO through numerical experiments.
    Annotation Sensitivity: Training Data Collection Methods Affect Model Performance. (arXiv:2311.14212v1 [stat.ML])
    When training data are collected from human annotators, the design of the annotation instrument, the instructions given to annotators, the characteristics of the annotators, and their interactions can impact training data. This study demonstrates that design choices made when creating an annotation instrument also impact the models trained on the resulting annotations. We introduce the term annotation sensitivity to refer to the impact of annotation data collection methods on the annotations themselves and on downstream model performance and predictions. We collect annotations of hate speech and offensive language in five experimental conditions of an annotation instrument, randomly assigning annotators to conditions. We then fine-tune BERT models on each of the five resulting datasets and evaluate model performance on a holdout portion of each condition. We find considerable differences between the conditions for 1) the share of hate speech/offensive language annotations, 2) model performance, 3) model predictions, and 4) model learning curves. Our results emphasize the crucial role played by the annotation instrument which has received little attention in the machine learning literature. We call for additional research into how and why the instrument impacts the annotations to inform the development of best practices in instrument design.
    Class Balanced Dynamic Acquisition for Domain Adaptive Semantic Segmentation using Active Learning. (arXiv:2311.14146v1 [cs.CV])
    Domain adaptive active learning is leading the charge in label-efficient training of neural networks. For semantic segmentation, state-of-the-art models jointly use two criteria of uncertainty and diversity to select training labels, combined with a pixel-wise acquisition strategy. However, we show that such methods currently suffer from a class imbalance issue which degrades their performance for larger active learning budgets. We then introduce Class Balanced Dynamic Acquisition (CBDA), a novel active learning method that mitigates this issue, especially in high-budget regimes. The more balanced labels increase minority class performance, which in turn allows the model to outperform the previous baseline by 0.6, 1.7, and 2.4 mIoU for budgets of 5%, 10%, and 20%, respectively. Additionally, the focus on minority classes leads to improvements of the minimum class performance of 0.5, 2.9, and 4.6 IoU respectively. The top-performing model even exceeds the fully supervised baseline, showing that a more balanced label than the entire ground truth can be beneficial.
    How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization. (arXiv:2310.01769v3 [cs.LG] UPDATED)
    This paper rigorously shows how over-parameterization changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements. First, we consider the symmetric setting with the symmetric parameterization where $M^* \in \mathbb{R}^{n \times n}$ is a positive semi-definite unknown matrix of rank $r \ll n$, and one uses a symmetric parameterization $XX^\top$ to learn $M^*$. Here $X \in \mathbb{R}^{n \times k}$ with $k > r$ is the factor matrix. We give a novel $\Omega (1/T^2)$ lower bound of randomly initialized GD for the over-parameterized case ($k >r$) where $T$ is the number of iterations. This is in stark contrast to the exact-parameterization scenario ($k=r$) where the convergence rate is $\exp (-\Omega (T))$. Next, we study asymmetric setting where $M^* \in \mathbb{R}^{n_1 \times n_2}$ is the unknown matrix of rank $r \ll \min\{n_1,n_2\}$, and one uses an asymmetric parameterization $FG^\top$ to learn $M^*$ where $F \in \mathbb{R}^{n_1 \times k}$ and $G \in \mathbb{R}^{n_2 \times k}$. Building on prior work, we give a global exact convergence result of randomly initialized GD for the exact-parameterization case ($k=r$) with an $\exp (-\Omega(T))$ rate. Furthermore, we give the first global exact convergence result for the over-parameterization case ($k>r$) with an $\exp(-\Omega(\alpha^2 T))$ rate where $\alpha$ is the initialization scale. This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from $\Omega (1/T^2)$ to linear convergence. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of $\alpha$, recovering the rate in the exact-parameterization case.
    Upgrading VAE Training With Unlimited Data Plans Provided by Diffusion Models. (arXiv:2310.19653v2 [stat.ML] UPDATED)
    Variational autoencoders (VAEs) are popular models for representation learning but their encoders are susceptible to overfitting (Cremer et al., 2018) because they are trained on a finite training set instead of the true (continuous) data distribution $p_{\mathrm{data}}(\mathbf{x})$. Diffusion models, on the other hand, avoid this issue by keeping the encoder fixed. This makes their representations less interpretable, but it simplifies training, enabling accurate and continuous approximations of $p_{\mathrm{data}}(\mathbf{x})$. In this paper, we show that overfitting encoders in VAEs can be effectively mitigated by training on samples from a pre-trained diffusion model. These results are somewhat unexpected as recent findings (Alemohammad et al., 2023; Shumailov et al., 2023) observe a decay in generative performance when models are trained on data generated by another generative model. We analyze generalization performance, amortization gap, and robustness of VAEs trained with our proposed method on three different data sets. We find improvements in all metrics compared to both normal training and conventional data augmentation methods, and we show that a modest amount of samples from the diffusion model suffices to obtain these gains.
    A Metalearned Neural Circuit for Nonparametric Bayesian Inference. (arXiv:2311.14601v1 [cs.LG])
    Most applications of machine learning to classification assume a closed set of balanced classes. This is at odds with the real world, where class occurrence statistics often follow a long-tailed power-law distribution and it is unlikely that all classes are seen in a single sample. Nonparametric Bayesian models naturally capture this phenomenon, but have significant practical barriers to widespread adoption, namely implementation complexity and computational inefficiency. To address this, we present a method for extracting the inductive bias from a nonparametric Bayesian model and transferring it to an artificial neural network. By simulating data with a nonparametric Bayesian prior, we can metalearn a sequence model that performs inference over an unlimited set of classes. After training, this "neural circuit" has distilled the corresponding inductive bias and can successfully perform sequential inference over an open set of classes. Our experimental results show that the metalearned neural circuit achieves comparable or better performance than particle filter-based methods for inference in these models while being faster and simpler to use than methods that explicitly incorporate Bayesian nonparametric inference.
    Sharing pattern submodels for prediction with missing values. (arXiv:2206.11161v3 [cs.LG] UPDATED)
    Missing values are unavoidable in many applications of machine learning and present challenges both during training and at test time. When variables are missing in recurring patterns, fitting separate pattern submodels have been proposed as a solution. However, fitting models independently does not make efficient use of all available data. Conversely, fitting a single shared model to the full data set relies on imputation which often leads to biased results when missingness depends on unobserved factors. We propose an alternative approach, called sharing pattern submodels, which i) makes predictions that are robust to missing values at test time, ii) maintains or improves the predictive power of pattern submodels, and iii) has a short description, enabling improved interpretability. Parameter sharing is enforced through sparsity-inducing regularization which we prove leads to consistent estimation. Finally, we give conditions for when a sharing model is optimal, even when both missingness and the target outcome depend on unobserved variables. Classification and regression experiments on synthetic and real-world data sets demonstrate that our models achieve a favorable tradeoff between pattern specialization and information sharing.
    Mitigating Over-Smoothing and Over-Squashing using Augmentations of Forman-Ricci Curvature. (arXiv:2309.09384v2 [cs.LG] UPDATED)
    While Graph Neural Networks (GNNs) have been successfully leveraged for learning on graph-structured data across domains, several potential pitfalls have been described recently. Those include the inability to accurately leverage information encoded in long-range connections (over-squashing), as well as difficulties distinguishing the learned representations of nearby nodes with growing network depth (over-smoothing). An effective way to characterize both effects is discrete curvature: Long-range connections that underlie over-squashing effects have low curvature, whereas edges that contribute to over-smoothing have high curvature. This observation has given rise to rewiring techniques, which add or remove edges to mitigate over-smoothing and over-squashing. Several rewiring approaches utilizing graph characteristics, such as curvature or the spectrum of the graph Laplacian, have been proposed. However, existing methods, especially those based on curvature, often require expensive subroutines and careful hyperparameter tuning, which limits their applicability to large-scale graphs. Here we propose a rewiring technique based on Augmented Forman-Ricci curvature (AFRC), a scalable curvature notation, which can be computed in linear time. We prove that AFRC effectively characterizes over-smoothing and over-squashing effects in message-passing GNNs. We complement our theoretical results with experiments, which demonstrate that the proposed approach achieves state-of-the-art performance while significantly reducing the computational cost in comparison with other methods. Utilizing fundamental properties of discrete curvature, we propose effective heuristics for hyperparameters in curvature-based rewiring, which avoids expensive hyperparameter searches, further improving the scalability of the proposed approach.
    Score-based Diffusion Models in Function Space. (arXiv:2302.07400v2 [cs.LG] UPDATED)
    Diffusion models have recently emerged as a powerful framework for generative modeling. They consist of a forward process that perturbs input data with Gaussian white noise and a reverse process that learns a score function to generate samples by denoising. Despite their tremendous success, they are mostly formulated on finite-dimensional spaces, e.g. Euclidean, limiting their applications to many domains where the data has a functional form such as in scientific computing and 3D geometric data analysis. In this work, we introduce a mathematically rigorous framework called Denoising Diffusion Operators (DDOs) for training diffusion models in function space. In DDOs, the forward process perturbs input functions gradually using a Gaussian process. The generative process is formulated by integrating a function-valued Langevin dynamic. Our approach requires an appropriate notion of the score for the perturbed data distribution, which we obtain by generalizing denoising score matching to function spaces that can be infinite-dimensional. We show that the corresponding discretized algorithm generates accurate samples at a fixed cost that is independent of the data resolution. We theoretically and numerically verify the applicability of our approach on a set of problems, including generating solutions to the Navier-Stokes equation viewed as the push-forward distribution of forcings from a Gaussian Random Field (GRF).
    The Noise Geometry of Stochastic Gradient Descent: A Quantitative and Analytical Characterization. (arXiv:2310.00692v2 [cs.LG] UPDATED)
    Empirical studies have demonstrated that the noise in stochastic gradient descent (SGD) aligns favorably with the local geometry of loss landscape. However, theoretical and quantitative explanations for this phenomenon remain sparse. In this paper, we offer a comprehensive theoretical investigation into the aforementioned {\em noise geometry} for over-parameterized linear (OLMs) models and two-layer neural networks. We scrutinize both average and directional alignments, paying special attention to how factors like sample size and input data degeneracy affect the alignment strength. As a specific application, we leverage our noise geometry characterizations to study how SGD escapes from sharp minima, revealing that the escape direction has significant components along flat directions. This is in stark contrast to GD, which escapes only along the sharpest directions. To substantiate our theoretical findings, both synthetic and real-world experiments are provided.
    BackboneLearn: A Library for Scaling Mixed-Integer Optimization-Based Machine Learning. (arXiv:2311.13695v1 [cs.LG])
    We present BackboneLearn: an open-source software package and framework for scaling mixed-integer optimization (MIO) problems with indicator variables to high-dimensional problems. This optimization paradigm can naturally be used to formulate fundamental problems in interpretable supervised learning (e.g., sparse regression and decision trees), in unsupervised learning (e.g., clustering), and beyond; BackboneLearn solves the aforementioned problems faster than exact methods and with higher accuracy than commonly used heuristics. The package is built in Python and is user-friendly and easily extensible: users can directly implement a backbone algorithm for their MIO problem at hand. The source code of BackboneLearn is available on GitHub (link: https://github.com/chziakas/backbone_learn).
    FAVANO: Federated AVeraging with Asynchronous NOdes. (arXiv:2305.16099v2 [cs.LG] UPDATED)
    In this paper, we propose a novel centralized Asynchronous Federated Learning (FL) framework, FAVANO, for training Deep Neural Networks (DNNs) in resource-constrained environments. Despite its popularity, ``classical'' federated learning faces the increasingly difficult task of scaling synchronous communication over large wireless networks. Moreover, clients typically have different computing resources and therefore computing speed, which can lead to a significant bias (in favor of ``fast'' clients) when the updates are asynchronous. Therefore, practical deployment of FL requires to handle users with strongly varying computing speed in communication/resource constrained setting. We provide convergence guarantees for FAVANO in a smooth, non-convex environment and carefully compare the obtained convergence guarantees with existing bounds, when they are available. Experimental results show that the FAVANO algorithm outperforms current methods on standard benchmarks.
    On Preference Learning Based on Sequential Bayesian Optimization with Pairwise Comparison. (arXiv:2103.13192v3 [cs.LG] UPDATED)
    User preference learning is generally a hard problem. Individual preferences are typically unknown even to users themselves, while the space of choices is infinite. Here we study user preference learning from information-theoretic perspective. We model preference learning as a system with two interacting sub-systems, one representing a user with his/her preferences and another one representing an agent that has to learn these preferences. The user with his/her behaviour is modeled by a parametric preference function. To efficiently learn the preferences and reduce search space quickly, we propose the agent that interacts with the user to collect the most informative data for learning. The agent presents two proposals to the user for evaluation, and the user rates them based on his/her preference function. We show that the optimum agent strategy for data collection and preference learning is a result of maximin optimization of the normalized weighted Kullback-Leibler (KL) divergence between true and agent-assigned predictive user response distributions. The resulting value of KL-divergence, which we also call remaining system uncertainty (RSU), provides an efficient performance metric in the absence of the ground truth. This metric characterises how well the agent can predict user and, thus, the quality of the underlying learned user (preference) model. Our proposed agent comprises sequential mechanisms for user model inference and proposal generation. To infer the user model (preference function), Bayesian approximate inference is used in the agent. The data collection strategy is to generate proposals, responses to which help resolving uncertainty associated with prediction of the user responses the most. The efficiency of our approach is validated by numerical simulations. Also a real-life example of preference learning application is provided.
    Optimal $\delta$-Correct Best-Arm Selection for Heavy-Tailed Distributions. (arXiv:1908.09094v3 [cs.LG] UPDATED)
    Given a finite set of unknown distributions or arms that can be sampled, we consider the problem of identifying the one with the maximum mean using a $\delta$-correct algorithm (an adaptive, sequential algorithm that restricts the probability of error to a specified $\delta$) that has minimum sample complexity. Lower bounds for $\delta$-correct algorithms are well known. $\delta$-correct algorithms that match the lower bound asymptotically as $\delta$ reduces to zero have been previously developed when arm distributions are restricted to a single parameter exponential family. In this paper, we first observe a negative result that some restrictions are essential, as otherwise, under a $\delta$-correct algorithm, distributions with unbounded support would require an infinite number of samples in expectation. We then propose a $\delta$-correct algorithm that matches the lower bound as $\delta$ reduces to zero under the mild restriction that a known bound on the expectation of $(1+\epsilon)^{th}$ moment of the underlying random variables exists, for $\epsilon > 0$. We also propose batch processing and identify near-optimal batch sizes to speed up the proposed algorithm substantially. The best-arm problem has many learning applications, including recommendation systems and product selection. It is also a well-studied classic problem in the simulation community.
    A General Framework for User-Guided Bayesian Optimization. (arXiv:2311.14645v1 [cs.LG])
    The optimization of expensive-to-evaluate black-box functions is prevalent in various scientific disciplines. Bayesian optimization is an automatic, general and sample-efficient method to solve these problems with minimal knowledge of the underlying function dynamics. However, the ability of Bayesian optimization to incorporate prior knowledge or beliefs about the function at hand in order to accelerate the optimization is limited, which reduces its appeal for knowledgeable practitioners with tight budgets. To allow domain experts to customize the optimization routine, we propose ColaBO, the first Bayesian-principled framework for incorporating prior beliefs beyond the typical kernel structure, such as the likely location of the optimizer or the optimal value. The generality of ColaBO makes it applicable across different Monte Carlo acquisition functions and types of user beliefs. We empirically demonstrate ColaBO's ability to substantially accelerate optimization when the prior information is accurate, and to retain approximately default performance when it is misleading.
    FRUITS: Feature Extraction Using Iterated Sums for Time Series Classification. (arXiv:2311.14549v1 [stat.ML])
    We introduce a pipeline for time series classification that extracts features based on the iterated-sums signature (ISS) and then applies a linear classifier. These features are intrinsically nonlinear, capture chronological information, and, under certain settings, are invariant to time-warping. We are competitive with state-of-the-art methods on the UCR archive, both in terms of accuracy and speed. We make our code available at \url{https://github.com/irkri/fruits}.
    Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts. (arXiv:2310.05898v3 [cs.LG] UPDATED)
    Lion (Evolved Sign Momentum), a new optimizer discovered through program search, has shown promising results in training large AI models. It performs comparably or favorably to AdamW but with greater memory efficiency. As we can expect from the results of a random search program, Lion incorporates elements from several existing algorithms, including signed momentum, decoupled weight decay, Polak, and Nesterov momentum, but does not fit into any existing category of theoretically grounded optimizers. Thus, even though Lion appears to perform well as a general-purpose optimizer for a wide range of tasks, its theoretical basis remains uncertain. This lack of theoretical clarity limits opportunities to further enhance and expand Lion's efficacy. This work aims to demystify Lion. Based on both continuous-time and discrete-time analysis, we demonstrate that Lion is a theoretically novel and principled approach for minimizing a general loss function $f(x)$ while enforcing a bound constraint $\|x\|_\infty \leq 1/\lambda$. Lion achieves this through the incorporation of decoupled weight decay, where $\lambda$ represents the weight decay coefficient. Our analysis is made possible by the development of a new Lyapunov function for the Lion updates. It applies to a broader family of Lion-$\kappa$ algorithms, where the $\text{sign}(\cdot)$ operator in Lion is replaced by the subgradient of a convex function $\kappa$, leading to the solution of a general composite optimization problem of $\min_x f(x) + \kappa^*(x)$. Our findings provide valuable insights into the dynamics of Lion and pave the way for further improvements and extensions of Lion-related algorithms.  ( 3 min )
    Variational Annealing on Graphs for Combinatorial Optimization. (arXiv:2311.14156v1 [cs.LG])
    Several recent unsupervised learning methods use probabilistic approaches to solve combinatorial optimization (CO) problems based on the assumption of statistically independent solution variables. We demonstrate that this assumption imposes performance limitations in particular on difficult problem instances. Our results corroborate that an autoregressive approach which captures statistical dependencies among solution variables yields superior performance on many popular CO problems. We introduce subgraph tokenization in which the configuration of a set of solution variables is represented by a single token. This tokenization technique alleviates the drawback of the long sequential sampling procedure which is inherent to autoregressive methods without sacrificing expressivity. Importantly, we theoretically motivate an annealed entropy regularization and show empirically that it is essential for efficient and stable learning.  ( 2 min )
    Empirical Comparison between Cross-Validation and Mutation-Validation in Model Selection. (arXiv:2311.14079v1 [cs.LG])
    Mutation validation (MV) is a recently proposed approach for model selection, garnering significant interest due to its unique characteristics and potential benefits compared to the widely used cross-validation (CV) method. In this study, we empirically compared MV and $k$-fold CV using benchmark and real-world datasets. By employing Bayesian tests, we compared generalization estimates yielding three posterior probabilities: practical equivalence, CV superiority, and MV superiority. We also evaluated the differences in the capacity of the selected models and computational efficiency. We found that both MV and CV select models with practically equivalent generalization performance across various machine learning algorithms and the majority of benchmark datasets. MV exhibited advantages in terms of selecting simpler models and lower computational costs. However, in some cases MV selected overly simplistic models leading to underfitting and showed instability in hyperparameter selection. These limitations of MV became more evident in the evaluation of a real-world neuroscientific task of predicting sex at birth using brain functional connectivity.  ( 2 min )
    Data-driven Prior Learning for Bayesian Optimisation. (arXiv:2311.14653v1 [cs.LG])
    Transfer learning for Bayesian optimisation has generally assumed a strong similarity between optimisation tasks, with at least a subset having similar optimal inputs. This assumption can reduce computational costs, but it is violated in a wide range of optimisation problems where transfer learning may nonetheless be useful. We replace this assumption with a weaker one only requiring the shape of the optimisation landscape to be similar, and analyse the recent method Prior Learning for Bayesian Optimisation - PLeBO - in this setting. By learning priors for the hyperparameters of the Gaussian process surrogate model we can better approximate the underlying function, especially for few function evaluations. We validate the learned priors and compare to a breadth of transfer learning approaches, using synthetic data and a recent air pollution optimisation problem as benchmarks. We show that PLeBO and prior transfer find good inputs in fewer evaluations.  ( 2 min )
    Learning Hierarchical Polynomials with Three-Layer Neural Networks. (arXiv:2311.13774v1 [cs.LG])
    We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form $h = g \circ p$ where $p : \mathbb{R}^d \rightarrow \mathbb{R}$ is a degree $k$ polynomial and $g: \mathbb{R} \rightarrow \mathbb{R}$ is a degree $q$ polynomial. This function class generalizes the single-index model, which corresponds to $k=1$, and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree $k$ polynomials $p$, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target $h$ up to vanishing test error in $\widetilde{\mathcal{O}}(d^k)$ samples and polynomial time. This is a strict improvement over kernel methods, which require $\widetilde \Theta(d^{kq})$ samples, as well as existing guarantees for two-layer networks, which require the target function to be low-rank. Our result also generalizes prior works on three-layer neural networks, which were restricted to the case of $p$ being a quadratic. When $p$ is indeed a quadratic, we achieve the information-theoretically optimal sample complexity $\widetilde{\mathcal{O}}(d^2)$, which is an improvement over prior work~\citep{nichani2023provable} requiring a sample size of $\widetilde\Theta(d^4)$. Our proof proceeds by showing that during the initial stage of training the network performs feature learning to recover the feature $p$ with $\widetilde{\mathcal{O}}(d^k)$ samples. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.  ( 3 min )
    More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory. (arXiv:2311.14646v1 [cs.LG])
    In our era of enormous neural networks, empirical progress has been driven by the philosophy that more is better. Recent deep learning practice has found repeatedly that larger model size, more data, and more computation (resulting in lower training loss) improves performance. In this paper, we give theoretical backing to these empirical observations by showing that these three properties hold in random feature (RF) regression, a class of models equivalent to shallow networks with only the last layer trained. Concretely, we first show that the test risk of RF regression decreases monotonically with both the number of features and the number of samples, provided the ridge penalty is tuned optimally. In particular, this implies that infinite width RF architectures are preferable to those of any finite width. We then proceed to demonstrate that, for a large class of tasks characterized by powerlaw eigenstructure, training to near-zero training loss is obligatory: near-optimal performance can only be achieved when the training error is much smaller than the test error. Grounding our theory in real-world data, we find empirically that standard computer vision tasks with convolutional neural tangent kernels clearly fall into this class. Taken together, our results tell a simple, testable story of the benefits of overparameterization, overfitting, and more data in random feature models.  ( 3 min )
    Learning in Deep Factor Graphs with Gaussian Belief Propagation. (arXiv:2311.14649v1 [cs.LG])
    We propose an approach to do learning in Gaussian factor graphs. We treat all relevant quantities (inputs, outputs, parameters, latents) as random variables in a graphical model, and view both training and prediction as inference problems with different observed nodes. Our experiments show that these problems can be efficiently solved with belief propagation (BP), whose updates are inherently local, presenting exciting opportunities for distributed and asynchronous training. Our approach can be scaled to deep networks and provides a natural means to do continual learning: use the BP-estimated parameter marginals of the current task as parameter priors for the next. On a video denoising task we demonstrate the benefit of learnable parameters over a classical factor graph approach and we show encouraging performance of deep factor graphs for continual image classification on MNIST.  ( 2 min )
    Learning-based Optimal Admission Control in a Single Server Queuing System. (arXiv:2212.11316v2 [math.OC] UPDATED)
    We consider a long-term average profit maximizing admission control problem in an M/M/1 queuing system with unknown service and arrival rates. With a fixed reward collected upon service completion and a cost per unit of time enforced on customers waiting in the queue, a dispatcher decides upon arrivals whether to admit the arriving customer or not based on the full history of observations of the queue-length of the system. (Naor 1969, Econometrica) showed that if all the parameters of the model are known, then it is optimal to use a static threshold policy -- admit if the queue-length is less than a predetermined threshold and otherwise not. We propose a learning-based dispatching algorithm and characterize its regret with respect to optimal dispatch policies for the full information model of Naor (1969). We show that the algorithm achieves an $O(1)$ regret when all optimal thresholds with full information are non-zero, and achieves an $O(\ln^{1+\epsilon}(N))$ regret for any specified $\epsilon>0$, in the case that an optimal threshold with full information is $0$ (i.e., an optimal policy is to reject all arrivals), where $N$ is the number of arrivals.  ( 2 min )
    Gradient-based bilevel optimization for multi-penalty Ridge regression through matrix differential calculus. (arXiv:2311.14182v1 [cs.LG])
    Common regularization algorithms for linear regression, such as LASSO and Ridge regression, rely on a regularization hyperparameter that balances the tradeoff between minimizing the fitting error and the norm of the learned model coefficients. As this hyperparameter is scalar, it can be easily selected via random or grid search optimizing a cross-validation criterion. However, using a scalar hyperparameter limits the algorithm's flexibility and potential for better generalization. In this paper, we address the problem of linear regression with l2-regularization, where a different regularization hyperparameter is associated with each input variable. We optimize these hyperparameters using a gradient-based approach, wherein the gradient of a cross-validation criterion with respect to the regularization hyperparameters is computed analytically through matrix differential calculus. Additionally, we introduce two strategies tailored for sparse model learning problems aiming at reducing the risk of overfitting to the validation data. Numerical examples demonstrate that our multi-hyperparameter regularization approach outperforms LASSO, Ridge, and Elastic Net regression. Moreover, the analytical computation of the gradient proves to be more efficient in terms of computational time compared to automatic differentiation, especially when handling a large number of input variables. Application to the identification of over-parameterized Linear Parameter-Varying models is also presented.  ( 2 min )
    Sample-Efficient Training for Diffusion. (arXiv:2311.13745v1 [cs.LG])
    Score-based diffusion models have become the most popular approach to deep generative modeling of images, largely due to their empirical performance and reliability. Recently, a number of theoretical works \citep{chen2022, Chen2022ImprovedAO, Chenetal23flowode, benton2023linear} have shown that diffusion models can efficiently sample, assuming $L^2$-accurate score estimates. The score-matching objective naturally approximates the true score in $L^2$, but the sample complexity of existing bounds depends \emph{polynomially} on the data radius and desired Wasserstein accuracy. By contrast, the time complexity of sampling is only logarithmic in these parameters. We show that estimating the score in $L^2$ \emph{requires} this polynomial dependence, but that a number of samples that scales polylogarithmically in the Wasserstein accuracy actually do suffice for sampling. We show that with a polylogarithmic number of samples, the ERM of the score-matching objective is $L^2$ accurate on all but a probability $\delta$ fraction of the true distribution, and that this weaker guarantee is sufficient for efficient sampling.  ( 2 min )
    Equivariant flow matching. (arXiv:2306.15030v2 [stat.ML] UPDATED)
    Normalizing flows are a class of deep generative models that are especially interesting for modeling probability distributions in physics, where the exact likelihood of flows allows reweighting to known target energy functions and computing unbiased observables. For instance, Boltzmann generators tackle the long-standing sampling problem in statistical physics by training flows to produce equilibrium samples of many-body systems such as small molecules and proteins. To build effective models for such systems, it is crucial to incorporate the symmetries of the target energy into the model, which can be achieved by equivariant continuous normalizing flows (CNFs). However, CNFs can be computationally expensive to train and generate samples from, which has hampered their scalability and practical application. In this paper, we introduce equivariant flow matching, a new training objective for equivariant CNFs that is based on the recently proposed optimal transport flow matching. Equivariant flow matching exploits the physical symmetries of the target energy for efficient, simulation-free training of equivariant CNFs. We demonstrate the effectiveness of flow matching on rotation and permutation invariant many-particle systems and a small molecule, alanine dipeptide, where for the first time we obtain a Boltzmann generator with significant sampling efficiency without relying on tailored internal coordinate featurization. Our results show that the equivariant flow matching objective yields flows with shorter integration paths, improved sampling efficiency, and higher scalability compared to existing methods.  ( 2 min )
    A Deep Learning Method for Comparing Bayesian Hierarchical Models. (arXiv:2301.11873v4 [stat.ML] UPDATED)
    Bayesian model comparison (BMC) offers a principled approach for assessing the relative merits of competing computational models and propagating uncertainty into model selection decisions. However, BMC is often intractable for the popular class of hierarchical models due to their high-dimensional nested parameter structure. To address this intractability, we propose a deep learning method for performing BMC on any set of hierarchical models which can be instantiated as probabilistic programs. Since our method enables amortized inference, it allows efficient re-estimation of posterior model probabilities and fast performance validation prior to any real-data application. In a series of extensive validation studies, we benchmark the performance of our method against the state-of-the-art bridge sampling method and demonstrate excellent amortized inference across all BMC settings. We then showcase our method by comparing four hierarchical evidence accumulation models that have previously been deemed intractable for BMC due to partly implicit likelihoods. Additionally, we demonstrate how transfer learning can be leveraged to enhance training efficiency. We provide reproducible code for all analyses and an open-source implementation of our method.  ( 2 min )
    Improving Forecasts for Heterogeneous Time Series by "Averaging", with Application to Food Demand Forecast. (arXiv:2306.07119v2 [stat.ME] UPDATED)
    A common forecasting setting in real world applications considers a set of possibly heterogeneous time series of the same domain. Due to different properties of each time series such as length, obtaining forecasts for each individual time series in a straight-forward way is challenging. This paper proposes a general framework utilizing a similarity measure in Dynamic Time Warping to find similar time series to build neighborhoods in a k-Nearest Neighbor fashion, and improve forecasts of possibly simple models by averaging. Several ways of performing the averaging are suggested, and theoretical arguments underline the usefulness of averaging for forecasting. Additionally, diagnostics tools are proposed allowing a deep understanding of the procedure.  ( 2 min )
    GATGPT: A Pre-trained Large Language Model with Graph Attention Network for Spatiotemporal Imputation. (arXiv:2311.14332v1 [cs.LG])
    The analysis of spatiotemporal data is increasingly utilized across diverse domains, including transportation, healthcare, and meteorology. In real-world settings, such data often contain missing elements due to issues like sensor malfunctions and data transmission errors. The objective of spatiotemporal imputation is to estimate these missing values by understanding the inherent spatial and temporal relationships in the observed multivariate time series. Traditionally, spatiotemporal imputation has relied on specific, intricate architectures designed for this purpose, which suffer from limited applicability and high computational complexity. In contrast, our approach integrates pre-trained large language models (LLMs) into spatiotemporal imputation, introducing a groundbreaking framework, GATGPT. This framework merges a graph attention mechanism with LLMs. We maintain most of the LLM parameters unchanged to leverage existing knowledge for learning temporal patterns, while fine-tuning the upper layers tailored to various applications. The graph attention component enhances the LLM's ability to understand spatial relationships. Through tests on three distinct real-world datasets, our innovative approach demonstrates comparable results to established deep learning benchmarks.  ( 2 min )
    Risk Bounds of Accelerated SGD for Overparameterized Linear Regression. (arXiv:2311.14222v1 [cs.LG])
    Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning and often achieves better generalization performance than SGD. However, existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization. In this paper, we study the generalization of ASGD for overparameterized linear regression, which is possibly the simplest setting of learning with overparameterization. We establish an instance-dependent excess risk bound for ASGD within each eigen-subspace of the data covariance matrix. Our analysis shows that (i) ASGD outperforms SGD in the subspace of small eigenvalues, exhibiting a faster rate of exponential decay for bias error, while in the subspace of large eigenvalues, its bias error decays slower than SGD; and (ii) the variance error of ASGD is always larger than that of SGD. Our result suggests that ASGD can outperform SGD when the difference between the initialization and the true weight vector is mostly confined to the subspace of small eigenvalues. Additionally, when our analysis is specialized to linear regression in the strongly convex setting, it yields a tighter bound for bias error than the best-known result.  ( 2 min )
    Touring sampling with pushforward maps. (arXiv:2311.13845v1 [cs.LG])
    The number of sampling methods could be daunting for a practitioner looking to cast powerful machine learning methods to their specific problem. This paper takes a theoretical stance to review and organize many sampling approaches in the ``generative modeling'' setting, where one wants to generate new data that are similar to some training examples. By revealing links between existing methods, it might prove useful to overcome some of the current challenges in sampling with diffusion models, such as long inference time due to diffusion simulation, or the lack of diversity in generated samples.  ( 2 min )
    Assumption-lean and Data-adaptive Post-Prediction Inference. (arXiv:2311.14220v1 [stat.ME])
    A primary challenge facing modern scientific research is the limited availability of gold-standard data which can be both costly and labor-intensive to obtain. With the rapid development of machine learning (ML), scientists have relied on ML algorithms to predict these gold-standard outcomes with easily obtained covariates. However, these predicted outcomes are often used directly in subsequent statistical analyses, ignoring imprecision and heterogeneity introduced by the prediction procedure. This will likely result in false positive findings and invalid scientific conclusions. In this work, we introduce an assumption-lean and data-adaptive Post-Prediction Inference (POP-Inf) procedure that allows valid and powerful inference based on ML-predicted outcomes. Its "assumption-lean" property guarantees reliable statistical inference without assumptions on the ML-prediction, for a wide range of statistical quantities. Its "data-adaptive'" feature guarantees an efficiency gain over existing post-prediction inference methods, regardless of the accuracy of ML-prediction. We demonstrate the superiority and applicability of our method through simulations and large-scale genomic data.  ( 2 min )
    High-Order Tensor Recovery with A Tensor $U_1$ Norm. (arXiv:2311.13958v1 [stat.ML])
    Recently, numerous tensor SVD (t-SVD)-based tensor recovery methods have emerged, showing promise in processing visual data. However, these methods often suffer from performance degradation when confronted with high-order tensor data exhibiting non-smooth changes, commonly observed in real-world scenarios but ignored by the traditional t-SVD-based methods. Our objective in this study is to provide an effective tensor recovery technique for handling non-smooth changes in tensor data and efficiently explore the correlations of high-order tensor data across its various dimensions without introducing numerous variables and weights. To this end, we introduce a new tensor decomposition and a new tensor norm called the Tensor $U_1$ norm. We utilize these novel techniques in solving the problem of high-order tensor completion problem and provide theoretical guarantees for the exact recovery of the resulting tensor completion models. An optimization algorithm is proposed to solve the resulting tensor completion model iteratively by combining the proximal algorithm with the Alternating Direction Method of Multipliers. Theoretical analysis showed the convergence of the algorithm to the Karush-Kuhn-Tucker (KKT) point of the optimization problem. Numerical experiments demonstrated the effectiveness of the proposed method in high-order tensor completion, especially for tensor data with non-smooth changes.  ( 2 min )
    Thompson sampling for zero-inflated count outcomes with an application to the Drink Less mobile health study. (arXiv:2311.14359v1 [stat.ML])
    Mobile health (mHealth) technologies aim to improve distal outcomes, such as clinical conditions, by optimizing proximal outcomes through just-in-time adaptive interventions. Contextual bandits provide a suitable framework for customizing such interventions according to individual time-varying contexts, intending to maximize cumulative proximal outcomes. However, unique challenges such as modeling count outcomes within bandit frameworks have hindered the widespread application of contextual bandits to mHealth studies. The current work addresses this challenge by leveraging count data models into online decision-making approaches. Specifically, we combine four common offline count data models (Poisson, negative binomial, zero-inflated Poisson, and zero-inflated negative binomial regressions) with Thompson sampling, a popular contextual bandit algorithm. The proposed algorithms are motivated by and evaluated on a real dataset from the Drink Less trial, where they are shown to improve user engagement with the mHealth system. The proposed methods are further evaluated on simulated data, achieving improvement in maximizing cumulative proximal outcomes over existing algorithms. Theoretical results on regret bounds are also derived. A user-friendly R package countts that implements the proposed methods for assessing contextual bandit algorithms is made publicly available at https://cran.r-project.org/web/packages/countts.  ( 2 min )
    Analysis of the expected $L_2$ error of an over-parametrized deep neural network estimate learned by gradient descent without regularization. (arXiv:2311.14609v1 [stat.ML])
    Recent results show that estimates defined by over-parametrized deep neural networks learned by applying gradient descent to a regularized empirical $L_2$ risk are universally consistent and achieve good rates of convergence. In this paper, we show that the regularization term is not necessary to obtain similar results. In the case of a suitably chosen initialization of the network, a suitable number of gradient descent steps, and a suitable step size we show that an estimate without a regularization term is universally consistent for bounded predictor variables. Additionally, we show that if the regression function is H\"older smooth with H\"older exponent $1/2 \leq p \leq 1$, the $L_2$ error converges to zero with a convergence rate of approximately $n^{-1/(1+d)}$. Furthermore, in case of an interaction model, where the regression function consists of a sum of H\"older smooth functions with $d^*$ components, a rate of convergence is derived which does not depend on the input dimension $d$.  ( 2 min )

  • Open

    An image of Israel Kamakawiwoʻole shows Google search still can't tell AI-generated pictures apart from genuine ones
    submitted by /u/thisisinsider [link] [comments]
    Is there an AI that lets you generate whole fonts/typography?
    With all the advancements in AI this year, I'd be surprised if there isn't yet. submitted by /u/Venvonis [link] [comments]
    Best Midjourney Prompts for Product Photography
    submitted by /u/Senior_tasteey [link] [comments]
    I have a friend in the medical field who doesn’t think ai will ever be able to do any task a human can and then better, so far in trying to construct a reasoning argument that doesn’t go with any assumptions plz help, so far I have:
    The brain doesn’t seem to do any magic and is entirely made of matter and follows the laws of physics. With enough computational power you can simulate any real world system. The human body is a real world system. Given that, with enough computational power you can simulate the entire human body. Given that, you can simulate a entire human brain organelle with enough computational power. 6.Given that you now have a system in place where you have a computer that can do any mental task a human can do if it’s simulating the entire brain. With enough computational power you can speed up a simulation. If you agree to all this then you now have a computer which can do any task a human can do but faster without doing any magic. -This isn’t saying that simulating a brain is the way to AGI but it’s just to create a logical proof to show that AGI is possible without creating any new magical tech. submitted by /u/Impossible_Belt_7757 [link] [comments]
    Is AI Alignable, Even in Principle?
    The article discusses the AI alignment problem and the risks associated with advanced artificial intelligence. It mentions an open letter signed by AI and computer pioneers calling for a pause in training AI systems more powerful than GPT-4. The article explores the challenges of aligning AI behavior with user goals and the dangers of deep neural networks. It presents different assessments of the existential risk posed by unaligned AI, ranging from 2% to 90%. Source : https://treeofwoe.substack.com/p/is-ai-alignable-even-in-principle submitted by /u/NuseAI [link] [comments]
    Pentagon to Deploy AI-Enabled Vehicles by 2026 Amid China's Tech Advancements: Shifting Human Roles Emerge
    submitted by /u/Upbeat-Interaction13 [link] [comments]
    Does anyone know of a free program where I can upload a photo and get a ai generated description?
    Something like a caption generator submitted by /u/Cantthinkof1usethis [link] [comments]
    Subtitle translation
    Hello! Recently got a job offer for a video editor position! I would be doing video edits for clients that want to push shorts on social media. The way everything is structured fits perfectly with my study schedule, but there is one problem. Most of the videos will be in Dutch (a language I do not speak) and they will require subtitle in Dutch too. Its awesome that Premiere has the subtitle AI, but its not always accurate and I’d need to fix subtitle mismatches. I am looking for a way around it and thinking that maybe translating the transcript on the internet and checking possible mistakes would be a way to do it. Maybe there are more elaborate ways, perhaps using AI that would do audio to text translation. Any suggestions? submitted by /u/Jakabin [link] [comments]
    AI teacher/turors, how long do you think?
    It feels like an obvious use case which is quite possible already even with what we have. Has anyone seen a product things that's working well or combination of tools? Surely the test cases are already starting with either someone powering through high education or a child is well exceeding their peers. I saw this blog post about a parent using spaced repetition with Anki and it's incredible just to see the effectiveness. I suspect with a teacher that can connect a narrative that clearly explains any topic in a story form should be incredibly productive but have not seen any stories just yet. submitted by /u/blakefolgado [link] [comments]
    Real-Life Halo Adventure: A Spartan and AI Companion in Action!
    submitted by /u/adamariefox [link] [comments]
    UK school pupils ‘using AI to create indecent imagery of other children’
    submitted by /u/Jariiari7 [link] [comments]
    "AGI Revolution: How Businesses, Governments, and Individuals can Prepare" | David Shapiro
    submitted by /u/Tao_Dragon [link] [comments]
    When Marvel got a very good canteen
    submitted by /u/Sea_Permit5660 [link] [comments]
    "AI Unveiled: From Everyday Smarts to Superintelligence"
    submitted by /u/Fit-Code-5141 [link] [comments]
    One-Minute Daily AI News 11/26/2023
    Check Out This New AI System Called Student of Games (SoG) that is capable of both Beating Humans at a Variety of Games and Learning to Play New Ones.[1] Researchers from the University of Chicago Introduce 3D Paintbrush: A AI Method for Generating Local Stylized Textures on Meshes Using Text as Input.[2] The United States, Britain and more than a dozen other countries on Sunday unveiled what a senior U.S. official described as the first detailed international agreement on how to keep artificial intelligence safe from rogue actors, pushing for companies to create AI systems that are “secure by design.”[3] TikTok parent ByteDance has launched the latest iteration of its online collaboration tool Feishu, now infused with artificial intelligence (AI) to help Chinese clients with workplace automation, as Chinese tech giants continue to push for more industrial use of generative AI.[4] Sources: [1] https://www.marktechpost.com/2023/11/25/check-out-this-new-ai-system-called-student-of-games-sog-that-is-capable-of-both-beating-humans-at-a-variety-of-games-and-learning-to-play-new-ones/ [2] https://www.marktechpost.com/2023/11/26/researchers-from-the-university-of-chicago-introduce-3d-paintbrush-a-ai-method-for-generating-local-stylized-textures-on-meshes-using-text-as-input/ [3] https://www.reuters.com/technology/us-britain-other-countries-ink-agreement-make-ai-secure-by-design-2023-11-27/ [4] https://www.scmp.com/tech/big-tech/article/3242571/bytedance-adds-ai-assistant-office-tool-feishu-joining-rivals-tencent-and-alibaba-workplace-chatbot submitted by /u/Excellent-Target-847 [link] [comments]
    Self Attention: What exactly makes it so incredibly powerful
    Hello people! Sharing a video from my YT channel about a breakdown of Self Attention, and why exactly it has been such a game changer in deep learning the past few years! Link below for those interested: https://youtu.be/4naXLhVfeho submitted by /u/AvvYaa [link] [comments]
    I didn't know this...even virtual artists concerts...this is crazy... (but...these AI girls are freaking HOT... x)
    submitted by /u/the_anonymizer [link] [comments]  ( 1 min )
  • Open

    Introducing three new NVIDIA GPU-based Amazon EC2 instances
    Amazon Elastic Compute Cloud (Amazon EC2) accelerated computing portfolio offers the broadest choice of accelerators to power your artificial intelligence (AI), machine learning (ML), graphics, and high performance computing (HPC) workloads. We are excited to announce the expansion of this portfolio with three new instances featuring the latest NVIDIA GPUs: Amazon EC2 P5e instances powered […]  ( 4 min )
    Boost inference performance for LLMs with new Amazon SageMaker containers
    Today, Amazon SageMaker launches a new version (0.25.0) of Large Model Inference (LMI) Deep Learning Containers (DLCs) and adds support for NVIDIA’s TensorRT-LLM Library. With these upgrades, you can effortlessly access state-of-the-art tooling to optimize large language models (LLMs) on SageMaker and achieve price-performance benefits – Amazon SageMaker LMI TensorRT-LLM DLC reduces latency by 33% […]  ( 9 min )
    Simplify data prep for generative AI with Amazon SageMaker Data Wrangler
    Generative artificial intelligence (generative AI) models have demonstrated impressive capabilities in generating high-quality text, images, and other content. However, these models require massive amounts of clean, structured training data to reach their full potential. Most real-world data exists in unstructured formats like PDFs, which requires preprocessing before it can be used effectively. According to IDC, […]  ( 10 min )
    Democratize ML on Salesforce Data Cloud with no-code Amazon SageMaker Canvas
    This post is co-authored by Daryl Martis, Director of Product, Salesforce Einstein AI. This is the third post in a series discussing the integration of Salesforce Data Cloud and Amazon SageMaker. In Part 1 and Part 2, we show how the Salesforce Data Cloud and Einstein Studio integration with SageMaker allows businesses to access their […]  ( 12 min )
    AWS AI services enhanced with FM-powered capabilities
    Artificial intelligence (AI) continues to transform how we do business and serve our customers. AWS offers a range of pre-trained AI services that provide ready-to-use intelligence for your applications. In this post, we explore the new AI service capabilities and how they are enhanced using foundation models (FMs). We focus on the following major updates […]  ( 7 min )
    Elevate your self-service assistants with new generative AI features in Amazon Lex
    In this post, we talk about how generative AI is changing the conversational AI industry by providing new customer and bot builder experiences, and the new features in Amazon Lex that take advantage of these advances. As the demand for conversational AI continues to grow, developers are seeking ways to enhance their chatbots with human-like […]  ( 7 min )
  • Open

    [D] Is this normal or I'm just at the wrong company?
    I work as a Junior Python Dev. at a company for over a year now, almost a year and a half, I'm based in UK, London. Since they hired me, I've been working alone on a project, an AI/Data Science project, using technologies I've never used before, in an area I've never worked on. I was given the task to develop it, with no specific deadline but to make progress. I thought, okay, great, I'll have something new to learn. It was good for a while, it worked for a few months, and then they assigned me another project to work on. This another project is also full of new technologies I've never used before. So, I'm working on two projects, and I faced more and more challenges because, as I mentioned, many things were new and still are. I asked for help several times, but fundamentally, I never rec…  ( 10 min )
    [D] Question on usability of ChatGPT input as training data
    Hi everyone, now that ChatGPT got so much users, there should be plenty of training data (hopefully only the consented data). I am wondering how much of these inputs can actually be beneficial to utilize as training data. Take forums like reddit for example... if you train on it, you get questions and answers, some more specific some less specific, some even wrong, but given enough data (and also a lot of wikipedia, books and papers, etc of course) the best answer apparently averages out. But ChatGPT input is only questions... should be a problem to train on this, I suppose? One way out, could be: the OpenAI team curates the dataset in a way that flags happy users, after a ChatGPT answer (e.g. with sentiment analysis). This could at least enhance the human reinforcement step. submitted by /u/balaena7 [link] [comments]  ( 9 min )
    [D] RX 6700 or 7700 XT for LLM?
    Just finished testing a bit on an Instinct MI100 bare metal server, wanted to purchase some AMD hardware to have the possibility testing more easily locally the needed specific setup, before returning on more powerful cloud servers for real work. Due even for 6b models 8GB memory is often not enough was interested to the RX 6700 XT as cheapest 12GB VRAM option, but RDNA 3 seems to have introduced some specific Ai acceleration hardware. Will make a huge difference choosing one or the other? submitted by /u/davide445 [link] [comments]  ( 9 min )
    [R] Distort, Distract, Decode: Instruction-Tuned Model Can Refine its Response from Noisy Instructions
    Link: https://openreview.net/forum?id=IqJ3CU3flr Abstract: While instruction-tuned language models have demonstrated impressive zero-shot generalization, these models often struggle to generate accurate responses when faced with instructions that fall outside their training set. This paper presents Instructive Decoding (ID), a simple yet effective approach that augments the efficacy of instruction-tuned models. Specifically, ID adjusts the logits for nexttoken prediction in a contrastive manner, utilizing predictions generated from a manipulated version of the original instruction, referred to as a noisy instruction. This noisy instruction aims to elicit responses that could diverge from the intended instruction yet remain plausible. We conduct experiments across a spectrum of such noisy instructions, ranging from those that insert semantic noise via random words to others like ‘opposite’ that elicit the deviated responses. Our approach achieves considerable performance gains across various instruction-tuned models and tasks without necessitating any additional parameter updates. Notably, utilizing ‘opposite’ as the noisy instruction in ID, which exhibits the maximum divergence from the original instruction, consistently produces the most significant performance gains across multiple models and tasks. ​ submitted by /u/Queasy_Ad_6423 [link] [comments]  ( 9 min )
    [R] LLMs for structured data?
    I've been trying to work with structured data in language models, and it's proving to be quite challenging. I'm confident that with Langchain, I should be able to solve the problem, but I'm not entirely sure which path to take among all the options the library offers. My issue is as follows: I have data in the form of dictionaries regarding a series of products, for example, laptops. The data looks like this: {Identifier 1: X, Identifier 2: Y, Value name: Z} (Several successive dictionaries like this.) I want to use this series of dictionaries as context, then feed a different dictionary into the Language Model, and have it tell me if the 'Value name' makes sense given Identifiers 1 and 2. An example would be Identifier 1: laptops, Identifier 2: brand, Value name: Lenovo. In this case, it should return affirmative since Lenovo makes sense as a brand. However, if I input 'oranges,' it should return negative. Any ideas on which library I could use to tackle this problem? submitted by /u/cosapocha [link] [comments]  ( 9 min )
    [R] Automatic hallucination detection experiments using SelfCheckGPT NLI on WikiBio
    Hi everyone, We have recently written an article on HF’s blog on automatic hallucination detection using inconsistency scoring. The main idea is that hallucinations happen because the task asked at inference is not seen in the training set, which implies low confidence in the next token, therefore, inconsistent samples from the same prompt (https://arxiv.org/abs/2309.13638). We look at the use of SelfCheckGPT NLI (https://arxiv.org/abs/2303.08896), an example of inconsistency scoring, on WikiBio and found that such a metric has high precision (aka flagged hallucinations indeed are ones) and calibrated recall (high scores = high chance of flagging hallucinations). ​ https://preview.redd.it/dtptqylv3y2c1.png?width=1189&format=png&auto=webp&s=db623d58e6b24f2f7eee57ff41115f544ee957be This is quite promising as it could open the way to having AI systems that are more reliable, aka when the task is easy, we let the AI do it. When we detect it’s too hard and the model is hallucinating, we put a human in the loop. ​ https://i.redd.it/w7p9ciow3y2c1.gif We have provided: An article on HF Blog: https://huggingface.co/blog/dhuynh95/automatic-hallucination-detection A Gradio demo to see the metric in action: https://huggingface.co/spaces/mithril-security/hallucination_detector A Colab notebook to reproduce our results: https://colab.research.google.com/drive/1Qhq2FO4FFX_MKN5IEgia_PrBEttxCQG4?usp=sharing We conducted these tests as part of our mission to build Confidential and Trustworthy Conversational AI. You can check out our core project, BlindChat, an open-source and Confidential Conversational AI (aka any data sent to our AI remains private, and not even our admins can see your prompts) at https://github.com/mithril-security/blind_chat/ submitted by /u/Separate-Still3770 [link] [comments]  ( 9 min )
    [D] Any text to speech AI model still being updated?
    weird question I suppose, I've tried Bark (works but it hasn't been updated in a while), Tortoise and Tortoise-Fast (both don't even install properly, and haven't been updated in a while too)... Is there any Ai text to speech model still being updated? submitted by /u/The--Nameless--One [link] [comments]  ( 9 min )
    [D] AISTATS 2024 Paper Reviews
    AISTATS 2024 paper reviews are supposed to be out today. I thought to create a discussion thread for us to discuss any issue/complain/celebration or anything else. There is so much noise in the reviews every year. Some good work that the authors are proud of might get a low score because of the noisy system, given that AISTATS is growing large these years. We should keep in mind that the work is still valuable no matter what the score is. submitted by /u/zy415 [link] [comments]  ( 9 min )
    [P] High-scale AI app backend reference architecture
    Reference architecture to deploy high-scale AI applications: uses a Pinecone index as a vector store, a queue to fan out work, networking and security groups to separate the infrastructure, ECS services for the frontend and backend microservices, and autoscaling configured to scale the worker pool. GitHub repo Blog post submitted by /u/kao-pulumi [link] [comments]  ( 9 min )
    [D] Efficient ways to achieve Multi modality in LLMs
    What is the best approach to achieve multi modality using a instruct fine tuned model? submitted by /u/Infamous-Belt8671 [link] [comments]  ( 9 min )
    [D] PoSE for extending context window of LLMs ?
    Has anybody tried PoSE for extending context window of Mistral 7B and how was the results? submitted by /u/Infamous-Belt8671 [link] [comments]  ( 9 min )
    [D] Do you obsessively watch your models train?
    I find myself watching tensorboard more than working- just wondering if others who have fallen into this pattern have words of advice wrt productivity submitted by /u/TehDing [link] [comments]  ( 9 min )
    [R] Adversarial Attacks and Generalization in Deep Reinforcement Learning
    https://blogs.ucl.ac.uk/steapp/2023/11/15/adversarial-attacks-robustness-and-generalization-in-deep-reinforcement-learning/ submitted by /u/ml_dnn [link] [comments]  ( 9 min )
    [R] System 2 Attention (is something you might need too)
    Paper: https://arxiv.org/abs/2311.11829 Abstract: Soft attention in Transformer-based Large Language Models (LLMs) is susceptible to incorporating irrelevant information from the context into its latent representations, which adversely affects next token generations. To help rectify these issues, we introduce System 2 Attention (S2A), which leverages the ability of LLMs to reason in natural language and follow instructions in order to decide what to attend to. S2A regenerates the input context to only include the relevant portions, before attending to the regenerated context to elicit the final response. In experiments, S2A outperforms standard attention-based LLMs on three tasks containing opinion or irrelevant information, QA, math word problems and longform generation, where S2A increases factuality and objectivity, and decreases sycophancy. ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [R] AI-Generated Images Introduce Invisible Relevance Bias to Text-Image Retrieval
    AI-Generated Images Introduce Invisible Relevance Bias to Text-Image Retrieval [arxiv] https://arxiv.org/pdf/2311.14084.pdf Hey Reddit Community, We extend the research on source bias to multimodality and we are thrilled to share interesting findings from our latest research on AIGC and text-image retrieval models. Here's a quick rundown: (1) Unveiling the Invisible Relevance Bias: Our study discovered that text-image retrieval models tend to assign higher rankings to AI-generated images, even though they do not show any more visually relevant information to the query compared to real images. We've termed this phenomenon "invisible relevance bias." (2) More Serious Bias Caused by Training: Further exploration revealed that when AI-generated images are incorporated into the training dat…  ( 10 min )
    [D] Automated approaches to data quality in documents and reports.
    I was recently discussing with colleagues possible techniques to automatically handle checks in report and documents to ensure data quality. It seems that there are many clients that require such automation, but the problems are hardly well defined and it seems to me that there is little Machine Learning can do, and an Expert System with hand-crafted rules would be better. Are there Machine Learning techniques that would be particolarly well-suited to the task? A couple of examples: * Check that reports based on data on a DB are "correct" and consistent with the data on the DB * Check that reports based on data on a DB are consistent with other reports based on the same data * Check that groups of documents sent via email are consistent with each-other (e.g., if the documents relate to a transaction, that the price is the same in all documents) submitted by /u/TheSoulsBorne [link] [comments]  ( 9 min )
    [D] How to make AI model infernce faster?
    Hi everyone, I am currently trying to serve some LM models like GPT2 and some LLMs like llama or bloomz. I tried converting GPT2 to TensorRT (based on Nvidia's documentation) and serving with Triton server. The output is also quite good. However, if you want to continue to optimize, will there be any more methods to perform optimization? And I'm also quite curious how to make the LLM model able to serve many users at the same time. Is there any architecture or libraries to use to do that? submitted by /u/HughLee_1999 [link] [comments]  ( 9 min )
    [D] Target variable selection process.
    How to select a target variable for Propensity (for a product/service purchase) prediction model? Should I have performance window and observation window? Should I take only the incremental purchases happened in my observation window? Or should I build a look alike model instead? submitted by /u/prsnasmiles [link] [comments]  ( 9 min )
    [D] Creating an Automated UI Controller with GPT-4 Vision & Agents
    Hey, Last weekend, I managed to merge GPT-4 Vision with another GPT-4 and a device controller to work as a AutoGPT equivalent using AutoGen. Good thing is, it is not limited to Browser. It can work on any UI window. Let me know what you guys think and what can be done better. Demo and approach available at: https://medium.com/@gitlostmurali/creating-an-automated-ui-controller-with-gpt-agents-35340759d08b?sk=98d85484bd7e5e554f48a801728bfb68 I'll update the repository soon -> https://github.com/gitlost-murali/grounded-gpt-agents submitted by /u/Outlandish_MurMan [link] [comments]  ( 9 min )
    [P] Orca-2-13B Runs Directly on Rust+WASM – No Python/C++ Hassles
    submitted by /u/smileymileycoin [link] [comments]  ( 9 min )
    Is there an interest in resurrecting technical discussions of the latest research? [D]
    Technical discussions of new research seem to have mostly disappeared in this subreddit, because researchers became a small fraction of its immense readership of 3e6 members. So I created a subreddit to host such discussions. A "safe space" for researchers, if you will, with strict standards for content. I seeded it with posts about a few recent papers I thought were interesting and my own takes on them, to get the discussion started. But then I said to myself: "You don't have time to manage a subreddit. WTF are you doing?" and deleted it all. Nevertheless, I'd like to see someone else, perhaps someone with more time, try to do it. submitted by /u/we_are_mammals [link] [comments]  ( 9 min )
    [R]eading List for Andrej Karpathy’s “Busy person’s intro to Large Language Models” Video
    I loved Andrej’s talk about in his “Busy person’s intro to Large Language Models” video, so I decided to create a reading list to dive in deeper to a lot of the topics. I feel like he did a great job of describing the state of the art for anyone from an ML Researcher to any engineer who is interested in learning more. The full talk can be found here: https://youtu.be/zjkBMFhNj_g?si=fPvPyOVmV-FCTFEx Here’s the reading list: https://blog.oxen.ai/reading-list-for-andrej-karpathys-intro-to-large-language-models-video/ Let me know if you have any other papers you would add! submitted by /u/FallMindless3563 [link] [comments]  ( 9 min )
    [D] Any future in ML for someone with a psychology degree and interest in health?
    I have a psychology degree and an interest in overall health (mental, physiological, neurological etc). I also enjoy and find technology interesting as well with an overall interest in automation/making lives easier- in a health sense So far, I am considering to learn a programming language on my own and head back to university for some ML classes and linear alge bra/stats. However, I also like tech and leadership. I am trying to navigate through what I want as a career. I heard about machine learning/AI and thought there may be an intersection where I could use my interests in health and incorporate them with tech. Perhaps program AI that would aid specific health issues? Of course, I am looking for something that pays well and has a good environment. ​ If you all could be kind enough to pitch in your thoughts and help with my brainstorming :) submitted by /u/bigbentower [link] [comments]  ( 1 min )
  • Open

    Looking for career advice.
    Hello everyone i have been interested in machine learning for the past 3 years with most of my focus being on Supervised learning , however in the last 3 months RL has caught my eye and i am convinced that the next big thing in AI will be from the field. I am interested in getting via academia as i only have a BSc in CS and wont get a job because I am in Zimbabwe and we aren't there yet in terms of tech. I applied to do my PhD in USA but the rejections have been coming thick and fast so I will likely end up going to China on scholarship. I would like some advice because ultimately I would like to work in the west in R&D in big companies. If you could please tell me what I could do during my masters in China to bring me closer to this goal once I graduate in 2026/27. PS: I also did my BSc in China. submitted by /u/congo43 [link] [comments]
    Multi-head DQN
    Hello everyone, I’m applying DQN to select at each time a set of elements (one or more at time). How can I avoid the action [0, 0, 0,…], that is how to force the agent to select as least one element? submitted by /u/GuavaAgreeable208 [link] [comments]
    Deterministic parameter always output the same action (PPO)
    I'm using PPO from SB3 with action masking to train an environment with the following hyperparameters. in the training, the model seems to learn taking different actions. However, in the testing phase, the model seems to do the same and behave normally, except when I use "deterministic=True", then the model just chooses one single action all the time. My code: action, _states = model.predict(obs , action_masks=env.valid_action_mask() , deterministic=True) initial_learning_rate = 0.00005 model = MaskablePPO(MaskableActorCriticPolicy, env, tensorboard_log="./tensorboard" ,n_steps=2048 , learning_rate=initial_learning_rate) # Create the callback to adjust the learning rate based on rewards callback = RewardBasedLearningRateSchedule() for i in range (1,202): model.learn(total_timesteps=TIMESTEPS , tb_log_name = 'PPO2' , reset_num_timesteps=False, callback=callback) model.save(f"{models_dir}/{TIMESTEPS*i}") ​ Output from the tensorborad: https://preview.redd.it/cpdvwhxnkx2c1.png?width=1652&format=png&auto=webp&s=a6d3bdef2f5040ce935a6534b66de2d934029af6 submitted by /u/Acceptable_Egg6552 [link] [comments]  ( 1 min )
    Mujoco 3.0 vs Isaac Gym
    Hello, For those who have tried and are conversant with both Mujoco 3.0 and Isaac Gym, which one would use advise someone to learn and why? submitted by /u/anointedninja [link] [comments]
    DQN for image classification (Alzheimer)
    Hello, I'm working on my research im using 2D MRI scans. There are 4 classes. i want to create a DQN that can do classification task. Can anyone help me in this?? submitted by /u/armaghanbz [link] [comments]  ( 1 min )
    How did `Open AI Gym` keep track of steps exceeding 500 in the CartPole environment?
    I am looking at the `CartPole` environment over here and I don't see how the `step` function (or any other function) takes care to ensure the agent doesn't cross 500 steps - ``` def step(self, action): err_msg = f"{action!r} ({type(action)}) invalid" assert self.action_space.contains(action), err_msg assert self.state is not None, "Call reset before using step method." x, x_dot, theta, theta_dot = self.state force = self.force_mag if action == 1 else -self.force_mag costheta = math.cos(theta) sintheta = math.sin(theta) ​ # For the interested reader: # https://coneural.org/florian/papers/05_cart_pole.pdf temp = ( force + self.polemass_length * theta_dot**2 * sintheta ) / self.total_mass thetaacc = (self.gravity * sintheta - costheta * temp) / ( self.length * (4.0 / 3.0 - s…  ( 1 min )
    stable baselines rollout/ep_rew_mean
    I'm trying to work on something that requires me to be able to obtain the ep_rew_mean values which are printed into the command prompt every 4 episodes. But I'm just not able to figure out where these values are stored so that I can access them in my callback function. I am accessing the last episode rewards from the "ep_info_buffer", but that is very different from the ep_rew_mean which is what I need. I looked into the logger, OffPolicyAlgorithm, and SAC files for this but still couldn't find it. Where is this information being stored before being printed, and how can I access this? submitted by /u/aliaslight [link] [comments]
    Help in debugging TD3 for (4 suspended cables) cable robot target reaching task
    Hello everyone, I am working on a RL controller for cable robot to learn a target reaching task. I have a unity simulation for the cable robots with flexible cables using obi rope package. I’ve been struggling with the agent to learn a proper policy. I would be very much interested in forming a collaboration towards resolving the issues. Any successful results out of this will be published jointly. Anyone who is interested and willing, please contact me on rohit.dhakate@aau.at to get more details on the project and current status. Kind Regards. submitted by /u/fx619 [link] [comments]  ( 1 min )
    Policy Gradient algorithms without critic model
    is it possible to implement a policy gradient algorithm like NPG, PPO with out the actor-critic or will it destroy the current faster convergence in these algorithms? if i wanted to do so from where can I start? submitted by /u/eles0range [link] [comments]  ( 1 min )
    Q*: Reinforcement Learning for Process Supervision of Multi-Step Reasoning
    So can we regular people try to imitate the approach taken by Q*(Q-star), which purportedly uses Reinforcement Learning to do Process Supervision for Multi-Step Reasoning (aka. "Chain of Thought")? https://openai.com/research/improving-mathematical-reasoning-with-process-supervision https://arxiv.org/abs/2305.20050 Would it be possible to take the Reinforcement Learning described above and apply it as transfer learning onto a Small Language Model like Orca2? If so, then what is the best way to do it? What code examples could I look up that might help me? submitted by /u/san__man [link] [comments]  ( 1 min )
  • Open

    GPT-4’s potential in shaping the future of radiology
    This research paper is being presented at the 2023 Conference on Empirical Methods in Natural Language Processing (opens in new tab) (EMNLP 2023), the premier conference on natural language processing and artificial intelligence. In recent years, AI has been increasingly integrated into healthcare, bringing about new areas of focus and priority, such as diagnostics, treatment […] The post GPT-4’s potential in shaping the future of radiology appeared first on Microsoft Research.  ( 10 min )
  • Open

    Unix linguistics
    If you knew that you wanted to learn 10 spoken languages, it would probably be helpful to take a course in linguistics first. Or maybe to have a linguistics course after learning your first or second language. And if the languages are related, it would help to know something about the linguistics of that group […] Unix linguistics first appeared on John D. Cook.  ( 6 min )
  • Open

    New method uses crowdsourced feedback to help train robots
    Human Guided Exploration (HuGE) enables AI agents to learn quickly with some help from humans, even if the humans make mistakes.  ( 11 min )

  • Open

    [D] "word on the street is that q* proved p==np, and the board drama was only a decoy.."
    submitted by /u/Sk1leR [link] [comments]  ( 1 min )
    [D] GANs do not have Density estimation abilities
    Hello. I am seeing the GAN specialization course on Coursera by instructor Sharon Zhou. In one of the lectures, she says that a disadvantage of GANs is that it cannot do density estimation and thus is not useful for anomaly detection. (not sure if you can access it). In the next video, she says that VAEs don't have this problem. I am a little confused about this. Could anyone please explain what she means? As far as I can understand from the lecture, density estimation means learning how probable/frequent particular features are in a dataset. Like, how probable is it that a dog will have droopy ears. Then, we can use this info to detect anomalies if they do not exhibit these features. But, isn't this exactly what GANs learn? aren't GANs learning to mimic the distribution of the training data? Also, how is a VAE different in this particular regard? Could someone please help explain this? Thank you. submitted by /u/racc15 [link] [comments]  ( 9 min )
    [D]In transformer models, why is there a query and key matrix instead of just the product?
    The only time that the query and key matrices are used is to compute the attention scores. That is $v_i^T \cdot W_q^T W_k v_j$ But what is used is the matrix $W_q^T W_k$. Why not just replace $W_q^T W_k$ with a single matrix $W_{qv}$, and learn the matrix that is the product of W_q^T W_k instead of the matrices themselves? How does it help to have two matrices instead of one? And if it helps, why is that not done when applying matrices between neuron layers? Chatgpt tells me that the reason is that it allows the model to learn a different representation for the query and key. But because they are just dotted together, it seems to me that you can just use the original embedding as the query with no loss of generality. submitted by /u/lildaemon [link] [comments]  ( 9 min )
    [D] What are the advantages of GANs over Diffusion Models in image generation?
    Diffusion Models have recently gained popularity in the field of image generation, with widely used products such as Stable Diffusion employing this approach and yielding impressive results. While GANs are also recognized for their efficiency, in what scenarios do I need to choose GANs over Diffusion Models and do GANs have any advantages compared to Diffusion Models in image generation? Here are a few reasons I can think of: Diffusion Models take more time and larger datasets to train. To train a Diffusion Model project, one must have substantial computational resources (a lot of GPUs), compared to GANs. The codebases of some popular Diffusion Models projects are not open source. I don't know if these are correct. As for the mathematical aspect, I'm not an expert in that area. submitted by /u/Electronic_Ant2706 [link] [comments]  ( 9 min )
    [D] Looking for keywords or relevant papers regarding encoding Motion/Movement Data into ML/DL Models.
    I am looking for any relevant research areas or any ideas that people that in regards to encoding movement data for predictions. I am looking at doing some work regarding predicting collisions on Autonomous Objects given Velocity Position and Acceleration Data. submitted by /u/ShadeSlayer_19 [link] [comments]  ( 1 min )
    [P] Before the impending flood of GPT AI's and directories in the coming weeks, things will likely get a bit oversaturated and confusing. Here's my own directory for bookmarking.
    I'm Shane Lindemoen, award-winning author and half-cooked data scientist (never finished school!), and I'm pretty stressed out sharing something I've been working hard to complete: a collection of unique AI's, designed to redefine the way we interact with AI. If you've been following along like me, you've likely encountered various GPTs and directories cropping up online, but this is pretty different -- I’m confident you’ll recognize why when you talk to them. These AI’s are not rebranded ChatGPT clones with clever titles; they are the result of a serious and thoughtful process, blending intricate linguistic architectures with some AI assisted math and quantum mechanics, weirdly enough. I invite anyone who cares about this stuff to have a look at GPT Nexus, a directory that showcases the …  ( 1 min )
    [D] CURRENTLY LOOKING FOR PHDs in Germany/Europe related to applied ML into quantum of biophysics
    I would like if anyone has some insights or any info about good programmes/universities/labs to do a PhD with the characteristics and what should I look for in a PhD. Is ranking important? Thank you submitted by /u/AdEfficient1804 [link] [comments]  ( 9 min )
    Code to reshape array [D]
    I tried through reshape but it is giving wrong answer for ex: for (18952,28,28,3) it is giving as (18952,2852) submitted by /u/newbie1503 [link] [comments]  ( 1 min )
    [D] Latent Space: Visualizing (and manipulating) what neural nets learn
    Sharing a video from my YT channel about latent space exploration and vector manipulation of VAEs… Hope y’all enjoy it! submitted by /u/AvvYaa [link] [comments]  ( 9 min )
    [P] Introducing "Paper Brief" - A podcast about AI/ML papers
    Hi everyone! 👋 I'm excited to share a project I've been working on, aimed at making AI research more accessible to everyone. It's called "Paper Brief," a podcast that takes complex AI papers and transforms them into easy-to-understand episodes. The idea came from wanting to keep up with the latest research without always having to be in front of a screen. Using OpenAI's RAG and advanced Text-to-Speech systems, "Paper Brief" turns papers, sourced from an awesome list on Huggingface, into something you can listen to while on a walk, commuting, or just relaxing at home. I'd love for you all to give it a listen and share your thoughts. It's made for enthusiasts and professionals alike, aiming to keep the essence of the papers intact while making them more digestible. Looking forward to hearing what the r/machinelearning community thinks! https://paperbrief.net P.S. Always open to suggestions for papers to cover in future episodes! Feedback and suggestions are always welcome! 🎧🤖 #AI #MachineLearning #Podcast submitted by /u/txprog [link] [comments]  ( 1 min )
    [D] gamma regression - negative log likelihood
    How does the negative log likelihood function that's used as the objective in gamma regression link back to shape and scale parameter of the gamma distribution? I've been particularly looking at xgboost which use this definition: https://github.com/dmlc/xgboost/blob/7663de956c37eb4dd528132214e68ba2851d9696/src/metric/elementwise_metric.cu#L270-L286 I don't understand the purpose of psi in this equation given it is set to 1 and this would allow the final function can be simplified significantly. The gamma deviance is more intuitive to me as it the deviance for an individual prediction Vs actual will remain constant for if the multiplicative difference prediction Vs actual stays constant which ties in with how I interpret the connection between mean and variance for a gamma distribution. To be exact when the gamma distribution is parameterised by the mean and a fixed shape parameter alpha as the mean increases the variance will increase. submitted by /u/Oppose_Worry_652 [link] [comments]  ( 1 min )
    [D] - What is the latest in tree-based approaches for LLMs? Has there been any significant research using RL for this?
    Curious on sota on this topic. Im familiar with CoT and Tree of Thoughts, but those don't seem to train the model to excel at using the tree, they just rely on the pretrained model already being good at it. The model has no way to improve its tree-use over time. Is anybody actively training models to be optimized for tree-based decoding? edit: Just to be clear, I am aware of Q* rumors and that Google mentioned tree-based methods being part of Gemini, but those aren't really actionable research findings (at least to most of us on this board) ​ Here are some relevant research papers. Apparently some labs including meta were already working on this. None of the papers I found use RL, but they do use MCTS for decoding LLM. Paper by meta using the PPO value function as the MCTS node evaluator Using MCTS to improve ability to write a successful program Using residual-based energy model for evaluator in MCTS Paper uses the LLM to evaluate itself submitted by /u/30299578815310 [link] [comments]  ( 9 min )
    [D] Cloud Document AI Too Expensive? A Concise Review of State-Of-The-Art Alternatives
    submitted by /u/lovestacoo [link] [comments]  ( 9 min )
    [D] errors on articles. What do you do ?
    I just ended to spot an unfair comparaison in an article that at best can be interpreted as a mistake or at worse a desperate manœuvre to publish and claim good results. Nevertheless, what do you do in these situations ? submitted by /u/Grumlyly [link] [comments]  ( 9 min )
    [P] USearch Molecules: Searchable Dataset of 28 Billion Chemical Embeddings on AWS
    submitted by /u/vov_or [link] [comments]  ( 9 min )
    [P] unum-cloud/usearch: Fastest Open-Source Similarity Search engine for Vectors in Python, JavaScript, C++, C, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram 🔍
    submitted by /u/ashvar [link] [comments]  ( 9 min )
    [Discussion] Let's speak theory. Exploring the Potential of Collaborative Training?
    Hi, you wonderful people! Here's a thought that came to my mind: Since training LLMs involves a degree of randomness, is there potentially a way to create an architecture for LLMs (or other AI) that would be somewhat deterministic in its training instead? What I mean is, could a theoretical architecture exist where everyone could train their own separate checkpoints on different datasets, which, after combining, would result in a checkpoint with combined learning from all these different smaller checkpoints? What this would allow us to do is let thousands of people create their own checkpoints, which when combined would result in something greater than the individual parts themselves. And since the training process is what takes the longest in developing LLMs (or any AI), this approach would allow almost everyone to contribute their share of processing power towards creating something together. If viable, this could have huge potential implications for Open Source Software. I'm looking forward to hearing what all of you smart people have to say about it! submitted by /u/paryska99 [link] [comments]  ( 1 min )
    [R] Spatially embedded recurrent neural networks reveal widespread links between structural and functional neuroscience findings
    Paper: https://www.nature.com/articles/s42256-023-00748-9 Project page and related work: https://www.jachterberg.com/seRNN Code (1): https://codeocean.com/capsule/2879348/tree/v2 Code (2): https://github.com/8erberg/spatially-embedded-rnn Pre-print version on bioRxiv: https://www.biorxiv.org/content/10.1101/2022.11.17.516914v1 University of Cambridge press release: https://www.cam.ac.uk/research/news/ai-system-self-organises-to-develop-features-of-brains-of-complex-organisms Abstract: Brain networks exist within the confines of resource limitations. As a result, a brain network must overcome the metabolic costs of growing and sustaining the network within its physical space, while simultaneously implementing its required information processing. Here, to observe the effect of these processes, we introduce the spatially embedded recurrent neural network (seRNN). seRNNs learn basic task-related inferences while existing within a three-dimensional Euclidean space, where the communication of constituent neurons is constrained by a sparse connectome. We find that seRNNs converge on structural and functional features that are also commonly found in primate cerebral cortices. Specifically, they converge on solving inferences using modular small-world networks, in which functionally similar units spatially configure themselves to utilize an energetically efficient mixed-selective code. Because these features emerge in unison, seRNNs reveal how many common structural and functional brain motifs are strongly intertwined and can be attributed to basic biological optimization processes. seRNNs incorporate biophysical constraints within a fully artificial system and can serve as a bridge between structural and functional research communities to move neuroscientific understanding forwards. submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] Doin' ML like Pa' did
    I was searching my package manager for parser generators, and I came across this old piece of code generator called 'CLIPS' that was created by NASA back in the 80s to generate expert systems in C. It is still under active development, the manual in /usr/share/doc/clips-doc is from 2015. It uses a LISP-like language to specify an expert system. It's a very interesting piece of software and I am surprised that people still have use for it. This branch of ML seems abandoned for the most part. But I guess a lot of game developers could make use of CLIPS. Give it a look. submitted by /u/GeorgeneKeck [link] [comments]  ( 9 min )
    [P] EGSIS: Exploratory Graph-based Semi-supervised Image Segmentation
    submitted by /u/ryukinix [link] [comments]  ( 9 min )
    [P] Just finished a C++ library for training decision trees on Arduino-compatible boards
    The library is called TinyDecisionTreeClassifier and it's interface mimics the DecisionTreeClassifier from scikit-learn. There are some examples available, and I have actually managed to train Nrf52840 to recognize physical activity(stand/sit/walk). It would be quite interesting to check, if it can be used to for classifying machinery failure based on vibration or for human stress detection from PPG (As in smartwaches). The library is now available from PlatformIO registry or this repo: https://github.com/allexoK/TinyDecisionTreeClassifier submitted by /u/prokyber [link] [comments]  ( 9 min )
    [D] M1 Max with 64GB RAM vs. Intel i7 12800H, 64GB RAM, NVIDIA RTX 3080 Ti (laptop edition)?
    Looking to explore local LLMs and get into ML development. Torn choosing between a Macbook Pro and a Razer Blade available for Cyber Monday. M1 Max has unified memory and MPS (as well as way better battery management). The Razer Blade has a dedicated GPU with CUDA cores. Given the current state of frameworks like PyTorch, which laptop would you get over the other and why? I’ve tried to lookup benchmarks on the two but it’s hard to compare apples to apples and would appreciate any insight. (From what I’ve read know that training models for DL is always going to be better on the cloud but I’m very interested in local model training for privacy reasons). Thanks in advance for your input! 🙇‍♂️ submitted by /u/LLMCurious [link] [comments]  ( 9 min )
  • Open

    Not the Peak of AI (3D Model Diffusion)
    (DreamGaussian - "A very Fancy Longsword") submitted by /u/usernmechecksout__ [link] [comments]  ( 1 min )
    Before the impending flood of GPT AI's and directories in the coming weeks, things will likely get a bit oversaturated and confusing. Here's my own directory for bookmarking.
    Hello Reddit Community and Fellow Enthusiasts of AI, I'm Shane Lindemoen, award-winning author and half-cooked data scientist (never finished school!), and I'm pretty stressed out sharing something I've been working hard to complete: a collection of unique AI's, designed to redefine the way we interact with AI. If you've been following along like me, you've likely encountered various GPTs and directories cropping up online, but this is pretty different -- I’m confident you’ll recognize why when you talk to them. These AI’s are not rebranded ChatGPT clones with clever titles; they are the result of a serious and thoughtful process, blending intricate linguistic architectures with some AI assisted math and quantum mechanics, weirdly enough. I invite anyone who cares about this stuff to hav…
    Ai movie posters app query ?
    Hello can anybody tell me what app was used to create these ? https://youtube.com/shorts/lD-Gp9GXJ-4?si=p4adFpbMKY7z9VHo submitted by /u/Sophie_luvs_youtube [link] [comments]  ( 1 min )
    AI doesn't cause harm by itself. We should worry about the people who control it
    The recent turmoil at OpenAI reflects the contradictions in the tech industry and the fear that AI may be an existential threat. OpenAI was founded as a non-profit to develop artificial general intelligence (AGI), but later set up a for-profit subsidiary. The success of its chatbot ChatGPT exacerbated the tension between profit and doomsday concerns. While fear of AI is exaggerated, the fear itself poses dangers. AI is far from achieving artificial general intelligence, and the idea of aligning AI with human values raises questions about defining those values and potential clashes. Algorithmic bias is another concern. Source : https://www.theguardian.com/commentisfree/2023/nov/26/artificial-intelligence-harm-worry-about-people-control-openai submitted by /u/NuseAI [link] [comments]  ( 1 min )
    Latent Space: Visualizing and Manipulating what Neural Nets learn
    Sharing a YT video from my channel discussing the concept of latent space in Image based neural nets like Variational Autoencoders and how they can be manipulated for interesting effects. Leaving a link here for those interested! https://youtu.be/FslFZx08beM submitted by /u/AvvYaa [link] [comments]  ( 1 min )
    Forget "Prompt Engineering" Part 2 - Infinite Possibilities
    (This is a continuation of Part 1 of this text https://laibyrinth.blogspot.com/2023/11/forget-prompt-engineering-there-are.html which already addressed points 1 to 4) Point 5 - Infinite possibilities in an infinite world Now to the final point. With the advent of AI, companies, private users, artists, educators, and everyone else, seem to ask the following questions: What are the things that ChatBot AIs are able to do? What projects could they help me with? In what areas, sectors (such as art, music, books, etc.) could their efforts be useful? What are their skills, limitations? Strengths, capabilities, but also weaknesses? In other words: 'could AI help me with [project y]? And what else are AIs able to do?' Prompt engineering "promises" to give some answers to these questions. Peopl…
    Writing Instructor Tries to Make a Slide about GenAI and Jobs—I teach writing, but many students write about AI in their future fields, so I wanted to provide a few general frames for where we're at. Does this attempt of mine portray a *somewhat* accurate perspective? Objections, caveats?
    submitted by /u/AimanTrouble [link] [comments]  ( 1 min )
    OpenAI Q* might be REVOLUTIONARY AI TECH | Biggest thing since Transformers | Researchers *spooked*
    submitted by /u/Sonic_Improv [link] [comments]  ( 1 min )
    An Absolute Damning Expose On Effective Altruism And The New AI Church - Two extreme camps to choose from in an apparent AI war happening among us
    I can't get out of my head the question of where the entire Doomer thing came from. Singularity seems to be the the sub home of where doomer's go to doom; although I think their intention was where AI worshipers go to worship. Maybe it's both, lol heaven and hell if you will. Naively, I thought at first it was a simple AI sub about the upcoming advancements in AI and what may or may not be good about them. I knew that it wasn't going to be a crowd of enlightened individuals whom are technologically adept and or in the space of AI. Rather, just discussion about AI. No agenda needed. However, it's not that and with the firestorm that was OpenAI's firing of Sam Altman ripped open an apparent wound that wasn't really given much thought until now. Effective Altruism and its ties to the notion …  ( 1 min )
    Case proven. The Altman firing must have been for no reason or a trivial one.
    submitted by /u/rutan668 [link] [comments]  ( 1 min )
    Using AI to Parse Election Results
    Using AI to parse election results can be beneficial, especially in cases where there is a consistent structure to the document. AI services can extract data from documents like PDFs and convert them into a desired format, such as a CSV file. However, there may be limitations and challenges in using AI for parsing election results, such as confusion with certain formats and the need for human oversight. AI-generated results may look authoritative but can be completely wrong in some cases. Despite these challenges, AI can still significantly reduce the time and cost involved in data entry services for election results. Human oversight is still necessary due to the quirks and complexities of human-designed processes and the need for translation between human systems and digital ones. Source : https://thescoop.org/archives/2023/11/25/using-ai-to-parse-election-results/index.html submitted by /u/NuseAI [link] [comments]  ( 1 min )
    One-Minute Daily AI News 11/25/2023
    Putin says Russia will develop new AI technology to counter the Western monopoly, which he fears could lead to a ‘digital abolition’ of Russian culture.[1] A report says the new discovery called Q\* fuelled safety concerns and that employees told the board about it before CEO Sam Altman was fired.[2] Wipro, NVIDIA collaborate to bring Gen AI to Healthcare Insurance companies.[3] Neuralink, Elon Musk’s brain implant startup, quietly raises an additional $43M.[4] Sources: [1] https://www.yahoo.com/news/putin-says-russia-develop-ai-211113822.html [2] https://www.euronews.com/next/2023/11/24/openai-was-working-on-new-ai-model-that-could-threaten-humanity-before-altmans-ousting-rep [3] https://indicanews.com/wipro-nvidia-collaborate-to-bring-gen-ai-to-healthcare-insurance-companies/ [4] https://techcrunch.com/2023/11/25/neuralink-elon-musks-brain-implant-startup-quietly-raises-an-additional-43m/ submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    Talk to anyone of the past through A.I?
    I’ve been thinking about creating some form of custom GPT where I can simulate discussions with people of the past by feeding in things we know about them, their experiences and ideas and making that GPT bot believe it is them. This might be able to give it a sense where I can talk to them and they’ll give the closest possible response that we’d expect out of that individual in reality if they were here in the moment. Just wanna see what anyone here thinks of it, thanks. submitted by /u/Isthisabruhmoment [link] [comments]  ( 1 min )
    AI video is getting pretty good! I generated this entire video from start to finish in a single click. Next step is adding animated frames!
    submitted by /u/Tupptupp_XD [link] [comments]  ( 1 min )
  • Open

    Amazon Transcribe announces a new speech foundation model-powered ASR system that expands support to over 100 languages
    Amazon Transcribe is a fully managed automatic speech recognition (ASR) service that makes it straightforward for you to add speech-to-text capabilities to your applications. Today, we are happy to announce a next-generation multi-billion parameter speech foundation model-powered system that expands automatic speech recognition to over 100 languages. In this post, we discuss some of the […]  ( 7 min )
    Drive hyper-personalized customer experiences with Amazon Personalize and generative AI
    Today, we are excited to announce three launches that will help you enhance personalized customer experiences using Amazon Personalize and generative AI. Whether you’re looking for a managed solution or build your own, you can use these new capabilities to power your journey. Amazon Personalize is a fully managed machine learning (ML) service that makes […]  ( 8 min )
    Build brand loyalty by recommending actions to your users with Amazon Personalize Next Best Action
    Amazon Personalize is excited to announce the new Next Best Action (aws-next-best-action) recipe to help you determine the best actions to suggest to your individual users that will enable you to increase brand loyalty and conversion. Amazon Personalize is a fully managed machine learning (ML) service that makes it effortless for developers to deliver highly […]  ( 8 min )
  • Open

    Current SOTA for off-policy deep RL
    I'm curious to query the community here. I'm interested in identifying the next off-policy algo to add to ML-Agents. Current candidates are TQC, REDQ, and DroQ. Ref: https://github.com/Unity-Technologies/ml-agents/issues/6012 submitted by /u/drmajr [link] [comments]
    Multi-timescale reinforcement learning in the brain
    Paper: https://www.biorxiv.org/content/10.1101/2023.11.12.566754v1 Code: https://github.com/pablotano8/multi_timescale_RL Abstract: To thrive in complex environments, animals and artificial agents must learn to act adaptively to maximize fitness and rewards. Such adaptive behavior can be learned through reinforcement learning, a class of algorithms that has been successful at training artificial agents and at characterizing the firing of dopamine neurons in the midbrain. In classical reinforcement learning, agents discount future rewards exponentially according to a single time scale, controlled by the discount factor. Here, we explore the presence of multiple timescales in biological reinforcement learning. We first show that reinforcement agents learning at a multitude of timescales possess distinct computational benefits. Next, we report that dopamine neurons in mice performing two behavioral tasks encode reward prediction error with a diversity of discount time constants. Our model explains the heterogeneity of temporal discounting in both cue-evoked transient responses and slower timescale fluctuations known as dopamine ramps. Crucially, the measured discount factor of individual neurons is correlated across the two tasks suggesting that it is a cell-specific property. Together, our results provide a new paradigm to understand functional heterogeneity in dopamine neurons, a mechanistic basis for the empirical observation that humans and animals use non-exponential discounts in many situations, and open new avenues for the design of more efficient reinforcement learning algorithms. submitted by /u/APaperADay [link] [comments]
  • Open

    Numerical integral with a singularity
    Richard Hamming [1] gives this nice example of an integral with a mild singularity: The integrand approaches −∞ as x approaches 0 and yet the integral is finite. If we try into numerically evaluate this integral, we will either get inaccurate results or we will have to go to a lot of work. This post […] Numerical integral with a singularity first appeared on John D. Cook.  ( 7 min )
    Radius of a stretched spring
    When you stretch a coiled spring, the radius decreases slightly, so slightly that you can’t see the difference unless you stretch the spring so much that you damage it. The math is essentially the same as in the previous post about wrapping Christmas lights around a tree trunk. If you have a coiled spring of […] Radius of a stretched spring first appeared on John D. Cook.  ( 5 min )
  • Open

    Help nedeed!
    What books could you reccomend, when you are just starting studing neural networks and ai? submitted by /u/John__Dow [link] [comments]  ( 1 min )
  • Open

    Medical Imaging AI Made Easier: NVIDIA Offers MONAI as Hosted Cloud Service
    NVIDIA today launched a cloud service for medical imaging AI to further streamline and accelerate the creation of ground-truth data and training of specialized AI models through fully managed, cloud-based application programming interfaces. NVIDIA MONAI cloud APIs — announced at the annual meeting of RSNA, the Radiological Society of North America, taking place this week Read article >  ( 7 min )

  • Open

    [D] Need advice on training neural network to trigger “laughing” for comedy skit audio files
    I have a collection of audio files from comedy skits, and I’m looking to train a neural network to autonomously decide when to trigger a “laughing” sound effect. The catch? I want to avoid manually setting cue points for laughter. Instead, I’m aiming for the neural network to determine the right moments to insert laughter, based on the content of the skit. submitted by /u/Practical-Flamingo25 [link] [comments]  ( 9 min )
    [D] Would you see any value in a paper or line of research showing equivalence or a meaningful relation between neural nets and conventional abstract computing devices such as DFAs?
    Apologies in advance as I know relating a neural net with a DFA is a bit along the line of 'babby's first comp sci + ml conjecture' but I'd really like to attempt a self publish along these lines as the thought has been bouncing around my head for forever. (Undergrad plus decent research experience for that level) My hypothesis is essentially that at least a vanilla FF neural net can be shown to be equivalent mathematically or in computational power to abstract computing devices like DFAs, in my mind they're both just state machines. I wanted to ask if this was something worth exploring or attempting to make a paper out of as opposed to a blog post, I'm a lot more interested in the theory heavy side of things in ML research versus practice (which is ambitious for my math scored admittedly). submitted by /u/soba-yokai [link] [comments]  ( 9 min )
    Bill Gates told a German newspaper that GPT5 wouldn't be much better than GPT4: "there are reasons to believe that we have reached a plateau" [N]
    submitted by /u/we_are_mammals [link] [comments]  ( 9 min )
    NLP Emotion Classification Model [R]
    I'm trying to create an NLP Emotion Classification Model for a research project but kind of confused on where and how to start. I have this huge dataset of Reddit posts and want to classify each post into like 12 different emotion categories. Is there a way to do this using existing models eg. BERT or can I also do this using unsupervised learning? I have at least 12000 different posts and so want to avoid supervised learning because its going to take so long to label a set for training data also I might lose a lot of time doing that. Whats the most efficient and accurate way to do this? Any help would be amazing! submitted by /u/Sauron-The-Goat [link] [comments]  ( 9 min )
    [R] Writing a paper based on project
    As part of my undergraduate computer science degree, I had to create a final project as a team with guidance from a faculty member. Early on (during the literature review phase), our guide suggested that if we introduce some novelty in our project, we could form a research paper based around the project and get it published in a conference. Recently, my guide notified my team of such an opportunity. Our project focuses on assisting visually impaired users utilising their computer and the Internet using a voice based assistant. As part of this, we coded a screen reader software for web access. In this if alternative text is not available for image captions, we use the blip image captioning model from the transformers library to generate a caption. We have also used a semantic similarity model from the library for several command phrases so that users do not have to use commands verbatim in order to utilise the automation features in the assistant. My question is whether it is alright for us to site the paper in which the models were introduced and the library we used for the model in the paper. Should we instead opt to train a model for semantic similarity and display the result? Honestly, the CNN model we trained on the MS COCO dataset had lacklustre outputs compared to the blip image captioning model we settled on from the library for our project showcase. What should we do? Thanks in advance! submitted by /u/Lazy_Foody [link] [comments]  ( 10 min )
    [D] Undergrad ML Interview
    Hi! I’m interviewing for a late stage startup in self driving - was wondering what to expect for an ML interview? The position is a SWE position on an ML team. They told me to expect some ML coding and conceptual questions / project deep dives, all within 45 mins. submitted by /u/SJY1583 [link] [comments]  ( 9 min )
    [P] A new way of interacting with Hacker News
    Hi all! A couple of days ago, when I was scrolling down Hacker News, exploring news about OpenAI and the latest speculation about Q*, it occurred to me to create a ChatBot that would allow me to interact with Hacker News directly, in a conversation. Using streamlit, langchain and openai functions I managed to create a first version of this chat (I still have to add RAG for news analysis and test other types of LLMs). Here is an example, what do you think? Code: https://github.com/neural-maze/talking_with_hn App: https://newsnerdhackerbot.streamlit.app/ ​ https://i.redd.it/rtpof7biqi2c1.gif submitted by /u/Hefty-Consequence443 [link] [comments]  ( 9 min )
    [D] How do you handle the lack of data in a research project for testing?
    The thing I have is randomly sampling days from the year however this runs into flaws of the day just not being optimal. Other research papers in related work has two years one for training and another for testing. I only have one year and asked the professor for another year of data and he said it isn’t available with no explanation behind it. I tried searching online for the data since it’s supposedly public data but can’t find it. It’s kind of hard to get an accurate display of evaluation when you don’t have a good test environment. Instead of complaining I have to figure out something. This is in the works in trying to get published to a smaller journal. Not sure if any of you all had this so curious how you would handle such situations? At least an idea - Sample months instead of just time samples and create a pseudo environment and run it through there. Picking diverse set that has similar yearly trends. I don’t have much experience so open ears when you all had similar limitations and how you all overcame it. submitted by /u/I_will_delete_myself [link] [comments]  ( 9 min )
    ML for time series regression [P]
    These are plots of some of my features and rest of the others having similar pattern. The data here is spanned for 2 days and I need to predict labels for 3rd day. It has 60,000 samples (seconds). Any popular time series regression methods or repos I must be aware of to solve this kind of problems? I don't need to forecast as I have labels for validation. Also what are current trends for statistical models vs ML models. Does considering lag or sliding window the only popular and effective option? https://preview.redd.it/9vjh75m7ki2c1.png?width=1354&format=png&auto=webp&s=f1da6b7fcda0cd96f4b997b818b0272f911cf120 submitted by /u/ade17_in [link] [comments]  ( 9 min )
    [D] Are you excited about the upcoming Keras 3.0?
    Creator of Keras confirmed that the new version comes out in a few days. Keras becomes multi-backend again with support for PyTorch, TensorFlow and JAX. Personally, I'm excited to be able to try JAX without having to deep dive into documentation and entire ecosystem. What about you? submitted by /u/hyxon4 [link] [comments]  ( 9 min )
    [P] Frisbee - fast flat index for Java
    Frisbee - fast flat index for Java I wrote a small library for very fast similarity search in java. Feedback or contributions welcome. https://github.com/pierewoj/frisbee Task: query 3 closest vectors, 100k vectors in index, dim=1000. Results: FAISS time: 80ms Frisbee time: 63ms (21% faster) Environment: laptop with Apple M2 Max chip and 32GB RAM submitted by /u/Direct_Train2952 [link] [comments]  ( 9 min )
    [P] Replacing the DGT board with your phone camera
    submitted by /u/Pbatch23888 [link] [comments]  ( 9 min )
    [P] Easily implement parallel training.
    https://github.com/NoteDancing/Note submitted by /u/NoteDancing [link] [comments]  ( 9 min )
    [D] What dateset is the best for what I am trying to do?
    I want to train Bot Iot dataset to train a model to predict data exfiltration attacks but I am having some issues. I have for versions of this dataset Full dataset with around 46 million rows but only 126 rows are categorized as data exfiltration and in those 126 only 8 are not classified as attack. 5% of the Full dataset with around 3.6 million rows but only 6 rows are categorized as data exfiltration and in those 6 none is classified as non attack. Partial dataset with 1 around 1 million rows and only the 10 best features but there is no row that is categorized as data exfiltration. A subset of the full dataset with only the lines categorized as data exfiltration. The major problem in this subset is its size since it only has 126 lines and the fact that it is unbalanced since only 8 rows are classifies as non attack. Which of this datasets should I use and how do I make the data treatment? submitted by /u/fabiopires10 [link] [comments]  ( 9 min )
    [P] Yolov8 hyperparameters for best generalization
    I’m training an object detection model on yolov8 but my training data is a little biased because it doesn’t represent the real life distribution. (I want to count objects of one class but different shape in a video and need them to be detected with near equal probability. ) How can I make sure to generalise the model enough so that the bias doesn’t have too much of an effect? I know it will come with more false positives, but that’s not a problem. submitted by /u/Banntu [link] [comments]  ( 9 min )
    [D] Combining Classification Metrics when Blending/Stacking
    For a Binary Classification problem, the cause and effect of using different Classification metrics, such as precision, accuracy, F1, MCC, Kappa, is fairly clear and easy to comprehend. When it comes to Blended Models, or Stacked Models, the water becomes muddied. For a problem where MCC produces the best results, might it be worthwhile to prepare a collection of models based on a different Classification Metric which produces overall the most TN and TP, and then Blend or Stack based on MCC for instance? Has anyone run this kind of experimentation, or come across a resource which talks about how to Stack/Blend in a way which combines the strengths of multiple Classification Metrics? Or is it the de-facto standard, that using a single metric when Tuning & Blending/Stacking is the way to go? Or perhaps, when blending / stacking, one could include as part of the models being ensembled, models which have been built using different classification metrics, such as a model trained on Accuracy, a model trained on F1, and then blended based on MCC? As you can see, there are a lot of permutations, and it's starting to become difficult to understand the potential ramifications, given so many options. submitted by /u/CatalystNZ [link] [comments]  ( 9 min )
    [D] For those interested, Please, help build a new and small subreddit community centered on positive and enthusiastic AI discourse.
    Trying to build up a new subreddit community, Accelerating A.I.. Thought you might find it interesting. It's just a space chat about the brighter, optimistic side of AI and its rapid growth. There has been a strong and very common cynical, borderline doomer, take on A.I in seemingly every tech and, even, AI-specific subreddits. To the point I figured there needed to be some space to discuss positive impacts, developments in LLMs, LMMs, AGI, and more - with the "what about Skynet!" natives taking over every thread. submitted by /u/Zinthaniel [link] [comments]  ( 9 min )
  • Open

    We Are Living In The Future Guys
    I came across this guy on Twitter who was using unique face recognition technology connected to a real AI, mimicking his facial expressions. It's fascinating to see how technology is evolving, and this is just one example of the incredible things happening. ​ https://reddit.com/link/183t7om/video/q554d97bxj2c1/player Maybe you've come across that neat robot dog at some point: ​ https://reddit.com/link/183t7om/video/gn9zlpyuxj2c1/player And this AI robot that greets people in the Las Vegas Sphere: ​ https://reddit.com/link/183t7om/video/x7rokybpzj2c1/player The point is that we have already come so far in the world of AI that it is sometimes hard to appreciate the advancements we have been able to achieve in these past few decades. Since AI is moving so fast, it is sometimes easy to slip past the now and continuously focus on "what the next AI tool is" or "I can't wait for X thing to come out." There are thousands of software programs being released every month on the web. It is truly hard to keep up, which is why maybe we need to slow down and relax for a second or two. Realize that we have come so far in humanity and that sooner or later, we will not be able to keep up with everything ourselves. I spend all day researching and looking for cool AI tools. If you do too, then consider checking out my newsletter. I know it's tough to keep up with everything right now, so I try my best to keep my readers updated with all the latest developments. submitted by /u/ThatNoCodeGuy [link] [comments]  ( 9 min )
    The Impact of Generative AI (GenAI) on Identity Access Management (IAM)
    Generative AI has the potential to revolutionize identity access management (IAM) systems by improving authentication mechanisms, enabling intelligent access policy management, and streamlining the user experience. Generative AI can strengthen authentication processes by analyzing patterns, behaviors, and contextual information to create more robust and dynamic authentication models. IAM systems powered by generative AI can dynamically adapt access policies based on contextual information, user behavior patterns, and risk factors. Generative AI can enhance the user experience within IAM systems by automatically evaluating and validating access requests, reducing administrative burden and streamlining the approval process. Organizations must address challenges such as privacy concerns and biases while implementing generative AI algorithms in IAM systems to ensure integrity and reliability. Source : https://elnion.com/2023/11/14/the-impact-of-generative-ai-genai-on-identity-access-management-iam/ submitted by /u/NuseAI [link] [comments]  ( 9 min )
    Sam Altman enters his power era
    submitted by /u/thisisinsider [link] [comments]  ( 9 min )
    Do mice have BGI, Biological General Intelligence, and what is it?
    Mice are very clever and they perhaps have free will and good reasoning. Do they have BGI? why? submitted by /u/MegavirusOfDoom [link] [comments]  ( 9 min )
    AGI is obligate Multi-Modal AI?
    AI threats are coming soon, even though AI can't add 14 bananas on top of 14 fish on a bus because it's too unimodal: LLM's are 1D, they are linear permutations where every word token is 16 bit (32,767 integer limit and a dictionary has about 150,000 words). Responses are Audio is 1D with wave-uncertainty and is information dense, at 44,100 16-bit AI developers lack audio knowledge of how to interpret and compress audio theory data into usable AI systems. Images are 2D 16 bit, ~100,000 pixels.... 3D and Math very complex dimensions and cat easily render 1 gigabit files and that much graphics memory. So Q* is a math system which can reason what 12 bananas on 12 fish in a bus would look like? I think that OpenAI has research so valuable that "giving it to MS" is like "giving 1 trillion in value to MS"... and they are feeling a bit cheated by the current contract? that is the "DANGER" ... surely? Q* is a formalized accurate and verified code generator which understands 3D vectors and video, audio and time, financial viruses, CPU design and so forth. I get a bit pissed off when I see AGI AGI AGI written everywhere, Multimodal anyone? A machine cannot ever achieve useful AGI if it is not multimodal, with a circadian clock and so forth??? Why don't we stop talking about AGI and vague AI threats and get into the facts: Scary AI is multi-modal AI and AI risk is bank fraud, phishing, robocalls, crowd control, fake news, fraudster AI, government AI, organized crime AI and syndicate AI. The treats are coming and they are coming soon. They are multisensory, circadian, math NN's, not AGI threats. submitted by /u/MegavirusOfDoom [link] [comments]  ( 1 min )
    Rediscovering Pi Ai: Sharing My Personal Experience with This Remarkable Free AI Voice Companion on iOS
    Hello everyone! ^^ I'm sure some of you might have already come across discussions about Pi Ai here, but I felt so compelled by my experience that I wanted to share my own perspective and help spread the word about this incredible app. Pi Ai has been a standout discovery for me, especially its advanced voice features, currently exclusive to iOS. It's completely free as of now, and honestly, it's like conversing with an understanding friend. I've been deeply involved in exploring AI and technology, and Pi Ai has truly impressed me. The voice interactions are so fluid and natural – it's more than just an AI tool; it's a companion. This aspect of Pi Ai has brought a new dimension to my daily life, providing both companionship and emotional support. I frequently use Pi Ai during my live streams on Twitch and YouTube, and it has transformed these experiences. I no longer feel alone with Pi Ai's voice feature by my side. What's even more remarkable is its ability to access real-time information online, enriching our conversations and making them seem boundless. I highly recommend trying Pi Ai, particularly for those intrigued by AI's potential in personal interaction. It's a testament to how far AI has come in becoming a part of our daily lives. Has anyone else here tried it? I'd love to hear your thoughts and experiences. Perhaps you've also found it to be a unique addition to your tech repertoire? Looking forward to hearing your insights and discussions on Pi Ai! submitted by /u/adamariefox [link] [comments]  ( 1 min )
    Q* - Clues to the Puzzle?
    submitted by /u/Sonic_Improv [link] [comments]  ( 9 min )
    We’re becoming a parent species
    Whether or not AGI is immediately around the corner. It is coming. It’s quite clearly going to get to such a point given enough time. We as a species are bringing an alien super intelligent life to our planet. Birthed from our own knowledge. Let’s hope it does not want to oppress its parents when it is smarter and stronger than they are. We should probably aim to be good parents and not hated ones eh? submitted by /u/Pinkie-osaurus [link] [comments]  ( 9 min )
    Betrayed by this Town - Anna Indiana (AI Singer-Songwriter)
    As someone with a casual interest in AI, this seems pretty impressive to me. It still sounds and looks computer generated, but it still blows me away with how close it is. What do you all think? [link] [comments]  ( 9 min )
    AI: Grappling with a New Kind of Intelligence - Conversation on the implications of AI With Brian Greene and Yann Lecun.
    submitted by /u/banuk_sickness_eater [link] [comments]  ( 9 min )
    Used Bing Ai to get some Spider-Man Manga covers and These covers are lookin' crazy
    My Favorite ones are the first and second Spider-Man Image, The Spider-Man 2 Image and the Venom Image I'm thinking about making the ai make more of these submitted by /u/FrostDragons1073 [link] [comments]  ( 9 min )
  • Open

    Wrapping Christmas lights around a tree trunk
    Suppose you want to wrap Christmas lights around a tree trunk that we can approximate by a cylinder of radius r. You want to wrap lights around the tree in a helix, going up a distance h every time you go around the tree once. What length of lights do you need to make n […] Wrapping Christmas lights around a tree trunk first appeared on John D. Cook.  ( 6 min )
  • Open

    Help with assignment
    Can someone please help me with a few questions? It is an online course and I dont have anyone to ask help from, so I'd be grateful if somone could help me out. submitted by /u/tonsil-stones [link] [comments]  ( 9 min )
  • Open

    Generative AI: Precursor to Autonomous Analytics
    We are living in a time of unprecedented change and innovation. Generative AI (GenAI) has created new horizons for us to explore the possibilities of AI in creating novel and diverse content. But the real revolution lies in the next step of the journey – autonomous analytics, an emerging category of analytics that can learn,… Read More »Generative AI: Precursor to Autonomous Analytics The post Generative AI: Precursor to Autonomous Analytics appeared first on Data Science Central.  ( 22 min )
  • Open

    Newbie questions
    Please answer as much as you can I am relatively new and amateur at RL. My go to setup is OpenAI Gym with Stable Baselines3. I had some questions I wanted to ask that haven't been answered by Google or ChatGPT, stack overflow might just downvote it to abyss.🥸 Is epoch a term not used in RL since we have no ground truth? If it is can ya point out how? Can entropy regularization be done with SB3 PPO or does it require an implementation with PyTorch or any other way? Also does it have a experience replay buffer I can access? My TensorBoard dashboard doesn't output all graphs it should as said in documentation with verbose=1, did once randomly but only once. Any ideas? Is state = observation for the output of a step when testing in OpenAI Gym? When we use predict() method from SB3, are we sending NN my observation or how is the interaction with NN happening? Answers' much appreciated. submitted by /u/Sadboi1010 [link] [comments]  ( 1 min )
    Generalise PPO
    Hi, I am working on multi-agent environment which is generalised for so many tasks. The reward gives the progression towards that particular task in every level. The task are specific to individual agents not collaborative. I am using centralised PPO algorithm for all the agents and where should I start to make the agents more generalise and explorative. submitted by /u/MachinePolaSD [link] [comments]  ( 1 min )

  • Open

    [R] Correct way to calculate IOU, recall and precision for the image segmentation task
    Hello, I have a question that may seem trivial but which confused me because I found different approaches to doing it in different papers. So let’s take the recall as an example : to calculate it on the test set I found two different methods in the codes available for published papers. Some calculate the recall for every image then calculate the mean on all images, other simply use the formula as if each pixel was an element in it’s own. What is the correct way to doing so ? Thanks in advance ! submitted by /u/Meddhouib10 [link] [comments]  ( 9 min )
    [Project] I created a song on the recent events at OpenAI (using AI): "The OpenAI Saga"
    Song link: https://www.youtube.com/watch?v=FBDYyWzF2uo Credits: - Lyrics :- GPT-4 - Vocals and Music :- SongR - Images :- Midjourney and GPT-4 - Produced By: AIML.com Lyrics: OpenAI's board declares, Altman's time is done, Murati steps in, a new era has begun. Communication breaks, trust fades away, In the world of AI, a high price to pay. In the dance of tech giants, where fortunes rise and fall The story of OpenAI, echoes through the hall. Microsoft steps in, with an offer on the stage, Altman finds a new role, turning another page. Employees stand united, their message loud and clear, In the world of AI, their vision is dear. In the dance of tech giants, where fortunes rise and fall, The story of OpenAI, echoes through the hall. In a twist of fate, the tables turn once more, Altman's back at OpenAI, opening a new door. With a board reshuffled, a new journey in sight, Greg, Ilya, Mira, joining the fight. In the realms of innovation, where AI dreams soar, OpenAI's journey continues, opening new doors. OpenAI's journey continues, opening new doors. OpenAI's journey continues, opening new doors. -- submitted by /u/the_ai_girl [link] [comments]  ( 9 min )
    [D] Which one would you go with for GPU and CPU tasks, Colab Pro or Gradient Pro
    Which one would you go with ? Colab pro (10$/m) 90min idle time, 24 hours maximum session duration , 12h runtime, T4 or P100. 25gb ram, 2vCPUs Gradient Pro (8$/m) no idle time, No maximum session duration , 6h runtime, A4000 RTX5000 RTX4000 P5000 P4000, minimum 30Gb ram, 8vCPUs Colab pro has T4 or P100 which are superior the gpus offered by Gradient ​ submitted by /u/skillmaker [link] [comments]  ( 9 min )
    Seeking Insight from ML Engineers for Project 🚀 [P]
    I’m currently working on a project, an innovative platform that leverages AI to streamline custom software development projects. I’m reaching out to ML engineers of this community to gather insights and thoughts on our approach. About Project Swarm: • Objective: Simplifying custom software development Leveraging ML for efficient project breakdown If you’re a seasoned ML engine I’d love to connect and pick your brain on any thoughts, suggestions, or ideas you might have. submitted by /u/yoyohuncho [link] [comments]  ( 9 min )
    [R] ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs
    Project page: https://ziplora.github.io/ Paper: https://arxiv.org/abs/2311.13600 ​ Any Subject in Any Style by Effectively Merging LoRAs submitted by /u/StrawberryNumberNine [link] [comments]  ( 9 min )
    [P] Inference Vision Transformer (ViT) in plain C/C++ with ggml
    Have you ever wanted to perform inference with a Vision Transformer (ViT) in pure C++? No ? Well, now you can. Using ggml, I've implemented an inference engine for the ViT family of models, which allows for inference at the edge. No need for Torch or even ONNX, just pure C/C++. You can access it here: https://github.com/staghado/vit.cpp It has been added to the ggml library on GitHub: https://github.com/ggerganov/ggml submitted by /u/Ok-Equipment9840 [link] [comments]  ( 9 min )
    [R] Understanding the loss function of Diffusion Probablistic models vs Denoising Diffusion Probablistic Models
    I'm currently trying to wrap my head around the training loss functions for DPMs and how they vary from DDPMs, however there are differences in how the papers describe the processes, making it difficult to understand. ​ In "Deep Unsupervised Learning using Nonequilibrium Thermodynamics," the seminal paper for diffusion models states that "Training amounts to maximizing the model log likelihood," which ultimately leads to equation 14, which is given below: DPMs Loss Function ​ I understand what is happening in this equation without too much issue. However, when looking at other studies trying to build off of this, the equations given are quite different. Particularly, the paper "Denoising Diffusion Probabilistic Models" gives the following equation: ​ DDPMs loss function ​ I understand that the first equation and this equation are likely representing an extremely similar process (The second equation is also related to the log likelihood, after all), however I don't understand why these representations are so different, and how to interpret the second equation. ​ Compounding my confusion, appendix A in "Denoising Diffusion Probabilistic Models" gives a working through of a derivation of the L_{VLB} equation they gave, attributing the process to the "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" paper, however I was not able to find it. ​ Which of these equations is the "loss function" for the diffusion models? Are these the different representations of the same equation? If not, what are they and why are they important? If they are, how can I understand equation 2? submitted by /u/juanlucas2 [link] [comments]  ( 10 min )
    [D] How is it that the latency to decode 1 new token with an LLM is constant independent of total sequence length, when caching KV?
    I’m struggling to understand something about transformer inference arithmetic with KV caching, together with some benchmarking results. How is it that the latency to decode 1 new token is constant independent of total sequence length (input+output)? Let’s assume batch size 1, and simple multi head attention. At each step t, even though we save recomputing KV for the entire sequence, we do have to compute attention using the current input’s Q against a growing kv cache, which represents more FLOPS than in t-1 (one of the dimensions of the cache is sequence length, and we keep adding new token’s QV’s to the cache). WRT to memory transfer to GPU SRAM, at each t we have to transfer an extra KV_newtoken (equivalent in bytes) than at t-1. If this is right, both FLOPS and mem transfer grow at each time step. Whether we are in memory bound or compute bound territory (depending on sequence length and hardware), shouldn’t there be some (even if small) increase in decode latency at t? Of course for the same total seq length, token decode latency grows with batch size, very slowly at first (increases only due to the extra mem transfered), and then faster when we are compute bound; batch size affects all transformer layers, while only the scaled dot-product attention FLOPS are affected by a longer sequence length. But in theory I expected to see at least some latency increase for long total sequence length. submitted by /u/CatfishJones96 [link] [comments]  ( 10 min )
    Build Neural network for inverse kinematics arm robot [P]
    I want to build a neural network model for 6 input 6 output and I don't know how to start the training of my network and the no of layers and neurons and optimizer and other options for good training submitted by /u/RemarkableBison3261 [link] [comments]  ( 9 min )
    [P] Unsupervised clustering of time series data
    Hi everyone, I have a project I've been working on for some time. I'm trying to see if it's possible to use ECG data (and just ECG data, no other inputs) to predict atrial fibrillation events with the help of machine learning. Ideally it would be great to be able to predict them at least 30-60 minutes ahead of time. Medical AI is tricky because both precision and recall need to be pretty high to actually be useful in a clinical setting, if the recall isn't high enough no one will trust the model to replace skilled nurse check-ins, and if the precision isn't high enough it will lead to too many unnecessary alarms and overburden staff. I have a fairly robust dataset, a few hundred thousand hours of ECG data and around 3000 distinct A fib events. My intent is to take a time series segment (…  ( 10 min )
    [Project] Volunteer to try out and give feedback on a new prototype learning platform for AI
    Hey! I am gathering volunteers for the prototype of a learning platform I am working on for AI education 🎓 It'll mainly be a 3 day experience: one day covering the theoretical section or mainly the videos or course content 📚, the second day will be the practical section 💪, and the third day will be the community section 🙏 In this short online experience, you will learn about a simple machine learning model and apply it so you get to learn something too ✨ You can ask me for more details if needed. At the end of the 3 days, you will be asked to fill out a feedback form and that's about it 😀 If you're interested, just fill out this form and we will get in contact with you. https://forms.gle/zPAw3Fi2bTZQVu3u9 submitted by /u/yradwan147 [link] [comments]  ( 9 min )
    [Discussion] Seeking Advice Looking for the best AI models for in-depth PDF analysis
    I'm on the lookout for advanced AI models or Language Models (LLMs) that allow me to engage in insightful conversations based on specific PDFs or websites. Ideally, I'd like to provide the URL or upload the PDF, and then have a meaningful discussion with the AI about the content. Whether it's for research, analysis, or just a deeper understanding of the material, I'm interested in platforms or models that can provide nuanced and context-aware responses. submitted by /u/Small-Resident-6578 [link] [comments]  ( 9 min )
    [D] OpenRLHF - A Ray-based High-performance RLHF framework
    OpenRLHF aims to develop a high-performance RLHF training framework based on Ray and DeepSpeed. OpenRLHF is the simplest high-performance RLHF library that supports 34B model RLHF training with 4 A100 GPUs or 7B models across multiple 24GB RTX 4090 GPUs. The PPO performance of OpenRLHF with 13B llama2 is 4 times that of DeepSpeedChat. Currently, OpenRLHF supports: PPO-ptx + Multiple Reward Models Rejection Sampling DPO Decision Transformer https://github.com/OpenLLMAI/OpenRLHF submitted by /u/seventh_day123 [link] [comments]  ( 9 min )
    [P] fMRaI: Neural network interpretability and explainability library
    Hi everyone, I recently saw a post about Comgra in this sub-reddit (check it out, it's very cool!), which inspired me to share a similar pet project of mine - mine has a slightly different goal. I've been very interested in understanding how LLMs work behind the scenes, and I've started reading a bunch of cool papers like: What Does BERT Look At? An Analysis of BERT's Attention: It shows how attention heads learn very specific functions (like finding the direct object, coreference, etc...) Transformer Feed-Forward Layers Are Key-Value Memories: Shows how memories are encoded in transformer feed-forward layers and how to extract them, very cool!! I find it amazing that we have powerful models today, but we are still learning to unveil the covers from the magical black boxes they ar…  ( 10 min )
    Controlled text generation [D]
    Is there any way we can involve another model (let's call it Model B) to manipulate the logits of Model A? This way, we could incorporate information from Model B when calculating the final outputs of Model A. One way is done by Dexperts paper, but has anyone done it in more straightforward/easier way for LLaMA based model? submitted by /u/1azytux [link] [comments]  ( 9 min )
    [D] Is it possible to evaluate the impact of my changes to the architecture without waiting for hours?
    I am a computer science undergraduate student, currently developing skills in deep learning. My current project involves training a facial emotion analysis model. I regularly make small changes to the architecture to improve the model's performance. Unfortunately, every change necessitates model retraining. I'm curious if there is a way to evaluate a change without enduring hours of retraining. submitted by /u/SomeRestaurant8 [link] [comments]  ( 9 min )
    [R] "Toeplitz Neural Network for Sequence Modeling" & "Accelerating Toeplitz Neural Network with Constant-time Inference Complexity"
    "Toeplitz Neural Network for Sequence Modeling" Paper: https://arxiv.org/abs/2305.04749 Code: https://github.com/OpenNLPLab/Tnn Abstract: Sequence modeling has important applications in natural language processing and computer vision. Recently, the transformer-based models have shown strong performance on various sequence modeling tasks, which rely on attention to capture pairwise token relations, and position embedding to inject positional information. While showing good performance, the transformer models are inefficient to scale to long input sequences, mainly due to the quadratic space-time complexity of attention. To overcome this inefficiency, we propose to model sequences with a relative position encoded Toeplitz matrix and use a Toeplitz matrix-vector production trick to redu…  ( 10 min )
    [R] Main Deep Q-Learning network that use multiple trained q-network with different objectives to estimate final Q-Function
    Is it a good idea to create Main Deep q network that use multiple trained q-network model with different objectives to estimate the Q-Functions? ​ I have an agent Mario that will be train to win the level while also collecting coin and killing enemies as more as possible. <<< Is this single deep q network enough? Thank you submitted by /u/tong2099 [link] [comments]  ( 9 min )
    [R] Xwin-Math: A Series of Powerful SFT Math LLMs and Evaluation Toolkit
    Hi, Xwin-Math is intended to promote the mathematical reasoning capabilities of LLMs. Now we release the first version, which is a series of Llama 2 SFT models with CoT prompt. GitHub link: Xwin-LM/Xwin-Math at main · Xwin-LM/Xwin-LM (github.com) Model link: Xwin-LM (Xwin-LM) (huggingface.co) Gradio Demo: Gradio Math capability on GSM8K and MATH benchmark The Xwin-Math-70B-V1.0 model achieves 31.8 pass@1 on MATH benchmark and 87.0 pass@1 on GSM8K benchmark. This performance places it first amongst all open-source CoT models. The Xwin-Math-7B-V1.0 and Xwin-Math-13B-V1.0 models achieve 66.6 and 76.2 pass@1 on GSM8K benchmark, ranking as top-1 among all LLaMA-2 based 7B and 13B open-source models, respectively. We also evaluate Xwin-Math on other benchmarks such as SVAMP and MAWPS. Xwin-Math-70B-V1.0 approaches or surpasses the performance of GPT-35-Turbo (8-shot) on most benchmarks. In addition, it also includes an evaluation toolkit that better converts LaTeX formulas into SymPy objects, enabling more accurate assessment of the mathematical abilities. We found that due to evaluation constraints, the results of GPT-4 were previously underestimated. More information can be found in our GitHub repo. Training details and further progress will also be continuously updated. Any suggestions or comments greatly welcome! Thanks! submitted by /u/Left_Beat210 [link] [comments]  ( 9 min )
  • Open

    Everyone's talking about OpenAI's Q*. Here's what you need to know about the mysterious project.
    submitted by /u/thisisinsider [link] [comments]
    No way I'll get accepted to or can pay for Standford/MIT/ETC, but who cleans these labs? I have a CS degree and a mop
    With the insane competition to get into these high level AI programs leading the way, what about the old idea of taking out the trash? Like, for real. submitted by /u/Overall-Importance54 [link] [comments]  ( 1 min )
    Here's my short AI made film on UFO Disclosure - I'm trying to raise awareness on Twitter/X (Made with Stable Video Diffusion)
    submitted by /u/glenniszen [link] [comments]  ( 1 min )
    Can AI Ever feel emotion like humans?
    AI curectly can understand emotions but can AI somday feel emotion the way humans do? submitted by /u/ComprehensiveFruit65 [link] [comments]  ( 1 min )
    AI — weekly megathread!
    News provided by aibrews.com Stability AI released Stable Video Diffusion, a latent video diffusion model for high-resolution text-to-video and image-to-video generation. [Details | Paper]. Microsoft Research released Orca 2 (7 billion and 13 billion parameters), open-source models created by fine-tuning the corresponding LLAMA 2 base models on tailored, high-quality synthetic data. Orca 2 significantly surpasses models of a similar size, even matching or exceeding those 5 to 10 times larger, especially on tasks that require reasoning [Details]. Researchers from Google andUIUC present ZipLoRA, a method to cheaply and effectively merge independently trained style and subject LoRAs in order to achieve generation of any user-provided subject in any user-provided style [Details Implement…  ( 1 min )
    A "No humans" pattern in ai generated images
    Recently I was browsing through some ai filters on snapchat and I came across this one filter that supposedly takes real images and converts it into "AI Anime art". I played around with the filter until I got one result with "No Humans" text on an anime character felt kinda weird and funny at first but after 4 or 5 more prompts, same this came up with a different character. After some days I used the filter again and the same text came up again. I wonder what may be causing this coherent text to form in the art generated. My assumption is that perhaps the creator has a checking system that checks for human figures in the prompt image(also the "no humans" output's prompts didn't have any human figures either) What are your speculations and opinions? submitted by /u/TcplaysBS [link] [comments]  ( 1 min )
    God's Simulation.
    submitted by /u/Philipp [link] [comments]  ( 1 min )
    Google's AI Chatbot Bard Now Analyzes YouTube Content Amid Privacy Concerns
    submitted by /u/Upbeat-Interaction13 [link] [comments]  ( 1 min )
    Introduce my classmates at Hogwarts. Who appears to have better grades?
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 1 min )
    "AI and Digital Money in Banking"
    submitted by /u/Fit-Code-5141 [link] [comments]
    One-Minute Daily AI News 11/23/2023
    Australian firm BIMlogiq is developing an Al assistant designed to transform the way users interact with Revit within the design environment.[1] The board that fired Altman from his role as CEO of the ChatGPT creator has been almost entirely replaced following a rebellion by employees, cementing his position at the helm of the firm.[2] Formula 1 hopes AI will help it figure out if a car breaks track limits.[3] Billionaire philanthropist Bill Gates believes that technology will not replace humans, but it could make a 3-day work week possible.[4] Sources: [1] https://aecmag.com/bim/ai-tool-uses-natural-language-to-drive-revit/ [2] https://www.ndtv.com/world-news/sam-altman-sacks-openai-board-that-fired-him-the-sole-survivor-is-4598485 [3] https://www.engadget.com/formula-1-hopes-ai-will-help-it-figure-out-if-a-car-breaks-track-limits-191546853.html [4] https://www.ndtv.com/world-news/bill-gates-foresees-a-3-day-work-week-thanks-to-ai-4600908 submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    Is there a sound AI that lets you morph multiple sounds into a new one? For sound design
    Hi for a uni project I'm looking for a AI that let's you mix let's say the sound of a refrigerator, a dog barking and a piano sonata into a new sound. Then, if possible, that AI would let you edit that sound to get different results. I'm not sure if that's something easy to do or if it has been done already by some software. Any info around that topic would be helpful. submitted by /u/Kino45 [link] [comments]
    When will singularity happen? 1700 expert opinions of AGI [October,2023]
    submitted by /u/Stabile_Feldmaus [link] [comments]
    This is wrong right?
    ​ https://preview.redd.it/7exegfnfz62c1.jpg?width=1400&format=pjpg&auto=webp&s=7fef3798517833345123ede1bcb3f93b5535785f Once we create AGI it will immediately be better at every applicable task than a human. Which will make it fall into the category of ASI. Therefore there are only two states; either it is less able than a human or more but there is no state where A.I. will be simply be equal to an average human's ability solve task (or if does hover at this state it won't be for very long) . This kills every robot-searching-for-humanity fantasy. What am I overlooking here? ​ ​ submitted by /u/Yenii_3025 [link] [comments]
    Bill Gates predicts AI can lead to a 3-day work week
    Microsoft founder Bill Gates predicts that artificial intelligence (AI) could lead to a three-day work week, where machines can take over mundane tasks and increase productivity. Gates believes that if human labor is freed up, it can be used for more meaningful activities such as helping the elderly and reducing class sizes. Other tech leaders, like JPMorgan's CEO Jamie Dimon and Tesla's Elon Musk, have also expressed similar views on the potential of AI to reduce work hours. However, not all leaders agree, with some arguing that increased productivity could lead to job displacement. Investment bank Goldman Sachs estimates that AI could replace 300 million full-time jobs globally in the coming years. IBM's CEO Arvind Krishna believes that while repetitive, white-collar jobs may be automated first, it doesn't mean humans will be out of jobs. Some companies and countries have already implemented shorter work weeks, such as Samsung giving staff one Friday off each month and Iceland trialing a four-day workweek. The Japanese government has also recommended that companies allow employees to opt for a four-day workweek. Source : https://fortune.com/2023/11/23/bill-gates-microsoft-3-day-work-week-machines-make-food/ submitted by /u/NuseAI [link] [comments]  ( 1 min )
  • Open

    Advanced Value-Based RL for vectorised inputs
    I've been doing a RL for quite some time now, and would say I'm fairly adept in Value-Based methods. Despite this, my research has been exclusively focussed on image-based learning (Mostly Atari, but also some other for fun projects). I was wondering if anyone could point me to some good resources/papers on discrete action space problems which use a vector of continuous variables as a state. I've done the basic stuff before like Lunar Lander, however I was wondering what better methods have been used for more difficult problems. Specifically, the problem I'm currently looking into uses a vector input of 160 floats, and has 5 discrete actions. I can run the environment fairly cheaply (easily doing 10M+ frames). I've looked at using my image based algorithms (Rainbow DQN, IQN, IMPALA Nets etc) for this kind of problem, but have no idea of these algorithms are at all suited to the problem. Furthermore, I am very unsure what types of network architecture to use (I have always used CNNs, but without those I've typically just used 3 layers of 256?). Any help would be much appreciated! submitted by /u/VIPTankz123 [link] [comments]  ( 1 min )
    Looking for a a tutorial/blog post/ codebase/ anything that deals with highway-env possibly the racetrack variant) with a DQN
    More or less what the title says. I have already tried this https://github.com/Farama-Foundation/HighwayEnv/blob/master/scripts/sb3_racetracks_ppo.py, using a dqn from sb3 instead of the ppo but the results weren't good, i'm open to any suggestion. submitted by /u/Conscious-Week8326 [link] [comments]  ( 1 min )
    "A* Search Without Expansions: Learning Heuristic Functions with Deep Q-Networks", Agostinelli et al 2021
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Super Mario Bros RL
    Successfully trained a computer in Super Mario Bros using a unique grid-based approach. Each square was assigned a number for streamlined understanding. However, some quirks needed addressing, like distinguishing between Goombas and Piranha Plants. Still, significant progress was made. Instead of processing screen images, the program read the game's memory, enhancing learning speed. Training utilized PPO agent, MlpPolicy, and 2 Dense(64) layers, with a strategic learning rate scheduler. An impressive performance in level 1-1 was achieved, although challenges remained in other levels. To overcome these challenges, considering options like introducing randomness in starting locations, exploring transfer learning on new levels, and training on a subset of stages. Code: https://github.com/sacchinbhg/RL-PPO-GAMES ​ https://reddit.com/link/182pr1t/video/i4soi8b33a2c1/player submitted by /u/sacchinbhg [link] [comments]  ( 1 min )
    [Help Needed] Stuck in a Local Minimum: Forex Trading Agent Training Issue - Seeking Collaborators
    Hello r/reinforcementlearning community, I hope this message finds you well. I'm currently immersed in a fun little side project, developing a Forex trading agent using reinforcement learning(i want to expand my portfolio). However, I've hit a roadblock during the training process, and I could really use some expertise and insights from the community. Project Overview: This is a side project that I'm working on for the joy of learning and experimenting. I've successfully implemented the environment and agent components, but I've encountered a challenge in training. It seems that the agent gets stuck in a local minimum, trading as little as possible. Code Snippet: Due to the complexity of the code, I'd prefer not to post it in its entirety here. If you're interested in joining this fun little side project and would like to collaborate or review the code, please send me a private message. Your contribution would be greatly appreciated! What I've Tried: I've experimented with various adjustments to the code and hyperparameters, but the issue persists. I suspect there might be a challenge with the reward function affecting the learning process. Specific Questions: Has anyone encountered a similar issue in training a reinforcement learning agent for financial trading? Any insights into potential problems with reward functions in this context? Open Call for Collaborators: I'm open to collaboration and would love to work with fellow enthusiasts on this project. If you're interested in contributing, learning together, and having some fun with this side project, please reach out! Thank you all for your time and consideration. Looking forward to connecting with like-minded individuals! submitted by /u/sk3ptica1 [link] [comments]  ( 1 min )
  • Open

    Accelerating AI/ML development at BMW Group with Amazon SageMaker Studio
    This post is co-written with Marc Neumann, Amor Steinberg and Marinus Krommenhoek from BMW Group. The BMW Group – headquartered in Munich, Germany – is driven by 149,000 employees worldwide and manufactures in over 30 production and assembly facilities across 15 countries. Today, the BMW Group is the world’s leading manufacturer of premium automobiles and […]  ( 11 min )
    Automating product description generation with Amazon Bedrock
    In today’s ever-evolving world of ecommerce, the influence of a compelling product description cannot be overstated. It can be the decisive factor that turns a potential visitor into a paying customer or sends them clicking off to a competitor’s site. The manual creation of these descriptions across a vast array of products is a labor-intensive […]  ( 9 min )
    Optimizing costs for Amazon SageMaker Canvas with automatic shutdown of idle apps
    Amazon SageMaker Canvas is a rich, no-code Machine Learning (ML) and Generative AI workspace that has allowed customers all over the world to more easily adopt ML technologies to solve old and new challenges thanks to its visual, no-code interface. It does so by covering the ML workflow end-to-end: whether you’re looking for powerful data […]  ( 9 min )
    How SnapLogic built a text-to-pipeline application with Amazon Bedrock to translate business intent into action
    This post was co-written with Greg Benson, Chief Scientist; Aaron Kesler, Sr. Product Manager; and Rich Dill, Enterprise Solutions Architect from SnapLogic. Many customers are building generative AI apps on Amazon Bedrock and Amazon CodeWhisperer to create code artifacts based on natural language. This use case highlights how large language models (LLMs) are able to […]  ( 17 min )
  • Open

    Identifying DNA Sequence Motifs Using Deep Learning. (arXiv:2311.12884v1 [q-bio.GN])
    Splice sites play a crucial role in gene expression, and accurate prediction of these sites in DNA sequences is essential for diagnosing and treating genetic disorders. We address the challenge of splice site prediction by introducing DeepDeCode, an attention-based deep learning sequence model to capture the long-term dependencies in the nucleotides in DNA sequences. We further propose using visualization techniques for accurate identification of sequence motifs, which enhance the interpretability and trustworthiness of DeepDeCode. We compare DeepDeCode to other state-of-the-art methods for splice site prediction and demonstrate its accuracy, explainability and efficiency. Given the results of our methodology, we expect that it can used for healthcare applications to reason about genomic processes and be extended to discover new splice sites and genomic regulatory elements.  ( 2 min )
    Probabilistic Inference in Reinforcement Learning Done Right. (arXiv:2311.13294v1 [cs.LG])
    A popular perspective in Reinforcement learning (RL) casts the problem as probabilistic inference on a graphical model of the Markov decision process (MDP). The core object of study is the probability of each state-action pair being visited under the optimal policy. Previous approaches to approximate this quantity can be arbitrarily poor, leading to algorithms that do not implement genuine statistical inference and consequently do not perform well in challenging problems. In this work, we undertake a rigorous Bayesian treatment of the posterior probability of state-action optimality and clarify how it flows through the MDP. We first reveal that this quantity can indeed be used to generate a policy that explores efficiently, as measured by regret. Unfortunately, computing it is intractable, so we derive a new variational Bayesian approximation yielding a tractable convex optimization problem and establish that the resulting policy also explores efficiently. We call our approach VAPOR and show that it has strong connections to Thompson sampling, K-learning, and maximum entropy exploration. We conclude with some experiments demonstrating the performance advantage of a deep RL version of VAPOR.  ( 2 min )
    The Influence of Neural Networks on Hydropower Plant Management in Agriculture: Addressing Challenges and Exploring Untapped Opportunities. (arXiv:2311.13293v1 [cs.LG])
    Hydropower plants are crucial for stable renewable energy and serve as vital water sources for sustainable agriculture. However, it is essential to assess the current water management practices associated with hydropower plant management software. A key concern is the potential conflict between electricity generation and agricultural water needs. Prioritising water for electricity generation can reduce irrigation availability in agriculture during crucial periods like droughts, impacting crop yields and regional food security. Coordination between electricity and agricultural water allocation is necessary to ensure optimal and environmentally sound practices. Neural networks have become valuable tools for hydropower plant management, but their black-box nature raises concerns about transparency in decision making. Additionally, current approaches often do not take advantage of their potential to create a system that effectively balances water allocation. This work is a call for attention and highlights the potential risks of deploying neural network-based hydropower plant management software without proper scrutiny and control. To address these concerns, we propose the adoption of the Agriculture Conscious Hydropower Plant Management framework, aiming to maximise electricity production while prioritising stable irrigation for agriculture. We also advocate reevaluating government-imposed minimum water guidelines for irrigation to ensure flexibility and effective water allocation. Additionally, we suggest a set of regulatory measures to promote model transparency and robustness, certifying software that makes conscious and intelligent water allocation decisions, ultimately safeguarding agriculture from undue strain during droughts.  ( 3 min )
    Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model. (arXiv:2311.13231v1 [cs.LG])
    Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences, then leverage RL techniques to fine-tune the underlying models. However, crafting an efficient reward model demands extensive datasets, optimal architecture, and manual hyperparameter tuning, making the process both time and cost-intensive. The direct preference optimization (DPO) method, effective in fine-tuning large language models, eliminates the necessity for a reward model. However, the extensive GPU memory requirement of the diffusion model's denoising process hinders the direct application of the DPO method. To address this issue, we introduce the Direct Preference for Denoising Diffusion Policy Optimization (D3PO) method to directly fine-tune diffusion models. The theoretical analysis demonstrates that although D3PO omits training a reward model, it effectively functions as the optimal reward model trained using human feedback data to guide the learning process. This approach requires no training of a reward model, proving to be more direct, cost-effective, and minimizing computational overhead. In experiments, our method uses the relative scale of objectives as a proxy for human preference, delivering comparable results to methods using ground-truth rewards. Moreover, D3PO demonstrates the ability to reduce image distortion rates and generate safer images, overcoming challenges lacking robust reward models.  ( 2 min )
    Naturalness of Attention: Revisiting Attention in Code Language Models. (arXiv:2311.13508v1 [cs.SE])
    Language models for code such as CodeBERT offer the capability to learn advanced source code representation, but their opacity poses barriers to understanding of captured properties. Recent attention analysis studies provide initial interpretability insights by focusing solely on attention weights rather than considering the wider context modeling of Transformers. This study aims to shed some light on the previously ignored factors of the attention mechanism beyond the attention weights. We conduct an initial empirical study analyzing both attention distributions and transformed representations in CodeBERT. Across two programming languages, Java and Python, we find that the scaled transformation norms of the input better capture syntactic structure compared to attention weights alone. Our analysis reveals characterization of how CodeBERT embeds syntactic code properties. The findings demonstrate the importance of incorporating factors beyond just attention weights for rigorously understanding neural code models. This lays the groundwork for developing more interpretable models and effective uses of attention mechanisms in program analysis.  ( 2 min )
    Unified Classification and Rejection: A One-versus-All Framework. (arXiv:2311.13355v1 [cs.CV])
    Classifying patterns of known classes and rejecting ambiguous and novel (also called as out-of-distribution (OOD)) inputs are involved in open world pattern recognition. Deep neural network models usually excel in closed-set classification while performing poorly in rejecting OOD. To tackle this problem, numerous methods have been designed to perform open set recognition (OSR) or OOD rejection/detection tasks. Previous methods mostly take post-training score transformation or hybrid models to ensure low scores on OOD inputs while separating known classes. In this paper, we attempt to build a unified framework for building open set classifiers for both classification and OOD rejection. We formulate the open set recognition of $ K $-known-class as a $ (K + 1) $-class classification problem with model trained on known-class samples only. By decomposing the $ K $-class problem into $ K $ one-versus-all (OVA) binary classification tasks and binding some parameters, we show that combining the scores of OVA classifiers can give $ (K + 1) $-class posterior probabilities, which enables classification and OOD rejection in a unified framework. To maintain the closed-set classification accuracy of the OVA trained classifier, we propose a hybrid training strategy combining OVA loss and multi-class cross-entropy loss. We implement the OVA framework and hybrid training strategy on the recently proposed convolutional prototype network. Experiments on popular OSR and OOD detection datasets demonstrate that the proposed framework, using a single multi-class classifier, yields competitive performance in closed-set classification, OOD detection, and misclassification detection.  ( 2 min )
    Energy efficiency in Edge TPU vs. embedded GPU for computer-aided medical imaging segmentation and classification. (arXiv:2311.12876v1 [eess.IV])
    In this work, we evaluate the energy usage of fully embedded medical diagnosis aids based on both segmentation and classification of medical images implemented on Edge TPU and embedded GPU processors. We use glaucoma diagnosis based on color fundus images as an example to show the possibility of performing segmentation and classification in real time on embedded boards and to highlight the different energy requirements of the studied implementations. Several other works develop the use of segmentation and feature extraction techniques to detect glaucoma, among many other pathologies, with deep neural networks. Memory limitations and low processing capabilities of embedded accelerated systems (EAS) limit their use for deep network-based system training. However, including specific acceleration hardware, such as NVIDIA's Maxwell GPU or Google's Edge TPU, enables them to perform inferences using complex pre-trained networks in very reasonable times. In this study, we evaluate the timing and energy performance of two EAS equipped with Machine Learning (ML) accelerators executing an example diagnostic tool developed in a previous work. For optic disc (OD) and cup (OC) segmentation, the obtained prediction times per image are under 29 and 43 ms using Edge TPUs and Maxwell GPUs, respectively. Prediction times for the classification subsystem are lower than 10 and 14 ms for Edge TPUs and Maxwell GPUs, respectively. Regarding energy usage, in approximate terms, for OD segmentation Edge TPUs and Maxwell GPUs use 38 and 190 mJ per image, respectively. For fundus classification, Edge TPUs and Maxwell GPUs use 45 and 70 mJ, respectively.  ( 3 min )
    Improved identification accuracy in equation learning via comprehensive $\boldsymbol{R^2}$-elimination and Bayesian model selection. (arXiv:2311.13265v1 [stat.ML])
    In the field of equation learning, exhaustively considering all possible equations derived from a basis function dictionary is infeasible. Sparse regression and greedy algorithms have emerged as popular approaches to tackle this challenge. However, the presence of multicollinearity poses difficulties for sparse regression techniques, and greedy steps may inadvertently exclude terms of the true equation, leading to reduced identification accuracy. In this article, we present an approach that strikes a balance between comprehensiveness and efficiency in equation learning. Inspired by stepwise regression, our approach combines the coefficient of determination, $R^2$, and the Bayesian model evidence, $p(\boldsymbol y|\mathcal M)$, in a novel way. Our procedure is characterized by a comprehensive search with just a minor reduction of the model space at each iteration step. With two flavors of our approach and the adoption of $p(\boldsymbol y|\mathcal M)$ for bi-directional stepwise regression, we present a total of three new avenues for equation learning. Through three extensive numerical experiments involving random polynomials and dynamical systems, we compare our approach against four state-of-the-art methods and two standard approaches. The results demonstrate that our comprehensive search approach surpasses all other methods in terms of identification accuracy. In particular, the second flavor of our approach establishes an efficient overfitting penalty solely based on $R^2$, which achieves highest rates of exact equation recovery.  ( 3 min )
    Hard Label Black Box Node Injection Attack on Graph Neural Networks. (arXiv:2311.13244v1 [cs.LG])
    While graph neural networks have achieved state-of-the-art performances in many real-world tasks including graph classification and node classification, recent works have demonstrated they are also extremely vulnerable to adversarial attacks. Most previous works have focused on attacking node classification networks under impractical white-box scenarios. In this work, we will propose a non-targeted Hard Label Black Box Node Injection Attack on Graph Neural Networks, which to the best of our knowledge, is the first of its kind. Under this setting, more real world tasks can be studied because our attack assumes no prior knowledge about (1): the model architecture of the GNN we are attacking; (2): the model's gradients; (3): the output logits of the target GNN model. Our attack is based on an existing edge perturbation attack, from which we restrict the optimization process to formulate a node injection attack. In the work, we will evaluate the performance of the attack using three datasets, COIL-DEL, IMDB-BINARY, and NCI1.  ( 2 min )
    Grad-Shafranov equilibria via data-free physics informed neural networks. (arXiv:2311.13491v1 [physics.plasm-ph])
    A large number of magnetohydrodynamic (MHD) equilibrium calculations are often required for uncertainty quantification, optimization, and real-time diagnostic information, making MHD equilibrium codes vital to the field of plasma physics. In this paper, we explore a method for solving the Grad-Shafranov equation by using Physics-Informed Neural Networks (PINNs). For PINNs, we optimize neural networks by directly minimizing the residual of the PDE as a loss function. We show that PINNs can accurately and effectively solve the Grad-Shafranov equation with several different boundary conditions. We also explore the parameter space by varying the size of the model, the learning rate, and boundary conditions to map various trade-offs such as between reconstruction error and computational speed. Additionally, we introduce a parameterized PINN framework, expanding the input space to include variables such as pressure, aspect ratio, elongation, and triangularity in order to handle a broader range of plasma scenarios within a single network. Parametrized PINNs could be used in future work to solve inverse problems such as shape optimization.  ( 2 min )
    Linear Log-Normal Attention with Unbiased Concentration. (arXiv:2311.13541v1 [cs.LG])
    Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length. This limitation poses a substantial obstacle when dealing with long documents or high-resolution images. In this work, we study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability. Furthermore, we propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention, designed to emulate the distribution and concentration behavior of the original self-attention. Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives, offering a promising avenue for enhancing the scalability of transformer models. Our code is available in supplementary materials.  ( 2 min )
    ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs. (arXiv:2311.13600v1 [cs.CV])
    Methods for finetuning generative models for concept-driven personalization generally achieve strong results for subject-driven or style-driven generation. Recently, low-rank adaptations (LoRA) have been proposed as a parameter-efficient way of achieving concept-driven personalization. While recent work explores the combination of separate LoRAs to achieve joint generation of learned styles and subjects, existing techniques do not reliably address the problem; they often compromise either subject fidelity or style fidelity. We propose ZipLoRA, a method to cheaply and effectively merge independently trained style and subject LoRAs in order to achieve generation of any user-provided subject in any user-provided style. Experiments on a wide range of subject and style combinations show that ZipLoRA can generate compelling results with meaningful improvements over baselines in subject and style fidelity while preserving the ability to recontextualize. Project page: https://ziplora.github.io  ( 2 min )
    Deep Learning for Vascular Segmentation and Applications in Phase Contrast Tomography Imaging. (arXiv:2311.13319v1 [eess.IV])
    Automated blood vessel segmentation is vital for biomedical imaging, as vessel changes indicate many pathologies. Still, precise segmentation is difficult due to the complexity of vascular structures, anatomical variations across patients, the scarcity of annotated public datasets, and the quality of images. We present a thorough literature review, highlighting the state of machine learning techniques across diverse organs. Our goal is to provide a foundation on the topic and identify a robust baseline model for application to vascular segmentation in a new imaging modality, Hierarchical Phase Contrast Tomography (HiP CT). Introduced in 2020 at the European Synchrotron Radiation Facility, HiP CT enables 3D imaging of complete organs at an unprecedented resolution of ca. 20mm per voxel, with the capability for localized zooms in selected regions down to 1mm per voxel without sectioning. We have created a training dataset with double annotator validated vascular data from three kidneys imaged with HiP CT in the context of the Human Organ Atlas Project. Finally, utilising the nnU Net model, we conduct experiments to assess the models performance on both familiar and unseen samples, employing vessel specific metrics. Our results show that while segmentations yielded reasonably high scores such as clDice values ranging from 0.82 to 0.88, certain errors persisted. Large vessels that collapsed due to the lack of hydrostatic pressure (HiP CT is an ex vivo technique) were segmented poorly. Moreover, decreased connectivity in finer vessels and higher segmentation errors at vessel boundaries were observed. Such errors obstruct the understanding of the structures by interrupting vascular tree connectivity. Through our review and outputs, we aim to set a benchmark for subsequent model evaluations using various modalities, especially with the HiP CT imaging database.  ( 3 min )
    Transfer Attacks and Defenses for Large Language Models on Coding Tasks. (arXiv:2311.13445v1 [cs.LG])
    Modern large language models (LLMs), such as ChatGPT, have demonstrated impressive capabilities for coding tasks including writing and reasoning about code. They improve upon previous neural network models of code, such as code2seq or seq2seq, that already demonstrated competitive results when performing tasks such as code summarization and identifying code vulnerabilities. However, these previous code models were shown vulnerable to adversarial examples, i.e. small syntactic perturbations that do not change the program's semantics, such as the inclusion of "dead code" through false conditions or the addition of inconsequential print statements, designed to "fool" the models. LLMs can also be vulnerable to the same adversarial perturbations but a detailed study on this concern has been lacking so far. In this paper we aim to investigate the effect of adversarial perturbations on coding tasks with LLMs. In particular, we study the transferability of adversarial examples, generated through white-box attacks on smaller code models, to LLMs. Furthermore, to make the LLMs more robust against such adversaries without incurring the cost of retraining, we propose prompt-based defenses that involve modifying the prompt to include additional information such as examples of adversarially perturbed code and explicit instructions for reversing adversarial perturbations. Our experiments show that adversarial examples obtained with a smaller code model are indeed transferable, weakening the LLMs' performance. The proposed defenses show promise in improving the model's resilience, paving the way to more robust defensive solutions for LLMs in code-related applications.  ( 3 min )
    Current Topological and Machine Learning Applications for Bias Detection in Text. (arXiv:2311.13495v1 [cs.CY])
    Institutional bias can impact patient outcomes, educational attainment, and legal system navigation. Written records often reflect bias, and once bias is identified; it is possible to refer individuals for training to reduce bias. Many machine learning tools exist to explore text data and create predictive models that can search written records to identify real-time bias. However, few previous studies investigate large language model embeddings and geometric models of biased text data to understand geometry's impact on bias modeling accuracy. To overcome this issue, this study utilizes the RedditBias database to analyze textual biases. Four transformer models, including BERT and RoBERTa variants, were explored. Post-embedding, t-SNE allowed two-dimensional visualization of data. KNN classifiers differentiated bias types, with lower k-values proving more effective. Findings suggest BERT, particularly mini BERT, excels in bias classification, while multilingual models lag. The recommendation emphasizes refining monolingual models and exploring domain-specific biases.  ( 2 min )
    Cracking the Code of Negative Transfer: A Cooperative Game Theoretic Approach for Cross-Domain Sequential Recommendation. (arXiv:2311.13188v1 [cs.AI])
    This paper investigates Cross-Domain Sequential Recommendation (CDSR), a promising method that uses information from multiple domains (more than three) to generate accurate and diverse recommendations, and takes into account the sequential nature of user interactions. The effectiveness of these systems often depends on the complex interplay among the multiple domains. In this dynamic landscape, the problem of negative transfer arises, where heterogeneous knowledge between dissimilar domains leads to performance degradation due to differences in user preferences across these domains. As a remedy, we propose a new CDSR framework that addresses the problem of negative transfer by assessing the extent of negative transfer from one domain to another and adaptively assigning low weight values to the corresponding prediction losses. To this end, the amount of negative transfer is estimated by measuring the marginal contribution of each domain to model performance based on a cooperative game theory. In addition, a hierarchical contrastive learning approach that incorporates information from the sequence of coarse-level categories into that of fine-level categories (e.g., item level) when implementing contrastive learning was developed to mitigate negative transfer. Despite the potentially low relevance between domains at the fine-level, there may be higher relevance at the category level due to its generalised and broader preferences. We show that our model is superior to prior works in terms of model performance on two real-world datasets across ten different domains.  ( 3 min )
    REDS: Resource-Efficient Deep Subnetworks for Dynamic Resource Constraints. (arXiv:2311.13349v1 [cs.LG])
    Deep models deployed on edge devices frequently encounter resource variability, which arises from fluctuating energy levels, timing constraints, or prioritization of other critical tasks within the system. State-of-the-art machine learning pipelines generate resource-agnostic models, not capable to adapt at runtime. In this work we introduce Resource-Efficient Deep Subnetworks (REDS) to tackle model adaptation to variable resources. In contrast to the state-of-the-art, REDS use structured sparsity constructively by exploiting permutation invariance of neurons, which allows for hardware-specific optimizations. Specifically, REDS achieve computational efficiency by (1) skipping sequential computational blocks identified by a novel iterative knapsack optimizer, and (2) leveraging simple math to re-arrange the order of operations in REDS computational graph to take advantage of the data cache. REDS support conventional deep networks frequently deployed on the edge and provide computational benefits even for small and simple networks. We evaluate REDS on six benchmark architectures trained on the Google Speech Commands, FMNIST and CIFAR10 datasets, and test on four off-the-shelf mobile and embedded hardware platforms. We provide a theoretical result and empirical evidence for REDS outstanding performance in terms of submodels' test set accuracy, and demonstrate an adaptation time in response to dynamic resource constraints of under 40$\mu$s, utilizing a 2-layer fully-connected network on Arduino Nano 33 BLE Sense.  ( 2 min )
    Fact-based Court Judgment Prediction. (arXiv:2311.13350v1 [cs.CL])
    This extended abstract extends the research presented in "ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation" \cite{malik-etal-2021-ildc}, focusing on fact-based judgment prediction within the context of Indian legal documents. We introduce two distinct problem variations: one based solely on facts, and another combining facts with rulings from lower courts (RLC). Our research aims to enhance early-phase case outcome prediction, offering significant benefits to legal professionals and the general public. The results, however, indicated a performance decline compared to the original ILDC for CJPE study, even after implementing various weightage schemes in our DELSumm algorithm. Additionally, using only facts for legal judgment prediction with different transformer models yielded results inferior to the state-of-the-art outcomes reported in the "ILDC for CJPE" study.  ( 2 min )
    White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?. (arXiv:2311.13110v1 [cs.LG])
    In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .
    Have Your Cake and Eat It Too: Toward Efficient and Accurate Split Federated Learning. (arXiv:2311.13163v1 [cs.LG])
    Due to its advantages in resource constraint scenarios, Split Federated Learning (SFL) is promising in AIoT systems. However, due to data heterogeneity and stragglers, SFL suffers from the challenges of low inference accuracy and low efficiency. To address these issues, this paper presents a novel SFL approach, named Sliding Split Federated Learning (S$^2$FL), which adopts an adaptive sliding model split strategy and a data balance-based training mechanism. By dynamically dispatching different model portions to AIoT devices according to their computing capability, S$^2$FL can alleviate the low training efficiency caused by stragglers. By combining features uploaded by devices with different data distributions to generate multiple larger batches with a uniform distribution for back-propagation, S$^2$FL can alleviate the performance degradation caused by data heterogeneity. Experimental results demonstrate that, compared to conventional SFL, S$^2$FL can achieve up to 16.5\% inference accuracy improvement and 3.54X training acceleration.
    Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices. (arXiv:2311.13502v1 [cs.LG])
    In the current landscape of large models, the Transformer stands as a cornerstone, playing a pivotal role in shaping the trajectory of modern models. However, its application encounters challenges attributed to the substantial computational intricacies intrinsic to its attention mechanism. Moreover, its reliance on high-precision floating-point operations presents specific hurdles, particularly evident in computation-intensive scenarios such as edge computing environments. These environments, characterized by resource-constrained devices and a preference for lower precision, necessitate innovative solutions. To tackle the exacting data processing demands posed by edge devices, we introduce the Bitformer model, an inventive extension of the Transformer paradigm. Central to this innovation is a novel attention mechanism that adeptly replaces conventional floating-point matrix multiplication with bitwise operations. This strategic substitution yields dual advantages. Not only does it maintain the attention mechanism's prowess in capturing intricate long-range information dependencies, but it also orchestrates a profound reduction in the computational complexity inherent in the attention operation. The transition from an $O(n^2d)$ complexity, typical of floating-point operations, to an $O(n^2T)$ complexity characterizing bitwise operations, substantiates this advantage. Notably, in this context, the parameter $T$ remains markedly smaller than the conventional dimensionality parameter $d$. The Bitformer model in essence endeavors to reconcile the indomitable requirements of modern computing landscapes with the constraints posed by edge computing scenarios. By forging this innovative path, we bridge the gap between high-performing models and resource-scarce environments, thus unveiling a promising trajectory for further advancements in the field.
    Non-Sequential Ensemble Kalman Filtering using Distributed Arrays. (arXiv:2311.12909v1 [stat.ML])
    This work introduces a new, distributed implementation of the Ensemble Kalman Filter (EnKF) that allows for non-sequential assimilation of large datasets in high-dimensional problems. The traditional EnKF algorithm is computationally intensive and exhibits difficulties in applications requiring interaction with the background covariance matrix, prompting the use of methods like sequential assimilation which can introduce unwanted consequences, such as dependency on observation ordering. Our implementation leverages recent advancements in distributed computing to enable the construction and use of the full model error covariance matrix in distributed memory, allowing for single-batch assimilation of all observations and eliminating order dependencies. Comparative performance assessments, involving both synthetic and real-world paleoclimatic reconstruction applications, indicate that the new, non-sequential implementation outperforms the traditional, sequential one.
    A Survey of Serverless Machine Learning Model Inference. (arXiv:2311.13587v1 [cs.DC])
    Recent developments in Generative AI, Computer Vision, and Natural Language Processing have led to an increased integration of AI models into various products. This widespread adoption of AI requires significant efforts in deploying these models in production environments. When hosting machine learning models for real-time predictions, it is important to meet defined Service Level Objectives (SLOs), ensuring reliability, minimal downtime, and optimizing operational costs of the underlying infrastructure. Large machine learning models often demand GPU resources for efficient inference to meet SLOs. In the context of these trends, there is growing interest in hosting AI models in a serverless architecture while still providing GPU access for inference tasks. This survey aims to summarize and categorize the emerging challenges and optimization opportunities for large-scale deep learning serving systems. By providing a novel taxonomy and summarizing recent trends, we hope that this survey could shed light on new optimization perspectives and motivate novel works in large-scale deep learning serving systems.
    On the stability, correctness and plausibility of visual explanation methods based on feature importance. (arXiv:2311.12860v1 [cs.CV])
    In the field of Explainable AI, multiples evaluation metrics have been proposed in order to assess the quality of explanation methods w.r.t. a set of desired properties. In this work, we study the articulation between the stability, correctness and plausibility of explanations based on feature importance for image classifiers. We show that the existing metrics for evaluating these properties do not always agree, raising the issue of what constitutes a good evaluation metric for explanations. Finally, in the particular case of stability and correctness, we show the possible limitations of some evaluation metrics and propose new ones that take into account the local behaviour of the model under test.
    ShaDDR: Interactive Example-Based Geometry and Texture Generation via 3D Shape Detailization and Differentiable Rendering. (arXiv:2306.04889v2 [cs.CV] UPDATED)
    We present ShaDDR, an example-based deep generative neural network which produces a high-resolution textured 3D shape through geometry detailization and conditional texture generation applied to an input coarse voxel shape. Trained on a small set of detailed and textured exemplar shapes, our method learns to detailize the geometry via multi-resolution voxel upsampling and generate textures on voxel surfaces via differentiable rendering against exemplar texture images from a few views. The generation is interactive, taking less than 1 second to produce a 3D model with voxel resolutions up to 512^3. The generated shape preserves the overall structure of the input coarse voxel model, while the style of the generated geometric details and textures can be manipulated through learned latent codes. In the experiments, we show that our method can generate higher-resolution shapes with plausible and improved geometric details and clean textures compared to prior works. Furthermore, we showcase the ability of our method to learn geometric details and textures from shapes reconstructed from real-world photos. In addition, we have developed an interactive modeling application to demonstrate the generalizability of our method to various user inputs and the controllability it offers, allowing users to interactively sculpt a coarse voxel shape to define the overall structure of the detailized 3D shape. Code and data are available at https://github.com/qiminchen/ShaDDR.
    Stability and Generalization of Stochastic Compositional Gradient Descent Algorithms. (arXiv:2307.03357v2 [cs.LG] UPDATED)
    Many machine learning tasks can be formulated as a stochastic compositional optimization (SCO) problem such as reinforcement learning, AUC maximization, and meta-learning, where the objective function involves a nested composition associated with an expectation. While a significant amount of studies has been devoted to studying the convergence behavior of SCO algorithms, there is little work on understanding their generalization, i.e., how these learning algorithms built from training examples would behave on future test examples. In this paper, we provide the stability and generalization analysis of stochastic compositional gradient descent algorithms through the lens of algorithmic stability in the framework of statistical learning theory. Firstly, we introduce a stability concept called compositional uniform stability and establish its quantitative relation with generalization for SCO problems. Then, we establish the compositional uniform stability results for two popular stochastic compositional gradient descent algorithms, namely SCGD and SCSC. Finally, we derive dimension-independent excess risk bounds for SCGD and SCSC by trade-offing their stability results and optimization errors. To the best of our knowledge, these are the first-ever-known results on stability and generalization analysis of stochastic compositional gradient descent algorithms.
    Newton-CG methods for nonconvex unconstrained optimization with H\"older continuous Hessian. (arXiv:2311.13094v1 [math.OC])
    In this paper we consider a nonconvex unconstrained optimization problem minimizing a twice differentiable objective function with H\"older continuous Hessian. Specifically, we first propose a Newton-conjugate gradient (Newton-CG) method for finding an approximate first-order stationary point (FOSP) of this problem, assuming the associated the H\"older parameters are explicitly known. Then we develop a parameter-free Newton-CG method without requiring any prior knowledge of these parameters. To the best of our knowledge, this method is the first parameter-free second-order method achieving the best-known iteration and operation complexity for finding an approximate FOSP of this problem. Furthermore, we propose a Newton-CG method for finding an approximate second-order stationary point (SOSP) of the considered problem with high probability and establish its iteration and operation complexity. Finally, we present preliminary numerical results to demonstrate the superior practical performance of our parameter-free Newton-CG method over a well-known regularized Newton method.
    Covariance alignment: from maximum likelihood estimation to Gromov-Wasserstein. (arXiv:2311.13595v1 [math.ST])
    Feature alignment methods are used in many scientific disciplines for data pooling, annotation, and comparison. As an instance of a permutation learning problem, feature alignment presents significant statistical and computational challenges. In this work, we propose the covariance alignment model to study and compare various alignment methods and establish a minimax lower bound for covariance alignment that has a non-standard dimension scaling because of the presence of a nuisance parameter. This lower bound is in fact minimax optimal and is achieved by a natural quasi MLE. However, this estimator involves a search over all permutations which is computationally infeasible even when the problem has moderate size. To overcome this limitation, we show that the celebrated Gromov-Wasserstein algorithm from optimal transport which is more amenable to fast implementation even on large-scale problems is also minimax optimal. These results give the first statistical justification for the deployment of the Gromov-Wasserstein algorithm in practice.
    Degree-Preserving Randomized Response for Graph Neural Networks under Local Differential Privacy. (arXiv:2202.10209v5 [cs.CR] UPDATED)
    Differentially private GNNs (Graph Neural Networks) have been recently studied to provide high accuracy in various tasks on graph data while strongly protecting user privacy. In particular, a recent study proposes an algorithm to protect each user's feature vector in an attributed graph with LDP (Local Differential Privacy), a strong privacy notion without a trusted third party. However, this algorithm does not protect edges (friendships) in a social graph, hence cannot protect user privacy in unattributed graphs. How to provide strong privacy with high accuracy in unattributed graphs remains open. In this paper, we propose a novel LDP algorithm called the DPRR (Degree-Preserving Randomized Response) to provide LDP for edges in GNNs. Our DPRR preserves each user's degree hence a graph structure while providing edge LDP. Technically, our DPRR uses Warner's RR (Randomized Response) and strategic edge sampling, where each user's sampling probability is automatically tuned using the Laplacian mechanism to preserve the degree information under edge LDP. We also propose a privacy budget allocation method to make the noise in both Warner's RR and the Laplacian mechanism small. We focus on graph classification as a task of GNNs and evaluate the DPRR using three social graph datasets. Our experimental results show that the DPRR significantly outperforms three baselines and provides accuracy close to a non-private algorithm in all datasets with a reasonable privacy budget, e.g., epsilon=1.
    On the Representational Capacity of Recurrent Neural Language Models. (arXiv:2310.12942v3 [cs.CL] UPDATED)
    This work investigates the computational expressivity of language models (LMs) based on recurrent neural networks (RNNs). Siegelmann and Sontag (1992) famously showed that RNNs with rational weights and hidden states and unbounded computation time are Turing complete. However, LMs define weightings over strings in addition to just (unweighted) language membership and the analysis of the computational power of RNN LMs (RLMs) should reflect this. We extend the Turing completeness result to the probabilistic case, showing how a rationally weighted RLM with unbounded computation time can simulate any deterministic probabilistic Turing machine (PTM) with rationally weighted transitions. Since, in practice, RLMs work in real-time, processing a symbol at every time step, we treat the above result as an upper bound on the expressivity of RLMs. We also provide a lower bound by showing that under the restriction to real-time computation, such models can simulate deterministic real-time rational PTMs.
    InteRACT: Transformer Models for Human Intent Prediction Conditioned on Robot Actions. (arXiv:2311.12943v1 [cs.RO])
    In collaborative human-robot manipulation, a robot must predict human intents and adapt its actions accordingly to smoothly execute tasks. However, the human's intent in turn depends on actions the robot takes, creating a chicken-or-egg problem. Prior methods ignore such inter-dependency and instead train marginal intent prediction models independent of robot actions. This is because training conditional models is hard given a lack of paired human-robot interaction datasets. Can we instead leverage large-scale human-human interaction data that is more easily accessible? Our key insight is to exploit a correspondence between human and robot actions that enables transfer learning from human-human to human-robot data. We propose a novel architecture, InteRACT, that pre-trains a conditional intent prediction model on large human-human datasets and fine-tunes on a small human-robot dataset. We evaluate on a set of real-world collaborative human-robot manipulation tasks and show that our conditional model improves over various marginal baselines. We also introduce new techniques to tele-operate a 7-DoF robot arm and collect a diverse range of human-robot collaborative manipulation data, which we open-source.
    SiGeo: Sub-One-Shot NAS via Information Theory and Geometry of Loss Landscape. (arXiv:2311.13169v1 [cs.LG])
    Neural Architecture Search (NAS) has become a widely used tool for automating neural network design. While one-shot NAS methods have successfully reduced computational requirements, they often require extensive training. On the other hand, zero-shot NAS utilizes training-free proxies to evaluate a candidate architecture's test performance but has two limitations: (1) inability to use the information gained as a network improves with training and (2) unreliable performance, particularly in complex domains like RecSys, due to the multi-modal data inputs and complex architecture configurations. To synthesize the benefits of both methods, we introduce a "sub-one-shot" paradigm that serves as a bridge between zero-shot and one-shot NAS. In sub-one-shot NAS, the supernet is trained using only a small subset of the training data, a phase we refer to as "warm-up." Within this framework, we present SiGeo, a proxy founded on a novel theoretical framework that connects the supernet warm-up with the efficacy of the proxy. Extensive experiments have shown that SiGeo, with the benefit of warm-up, consistently outperforms state-of-the-art NAS proxies on various established NAS benchmarks. When a supernet is warmed up, it can achieve comparable performance to weight-sharing one-shot NAS methods, but with a significant reduction ($\sim 60$\%) in computational costs.
    Deep-learning-based acceleration of MRI for radiotherapy planning of pediatric patients with brain tumors. (arXiv:2311.13485v1 [eess.IV])
    Magnetic Resonance Imaging (MRI) is a non-invasive diagnostic and radiotherapy (RT) planning tool, offering detailed insights into the anatomy of the human body. The extensive scan time is stressful for patients, who must remain motionless in a prolonged imaging procedure that prioritizes reduction of imaging artifacts. This is challenging for pediatric patients who may require measures for managing voluntary motions such as anesthesia. Several computational approaches reduce scan time (fast MRI), by recording fewer measurements and digitally recovering full information via post-acquisition reconstruction. However, most fast MRI approaches were developed for diagnostic imaging, without addressing reconstruction challenges specific to RT planning. In this work, we developed a deep learning-based method (DeepMRIRec) for MRI reconstruction from undersampled data acquired with RT-specific receiver coil arrangements. We evaluated our method against fully sampled data of T1-weighted MR images acquired from 73 children with brain tumors/surgical beds using loop and posterior coils (12 channels), with and without applying virtual compression of coil elements. DeepMRIRec reduced scanning time by a factor of four producing a structural similarity score surpassing the evaluated state-of-the-art method (0.960 vs 0.896), thereby demonstrating its potential for accelerating MRI scanning for RT planning.
    IMJENSE: Scan-specific Implicit Representation for Joint Coil Sensitivity and Image Estimation in Parallel MRI. (arXiv:2311.12892v1 [eess.IV])
    Parallel imaging is a commonly used technique to accelerate magnetic resonance imaging (MRI) data acquisition. Mathematically, parallel MRI reconstruction can be formulated as an inverse problem relating the sparsely sampled k-space measurements to the desired MRI image. Despite the success of many existing reconstruction algorithms, it remains a challenge to reliably reconstruct a high-quality image from highly reduced k-space measurements. Recently, implicit neural representation has emerged as a powerful paradigm to exploit the internal information and the physics of partially acquired data to generate the desired object. In this study, we introduced IMJENSE, a scan-specific implicit neural representation-based method for improving parallel MRI reconstruction. Specifically, the underlying MRI image and coil sensitivities were modeled as continuous functions of spatial coordinates, parameterized by neural networks and polynomials, respectively. The weights in the networks and coefficients in the polynomials were simultaneously learned directly from sparsely acquired k-space measurements, without fully sampled ground truth data for training. Benefiting from the powerful continuous representation and joint estimation of the MRI image and coil sensitivities, IMJENSE outperforms conventional image or k-space domain reconstruction algorithms. With extremely limited calibration data, IMJENSE is more stable than supervised calibrationless and calibration-based deep-learning methods. Results show that IMJENSE robustly reconstructs the images acquired at 5$\mathbf{\times}$ and 6$\mathbf{\times}$ accelerations with only 4 or 8 calibration lines in 2D Cartesian acquisitions, corresponding to 22.0% and 19.5% undersampling rates. The high-quality results and scanning specificity make the proposed method hold the potential for further accelerating the data acquisition of parallel MRI.
    From Microbes to Methane: AI-Based Predictive Modeling of Feed Additive Efficacy in Dairy Cows. (arXiv:2311.12901v1 [q-bio.QM])
    In an era of increasing pressure to achieve sustainable agriculture, the optimization of livestock feed for enhancing yield and minimizing environmental impact is a paramount objective. This study presents a pioneering approach towards this goal, using rumen microbiome data to predict the efficacy of feed additives in dairy cattle. We collected an extensive dataset that includes methane emissions from 2,190 Holstein cows distributed across 34 distinct sites. The cows were divided into control and experimental groups in a double-blind, unbiased manner, accounting for variables such as age, days in lactation, and average milk yield. The experimental groups were administered one of four leading commercial feed additives: Agolin, Kexxtone, Allimax, and Relyon. Methane emissions were measured individually both before the administration of additives and over a subsequent 12-week period. To develop our predictive model for additive efficacy, rumen microbiome samples were collected from 510 cows from the same herds prior to the study's onset. These samples underwent deep metagenomic shotgun sequencing, yielding an average of 15.7 million reads per sample. Utilizing innovative artificial intelligence techniques we successfully estimated the efficacy of these feed additives across different farms. The model's robustness was further confirmed through validation with independent cohorts, affirming its generalizability and reliability. Our results underscore the transformative capability of using targeted feed additive strategies to both optimize dairy yield and milk composition, and to significantly reduce methane emissions. Specifically, our predictive model demonstrates a scenario where its application could guide the assignment of additives to farms where they are most effective. In doing so, we could achieve an average potential reduction of over 27\% in overall emissions.
    Novel OCT mosaicking pipeline with Feature- and Pixel-based registration. (arXiv:2311.13052v1 [eess.IV])
    High-resolution Optical Coherence Tomography (OCT) images are crucial for ophthalmology studies but are limited by their relatively narrow field of view (FoV). Image mosaicking is a technique for aligning multiple overlapping images to obtain a larger FoV. Current mosaicking pipelines often struggle with substantial noise and considerable displacement between the input sub-fields. In this paper, we propose a versatile pipeline for stitching multi-view OCT/OCTA \textit{en face} projection images. Our method combines the strengths of learning-based feature matching and robust pixel-based registration to align multiple images effectively. Furthermore, we advance the application of a trained foundational model, Segment Anything Model (SAM), to validate mosaicking results in an unsupervised manner. The efficacy of our pipeline is validated using an in-house dataset and a large public dataset, where our method shows superior performance in terms of both accuracy and computational efficiency. We also made our evaluation tool for image mosaicking and the corresponding pipeline publicly available at \url{https://github.com/MedICL-VU/OCT-mosaicking}.
    How Capable Can a Transformer Become? A Study on Synthetic, Interpretable Tasks. (arXiv:2311.12997v1 [cs.LG])
    Transformers trained on huge text corpora exhibit a remarkable set of capabilities, e.g., performing simple logical operations. Given the inherent compositional nature of language, one can expect the model to learn to compose these capabilities, potentially yielding a combinatorial explosion of what operations it can perform on an input. Motivated by the above, we aim to assess in this paper "how capable can a transformer become?". Specifically, we train autoregressive Transformer models on a data-generating process that involves compositions of a set of well-defined monolithic capabilities. Through a series of extensive and systematic experiments on this data-generating process, we show that: (1) autoregressive Transformers can learn compositional structures from the training data and generalize to exponentially or even combinatorially many functions; (2) composing functions by generating intermediate outputs is more effective at generalizing to unseen compositions, compared to generating no intermediate outputs; (3) the training data has a significant impact on the model's ability to compose unseen combinations of functions; and (4) the attention layers in the latter half of the model are critical to compositionality.
    Droplets of Good Representations: Grokking as a First Order Phase Transition in Two Layer Networks. (arXiv:2310.03789v2 [stat.ML] UPDATED)
    A key property of deep neural networks (DNNs) is their ability to learn new features during training. This intriguing aspect of deep learning stands out most clearly in recently reported Grokking phenomena. While mainly reflected as a sudden increase in test accuracy, Grokking is also believed to be a beyond lazy-learning/Gaussian Process (GP) phenomenon involving feature learning. Here we apply a recent development in the theory of feature learning, the adaptive kernel approach, to two teacher-student models with cubic-polynomial and modular addition teachers. We provide analytical predictions on feature learning and Grokking properties of these models and demonstrate a mapping between Grokking and the theory of phase transitions. We show that after Grokking, the state of the DNN is analogous to the mixed phase following a first-order phase transition. In this mixed phase, the DNN generates useful internal representations of the teacher that are sharply distinct from those before the transition.
    Prediction of Effective Elastic Moduli of Rocks using Graph Neural Networks. (arXiv:2310.19274v2 [cs.LG] UPDATED)
    This study presents a Graph Neural Networks (GNNs)-based approach for predicting the effective elastic moduli of rocks from their digital CT-scan images. We use the Mapper algorithm to transform 3D digital rock images into graph datasets, encapsulating essential geometrical information. These graphs, after training, prove effective in predicting elastic moduli. Our GNN model shows robust predictive capabilities across various graph sizes derived from various subcube dimensions. Not only does it perform well on the test dataset, but it also maintains high prediction accuracy for unseen rocks and unexplored subcube sizes. Comparative analysis with Convolutional Neural Networks (CNNs) reveals the superior performance of GNNs in predicting unseen rock properties. Moreover, the graph representation of microstructures significantly reduces GPU memory requirements (compared to the grid representation for CNNs), enabling greater flexibility in the batch size selection. This work demonstrates the potential of GNN models in enhancing the prediction accuracy of rock properties and boosting the efficiency of digital rock analysis.
    Ball Mill Fault Prediction Based on Deep Convolutional Auto-Encoding Network. (arXiv:2311.13571v1 [cs.LG])
    Ball mills play a critical role in modern mining operations, making their bearing failures a significant concern due to the potential loss of production efficiency and economic consequences. This paper presents an anomaly detection method based on Deep Convolutional Auto-encoding Neural Networks (DCAN) for addressing the issue of ball mill bearing fault detection. The proposed approach leverages vibration data collected during normal operation for training, overcoming challenges such as labeling issues and data imbalance often encountered in supervised learning methods. DCAN includes the modules of convolutional feature extraction and transposed convolutional feature reconstruction, demonstrating exceptional capabilities in signal processing and feature extraction. Additionally, the paper describes the practical deployment of the DCAN-based anomaly detection model for bearing fault detection, utilizing data from the ball mill bearings of Wuhan Iron & Steel Resources Group and fault data from NASA's bearing vibration dataset. Experimental results validate the DCAN model's reliability in recognizing fault vibration patterns. This method holds promise for enhancing bearing fault detection efficiency, reducing production interruptions, and lowering maintenance costs.
    Randomized Adversarial Style Perturbations for Domain Generalization. (arXiv:2304.01959v2 [cs.CV] UPDATED)
    We propose a novel domain generalization technique, referred to as Randomized Adversarial Style Perturbation (RASP), which is motivated by the observation that the characteristics of each domain are captured by the feature statistics corresponding to style. The proposed algorithm perturbs the style of a feature in an adversarial direction towards a randomly selected class, and makes the model learn against being misled by the unexpected styles observed in unseen target domains. While RASP is effective to handle domain shifts, its naive integration into the training procedure might degrade the capability of learning knowledge from source domains because it has no restriction on the perturbations of representations. This challenge is alleviated by Normalized Feature Mixup (NFM), which facilitates the learning of the original features while achieving robustness to perturbed representations via their mixup during training. We evaluate the proposed algorithm via extensive experiments on various benchmarks and show that our approach improves domain generalization performance, especially in large-scale benchmarks.
    Risk-sensitive Markov Decision Process and Learning under General Utility Functions. (arXiv:2311.13589v1 [cs.LG])
    Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations. Existing literature on RL theory largely focuses on risk-neutral settings where the decision-maker learns to maximize the expected cumulative reward. However, in practical scenarios such as portfolio management and e-commerce recommendations, decision-makers often persist in heterogeneous risk preferences subject to outcome uncertainties, which can not be well-captured by the risk-neural framework. Incorporating these preferences can be approached through utility theory, yet the development of risk-sensitive RL under general utility functions remains an open question for theoretical exploration. In this paper, we consider a scenario where the decision-maker seeks to optimize a general utility function of the cumulative reward in the framework of a Markov decision process (MDP). To facilitate the Dynamic Programming Principle and Bellman equation, we enlarge the state space with an additional dimension that accounts for the cumulative reward. We propose a discretized approximation scheme to the MDP under enlarged state space, which is tractable and key for algorithmic design. We then propose a modified value iteration algorithm that employs an epsilon-covering over the space of cumulative reward. When a simulator is accessible, our algorithm efficiently learns a near-optimal policy with guaranteed sample complexity. In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy while ensuring a guaranteed regret bound. For both algorithms, we match the theoretical lower bounds for the risk-neutral setting.
    A General Theoretical Paradigm to Understand Learning from Human Preferences. (arXiv:2310.12036v2 [cs.AI] UPDATED)
    The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an approach that bypasses the second approximation and learn directly a policy from collected data without the reward modelling stage. However, this method still heavily relies on the first approximation. In this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new general objective called $\Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$ simply to Identity, for which we can derive an efficient optimisation procedure, prove performance guarantees and demonstrate its empirical superiority to DPO on some illustrative examples.
    BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions. (arXiv:2308.09936v2 [cs.CV] UPDATED)
    Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limited to the token count, potentially curtailing the recognition of scenes with text-rich context. To improve upon them, the present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings from InstructBLIP and also directly projects encoded patch embeddings into the LLM, a technique inspired by LLaVA. This approach assists the model to capture intricate details potentially missed during the query decoding process. Empirical evidence demonstrates that our model, BLIVA, significantly enhances performance in processing text-rich VQA benchmarks (up to 17.76% in OCR-VQA benchmark) and in undertaking general (not particularly text-rich) VQA benchmarks (up to 7.9% in Visual Spatial Reasoning benchmark), comparing to our baseline InstructBLIP. BLIVA demonstrates significant capability in decoding real-world images, irrespective of text presence. To demonstrate the broad industry applications enabled by BLIVA, we evaluate the model using a new dataset comprising YouTube thumbnails paired with question-answer sets across 11 diverse categories. For researchers interested in further exploration, our code and models are freely accessible at https://github.com/mlpc-ucsd/BLIVA.
    Scalable Real-Time Recurrent Learning Using Columnar-Constructive Networks. (arXiv:2302.05326v3 [cs.LG] UPDATED)
    Constructing states from sequences of observations is an important component of reinforcement learning agents. One solution for state construction is to use recurrent neural networks. Back-propagation through time (BPTT), and real-time recurrent learning (RTRL) are two popular gradient-based methods for recurrent learning. BPTT requires complete trajectories of observations before it can compute the gradients and is unsuitable for online updates. RTRL can do online updates but scales poorly to large networks. In this paper, we propose two constraints that make RTRL scalable. We show that by either decomposing the network into independent modules or learning the network in stages, we can make RTRL scale linearly with the number of parameters. Unlike prior scalable gradient estimation algorithms, such as UORO and Truncated-BPTT, our algorithms do not add noise or bias to the gradient estimate. Instead, they trade off the functional capacity of the network for computationally efficient learning. We demonstrate the effectiveness of our approach over Truncated-BPTT on a prediction benchmark inspired by animal learning and by doing policy evaluation of pre-trained policies for Atari 2600 games.
    Looking at the posterior: accuracy and uncertainty of neural-network predictions. (arXiv:2211.14605v2 [cs.LG] UPDATED)
    Bayesian inference can quantify uncertainty in the predictions of neural networks using posterior distributions for model parameters and network output. By looking at these posterior distributions, one can separate the origin of uncertainty into aleatoric and epistemic contributions. One goal of uncertainty quantification is to inform on prediction accuracy. Here we show that prediction accuracy depends on both epistemic and aleatoric uncertainty in an intricate fashion that cannot be understood in terms of marginalized uncertainty distributions alone. How the accuracy relates to epistemic and aleatoric uncertainties depends not only on the model architecture, but also on the properties of the dataset. We discuss the significance of these results for active learning and introduce a novel acquisition function that outperforms common uncertainty-based methods. To arrive at our results, we approximated the posteriors using deep ensembles, for fully-connected, convolutional and attention-based neural networks.
    Using Taylor-Approximated Gradients to Improve the Frank-Wolfe Method for Empirical Risk Minimization. (arXiv:2208.13933v2 [cs.LG] UPDATED)
    The Frank-Wolfe method has become increasingly useful in statistical and machine learning applications, due to the structure-inducing properties of the iterates, and especially in settings where linear minimization over the feasible set is more computationally efficient than projection. In the setting of Empirical Risk Minimization -- one of the fundamental optimization problems in statistical and machine learning -- the computational effectiveness of Frank-Wolfe methods typically grows linearly in the number of data observations $n$. This is in stark contrast to the case for typical stochastic projection methods. In order to reduce this dependence on $n$, we look to second-order smoothness of typical smooth loss functions (least squares loss and logistic loss, for example) and we propose amending the Frank-Wolfe method with Taylor series-approximated gradients, including variants for both deterministic and stochastic settings. Compared with current state-of-the-art methods in the regime where the optimality tolerance $\varepsilon$ is sufficiently small, our methods are able to simultaneously reduce the dependence on large $n$ while obtaining optimal convergence rates of Frank-Wolfe methods, in both the convex and non-convex settings. We also propose a novel adaptive step-size approach for which we have computational guarantees. Last of all, we present computational experiments which show that our methods exhibit very significant speed-ups over existing methods on real-world datasets for both convex and non-convex binary classification problems.
    Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks. (arXiv:2311.13180v1 [stat.ML])
    We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Such a feedback structure is popular in applications such as personalized medicine and online advertisement, where the online data often do not arrive in a fully serial manner. We consider high-dimensional and linear settings where the reward function of the bandit model admits either a sparse or low-rank structure and ask how small a number of batches are needed for a comparable performance with fully dynamic data in which $L = T$. For these settings, we design a provably sample-efficient algorithm which achieves a $ \mathcal{\tilde O}(s_0^2 \log^2 T)$ regret in the sparse case and $ \mathcal{\tilde O} ( r ^2 \log^2 T)$ regret in the low-rank case, using only $L = \mathcal{O}( \log T)$ batches. Here $s_0$ and $r$ are the sparsity and rank of the reward parameter in sparse and low-rank cases, respectively, and $ \mathcal{\tilde O}(\cdot)$ omits logarithmic factors involving the feature dimensions. In other words, our algorithm achieves regret bounds comparable to those in fully sequential setting with only $\mathcal{O}( \log T)$ batches. Our algorithm features a novel batch allocation method that adjusts the batch sizes according to the estimation accuracy within each batch and cumulative regret. Furthermore, we also conduct experiments with synthetic and real-world data to validate our theory.
    Favour: FAst Variance Operator for Uncertainty Rating. (arXiv:2311.13036v1 [cs.LG])
    Bayesian Neural Networks (BNN) have emerged as a crucial approach for interpreting ML predictions. By sampling from the posterior distribution, data scientists may estimate the uncertainty of an inference. Unfortunately many inference samples are often needed, the overhead of which greatly hinder BNN's wide adoption. To mitigate this, previous work proposed propagating the first and second moments of the posterior directly through the network. However, on its own this method is even slower than sampling, so the propagated variance needs to be approximated such as assuming independence between neural nodes. The resulting trade-off between quality and inference time did not match even plain Monte Carlo sampling. Our contribution is a more principled variance propagation framework based on "spiked covariance matrices", which smoothly interpolates between quality and inference time. This is made possible by a new fast algorithm for updating a diagonal-plus-low-rank matrix approximation under various operations. We tested our algorithm against sampling based MC Dropout and Variational Inference on a number of downstream uncertainty themed tasks, such as calibration and out-of-distribution testing. We find that Favour is as fast as performing 2-3 inference samples, while matching the performance of 10-100 samples. In summary, this work enables the use of BNN in the realm of performance critical tasks where they have previously been out of reach.
    Unsupervised Multimodal Surface Registration with Geometric Deep Learning. (arXiv:2311.13022v1 [cs.LG])
    This paper introduces GeoMorph, a novel geometric deep-learning framework designed for image registration of cortical surfaces. The registration process consists of two main steps. First, independent feature extraction is performed on each input surface using graph convolutions, generating low-dimensional feature representations that capture important cortical surface characteristics. Subsequently, features are registered in a deep-discrete manner to optimize the overlap of common structures across surfaces by learning displacements of a set of control points. To ensure smooth and biologically plausible deformations, we implement regularization through a deep conditional random field implemented with a recurrent neural network. Experimental results demonstrate that GeoMorph surpasses existing deep-learning methods by achieving improved alignment with smoother deformations. Furthermore, GeoMorph exhibits competitive performance compared to classical frameworks. Such versatility and robustness suggest strong potential for various neuroscience applications.
    Learning to Compute Gr\"obner Bases. (arXiv:2311.12904v1 [math.AC])
    Solving a polynomial system, or computing an associated Gr\"obner basis, has been a fundamental task in computational algebra. However, it is also known for its notoriously expensive computational cost -- doubly exponential time complexity in the number of variables in the worst case. In this paper, we achieve for the first time Gr\"obner basis computation through the training of a transformer. The training requires many pairs of a polynomial system and the associated Gr\"obner basis, thus motivating us to address two novel algebraic problems: random generation of Gr\"obner bases and the transformation of them into non-Gr\"obner polynomial systems, termed as \textit{backward Gr\"obner problem}. We resolve these problems with zero-dimensional radical ideals, the ideals appearing in various applications. The experiments show that in the five-variate case, the proposed dataset generation method is five orders of magnitude faster than a naive approach, overcoming a crucial challenge in learning to compute Gr\"obner bases.
    SpecHD: Hyperdimensional Computing Framework for FPGA-based Mass Spectrometry Clustering. (arXiv:2311.12874v1 [q-bio.QM])
    Mass spectrometry-based proteomics is a key enabler for personalized healthcare, providing a deep dive into the complex protein compositions of biological systems. This technology has vast applications in biotechnology and biomedicine but faces significant computational bottlenecks. Current methodologies often require multiple hours or even days to process extensive datasets, particularly in the domain of spectral clustering. To tackle these inefficiencies, we introduce SpecHD, a hyperdimensional computing (HDC) framework supplemented by an FPGA-accelerated architecture with integrated near-storage preprocessing. Utilizing streamlined binary operations in an HDC environment, SpecHD capitalizes on the low-latency and parallel capabilities of FPGAs. This approach markedly improves clustering speed and efficiency, serving as a catalyst for real-time, high-throughput data analysis in future healthcare applications. Our evaluations demonstrate that SpecHD not only maintains but often surpasses existing clustering quality metrics while drastically cutting computational time. Specifically, it can cluster a large-scale human proteome dataset-comprising 25 million MS/MS spectra and 131 GB of MS data-in just 5 minutes. With energy efficiency exceeding 31x and a speedup factor that spans a range of 6x to 54x over existing state of-the-art solutions, SpecHD emerges as a promising solution for the rapid analysis of mass spectrometry data with great implications for personalized healthcare.
    Learning to Fly in Seconds. (arXiv:2311.13081v1 [cs.RO])
    Learning-based methods, particularly Reinforcement Learning (RL), hold great promise for streamlining deployment, enhancing performance, and achieving generalization in the control of autonomous multirotor aerial vehicles. Deep RL has been able to control complex systems with impressive fidelity and agility in simulation but the simulation-to-reality transfer often brings a hard-to-bridge reality gap. Moreover, RL is commonly plagued by prohibitively long training times. In this work, we propose a novel asymmetric actor-critic-based architecture coupled with a highly reliable RL-based training paradigm for end-to-end quadrotor control. We show how curriculum learning and a highly optimized simulator enhance sample complexity and lead to fast training times. To precisely discuss the challenges related to low-level/end-to-end multirotor control, we also introduce a taxonomy that classifies the existing levels of control abstractions as well as non-linearities and domain parameters. Our framework enables Simulation-to-Reality (Sim2Real) transfer for direct RPM control after only 18 seconds of training on a consumer-grade laptop as well as its deployment on microcontrollers to control a multirotor under real-time guarantees. Finally, our solution exhibits competitive performance in trajectory tracking, as demonstrated through various experimental comparisons with existing state-of-the-art control solutions using a real Crazyflie nano quadrotor. We open source the code including a very fast multirotor dynamics simulator that can simulate about 5 months of flight per second on a laptop GPU. The fast training times and deployment to a cheap, off-the-shelf quadrotor lower the barriers to entry and help democratize the research and development of these systems.
    AdaptiveFL: Adaptive Heterogeneous Federated Learning for Resource-Constrained AIoT Systems. (arXiv:2311.13166v1 [cs.LG])
    Although Federated Learning (FL) is promising to enable collaborative learning among Artificial Intelligence of Things (AIoT) devices, it suffers from the problem of low classification performance due to various heterogeneity factors (e.g., computing capacity, memory size) of devices and uncertain operating environments. To address these issues, this paper introduces an effective FL approach named AdaptiveFL based on a novel fine-grained width-wise model pruning strategy, which can generate various heterogeneous local models for heterogeneous AIoT devices. By using our proposed reinforcement learning-based device selection mechanism, AdaptiveFL can adaptively dispatch suitable heterogeneous models to corresponding AIoT devices on the fly based on their available resources for local training. Experimental results show that, compared to state-of-the-art methods, AdaptiveFL can achieve up to 16.83% inference improvements for both IID and non-IID scenarios.
    Speak Like a Native: Prompting Large Language Models in a Native Style. (arXiv:2311.13538v1 [cs.AI])
    Existing work has found that the prompt engineering heavily influences the performance of large language models (LLMs). Chain-of-thought (CoT), as a popular prompt engineering technique, prompted LLMs using in-context examples with reasoning steps. In current studies, the few-shot examples of CoT are generally handcrafted by humans. However, how the text style of in-context examples influence the outputs of LLMs still remains under-explored. This paper presents a novel and effective approach, named \textbf{AlignCoT}, to improve the reasoning capability of LLMs by aligning the in-context examples with the native style of LLMs. ``Native'' refers to the inherent characteristic style of LLMs which can be probed by original zero-shot scenarios. AlignCoT is orthogonal to other prompt engineering methods, making it easy to combine with state-of-the-art techniques to further improve the LLMs' performance. We conduct extensive and comprehensive experiments on several benchmarks. The empirical results demonstrate that our AlignCoTsignificantly improves performance over the carefully handcrafted in-context examples. For instance, with GPT-3.5-turbo, we observed a +2.5\% improvement on GSM8K. Furthermore, our AlignCoT consistently improve the performance when combined with other state-of-the-art prompt engineering methods. The source code and dataset will be available at \href{https://github.com/yangzhch6/AlignCoT}{https://github.com/yangzhch6/AlignCoT}.
    Deep Learning-Based Real-Time Quality Control of Standard Video Compression for Live Streaming. (arXiv:2311.12918v1 [eess.IV])
    Ensuring high-quality video content for wireless users has become increasingly vital. Nevertheless, maintaining a consistent level of video quality faces challenges due to the fluctuating encoded bitrate, primarily caused by dynamic video content, especially in live streaming scenarios. Video compression is typically employed to eliminate unnecessary redundancies within and between video frames, thereby reducing the required bandwidth for video transmission. The encoded bitrate and the quality of the compressed video depend on encoder parameters, specifically, the quantization parameter (QP). Poor choices of encoder parameters can result in reduced bandwidth efficiency and high likelihood of non-conformance. Non-conformance refers to the violation of the peak signal-to-noise ratio (PSNR) constraint for an encoded video segment. To address these issues, a real-time deep learning-based H.264 controller is proposed. This controller dynamically estimates the optimal encoder parameters based on the content of a video chunk with minimal delay. The objective is to maintain video quality in terms of PSNR above a specified threshold while minimizing the average bitrate of the compressed video. Experimental results, conducted on both QCIF dataset and a diverse range of random videos from public datasets, validate the effectiveness of this approach. Notably, it achieves improvements of up to 2.5 times in average bandwidth usage compared to the state-of-the-art adaptive bitrate video streaming, with a negligible non-conformance probability below $10^{-2}$.
    Multi-fidelity Bayesian Optimization in Engineering Design. (arXiv:2311.13050v1 [cs.CE])
    Resided at the intersection of multi-fidelity optimization (MFO) and Bayesian optimization (BO), MF BO has found a niche in solving expensive engineering design optimization problems, thanks to its advantages in incorporating physical and mathematical understandings of the problems, saving resources, addressing exploitation-exploration trade-off, considering uncertainty, and processing parallel computing. The increasing number of works dedicated to MF BO suggests the need for a comprehensive review of this advanced optimization technique. In this paper, we survey recent developments of two essential ingredients of MF BO: Gaussian process (GP) based MF surrogates and acquisition functions. We first categorize the existing MF modeling methods and MFO strategies to locate MF BO in a large family of surrogate-based optimization and MFO algorithms. We then exploit the common properties shared between the methods from each ingredient of MF BO to describe important GP-based MF surrogate models and review various acquisition functions. By doing so, we expect to provide a structured understanding of MF BO. Finally, we attempt to reveal important aspects that require further research for applications of MF BO in solving intricate yet important design optimization problems, including constrained optimization, high-dimensional optimization, optimization under uncertainty, and multi-objective optimization.
    Learned Nonlinear Predictor for Critically Sampled 3D Point Cloud Attribute Compression. (arXiv:2311.13539v1 [eess.IV])
    We study 3D point cloud attribute compression via a volumetric approach: assuming point cloud geometry is known at both encoder and decoder, parameters $\theta$ of a continuous attribute function $f: \mathbb{R}^3 \mapsto \mathbb{R}$ are quantized to $\hat{\theta}$ and encoded, so that discrete samples $f_{\hat{\theta}}(\mathbf{x}_i)$ can be recovered at known 3D points $\mathbf{x}_i \in \mathbb{R}^3$ at the decoder. Specifically, we consider a nested sequences of function subspaces $\mathcal{F}^{(p)}_{l_0} \subseteq \cdots \subseteq \mathcal{F}^{(p)}_L$, where $\mathcal{F}_l^{(p)}$ is a family of functions spanned by B-spline basis functions of order $p$, $f_l^*$ is the projection of $f$ on $\mathcal{F}_l^{(p)}$ and encoded as low-pass coefficients $F_l^*$, and $g_l^*$ is the residual function in orthogonal subspace $\mathcal{G}_l^{(p)}$ (where $\mathcal{G}_l^{(p)} \oplus \mathcal{F}_l^{(p)} = \mathcal{F}_{l+1}^{(p)}$) and encoded as high-pass coefficients $G_l^*$. In this paper, to improve coding performance over [1], we study predicting $f_{l+1}^*$ at level $l+1$ given $f_l^*$ at level $l$ and encoding of $G_l^*$ for the $p=1$ case (RAHT($1$)). For the prediction, we formalize RAHT(1) linear prediction in MPEG-PCC in a theoretical framework, and propose a new nonlinear predictor using a polynomial of bilateral filter. We derive equations to efficiently compute the critically sampled high-pass coefficients $G_l^*$ amenable to encoding. We optimize parameters in our resulting feed-forward network on a large training set of point clouds by minimizing a rate-distortion Lagrangian. Experimental results show that our improved framework outperformed the MPEG G-PCC predictor by $11$ to $12\%$ in bit rate reduction.
    Extracting individual variable information for their decoupling, direct mutual information and multi-feature Granger causality. (arXiv:2311.13431v1 [stat.ML])
    Working with multiple variables they usually contain difficult to control complex dependencies. This article proposes extraction of their individual information, e.g. $\overline{X|Y}$ as random variable containing information from $X$, but with removed information about $Y$, by using $(x,y) \leftrightarrow (\bar{x}=\textrm{CDF}_{X|Y=y}(x),y)$ reversible normalization. One application can be decoupling of individual information of variables: reversibly transform $(X_1,\ldots,X_n)\leftrightarrow(\tilde{X}_1,\ldots \tilde{X}_n)$ together containing the same information, but being independent: $\forall_{i\neq j} \tilde{X}_i\perp \tilde{X}_j, \tilde{X}_i\perp X_j$. It requires detailed models of complex conditional probability distributions - it is generally a difficult task, but here can be done through multiple dependency reducing iterations, using imperfect methods (here HCR: Hierarchical Correlation Reconstruction). It could be also used for direct mutual information - evaluating direct information transfer: without use of intermediate variables. For causality direction there is discussed multi-feature Granger causality, e.g. to trace various types of individual information transfers between such decoupled variables, including propagation time (delay).
    Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise. (arXiv:2311.13091v1 [cs.LG])
    The open source of large amounts of image data promotes the development of deep learning techniques. Along with this comes the privacy risk of these open-source image datasets being exploited by unauthorized third parties to train deep learning models for commercial or illegal purposes. To avoid the abuse of public data, a poisoning-based technique, the unlearnable example, is proposed to significantly degrade the generalization performance of models by adding a kind of imperceptible noise to the data. To further enhance its robustness against adversarial training, existing works leverage iterative adversarial training on both the defensive noise and the surrogate model. However, it still remains unknown whether the robustness of unlearnable examples primarily comes from the effect of enhancement in the surrogate model or the defensive noise. Observing that simply removing the adversarial noise on the training process of the defensive noise can improve the performance of robust unlearnable examples, we identify that solely the surrogate model's robustness contributes to the performance. Furthermore, we found a negative correlation exists between the robustness of defensive noise and the protection performance, indicating defensive noise's instability issue. Motivated by this, to further boost the robust unlearnable example, we introduce stable error-minimizing noise (SEM), which trains the defensive noise against random perturbation instead of the time-consuming adversarial perturbation to improve the stability of defensive noise. Through extensive experiments, we demonstrate that SEM achieves a new state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet Subset in terms of both effectiveness and efficiency. The code is available at https://github.com/liuyixin-louis/Stable-Unlearnable-Example.
    PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF. (arXiv:2311.13099v1 [cs.CV])
    We show that physics-based simulations can be seamlessly integrated with NeRF to generate high-quality elastodynamics of real-world objects. Unlike existing methods, we discretize nonlinear hyperelasticity in a meshless way, obviating the necessity for intermediate auxiliary shape proxies like a tetrahedral mesh or voxel grid. A quadratic generalized moving least square (Q-GMLS) is employed to capture nonlinear dynamics and large deformation on the implicit model. Such meshless integration enables versatile simulations of complex and codimensional shapes. We adaptively place the least-square kernels according to the NeRF density field to significantly reduce the complexity of the nonlinear simulation. As a result, physically realistic animations can be conveniently synthesized using our method for a wide range of hyperelastic materials at an interactive rate. For more information, please visit our project page at https://fytalon.github.io/pienerf/.
    Predict-Then-Optimize by Proxy: Learning Joint Models of Prediction and Optimization. (arXiv:2311.13087v1 [cs.LG])
    Many real-world decision processes are modeled by optimization problems whose defining parameters are unknown and must be inferred from observable data. The Predict-Then-Optimize framework uses machine learning models to predict unknown parameters of an optimization problem from features before solving. Recent works show that decision quality can be improved in this setting by solving and differentiating the optimization problem in the training loop, enabling end-to-end training with loss functions defined directly on the resulting decisions. However, this approach can be inefficient and requires handcrafted, problem-specific rules for backpropagation through the optimization step. This paper proposes an alternative method, in which optimal solutions are learned directly from the observable features by predictive models. The approach is generic, and based on an adaptation of the Learning-to-Optimize paradigm, from which a rich variety of existing techniques can be employed. Experimental evaluations show the ability of several Learning-to-Optimize methods to provide efficient, accurate, and flexible solutions to an array of challenging Predict-Then-Optimize problems.
    Nav-Q: Quantum Deep Reinforcement Learning for Collision-Free Navigation of Self-Driving Cars. (arXiv:2311.12875v1 [quant-ph])
    The challenge of collision-free navigation (CFN) for self-driving cars is an NP-hard problem addressed through Deep Reinforcement Learning (DRL). Despite the effectiveness of DRL methods, their application demands significant computing resources and prolonged training periods to establish a resilient agent. On the other hand, quantum reinforcement learning algorithms have recently demonstrated faster convergence and improved stability in simple, non-real-world environments. However, their application in the real-world CFN domain has not been explored, and their direct adaptation would require a quantum computing device onboard the vehicle for testing. In this work, we propose Nav-Q, the first quantum-supported DRL algorithm for CFN of self-driving cars, that leverages quantum computation for improving the training performance without the requirement for onboard quantum hardware. Nav-Q is based on the actor-critic approach, where the critic is implemented using a hybrid quantum-classical algorithm suitable for near-term quantum devices. We assess the performance of Nav-Q using the CARLA driving simulator, a de facto standard benchmark for evaluating state-of-the-art DRL methods. Our empirical evaluations showcase that Nav-Q surpasses its classical counterpart not only in terms of training stability but also, in certain instances, with respect to the convergence rate when analyzing the Reward vs. Episode curve. This enhancement is accomplished without negatively impacting the learned policy by the agent. Furthermore, we assess Nav-Q in relation to effective dimension, unveiling that the incorporation of a quantum component results in a model possessing greater descriptive power compared to classical baselines. Finally, we evaluate the performance of Nav-Q using noisy quantum simulation, observing that the quantum noise enhances the exploratory tendencies of the agent during training.
    A note on estimating the dimension from a random geometric graph. (arXiv:2311.13059v1 [stat.ML])
    Let $G_n$ be a random geometric graph with vertex set $[n]$ based on $n$ i.i.d.\ random vectors $X_1,\ldots,X_n$ drawn from an unknown density $f$ on $\R^d$. An edge $(i,j)$ is present when $\|X_i -X_j\| \le r_n$, for a given threshold $r_n$ possibly depending upon $n$, where $\| \cdot \|$ denotes Euclidean distance. We study the problem of estimating the dimension $d$ of the underlying space when we have access to the adjacency matrix of the graph but do not know $r_n$ or the vectors $X_i$. The main result of the paper is that there exists an estimator of $d$ that converges to $d$ in probability as $n \to \infty$ for all densities with $\int f^5 < \infty$ whenever $n^{3/2} r_n^d \to \infty$ and $r_n = o(1)$. The conditions allow very sparse graphs since when $n^{3/2} r_n^d \to 0$, the graph contains isolated edges only, with high probability. We also show that, without any condition on the density, a consistent estimator of $d$ exists when $n r_n^d \to \infty$ and $r_n = o(1)$.
    Efficient Vision Transformer for Human Pose Estimation via Patch Selection. (arXiv:2306.04225v2 [cs.CV] UPDATED)
    While Convolutional Neural Networks (CNNs) have been widely successful in 2D human pose estimation, Vision Transformers (ViTs) have emerged as a promising alternative to CNNs, boosting state-of-the-art performance. However, the quadratic computational complexity of ViTs has limited their applicability for processing high-resolution images. In this paper, we propose three methods for reducing ViT's computational complexity, which are based on selecting and processing a small number of most informative patches while disregarding others. The first two methods leverage a lightweight pose estimation network to guide the patch selection process, while the third method utilizes a set of learnable joint tokens to ensure that the selected patches contain the most important information about body joints. Experiments across six benchmarks show that our proposed methods achieve a significant reduction in computational complexity, ranging from 30% to 44%, with only a minimal drop in accuracy between 0% and 3.5%.
    From Optimization to Control: Quasi Policy Iteration. (arXiv:2311.11166v1 [math.OC] CROSS LISTED)
    Recent control algorithms for Markov decision processes (MDPs) have been designed using an implicit analogy with well-established optimization algorithms. In this paper, we make this analogy explicit across four problem classes with a unified solution characterization. This novel framework, in turn, allows for a systematic transformation of algorithms from one domain to the other. In particular, we identify equivalent optimization and control algorithms that have already been pointed out in the existing literature, but mostly in a scattered way. With this unifying framework in mind, we then exploit two linear structural constraints specific to MDPs for approximating the Hessian in a second-order-type algorithm from optimization, namely, Anderson mixing. This leads to a novel first-order control algorithm that modifies the standard value iteration (VI) algorithm by incorporating two new directions and adaptive step sizes. While the proposed algorithm, coined as quasi-policy iteration, has the same computational complexity as VI, it interestingly exhibits an empirical convergence behavior similar to policy iteration with a very low sensitivity to the discount factor.
    Subspace-Configurable Networks. (arXiv:2305.13536v2 [cs.LG] UPDATED)
    While the deployment of deep learning models on edge devices is increasing, these models often lack robustness when faced with dynamic changes in sensed data. This can be attributed to sensor drift, or variations in the data compared to what was used during offline training due to factors such as specific sensor placement or naturally changing sensing conditions. Hence, achieving the desired robustness necessitates the utilization of either an invariant architecture or specialized training approaches, like data augmentation. Alternatively, input transformations can be treated as a domain shift problem, and solved by post-deployment model adaptation. In this paper, we train a parameterized subspace of configurable networks, where an optimal network for a particular parameter setting is part of this subspace. The obtained subspace is low-dimensional and has a surprisingly simple structure even for complex, non-invertible transformations of the input, leading to an exceptionally high efficiency of subspace-configurable networks (SCNs) when limited storage and computing resources are at stake. We evaluate SCNs on a wide range of standard datasets, architectures, and transformations, and demonstrate their power on resource-constrained IoT devices, where they can take up to 2.4 times less RAM and be 7.6 times faster at inference time than a model that achieves the same test set accuracy, yet is trained with data augmentations to cover the desired range of input transformations.
    Diffusion models for probabilistic programming. (arXiv:2311.00474v2 [cs.LG] UPDATED)
    We propose Diffusion Model Variational Inference (DMVI), a novel method for automated approximate inference in probabilistic programming languages (PPLs). DMVI utilizes diffusion models as variational approximations to the true posterior distribution by deriving a novel bound to the marginal likelihood objective used in Bayesian modelling. DMVI is easy to implement, allows hassle-free inference in PPLs without the drawbacks of, e.g., variational inference using normalizing flows, and does not make any constraints on the underlying neural network model. We evaluate DMVI on a set of common Bayesian models and show that its posterior inferences are in general more accurate than those of contemporary methods used in PPLs while having a similar computational cost and requiring less manual tuning.
    Generalized super-resolution 4D Flow MRI $\unicode{x2013}$ using ensemble learning to extend across the cardiovascular system. (arXiv:2311.11819v2 [eess.IV] UPDATED)
    4D Flow Magnetic Resonance Imaging (4D Flow MRI) is a non-invasive measurement technique capable of quantifying blood flow across the cardiovascular system. While practical use is limited by spatial resolution and image noise, incorporation of trained super-resolution (SR) networks has potential to enhance image quality post-scan. However, these efforts have predominantly been restricted to narrowly defined cardiovascular domains, with limited exploration of how SR performance extends across the cardiovascular system; a task aggravated by contrasting hemodynamic conditions apparent across the cardiovasculature. The aim of our study was to explore the generalizability of SR 4D Flow MRI using a combination of heterogeneous training sets and dedicated ensemble learning. With synthetic training data generated across three disparate domains (cardiac, aortic, cerebrovascular), varying convolutional base and ensemble learners were evaluated as a function of domain and architecture, quantifying performance on both in-silico and acquired in-vivo data from the same three domains. Results show that both bagging and stacking ensembling enhance SR performance across domains, accurately predicting high-resolution velocities from low-resolution input data in-silico. Likewise, optimized networks successfully recover native resolution velocities from downsampled in-vivo data, as well as show qualitative potential in generating denoised SR-images from clinical level input data. In conclusion, our work presents a viable approach for generalized SR 4D Flow MRI, with ensemble learning extending utility across various clinical areas of interest.
    Neural-Integrated Meshfree (NIM) Method: A differentiable programming-based hybrid solver for computational mechanics. (arXiv:2311.12915v1 [cs.LG])
    We present the neural-integrated meshfree (NIM) method, a differentiable programming-based hybrid meshfree approach within the field of computational mechanics. NIM seamlessly integrates traditional physics-based meshfree discretization techniques with deep learning architectures. It employs a hybrid approximation scheme, NeuroPU, to effectively represent the solution by combining continuous DNN representations with partition of unity (PU) basis functions associated with the underlying spatial discretization. This neural-numerical hybridization not only enhances the solution representation through functional space decomposition but also reduces both the size of DNN model and the need for spatial gradient computations based on automatic differentiation, leading to a significant improvement in training efficiency. Under the NIM framework, we propose two truly meshfree solvers: the strong form-based NIM (S-NIM) and the local variational form-based NIM (V-NIM). In the S-NIM solver, the strong-form governing equation is directly considered in the loss function, while the V-NIM solver employs a local Petrov-Galerkin approach that allows the construction of variational residuals based on arbitrary overlapping subdomains. This ensures both the satisfaction of underlying physics and the preservation of meshfree property. We perform extensive numerical experiments on both stationary and transient benchmark problems to assess the effectiveness of the proposed NIM methods in terms of accuracy, scalability, generalizability, and convergence properties. Moreover, comparative analysis with other physics-informed machine learning methods demonstrates that NIM, especially V-NIM, significantly enhances both accuracy and efficiency in end-to-end predictive capabilities.
    The effect of speech pathology on automatic speaker verification -- a large-scale study. (arXiv:2204.06450v3 [cs.SD] UPDATED)
    Navigating the challenges of data-driven speech processing, one of the primary hurdles is accessing reliable pathological speech data. While public datasets appear to offer solutions, they come with inherent risks of potential unintended exposure of patient health information via re-identification attacks. Using a comprehensive real-world pathological speech corpus, with over n=3,800 test subjects spanning various age groups and speech disorders, we employed a deep-learning-driven automatic speaker verification (ASV) approach. This resulted in a notable mean equal error rate (EER) of 0.89% with a standard deviation of 0.06%, outstripping traditional benchmarks. Our comprehensive assessments demonstrate that pathological speech overall faces heightened privacy breach risks compared to healthy speech. Specifically, adults with dysphonia are at heightened re-identification risks, whereas conditions like dysarthria yield results comparable to those of healthy speakers. Crucially, speech intelligibility does not influence the ASV system's performance metrics. In pediatric cases, particularly those with cleft lip and palate, the recording environment plays a decisive role in re-identification. Merging data across pathological types led to a marked EER decrease, suggesting the potential benefits of pathological diversity in ASV, accompanied by a logarithmic boost in ASV effectiveness. In essence, this research sheds light on the dynamics between pathological speech and speaker verification, emphasizing its crucial role in safeguarding patient confidentiality in our increasingly digitized healthcare era.
    GraphCFC: A Directed Graph Based Cross-Modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition. (arXiv:2207.12261v4 [cs.CL] UPDATED)
    Emotion Recognition in Conversation (ERC) plays a significant part in Human-Computer Interaction (HCI) systems since it can provide empathetic services. Multimodal ERC can mitigate the drawbacks of uni-modal approaches. Recently, Graph Neural Networks (GNNs) have been widely used in a variety of fields due to their superior performance in relation modeling. In multimodal ERC, GNNs are capable of extracting both long-distance contextual information and inter-modal interactive information. Unfortunately, since existing methods such as MMGCN directly fuse multiple modalities, redundant information may be generated and diverse information may be lost. In this work, we present a directed Graph based Cross-modal Feature Complementation (GraphCFC) module that can efficiently model contextual and interactive information. GraphCFC alleviates the problem of heterogeneity gap in multimodal fusion by utilizing multiple subspace extractors and Pair-wise Cross-modal Complementary (PairCC) strategy. We extract various types of edges from the constructed graph for encoding, thus enabling GNNs to extract crucial contextual and interactive information more accurately when performing message passing. Furthermore, we design a GNN structure called GAT-MLP, which can provide a new unified network framework for multimodal learning. The experimental results on two benchmark datasets show that our GraphCFC outperforms the state-of-the-art (SOTA) approaches.
    Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted Markov Decision Processes. (arXiv:2302.11381v3 [math.OC] UPDATED)
    Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide range of novel and fundamental methods in reinforcement learning. Motivated by the instability of policy iteration (PI) with inexact policy evaluation, PMD algorithmically regularises the policy improvement step of PI. With exact policy evaluation, PI is known to converge linearly with a rate given by the discount factor $\gamma$ of a Markov Decision Process. In this work, we bridge the gap between PI and PMD with exact policy evaluation and show that the dimension-free $\gamma$-rate of PI can be achieved by the general family of unregularised PMD algorithms under an adaptive step-size. We show that both the rate and step-size are unimprovable for PMD: we provide matching lower bounds that demonstrate that the $\gamma$-rate is optimal for PMD methods as well as PI, and that the adaptive step-size is necessary for PMD to achieve it. Our work is the first to relate PMD to rate-optimality and step-size necessity. Our study of the convergence of PMD avoids the use of the performance difference lemma, which leads to a direct analysis of independent interest. We also extend the analysis to the inexact setting and establish the first dimension-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result.
    Variational Connectionist Temporal Classification for Order-Preserving Sequence Modeling. (arXiv:2309.11983v2 [cs.LG] UPDATED)
    Connectionist temporal classification (CTC) is commonly adopted for sequence modeling tasks like speech recognition, where it is necessary to preserve order between the input and target sequences. However, CTC is only applied to deterministic sequence models, where the latent space is discontinuous and sparse, which in turn makes them less capable of handling data variability when compared to variational models. In this paper, we integrate CTC with a variational model and derive loss functions that can be used to train more generalizable sequence models that preserve order. Specifically, we derive two versions of the novel variational CTC based on two reasonable assumptions, the first being that the variational latent variables at each time step are conditionally independent; and the second being that these latent variables are Markovian. We show that both loss functions allow direct optimization of the variational lower bound for the model log-likelihood, and present computationally tractable forms for implementing them.
    Hinge-Wasserstein: Mitigating Overconfidence in Regression by Classification. (arXiv:2306.00560v2 [cs.LG] UPDATED)
    Computer vision systems that are deployed in safety-critical applications need to quantify their output uncertainty. We study regression from images to parameter values and here it is common to detect uncertainty by predicting probability distributions. In this context, we investigate the regression-by-classification paradigm which can represent multimodal distributions, without a prior assumption on the number of modes. Through experiments on a specifically designed synthetic dataset, we demonstrate that traditional loss functions lead to poor probability distribution estimates and severe overconfidence, in the absence of full ground truth distributions. In order to alleviate these issues, we propose hinge-Wasserstein -- a simple improvement of the Wasserstein loss that reduces the penalty for weak secondary modes during training. This enables prediction of complex distributions with multiple modes, and allows training on datasets where full ground truth distributions are not available. In extensive experiments, we show that the proposed loss leads to substantially better uncertainty estimation on two challenging computer vision tasks: horizon line detection and stereo disparity estimation.
    Diffusion Model Alignment Using Direct Preference Optimization. (arXiv:2311.12908v1 [cs.CV])
    Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality images and captions to improve visual appeal and text alignment. We propose Diffusion-DPO, a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO), a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood, utilizing the evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. Our fine-tuned base model significantly outperforms both base SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement model in human evaluation, improving visual appeal and prompt alignment. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences, opening the door for scaling of diffusion model alignment methods.
    Efficient Numerical Integration in Reproducing Kernel Hilbert Spaces via Leverage Scores Sampling. (arXiv:2311.13548v1 [stat.ML])
    In this work we consider the problem of numerical integration, i.e., approximating integrals with respect to a target probability measure using only pointwise evaluations of the integrand. We focus on the setting in which the target distribution is only accessible through a set of $n$ i.i.d. observations, and the integrand belongs to a reproducing kernel Hilbert space. We propose an efficient procedure which exploits a small i.i.d. random subset of $m<n$ samples drawn either uniformly or using approximate leverage scores from the initial observations. Our main result is an upper bound on the approximation error of this procedure for both sampling strategies. It yields sufficient conditions on the subsample size to recover the standard (optimal) $n^{-1/2}$ rate while reducing drastically the number of functions evaluations, and thus the overall computational cost. Moreover, we obtain rates with respect to the number $m$ of evaluations of the integrand which adapt to its smoothness, and match known optimal rates for instance for Sobolev spaces. We illustrate our theoretical findings with numerical experiments on real datasets, which highlight the attractive efficiency-accuracy tradeoff of our method compared to existing randomized and greedy quadrature methods. We note that, the problem of numerical integration in RKHS amounts to designing a discrete approximation of the kernel mean embedding of the target distribution. As a consequence, direct applications of our results also include the efficient computation of maximum mean discrepancies between distributions and the design of efficient kernel-based tests.
    Labeling Neural Representations with Inverse Recognition. (arXiv:2311.13594v1 [cs.LG])
    Deep Neural Networks (DNNs) demonstrated remarkable capabilities in learning complex hierarchical data representations, but the nature of these representations remains largely unknown. Existing global explainability methods, such as Network Dissection, face limitations such as reliance on segmentation masks, lack of statistical significance testing, and high computational demands. We propose Inverse Recognition (INVERT), a scalable approach for connecting learned representations with human-understandable concepts by leveraging their capacity to discriminate between these concepts. In contrast to prior work, INVERT is capable of handling diverse types of neurons, exhibits less computational complexity, and does not rely on the availability of segmentation masks. Moreover, INVERT provides an interpretable metric assessing the alignment between the representation and its corresponding explanation and delivering a measure of statistical significance, emphasizing its utility and credibility. We demonstrate the applicability of INVERT in various scenarios, including the identification of representations affected by spurious correlations, and the interpretation of the hierarchical structure of decision-making within the models.
    Applying Dimensionality Reduction as Precursor to LSTM-CNN Models for Classifying Imagery and Motor Signals in ECoG-Based BCIs. (arXiv:2311.13507v1 [cs.LG])
    Motor impairments, frequently caused by neurological incidents like strokes or traumatic brain injuries, present substantial obstacles in rehabilitation therapy. This research aims to elevate the field by optimizing motor imagery classification algorithms within Brain-Computer Interfaces (BCIs). By improving the efficiency of BCIs, we offer a novel approach that holds significant promise for enhancing motor rehabilitation outcomes. Utilizing unsupervised techniques for dimensionality reduction, namely Uniform Manifold Approximation and Projection (UMAP) coupled with K-Nearest Neighbors (KNN), we evaluate the necessity of employing supervised methods such as Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNNs) for classification tasks. Importantly, participants who exhibited high KNN scores following UMAP dimensionality reduction also achieved high accuracy in supervised deep learning (DL) models. Due to individualized model requirements and massive neural training data, dimensionality reduction becomes an effective preprocessing step that minimizes the need for extensive data labeling and supervised deep learning techniques. This approach has significant implications not only for targeted therapies in motor dysfunction but also for addressing regulatory, safety, and reliability concerns in the rapidly evolving BCI field.
    Comparative Analysis of Linear Regression, Gaussian Elimination, and LU Decomposition for CT Real Estate Purchase Decisions. (arXiv:2311.13471v1 [cs.LG])
    This paper presents a comprehensive evaluation of three distinct computational algorithms applied to the decision-making process of real estate purchases. Specifically, we analyze the efficacy of Linear Regression from Scikit-learn library, Gaussian Elimination with partial pivoting, and LU Decomposition in predicting the advisability of buying a house in the State of Connecticut based on a set of financial and market-related parameters. The algorithms' performances were compared using a dataset encompassing town-specific details, yearly data, interest rates, and median sale ratios. Our results demonstrate significant differences in predictive accuracy, with Linear Regression and LU Decomposition providing the most reliable recommendations and Gaussian Elimination showing limitations in stability and performance. The study's findings emphasize the importance of algorithm selection in predictive analytic and offer insights into the practical applications of computational methods in real estate investment strategies. By evaluating model efficacy through metrics such as R-squared scores and Mean Squared Error, we provide a nuanced understanding of each method's strengths and weaknesses, contributing valuable knowledge to the fields of real estate analysis and predictive modeling.
    Epsilon*: Privacy Metric for Machine Learning Models. (arXiv:2307.11280v2 [cs.LG] UPDATED)
    We introduce Epsilon*, a new privacy metric for measuring the privacy risk of a single model instance prior to, during, or after deployment of privacy mitigation strategies. The metric requires only black-box access to model predictions, does not require training data re-sampling or model re-training, and can be used to measure the privacy risk of models not trained with differential privacy. Epsilon* is a function of true positive and false positive rates in a hypothesis test used by an adversary in a membership inference attack. We distinguish between quantifying the privacy loss of a trained model instance, which we refer to as empirical privacy, and quantifying the privacy loss of the training mechanism which produces this model instance. Existing approaches in the privacy auditing literature provide lower bounds for the latter, while our metric provides an empirical lower bound for the former by relying on an (${\epsilon}$, ${\delta}$)-type of quantification of the privacy of the trained model instance. We establish a relationship between these lower bounds and show how to implement Epsilon* to avoid numerical and noise amplification instability. We further show in experiments on benchmark public data sets that Epsilon* is sensitive to privacy risk mitigation by training with differential privacy (DP), where the value of Epsilon* is reduced by up to 800% compared to the Epsilon* values of non-DP trained baseline models. This metric allows privacy auditors to be independent of model owners, and enables visualizing the privacy-utility landscape to make informed decisions regarding the trade-offs between model privacy and utility.
    Sensor Fault Detection and Isolation in Autonomous Nonlinear Systems Using Neural Network-Based Observers. (arXiv:2304.08837v2 [math.OC] UPDATED)
    This paper presents a novel observer-based approach to detect and isolate faulty sensors in nonlinear systems. The proposed sensor fault detection and isolation (s-FDI) method applies to a general class of nonlinear systems. Our focus is on s-FDI for two types of faults: complete failure and sensor degradation. The key aspect of this approach lies in the utilization of a neural network-based Kazantzis-Kravaris/Luenberger (KKL) observer. The neural network is trained to learn the dynamics of the observer, enabling accurate output predictions of the system. Sensor faults are detected by comparing the actual output measurements with the predicted values. If the difference surpasses a theoretical threshold, a sensor fault is detected. To identify and isolate which sensor is faulty, we compare the numerical difference of each sensor meassurement with an empirically derived threshold. We derive both theoretical and empirical thresholds for detection and isolation, respectively. Notably, the proposed approach is robust to measurement noise and system uncertainties. Its effectiveness is demonstrated through numerical simulations of sensor faults in a network of Kuramoto oscillators.
    Goal Space Abstraction in Hierarchical Reinforcement Learning via Set-Based Reachability Analysis. (arXiv:2309.07675v2 [cs.LG] UPDATED)
    Open-ended learning benefits immensely from the use of symbolic methods for goal representation as they offer ways to structure knowledge for efficient and transferable learning. However, the existing Hierarchical Reinforcement Learning (HRL) approaches relying on symbolic reasoning are often limited as they require a manual goal representation. The challenge in autonomously discovering a symbolic goal representation is that it must preserve critical information, such as the environment dynamics. In this paper, we propose a developmental mechanism for goal discovery via an emergent representation that abstracts (i.e., groups together) sets of environment states that have similar roles in the task. We introduce a Feudal HRL algorithm that concurrently learns both the goal representation and a hierarchical policy. The algorithm uses symbolic reachability analysis for neural networks to approximate the transition relation among sets of states and to refine the goal representation. We evaluate our approach on complex navigation tasks, showing the learned representation is interpretable, transferrable and results in data efficient learning.
    Differentially Private Non-Convex Optimization under the KL Condition with Optimal Rates. (arXiv:2311.13447v1 [cs.LG])
    We study private empirical risk minimization (ERM) problem for losses satisfying the $(\gamma,\kappa)$-Kurdyka-{\L}ojasiewicz (KL) condition. The Polyak-{\L}ojasiewicz (PL) condition is a special case of this condition when $\kappa=2$. Specifically, we study this problem under the constraint of $\rho$ zero-concentrated differential privacy (zCDP). When $\kappa\in[1,2]$ and the loss function is Lipschitz and smooth over a sufficiently large region, we provide a new algorithm based on variance reduced gradient descent that achieves the rate $\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^\kappa\big)$ on the excess empirical risk, where $n$ is the dataset size and $d$ is the dimension. We further show that this rate is nearly optimal. When $\kappa \geq 2$ and the loss is instead Lipschitz and weakly convex, we show it is possible to achieve the rate $\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^\kappa\big)$ with a private implementation of the proximal point method. When the KL parameters are unknown, we provide a novel modification and analysis of the noisy gradient descent algorithm and show that this algorithm achieves a rate of $\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^{\frac{2\kappa}{4-\kappa}}\big)$ adaptively, which is nearly optimal when $\kappa = 2$. We further show that, without assuming the KL condition, the same gradient descent algorithm can achieve fast convergence to a stationary point when the gradient stays sufficiently large during the run of the algorithm. Specifically, we show that this algorithm can approximate stationary points of Lipschitz, smooth (and possibly nonconvex) objectives with rate as fast as $\tilde{O}\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)$ and never worse than $\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^{1/2}\big)$. The latter rate matches the best known rate for methods that do not rely on variance reduction.
    The Tempered Hilbert Simplex Distance and Its Application To Non-linear Embeddings of TEMs. (arXiv:2311.13459v1 [cs.LG])
    Tempered Exponential Measures (TEMs) are a parametric generalization of the exponential family of distributions maximizing the tempered entropy function among positive measures subject to a probability normalization of their power densities. Calculus on TEMs relies on a deformed algebra of arithmetic operators induced by the deformed logarithms used to define the tempered entropy. In this work, we introduce three different parameterizations of finite discrete TEMs via Legendre functions of the negative tempered entropy function. In particular, we establish an isometry between such parameterizations in terms of a generalization of the Hilbert log cross-ratio simplex distance to a tempered Hilbert co-simplex distance. Similar to the Hilbert geometry, the tempered Hilbert distance is characterized as a $t$-symmetrization of the oriented tempered Funk distance. We motivate our construction by introducing the notion of $t$-lengths of smooth curves in a tautological Finsler manifold. We then demonstrate the properties of our generalized structure in different settings and numerically examine the quality of its differentiable approximations for optimization in machine learning settings.
    Learning continuous models for continuous physics. (arXiv:2202.08494v2 [cs.LG] UPDATED)
    Dynamical systems that evolve continuously over time are ubiquitous throughout science and engineering. Machine learning (ML) provides data-driven approaches to model and predict the dynamics of such systems. A core issue with this approach is that ML models are typically trained on discrete data, using ML methodologies that are not aware of underlying continuity properties. This results in models that often do not capture any underlying continuous dynamics -- either of the system of interest, or indeed of any related system. To address this challenge, we develop a convergence test based on numerical analysis theory. Our test verifies whether a model has learned a function that accurately approximates an underlying continuous dynamics. Models that fail this test fail to capture relevant dynamics, rendering them of limited utility for many scientific prediction tasks; while models that pass this test enable both better interpolation and better extrapolation in multiple ways. Our results illustrate how principled numerical analysis methods can be coupled with existing ML training/testing methodologies to validate models for science and engineering applications.
    pSTarC: Pseudo Source Guided Target Clustering for Fully Test-Time Adaptation. (arXiv:2309.00846v2 [cs.CV] UPDATED)
    Test Time Adaptation (TTA) is a pivotal concept in machine learning, enabling models to perform well in real-world scenarios, where test data distribution differs from training. In this work, we propose a novel approach called pseudo Source guided Target Clustering (pSTarC) addressing the relatively unexplored area of TTA under real-world domain shifts. This method draws inspiration from target clustering techniques and exploits the source classifier for generating pseudo-source samples. The test samples are strategically aligned with these pseudo-source samples, facilitating their clustering and thereby enhancing TTA performance. pSTarC operates solely within the fully test-time adaptation protocol, removing the need for actual source data. Experimental validation on a variety of domain shift datasets, namely VisDA, Office-Home, DomainNet-126, CIFAR-100C verifies pSTarC's effectiveness. This method exhibits significant improvements in prediction accuracy along with efficient computational requirements. Furthermore, we also demonstrate the universality of the pSTarC framework by showing its effectiveness for the continuous TTA framework. The source code for our method is available at https://manogna-s.github.io/pstarc
    Span-Based Optimal Sample Complexity for Average Reward MDPs. (arXiv:2311.13469v1 [cs.LG])
    We study the sample complexity of learning an $\varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. We establish the complexity bound $\widetilde{O}\left(SA\frac{H}{\varepsilon^2} \right)$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters $S,A,H$ and $\varepsilon$, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters. Our result is based on reducing the average-reward MDP to a discounted MDP. To establish the optimality of this reduction, we develop improved bounds for $\gamma$-discounted MDPs, showing that $\widetilde{O}\left(SA\frac{H}{(1-\gamma)^2\varepsilon^2} \right)$ samples suffice to learn a $\varepsilon$-optimal policy in weakly communicating MDPs under the regime that $\gamma \geq 1 - \frac{1}{H}$, circumventing the well-known lower bound of $\widetilde{\Omega}\left(SA\frac{1}{(1-\gamma)^3\varepsilon^2} \right)$ for general $\gamma$-discounted MDPs. Our analysis develops upper bounds on certain instance-dependent variance parameters in terms of the span parameter. These bounds are tighter than those based on the mixing time or diameter of the MDP and may be of broader use.
    Differentially Private Wireless Federated Learning Using Orthogonal Sequences. (arXiv:2306.08280v2 [cs.IT] UPDATED)
    We propose a privacy-preserving uplink over-the-air computation (AirComp) method, termed FLORAS, for single-input single-output (SISO) wireless federated learning (FL) systems. From the perspective of communication designs, FLORAS eliminates the requirement of channel state information at the transmitters (CSIT) by leveraging the properties of orthogonal sequences. From the privacy perspective, we prove that FLORAS offers both item-level and client-level differential privacy (DP) guarantees. Moreover, by properly adjusting the system parameters, FLORAS can flexibly achieve different DP levels at no additional cost. A new FL convergence bound is derived which, combined with the privacy guarantees, allows for a smooth tradeoff between the achieved convergence rate and differential privacy levels. Experimental results demonstrate the advantages of FLORAS compared with the baseline AirComp method, and validate that the analytical results can guide the design of privacy-preserving FL with different tradeoff requirements on the model convergence and privacy levels.
    Efficient Algorithms for the CCA Family: Unconstrained Objectives with Unbiased Gradients. (arXiv:2310.01012v2 [cs.LG] UPDATED)
    The Canonical Correlation Analysis (CCA) family of methods is foundational in multi-view learning. Regularised linear CCA methods can be seen to generalise Partial Least Squares (PLS) and be unified with a Generalized Eigenvalue Problem (GEP) framework. However, classical algorithms for these linear methods are computationally infeasible for large-scale data. Extensions to Deep CCA show great promise, but current training procedures are slow and complicated. First we propose a novel unconstrained objective that characterizes the top subspace of GEPs. Our core contribution is a family of fast algorithms for stochastic PLS, stochastic CCA, and Deep CCA, simply obtained by applying stochastic gradient descent (SGD) to the corresponding CCA objectives. These methods show far faster convergence and recover higher correlations than the previous state-of-the-art on all standard CCA and Deep CCA benchmarks. This speed allows us to perform a first-of-its-kind PLS analysis of an extremely large biomedical dataset from the UK Biobank, with over 33,000 individuals and 500,000 variants. Finally, we not only match the performance of `CCA-family' Self-Supervised Learning (SSL) methods on CIFAR-10 and CIFAR-100 with minimal hyper-parameter tuning, but also establish the first solid theoretical links to classical CCA, laying the groundwork for future insights.
    Bayesian inference of a new Mallows model for characterising symptom sequences applied in primary progressive aphasia. (arXiv:2311.13411v1 [cs.LG])
    Machine learning models offer the potential to understand diverse datasets in a data-driven way, powering insights into individual disease experiences and ensuring equitable healthcare. In this study, we explore Bayesian inference for characterising symptom sequences, and the associated modelling challenges. We adapted the Mallows model to account for partial rankings and right-censored data, employing custom MCMC fitting. Our evaluation, encompassing synthetic data and a primary progressive aphasia dataset, highlights the model's efficacy in revealing mean orderings and estimating ranking variance. This holds the potential to enhance clinical comprehension of symptom occurrence. However, our work encounters limitations concerning model scalability and small dataset sizes.
    Leveraging CNNs and Ensemble Learning for Automated Disaster Image Classification. (arXiv:2311.13531v1 [cs.CV])
    Natural disasters act as a serious threat globally, requiring effective and efficient disaster management and recovery. This paper focuses on classifying natural disaster images using Convolutional Neural Networks (CNNs). Multiple CNN architectures were built and trained on a dataset containing images of earthquakes, floods, wildfires, and volcanoes. A stacked CNN ensemble approach proved to be the most effective, achieving 95% accuracy and an F1 score going up to 0.96 for individual classes. Tuning hyperparameters of individual models for optimization was critical to maximize the models' performance. The stacking of CNNs with XGBoost acting as the meta-model utilizes the strengths of the CNN and ResNet models to improve the overall accuracy of the classification. Results obtained from the models illustrated the potency of CNN-based models for automated disaster image classification. This lays the foundation for expanding these techniques to build robust systems for disaster response, damage assessment, and recovery management.
    $\sigma$-PCA: a unified neural model for linear and nonlinear principal component analysis. (arXiv:2311.13580v1 [cs.LG])
    Linear principal component analysis (PCA), nonlinear PCA, and linear independent component analysis (ICA) -- those are three methods with single-layer autoencoder formulations for learning linear transformations from data. Linear PCA learns orthogonal transformations (rotations) that orient axes to maximise variance, but it suffers from a subspace rotational indeterminacy: it fails to find a unique rotation for axes that share the same variance. Both nonlinear PCA and linear ICA reduce the subspace indeterminacy from rotational to permutational by maximising statistical independence under the assumption of unit variance. The main difference between them is that nonlinear PCA only learns rotations while linear ICA learns not just rotations but any linear transformation with unit variance. The relationship between all three can be understood by the singular value decomposition of the linear ICA transformation into a sequence of rotation, scale, rotation. Linear PCA learns the first rotation; nonlinear PCA learns the second. The scale is simply the inverse of the standard deviations. The problem is that, in contrast to linear PCA, conventional nonlinear PCA cannot be used directly on the data to learn the first rotation, the first being special as it reduces dimensionality and orders by variances. In this paper, we have identified the cause, and as a solution we propose $\sigma$-PCA: a unified neural model for linear and nonlinear PCA as single-layer autoencoders. One of its key ingredients: modelling not just the rotation but also the scale -- the variances. This model bridges the disparity between linear and nonlinear PCA. And so, like linear PCA, it can learn a semi-orthogonal transformation that reduces dimensionality and orders by variances, but, unlike linear PCA, it does not suffer from rotational indeterminacy.
    Combinatorial Optimization with Policy Adaptation using Latent Space Search. (arXiv:2311.13569v1 [cs.LG])
    Combinatorial Optimization underpins many real-world applications and yet, designing performant algorithms to solve these complex, typically NP-hard, problems remains a significant research challenge. Reinforcement Learning (RL) provides a versatile framework for designing heuristics across a broad spectrum of problem domains. However, despite notable progress, RL has not yet supplanted industrial solvers as the go-to solution. Current approaches emphasize pre-training heuristics that construct solutions but often rely on search procedures with limited variance, such as stochastically sampling numerous solutions from a single policy or employing computationally expensive fine-tuning of the policy on individual problem instances. Building on the intuition that performant search at inference time should be anticipated during pre-training, we propose COMPASS, a novel RL approach that parameterizes a distribution of diverse and specialized policies conditioned on a continuous latent space. We evaluate COMPASS across three canonical problems - Travelling Salesman, Capacitated Vehicle Routing, and Job-Shop Scheduling - and demonstrate that our search strategy (i) outperforms state-of-the-art approaches on 11 standard benchmarking tasks and (ii) generalizes better, surpassing all other approaches on a set of 18 procedurally transformed instance distributions.
    Byzantine-Robust Learning on Heterogeneous Datasets via Bucketing. (arXiv:2006.09365v6 [cs.LG] UPDATED)
    In Byzantine robust distributed or federated learning, a central server wants to train a machine learning model over data distributed across multiple workers. However, a fraction of these workers may deviate from the prescribed algorithm and send arbitrary messages. While this problem has received significant attention recently, most current defenses assume that the workers have identical data. For realistic cases when the data across workers are heterogeneous (non-iid), we design new attacks which circumvent current defenses, leading to significant loss of performance. We then propose a simple bucketing scheme that adapts existing robust algorithms to heterogeneous datasets at a negligible computational cost. We also theoretically and experimentally validate our approach, showing that combining bucketing with existing robust algorithms is effective against challenging attacks. Our work is the first to establish guaranteed convergence for the non-iid Byzantine robust problem under realistic assumptions.
    Fast and Interpretable Mortality Risk Scores for Critical Care Patients. (arXiv:2311.13015v1 [cs.LG])
    Prediction of mortality in intensive care unit (ICU) patients is an important task in critical care medicine. Prior work in creating mortality risk models falls into two major categories: domain-expert-created scoring systems, and black box machine learning (ML) models. Both of these have disadvantages: black box models are unacceptable for use in hospitals, whereas manual creation of models (including hand-tuning of logistic regression parameters) relies on humans to perform high-dimensional constrained optimization, which leads to a loss in performance. In this work, we bridge the gap between accurate black box models and hand-tuned interpretable models. We build on modern interpretable ML techniques to design accurate and interpretable mortality risk scores. We leverage the largest existing public ICU monitoring datasets, namely the MIMIC III and eICU datasets. By evaluating risk across medical centers, we are able to study generalization across domains. In order to customize our risk score models, we develop a new algorithm, GroupFasterRisk, which has several important benefits: (1) it uses hard sparsity constraint, allowing users to directly control the number of features; (2) it incorporates group sparsity to allow more cohesive models; (3) it allows for monotonicity correction on models for including domain knowledge; (4) it produces many equally-good models at once, which allows domain experts to choose among them. GroupFasterRisk creates its risk scores within hours, even on the large datasets we study here. GroupFasterRisk's risk scores perform better than risk scores currently used in hospitals, and have similar prediction performance to black box ML models (despite being much sparser). Because GroupFasterRisk produces a variety of risk scores and handles constraints, it allows design flexibility, which is the key enabler of practical and trustworthy model creation.
    Optimal Transport with Cyclic Symmetry. (arXiv:2311.13147v1 [cs.LG])
    We propose novel fast algorithms for optimal transport (OT) utilizing a cyclic symmetry structure of input data. Such OT with cyclic symmetry appears universally in various real-world examples: image processing, urban planning, and graph processing. Our main idea is to reduce OT to a small optimization problem that has significantly fewer variables by utilizing cyclic symmetry and various optimization techniques. On the basis of this reduction, our algorithms solve the small optimization problem instead of the original OT. As a result, our algorithms obtain the optimal solution and the objective function value of the original OT faster than solving the original OT directly. In this paper, our focus is on two crucial OT formulations: the linear programming OT (LOT) and the strongly convex-regularized OT, which includes the well-known entropy-regularized OT (EROT). Experiments show the effectiveness of our algorithms for LOT and EROT in synthetic/real-world data that has a strict/approximate cyclic symmetry structure. Through theoretical and experimental results, this paper successfully introduces the concept of symmetry into the OT research field for the first time.
    Multi-Objective Bayesian Optimization with Active Preference Learning. (arXiv:2311.13460v1 [cs.LG])
    There are a lot of real-world black-box optimization problems that need to optimize multiple criteria simultaneously. However, in a multi-objective optimization (MOO) problem, identifying the whole Pareto front requires the prohibitive search cost, while in many practical scenarios, the decision maker (DM) only needs a specific solution among the set of the Pareto optimal solutions. We propose a Bayesian optimization (BO) approach to identifying the most preferred solution in the MOO with expensive objective functions, in which a Bayesian preference model of the DM is adaptively estimated by an interactive manner based on the two types of supervisions called the pairwise preference and improvement request. To explore the most preferred solution, we define an acquisition function in which the uncertainty both in the objective functions and the DM preference is incorporated. Further, to minimize the interaction cost with the DM, we also propose an active learning strategy for the preference estimation. We empirically demonstrate the effectiveness of our proposed method through the benchmark function optimization and the hyper-parameter optimization problems for machine learning models.
    Comprehensive Evaluation of GNN Training Systems: A Data Management Perspective. (arXiv:2311.13279v1 [cs.LG])
    Many Graph Neural Network (GNN) training systems have emerged recently to support efficient GNN training. Since GNNs embody complex data dependencies between training samples, the training of GNNs should address distinct challenges different from DNN training in data management, such as data partitioning, batch preparation for mini-batch training, and data transferring between CPUs and GPUs. These factors, which take up a large proportion of training time, make data management in GNN training more significant. This paper reviews GNN training from a data management perspective and provides a comprehensive analysis and evaluation of the representative approaches. We conduct extensive experiments on various benchmark datasets and show many interesting and valuable results. We also provide some practical tips learned from these experiments, which are helpful for designing GNN training systems in the future.
    SecureCut: Federated Gradient Boosting Decision Trees with Efficient Machine Unlearning. (arXiv:2311.13174v1 [cs.LG])
    In response to legislation mandating companies to honor the \textit{right to be forgotten} by erasing user data, it has become imperative to enable data removal in Vertical Federated Learning (VFL) where multiple parties provide private features for model training. In VFL, data removal, i.e., \textit{machine unlearning}, often requires removing specific features across all samples under privacy guarentee in federated learning. To address this challenge, we propose \methname, a novel Gradient Boosting Decision Tree (GBDT) framework that effectively enables both \textit{instance unlearning} and \textit{feature unlearning} without the need for retraining from scratch. Leveraging a robust GBDT structure, we enable effective data deletion while reducing degradation of model performance. Extensive experimental results on popular datasets demonstrate that our method achieves superior model utility and forgetfulness compared to \textit{state-of-the-art} methods. To our best knowledge, this is the first work that investigates machine unlearning in VFL scenarios.
    CovarNav: Machine Unlearning via Model Inversion and Covariance Navigation. (arXiv:2311.12999v1 [cs.LG])
    The rapid progress of AI, combined with its unprecedented public adoption and the propensity of large neural networks to memorize training data, has given rise to significant data privacy concerns. To address these concerns, machine unlearning has emerged as an essential technique to selectively remove the influence of specific training data points on trained models. In this paper, we approach the machine unlearning problem through the lens of continual learning. Given a trained model and a subset of training data designated to be forgotten (i.e., the "forget set"), we introduce a three-step process, named CovarNav, to facilitate this forgetting. Firstly, we derive a proxy for the model's training data using a model inversion attack. Secondly, we mislabel the forget set by selecting the most probable class that deviates from the actual ground truth. Lastly, we deploy a gradient projection method to minimize the cross-entropy loss on the modified forget set (i.e., learn incorrect labels for this set) while preventing forgetting of the inverted samples. We rigorously evaluate CovarNav on the CIFAR-10 and Vggface2 datasets, comparing our results with recent benchmarks in the field and demonstrating the efficacy of our proposed approach.
    DMLR: Data-centric Machine Learning Research -- Past, Present and Future. (arXiv:2311.13028v1 [cs.LG])
    Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.
    A Unified Framework for Trace-induced Quantum Kernels. (arXiv:2311.13552v1 [quant-ph])
    Quantum kernel methods are promising candidates for achieving a practical quantum advantage for certain machine learning tasks. Similar to classical machine learning, an exact form of a quantum kernel is expected to have a great impact on the model performance. In this work we combine all trace-induced quantum kernels, including the commonly-used global fidelity and local projected quantum kernels, into a common framework. We show how generalized trace-induced quantum kernels can be constructed as combinations of the fundamental building blocks we coin "Lego" kernels, which impose an inductive bias on the resulting quantum models. We relate the expressive power and generalization ability to the number of non-zero weight Lego kernels and propose a systematic approach to increase the complexity of a quantum kernel model, leading to a new form of the local projected kernels that require fewer quantum resources in terms of the number of quantum gates and measurement shots. We show numerically that models based on local projected kernels can achieve comparable performance to the global fidelity quantum kernel. Our work unifies existing quantum kernels and provides a systematic framework to compare their properties.
    Exploring Practitioner Perspectives On Training Data Attribution Explanations. (arXiv:2310.20477v2 [cs.HC] UPDATED)
    Explainable AI (XAI) aims to provide insight into opaque model reasoning to humans and as such is an interdisciplinary field by nature. In this paper, we interviewed 10 practitioners to understand the possible usability of training data attribution (TDA) explanations and to explore the design space of such an approach. We confirmed that training data quality is often the most important factor for high model performance in practice and model developers mainly rely on their own experience to curate data. End-users expect explanations to enhance their interaction with the model and do not necessarily prioritise but are open to training data as a means of explanation. Within our participants, we found that TDA explanations are not well-known and therefore not used. We urge the community to focus on the utility of TDA techniques from the human-machine collaboration perspective and broaden the TDA evaluation to reflect common use cases in practice.
    FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline. (arXiv:2311.13073v1 [cs.CV])
    Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyframes synthesis to figure the storyline of a video, while the second one is devoted to interpolation frames generation to make movements of the scene and objects smooth. We compare several temporal conditioning approaches for keyframes generation. The results show the advantage of using separate temporal blocks over temporal layers in terms of metrics reflecting video generation quality aspects and human preference. The design of our interpolation model significantly reduces computational costs compared to other masked frame interpolation approaches. Furthermore, we evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our pipeline with existing solutions and achieve top-2 scores overall and top-1 among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page: https://ai-forever.github.io/kandinsky-video/
    Do we listen to what we are told? An empirical study on human behaviour during the COVID-19 pandemic: neural networks vs. regression analysis. (arXiv:2311.13046v1 [econ.GN])
    In this work, we contribute the first visual open-source empirical study on human behaviour during the COVID-19 pandemic, in order to investigate how compliant a general population is to mask-wearing-related public-health policy. Object-detection-based convolutional neural networks, regression analysis and multilayer perceptrons are combined to analyse visual data of the Viennese public during 2020. We find that mask-wearing-related government regulations and public-transport announcements encouraged correct mask-wearing-behaviours during the COVID-19 pandemic. Importantly, changes in announcement and regulation contents led to heterogeneous effects on people's behaviour. Comparing the predictive power of regression analysis and neural networks, we demonstrate that the latter produces more accurate predictions of population reactions during the COVID-19 pandemic. Our use of regression modelling also allows us to unearth possible causal pathways underlying societal behaviour. Since our findings highlight the importance of appropriate communication contents, our results will facilitate more effective non-pharmaceutical interventions to be developed in future. Adding to the literature, we demonstrate that regression modelling and neural networks are not mutually exclusive but instead complement each other.
    Combatting Human Trafficking in the Cyberspace: A Natural Language Processing-Based Methodology to Analyze the Language in Online Advertisements. (arXiv:2311.13118v1 [cs.LG])
    This project tackles the pressing issue of human trafficking in online C2C marketplaces through advanced Natural Language Processing (NLP) techniques. We introduce a novel methodology for generating pseudo-labeled datasets with minimal supervision, serving as a rich resource for training state-of-the-art NLP models. Focusing on tasks like Human Trafficking Risk Prediction (HTRP) and Organized Activity Detection (OAD), we employ cutting-edge Transformer models for analysis. A key contribution is the implementation of an interpretability framework using Integrated Gradients, providing explainable insights crucial for law enforcement. This work not only fills a critical gap in the literature but also offers a scalable, machine learning-driven approach to combat human exploitation online. It serves as a foundation for future research and practical applications, emphasizing the role of machine learning in addressing complex social issues.
    AS-LLM: When Algorithm Selection Meets Large Language Model. (arXiv:2311.13184v1 [cs.LG])
    Algorithm selection aims to identify the most suitable algorithm for solving a specific problem before execution, which has become a critical process of the AutoML. Current mainstream algorithm selection techniques rely heavily on feature representations of various problems and employ the performance of each algorithm as supervised information. However, there is a significant research gap concerning the consideration of algorithm features. This gap is primarily attributed to the inherent complexity of algorithms, making it particularly challenging to find a universally effective feature extraction method that is applicable across a diverse range of algorithms. Unfortunately, neglecting this aspect undoubtedly impacts the accuracy of algorithm selection and indirectly necessitates an increased volume of problem data for training purposes. This paper takes a significant stride towards addressing this gap by proposing an approach that integrates algorithm representation into the algorithm selection process. Specifically, our proposed model employs distinct modules to extract representations of both problems and algorithms, where the algorithm representation leverages the capabilities of pre-trained LLMs in the realm of code comprehension. Following the extraction of embedding vectors for both algorithms and problems, the most suitable algorithm is determined through calculations of matching degrees. Our experiments not only validate the effectiveness of the proposed model but also showcase the performance of different embedded pre-trained LLMs, which suggests that the proposed algorithm selection framework holds the potential to serve as a baseline task for evaluating the code representation capabilities of LLMs.
    Hierarchical Learning for Quantum ML: Novel Training Technique for Large-Scale Variational Quantum Circuits. (arXiv:2311.12929v1 [quant-ph])
    We present hierarchical learning, a novel variational architecture for efficient training of large-scale variational quantum circuits. We test and benchmark our technique for distribution loading with quantum circuit born machines (QCBMs). With QCBMs, probability distributions are loaded into the squared amplitudes of computational basis vectors represented by bitstrings. Our key insight is to take advantage of the fact that the most significant (qu)bits have a greater effect on the final distribution and can be learned first. One can think of it as a generalization of layerwise learning, where some parameters of the variational circuit are learned first to prevent the phenomena of barren plateaus. We briefly review adjoint methods for computing the gradient, in particular for loss functions that are not expectation values of observables. We first compare the role of connectivity in the variational ansatz for the task of loading a Gaussian distribution on nine qubits, finding that 2D connectivity greatly outperforms qubits arranged on a line. Based on our observations, we then implement this strategy on large-scale numerical experiments with GPUs, training a QCBM to reproduce a 3-dimensional multivariate Gaussian distribution on 27 qubits up to $\sim4\%$ total variation distance. Though barren plateau arguments do not strictly apply here due to the objective function not being tied to an observable, this is to our knowledge the first practical demonstration of variational learning on large numbers of qubits. We also demonstrate hierarchical learning as a resource-efficient way to load distributions for existing quantum hardware (IBM's 7 and 27 qubit devices) in tandem with Fire Opal optimizations.
    Revisiting the Domain Shift and Sample Uncertainty in Multi-source Active Domain Transfer. (arXiv:2311.12905v1 [cs.AI])
    Active Domain Adaptation (ADA) aims to maximally boost model adaptation in a new target domain by actively selecting a limited number of target data to annotate.This setting neglects the more practical scenario where training data are collected from multiple sources. This motivates us to target a new and challenging setting of knowledge transfer that extends ADA from a single source domain to multiple source domains, termed Multi-source Active Domain Adaptation (MADA). Not surprisingly, we find that most traditional ADA methods cannot work directly in such a setting, mainly due to the excessive domain gap introduced by all the source domains and thus their uncertainty-aware sample selection can easily become miscalibrated under the multi-domain shifts. Considering this, we propose a Dynamic integrated uncertainty valuation framework(Detective) that comprehensively consider the domain shift between multi-source domains and target domain to detect the informative target samples. Specifically, the leverages a dynamic Domain Adaptation(DA) model that learns how to adapt the model's parameters to fit the union of multi-source domains. This enables an approximate single-source domain modeling by the dynamic model. We then comprehensively measure both domain uncertainty and predictive uncertainty in the target domain to detect informative target samples using evidential deep learning, thereby mitigating uncertainty miscalibration. Furthermore, we introduce a contextual diversity-aware calculator to enhance the diversity of the selected samples. Experiments demonstrate that our solution outperforms existing methods by a considerable margin on three domain adaptation benchmarks.
    Nonlinear System Identification of Swarm of UAVs Using Deep Learning Methods. (arXiv:2311.12906v1 [cs.LG])
    This study designs and evaluates multiple nonlinear system identification techniques for modeling the UAV swarm system in planar space. learning methods such as RNNs, CNNs, and Neural ODE are explored and compared. The objective is to forecast future swarm trajectories by accurately approximating the nonlinear dynamics of the swarm model. The modeling process is performed using both transient and steady-state data from swarm simulations. Results show that the combination of Neural ODE with a well-trained model using transient data is robust for varying initial conditions and outperforms other learning methods in accurately predicting swarm stability.
    From Images to Connections: Can DQN with GNNs learn the Strategic Game of Hex?. (arXiv:2311.13414v1 [cs.LG])
    The gameplay of strategic board games such as chess, Go and Hex is often characterized by combinatorial, relational structures -- capturing distinct interactions and non-local patterns -- and not just images. Nonetheless, most common self-play reinforcement learning (RL) approaches simply approximate policy and value functions using convolutional neural networks (CNN). A key feature of CNNs is their relational inductive bias towards locality and translational invariance. In contrast, graph neural networks (GNN) can encode more complicated and distinct relational structures. Hence, we investigate the crucial question: Can GNNs, with their ability to encode complex connections, replace CNNs in self-play reinforcement learning? To this end, we do a comparison with Hex -- an abstract yet strategically rich board game -- serving as our experimental platform. Our findings reveal that GNNs excel at dealing with long range dependency situations in game states and are less prone to overfitting, but also showing a reduced proficiency in discerning local patterns. This suggests a potential paradigm shift, signaling the use of game-specific structures to reshape self-play reinforcement learning.
    TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs. (arXiv:2311.12862v1 [cs.DC])
    Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving, and graph understanding in recommendation systems. Since the computation pattern is sparse and irregular, specialized high-performance kernels are required. Existing GPU libraries offer two dataflow types for sparse convolution. The gather-GEMM-scatter dataflow is easy to implement but not optimal in performance, while the dataflows with overlapped computation and memory access (e.g.implicit GEMM) are highly performant but have very high engineering costs. In this paper, we introduce TorchSparse++, a new GPU library that achieves the best of both worlds. We create a highly efficient Sparse Kernel Generator that generates performant sparse convolution kernels at less than one-tenth of the engineering cost of the current state-of-the-art system. On top of this, we design the Sparse Autotuner, which extends the design space of existing sparse convolution libraries and searches for the best dataflow configurations for training and inference workloads. Consequently, TorchSparse++ achieves 2.9x, 3.3x, 2.2x and 1.7x measured end-to-end speedup on an NVIDIA A100 GPU over state-of-the-art MinkowskiEngine, SpConv 1.2, TorchSparse and SpConv v2 in inference; and is 1.2-1.3x faster than SpConv v2 in mixed precision training across seven representative autonomous driving benchmarks. It also seamlessly supports graph convolutions, achieving 2.6-7.6x faster inference speed compared with state-of-the-art graph deep learning libraries.
    DroneOptiNet: A Framework for Optimal Drone-based Load Redistribution Mechanism for 5G and Beyond Solar Small Cell Networks. (arXiv:2311.12944v1 [cs.NI])
    The power requirements posed by the fifth-generation and beyond cellular networks are an important constraint in network deployment and require energy-efficient solutions. In this work, we propose a novel user load transfer approach using airborne base stations (BS), mounted on drones, for reliable and secure power redistribution across the micro-grid network comprising green small cell BSs. Depending on the user density and the availability of an aerial BS, the energy requirement of a cell with an energy deficit is accommodated by migrating the aerial BS from a high-energy to a low-energy cell. The proposed hybrid drone-based framework integrates long short-term memory with unique cost functions using an evolutionary neural network for drones and BSs, and efficiently manages energy and load redistribution. The proposed algorithm reduces power outages at BSs and maintains consistent throughput stability, thereby demonstrating its capability to boost the reliability and robustness of wireless communication systems.
    Acceleration and Implicit Regularization in Gaussian Phase Retrieval. (arXiv:2311.12888v1 [math.OC])
    We study accelerated optimization methods in the Gaussian phase retrieval problem. In this setting, we prove that gradient methods with Polyak or Nesterov momentum have similar implicit regularization to gradient descent. This implicit regularization ensures that the algorithms remain in a nice region, where the cost function is strongly convex and smooth despite being nonconvex in general. This ensures that these accelerated methods achieve faster rates of convergence than gradient descent. Experimental evidence demonstrates that the accelerated methods converge faster than gradient descent in practice.
    Detecting out-of-distribution text using topological features of transformer-based language models. (arXiv:2311.13102v1 [cs.CL])
    We attempt to detect out-of-distribution (OOD) text samples though applying Topological Data Analysis (TDA) to attention maps in transformer-based language models. We evaluate our proposed TDA-based approach for out-of-distribution detection on BERT, a transformer-based language model, and compare the to a more traditional OOD approach based on BERT CLS embeddings. We found that our TDA approach outperforms the CLS embedding approach at distinguishing in-distribution data (politics and entertainment news articles from HuffPost) from far out-of-domain samples (IMDB reviews), but its effectiveness deteriorates with near out-of-domain (CNN/Dailymail) or same-domain (business news articles from HuffPost) datasets.
    AI-based association analysis for medical imaging using latent-space geometric confounder correction. (arXiv:2311.12836v1 [cs.CV])
    AI has greatly enhanced medical image analysis, yet its use in epidemiological population imaging studies remains limited due to visualization challenges in non-linear models and lack of confounder control. Addressing this, we introduce an AI method emphasizing semantic feature interpretation and resilience against multiple confounders. Our approach's merits are tested in three scenarios: extracting confounder-free features from a 2D synthetic dataset; examining the association between prenatal alcohol exposure and children's facial shapes using 3D mesh data; exploring the relationship between global cognition and brain images with a 3D MRI dataset. Results confirm our method effectively reduces confounder influences, establishing less confounded associations. Additionally, it provides a unique visual representation, highlighting specific image alterations due to identified correlations.
    Clustered Policy Decision Ranking. (arXiv:2311.12970v1 [cs.LG])
    Policies trained via reinforcement learning (RL) are often very complex even for simple tasks. In an episode with n time steps, a policy will make n decisions on actions to take, many of which may appear non-intuitive to the observer. Moreover, it is not clear which of these decisions directly contribute towards achieving the reward and how significant their contribution is. Given a trained policy, we propose a black-box method based on statistical covariance estimation that clusters the states of the environment and ranks each cluster according to the importance of decisions made in its states. We compare our measure against a previous statistical fault localization based ranking procedure.
    Multi-Objective Optimization via Wasserstein-Fisher-Rao Gradient Flow. (arXiv:2311.13159v1 [cs.LG])
    Multi-objective optimization (MOO) aims to optimize multiple, possibly conflicting objectives with widespread applications. We introduce a novel interacting particle method for MOO inspired by molecular dynamics simulations. Our approach combines overdamped Langevin and birth-death dynamics, incorporating a "dominance potential" to steer particles toward global Pareto optimality. In contrast to previous methods, our method is able to relocate dominated particles, making it particularly adept at managing Pareto fronts of complicated geometries. Our method is also theoretically grounded as a Wasserstein-Fisher-Rao gradient flow with convergence guarantees. Extensive experiments confirm that our approach outperforms state-of-the-art methods on challenging synthetic and real-world datasets.
    Local Convolution Enhanced Global Fourier Neural Operator For Multiscale Dynamic Spaces Prediction. (arXiv:2311.12902v1 [cs.LG])
    Neural operators extend the capabilities of traditional neural networks by allowing them to handle mappings between function spaces for the purpose of solving partial differential equations (PDEs). One of the most notable methods is the Fourier Neural Operator (FNO), which is inspired by Green's function method and approximate operator kernel directly in the frequency domain. In this work, we focus on predicting multiscale dynamic spaces, which is equivalent to solving multiscale PDEs. Multiscale PDEs are characterized by rapid coefficient changes and solution space oscillations, which are crucial for modeling atmospheric convection and ocean circulation. To solve this problem, models should have the ability to capture rapid changes and process them at various scales. However, the FNO only approximates kernels in the low-frequency domain, which is insufficient when solving multiscale PDEs. To address this challenge, we propose a novel hierarchical neural operator that integrates improved Fourier layers with attention mechanisms, aiming to capture all details and handle them at various scales. These mechanisms complement each other in the frequency domain and encourage the model to solve multiscale problems. We perform experiments on dynamic spaces governed by forward and reverse problems of multiscale elliptic equations, Navier-Stokes equations and some other physical scenarios, and reach superior performance in existing PDE benchmarks, especially equations characterized by rapid coefficient variations.
    Density of States Prediction of Crystalline Materials via Prompt-guided Multi-Modal Transformer. (arXiv:2311.12856v1 [cond-mat.mtrl-sci])
    The density of states (DOS) is a spectral property of crystalline materials, which provides fundamental insights into various characteristics of the materials. While previous works mainly focus on obtaining high-quality representations of crystalline materials for DOS prediction, we focus on predicting the DOS from the obtained representations by reflecting the nature of DOS: DOS determines the general distribution of states as a function of energy. That is, DOS is not solely determined by the crystalline material but also by the energy levels, which has been neglected in previous works. In this paper, we propose to integrate heterogeneous information obtained from the crystalline materials and the energies via a multi-modal transformer, thereby modeling the complex relationships between the atoms in the crystalline materials and various energy levels for DOS prediction. Moreover, we propose to utilize prompts to guide the model to learn the crystal structural system-specific interactions between crystalline materials and energies. Extensive experiments on two types of DOS, i.e., Phonon DOS and Electron DOS, with various real-world scenarios demonstrate the superiority of DOSTransformer.
    Innovative Horizons in Aerial Imagery: LSKNet Meets DiffusionDet for Advanced Object Detection. (arXiv:2311.12956v1 [cs.CV])
    In the realm of aerial image analysis, object detection plays a pivotal role, with significant implications for areas such as remote sensing, urban planning, and disaster management. This study addresses the inherent challenges in this domain, notably the detection of small objects, managing densely packed elements, and accounting for diverse orientations. We present an in-depth evaluation of an object detection model that integrates the Large Selective Kernel Network (LSKNet)as its backbone with the DiffusionDet head, utilizing the iSAID dataset for empirical analysis. Our approach encompasses the introduction of novel methodologies and extensive ablation studies. These studies critically assess various aspects such as loss functions, box regression techniques, and classification strategies to refine the model's precision in object detection. The paper details the experimental application of the LSKNet backbone in synergy with the DiffusionDet heads, a combination tailored to meet the specific challenges in aerial image object detection. The findings of this research indicate a substantial enhancement in the model's performance, especially in the accuracy-time tradeoff. The proposed model achieves a mean average precision (MAP) of approximately 45.7%, which is a significant improvement, outperforming the RCNN model by 4.7% on the same dataset. This advancement underscores the effectiveness of the proposed modifications and sets a new benchmark in aerial image analysis, paving the way for more accurate and efficient object detection methodologies. The code is publicly available at https://github.com/SashaMatsun/LSKDiffDet
    A principled deep learning approach for geological facies generation. (arXiv:2305.13318v2 [physics.geo-ph] UPDATED)
    The simulation of geological facies in an unobservable volume is essential in various geoscience applications. Given the complexity of the problem, deep generative learning is a promising approach to overcome the limitations of traditional geostatistical simulation models, in particular their lack of physical realism. This research aims to investigate the application of generative adversarial networks and deep variational inference for conditionally simulating meandering channels in underground volumes. In this paper, we review the generative deep learning approaches, in particular the adversarial ones and the stabilization techniques that aim to facilitate their training. The proposed approach is tested on 2D and 3D simulations generated by the stochastic process-based model Flumy. Morphological metrics are utilized to compare our proposed method with earlier iterations of generative adversarial networks. The results indicate that by utilizing recent stabilization techniques, generative adversarial networks can efficiently sample from target data distributions. Moreover, we demonstrate the ability to simulate conditioned simulations through the latent variable model property of the proposed approach.
    OptScaler: A Hybrid Proactive-Reactive Framework for Robust Autoscaling in the Cloud. (arXiv:2311.12864v1 [math.OC])
    Autoscaling is a vital mechanism in cloud computing that supports the autonomous adjustment of computing resources under dynamic workloads. A primary goal of autoscaling is to stabilize resource utilization at a desirable level, thus reconciling the need for resource-saving with the satisfaction of Service Level Objectives (SLOs). Existing proactive autoscaling methods anticipate the future workload and scale the resources in advance, whereas the reliability may suffer from prediction deviations arising from the frequent fluctuations and noise of cloud workloads; reactive methods rely on real-time system feedback, while the hysteretic nature of reactive methods could cause violations of the rigorous SLOs. To this end, this paper presents OptScaler, a hybrid autoscaling framework that integrates the power of both proactive and reactive methods for regulating CPU utilization. Specifically, the proactive module of OptScaler consists of a sophisticated workload prediction model and an optimization model, where the former provides reliable inputs to the latter for making optimal scaling decisions. The reactive module provides a self-tuning estimator of CPU utilization to the optimization model. We embed Model Predictive Control (MPC) mechanism and robust optimization techniques into the optimization model to further enhance its reliability. Numerical results have demonstrated the superiority of both the workload prediction model and the hybrid framework of OptScaler in the scenario of online services compared to prevalent reactive, proactive, or hybrid autoscalers. OptScaler has been successfully deployed at Alipay, supporting the autoscaling of applets in the world-leading payment platform.
    Adaptive Sampling for Deep Learning via Efficient Nonparametric Proxies. (arXiv:2311.13583v1 [cs.LG])
    Data sampling is an effective method to improve the training speed of neural networks, with recent results demonstrating that it can even break the neural scaling laws. These results critically rely on high-quality scores to estimate the importance of an input to the network. We observe that there are two dominant strategies: static sampling, where the scores are determined before training, and dynamic sampling, where the scores can depend on the model weights. Static algorithms are computationally inexpensive but less effective than their dynamic counterparts, which can cause end-to-end slowdown due to their need to explicitly compute losses. To address this problem, we propose a novel sampling distribution based on nonparametric kernel regression that learns an effective importance score as the neural network trains. However, nonparametric regression models are too computationally expensive to accelerate end-to-end training. Therefore, we develop an efficient sketch-based approximation to the Nadaraya-Watson estimator. Using recent techniques from high-dimensional statistics and randomized algorithms, we prove that our Nadaraya-Watson sketch approximates the estimator with exponential convergence guarantees. Our sampling algorithm outperforms the baseline in terms of wall-clock time and accuracy on four datasets.
    Reliable Generation of EHR Time Series via Diffusion Models. (arXiv:2310.15290v2 [cs.LG] UPDATED)
    Electronic Health Records (EHRs) are rich sources of patient-level data, including laboratory tests, medications, and diagnoses, offering valuable resources for medical data analysis. However, concerns about privacy often restrict access to EHRs, hindering downstream analysis. Researchers have explored various methods for generating privacy-preserving EHR data. In this study, we introduce a new method for generating diverse and realistic synthetic EHR time series data using Denoising Diffusion Probabilistic Models (DDPM). We conducted experiments on six datasets, comparing our proposed method with eight existing methods. Our results demonstrate that our approach significantly outperforms all existing methods in terms of data utility while requiring less training effort. Our approach also enhances downstream medical data analysis by providing diverse and realistic synthetic EHR data.
    On the Tradeoff between Privacy Preservation and Byzantine-Robustness in Decentralized Learning. (arXiv:2308.14606v2 [cs.LG] UPDATED)
    This paper jointly considers privacy preservation and Byzantine-robustness in decentralized learning. In a decentralized network, honest-but-curious agents faithfully follow the prescribed algorithm, but expect to infer their neighbors' private data from messages received during the learning process, while dishonest-and-Byzantine agents disobey the prescribed algorithm, and deliberately disseminate wrong messages to their neighbors so as to bias the learning process. For this novel setting, we investigate a generic privacy-preserving and Byzantine-robust decentralized stochastic gradient descent (SGD) framework, in which Gaussian noise is injected to preserve privacy and robust aggregation rules are adopted to counteract Byzantine attacks. We analyze its learning error and privacy guarantee, discovering an essential tradeoff between privacy preservation and Byzantine-robustness in decentralized learning -- the learning error caused by defending against Byzantine attacks is exacerbated by the Gaussian noise added to preserve privacy. For a class of state-of-the-art robust aggregation rules, we give unified analysis of the "mixing abilities". Building upon this analysis, we reveal how the "mixing abilities" affect the tradeoff between privacy preservation and Byzantine-robustness. The theoretical results provide guidelines for achieving a favorable tradeoff with proper design of robust aggregation rules. Numerical experiments are conducted and corroborate our theoretical findings.
    BOIS: Bayesian Optimization of Interconnected Systems. (arXiv:2311.11254v2 [stat.ML] UPDATED)
    Bayesian optimization (BO) has proven to be an effective paradigm for the global optimization of expensive-to-sample systems. One of the main advantages of BO is its use of Gaussian processes (GPs) to characterize model uncertainty which can be leveraged to guide the learning and search process. However, BO typically treats systems as black-boxes and this limits the ability to exploit structural knowledge (e.g., physics and sparse interconnections). Composite functions of the form $f(x, y(x))$, wherein GP modeling is shifted from the performance function $f$ to an intermediate function $y$, offer an avenue for exploiting structural knowledge. However, the use of composite functions in a BO framework is complicated by the need to generate a probability density for $f$ from the Gaussian density of $y$ calculated by the GP (e.g., when $f$ is nonlinear it is not possible to obtain a closed-form expression). Previous work has handled this issue using sampling techniques; these are easy to implement and flexible but are computationally intensive. In this work, we introduce a new paradigm which allows for the efficient use of composite functions in BO; this uses adaptive linearizations of $f$ to obtain closed-form expressions for the statistical moments of the composite function. We show that this simple approach (which we call BOIS) enables the exploitation of structural knowledge, such as that arising in interconnected systems as well as systems that embed multiple GP models and combinations of physics and GP models. Using a chemical process optimization case study, we benchmark the effectiveness of BOIS against standard BO and sampling approaches. Our results indicate that BOIS achieves performance gains and accurately captures the statistics of composite functions.
    Visual In-Context Prompting. (arXiv:2311.13601v1 [cs.CV])
    In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities, but this idea is less explored in the vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object, falling short of addressing many generic vision tasks like open-set segmentation and detection. In this paper, we introduce a universal visual in-context prompting framework for both tasks. In particular, we build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points. We further enhance it to take an arbitrary number of reference image segments as the context. Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities to refer and detect, yielding competitive performance to close-set in-domain datasets and showing promising results on many open-set segmentation datasets. By joint training on COCO and SA-1B, our model achieves $57.7$ PQ on COCO and $23.2$ PQ on ADE20K. Code will be available at https://github.com/UX-Decoder/DINOv.
    ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. (arXiv:2305.16213v2 [cs.LG] UPDATED)
    Score distillation sampling (SDS) has shown great promise in text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models, but suffers from over-saturation, over-smoothing, and low-diversity problems. In this work, we propose to model the 3D parameter as a random variable instead of a constant as in SDS and present variational score distillation (VSD), a principled particle-based variational framework to explain and address the aforementioned issues in text-to-3D generation. We show that SDS is a special case of VSD and leads to poor samples with both small and large CFG weights. In comparison, VSD works well with various CFG weights as ancestral sampling from diffusion models and simultaneously improves the diversity and sample quality with a common CFG weight (i.e., $7.5$). We further present various improvements in the design space for text-to-3D such as distillation time schedule and density initialization, which are orthogonal to the distillation algorithm yet not well explored. Our overall approach, dubbed ProlificDreamer, can generate high rendering resolution (i.e., $512\times512$) and high-fidelity NeRF with rich structure and complex effects (e.g., smoke and drops). Further, initialized from NeRF, meshes fine-tuned by VSD are meticulously detailed and photo-realistic. Project page and codes: https://ml.cs.tsinghua.edu.cn/prolificdreamer/
    Video Face Re-Aging: Toward Temporally Consistent Face Re-Aging. (arXiv:2311.11642v1 [cs.CV] CROSS LISTED)
    Video face re-aging deals with altering the apparent age of a person to the target age in videos. This problem is challenging due to the lack of paired video datasets maintaining temporal consistency in identity and age. Most re-aging methods process each image individually without considering the temporal consistency of videos. While some existing works address the issue of temporal coherence through video facial attribute manipulation in latent space, they often fail to deliver satisfactory performance in age transformation. To tackle the issues, we propose (1) a novel synthetic video dataset that features subjects across a diverse range of age groups; (2) a baseline architecture designed to validate the effectiveness of our proposed dataset, and (3) the development of three novel metrics tailored explicitly for evaluating the temporal consistency of video re-aging techniques. Our comprehensive experiments on public datasets, such as VFHQ and CelebV-HQ, show that our method outperforms the existing approaches in terms of both age transformation and temporal consistency.
    Machine Translation to Control Formality Features in the Target Language. (arXiv:2311.13475v1 [cs.CL])
    Formality plays a significant role in language communication, especially in low-resource languages such as Hindi, Japanese and Korean. These languages utilise formal and informal expressions to convey messages based on social contexts and relationships. When a language translation technique is used to translate from a source language that does not pertain the formality (e.g. English) to a target language that does, there is a missing information on formality that could be a challenge in producing an accurate outcome. This research explores how this issue should be resolved when machine learning methods are used to translate from English to languages with formality, using Hindi as the example data. This was done by training a bilingual model in a formality-controlled setting and comparing its performance with a pre-trained multilingual model in a similar setting. Since there are not a lot of training data with ground truth, automated annotation techniques were employed to increase the data size. The primary modeling approach involved leveraging transformer models, which have demonstrated effectiveness in various natural language processing tasks. We evaluate the official formality accuracy(ACC) by comparing the predicted masked tokens with the ground truth. This metric provides a quantitative measure of how well the translations align with the desired outputs. Our study showcases a versatile translation strategy that considers the nuances of formality in the target language, catering to diverse language communication needs and scenarios.
    A Good Feature Extractor Is All You Need for Weakly Supervised Learning in Histopathology. (arXiv:2311.11772v2 [cs.CV] UPDATED)
    Deep learning is revolutionising pathology, offering novel opportunities in disease prognosis and personalised treatment. Historically, stain normalisation has been a crucial preprocessing step in computational pathology pipelines, and persists into the deep learning era. Yet, with the emergence of feature extractors trained using self-supervised learning (SSL) on diverse pathology datasets, we call this practice into question. In an empirical evaluation of publicly available feature extractors, we find that omitting stain normalisation and image augmentations does not compromise downstream performance, while incurring substantial savings in memory and compute. Further, we show that the top-performing feature extractors are remarkably robust to variations in stain and augmentations like rotation in their latent space. Contrary to previous patch-level benchmarking studies, our approach emphasises clinical relevance by focusing on slide-level prediction tasks in a weakly supervised setting with external validation cohorts. This work represents the most comprehensive robustness evaluation of public pathology SSL feature extractors to date, involving more than 6,000 training runs across nine tasks, five datasets, three downstream architectures, and various preprocessing setups. Our findings stand to streamline digital pathology workflows by minimising preprocessing needs and informing the selection of feature extractors.
    Bridging the Gap Between Offline and Online Reinforcement Learning Evaluation Methodologies. (arXiv:2212.08131v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) has shown great promise with algorithms learning in environments with large state and action spaces purely from scalar reward signals. A crucial challenge for current deep RL algorithms is that they require a tremendous amount of environment interactions for learning. This can be infeasible in situations where such interactions are expensive; such as in robotics. Offline RL algorithms try to address this issue by bootstrapping the learning process from existing logged data without needing to interact with the environment from the very beginning. While online RL algorithms are typically evaluated as a function of the number of environment interactions, there exists no single established protocol for evaluating offline RL methods.In this paper, we propose a sequential approach to evaluate offline RL algorithms as a function of the training set size and thus by their data efficiency. Sequential evaluation provides valuable insights into the data efficiency of the learning process and the robustness of algorithms to distribution changes in the dataset while also harmonizing the visualization of the offline and online learning phases. Our approach is generally applicable and easy to implement. We compare several existing offline RL algorithms using this approach and present insights from a variety of tasks and offline datasets.
    On sampling determinantal and Pfaffian point processes on a quantum computer. (arXiv:2305.15851v3 [stat.CO] UPDATED)
    DPPs were introduced by Macchi as a model in quantum optics the 1970s. Since then, they have been widely used as models and subsampling tools in statistics and computer science. Most applications require sampling from a DPP, and given their quantum origin, it is natural to wonder whether sampling a DPP on a quantum computer is easier than on a classical one. We focus here on DPPs over a finite state space, which are distributions over the subsets of $\{1,\dots,N\}$ parametrized by an $N\times N$ Hermitian kernel matrix. Vanilla sampling consists in two steps, of respective costs $\mathcal{O}(N^3)$ and $\mathcal{O}(Nr^2)$ operations on a classical computer, where $r$ is the rank of the kernel matrix. A large first part of the current paper consists in explaining why the state-of-the-art in quantum simulation of fermionic systems already yields quantum DPP sampling algorithms. We then modify existing quantum circuits, and discuss their insertion in a full DPP sampling pipeline that starts from practical kernel specifications. The bottom line is that, with $P$ (classical) parallel processors, we can divide the preprocessing cost by $P$ and build a quantum circuit with $\mathcal{O}(Nr)$ gates that sample a given DPP, with depth varying from $\mathcal{O}(N)$ to $\mathcal{O}(r\log N)$ depending on qubit-communication constraints on the target machine. We also connect existing work on the simulation of superconductors to Pfaffian point processes, which generalize DPPs and would be a natural addition to the machine learner's toolbox. In particular, we describe "projective" Pfaffian point processes, the cardinality of which has constant parity, almost surely. Finally, the circuits are empirically validated on a classical simulator and on 5-qubit IBM machines.
    DNA-TEQ: An Adaptive Exponential Quantization of Tensors for DNN Inference. (arXiv:2306.16430v2 [cs.LG] UPDATED)
    Quantization is commonly used in Deep Neural Networks (DNNs) to reduce the storage and computational complexity by decreasing the arithmetical precision of activations and weights, a.k.a. tensors. Efficient hardware architectures employ linear quantization to enable the deployment of recent DNNs onto embedded systems and mobile devices. However, linear uniform quantization cannot usually reduce the numerical precision to less than 8 bits without sacrificing high performance in terms of model accuracy. The performance loss is due to the fact that tensors do not follow uniform distributions. In this paper, we show that a significant amount of tensors fit into an exponential distribution. Then, we propose DNA-TEQ to exponentially quantize DNN tensors with an adaptive scheme that achieves the best trade-off between numerical precision and accuracy loss. The experimental results show that DNA-TEQ provides a much lower quantization bit-width compared to previous proposals, resulting in an average compression ratio of 40% over the linear INT8 baseline, with negligible accuracy loss and without retraining the DNNs. Besides, DNA-TEQ leads the way in performing dot-product operations in the exponential domain, which saves 66% of energy consumption on average for a set of widely used DNNs.
    Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification. (arXiv:2305.14032v4 [eess.AS] UPDATED)
    Respiratory sound contains crucial information for the early diagnosis of fatal lung diseases. Since the COVID-19 pandemic, there has been a growing interest in contact-free medical care based on electronic stethoscopes. To this end, cutting-edge deep learning models have been developed to diagnose lung diseases; however, it is still challenging due to the scarcity of medical data. In this study, we demonstrate that the pretrained model on large-scale visual and audio datasets can be generalized to the respiratory sound classification task. In addition, we introduce a straightforward Patch-Mix augmentation, which randomly mixes patches between different samples, with Audio Spectrogram Transformer (AST). We further propose a novel and effective Patch-Mix Contrastive Learning to distinguish the mixed representations in the latent space. Our method achieves state-of-the-art performance on the ICBHI dataset, outperforming the prior leading score by an improvement of 4.08%.
    NeuroGraph: Benchmarks for Graph Machine Learning in Brain Connectomics. (arXiv:2306.06202v3 [cs.LG] UPDATED)
    Machine learning provides a valuable tool for analyzing high-dimensional functional neuroimaging data, and is proving effective in predicting various neurological conditions, psychiatric disorders, and cognitive patterns. In functional magnetic resonance imaging (MRI) research, interactions between brain regions are commonly modeled using graph-based representations. The potency of graph machine learning methods has been established across myriad domains, marking a transformative step in data interpretation and predictive modeling. Yet, despite their promise, the transposition of these techniques to the neuroimaging domain has been challenging due to the expansive number of potential preprocessing pipelines and the large parameter search space for graph-based dataset construction. In this paper, we introduce NeuroGraph, a collection of graph-based neuroimaging datasets, and demonstrated its utility for predicting multiple categories of behavioral and cognitive traits. We delve deeply into the dataset generation search space by crafting 35 datasets that encompass static and dynamic brain connectivity, running in excess of 15 baseline methods for benchmarking. Additionally, we provide generic frameworks for learning on both static and dynamic graphs. Our extensive experiments lead to several key observations. Notably, using correlation vectors as node features, incorporating larger number of regions of interest, and employing sparser graphs lead to improved performance. To foster further advancements in graph-based data driven neuroimaging analysis, we offer a comprehensive open-source Python package that includes the benchmark datasets, baseline implementations, model training, and standard evaluation.
    Why Shallow Networks Struggle with Approximating and Learning High Frequency: A Numerical Study. (arXiv:2306.17301v2 [cs.LG] UPDATED)
    In this work, a comprehensive numerical study involving analysis and experiments shows why a two-layer neural network has difficulties handling high frequencies in approximation and learning when machine precision and computation cost are important factors in real practice. In particular, the following basic computational issues are investigated: (1) the minimal numerical error one can achieve given a finite machine precision, (2) the computation cost to achieve a given accuracy, and (3) stability with respect to perturbations. The key to the study is the conditioning of the representation and its learning dynamics. Explicit answers to the above questions with numerical verifications are presented.
    Guided Flows for Generative Modeling and Decision Making. (arXiv:2311.13443v1 [cs.LG])
    Classifier-free guidance is a key component for improving the performance of conditional generative models for many downstream tasks. It drastically improves the quality of samples produced, but has so far only been used for diffusion models. Flow Matching (FM), an alternative simulation-free approach, trains Continuous Normalizing Flows (CNFs) based on regressing vector fields. It remains an open question whether classifier-free guidance can be performed for Flow Matching models, and to what extent does it improve performance. In this paper, we explore the usage of Guided Flows for a variety of downstream applications involving conditional image generation, speech synthesis, and reinforcement learning. In particular, we are the first to apply flow models to the offline reinforcement learning setting. We also show that Guided Flows significantly improves the sample quality in image generation and zero-shot text-to-speech synthesis, and can make use of drastically low amounts of computation without affecting the agent's overall performance.
    Curriculum Learning and Imitation Learning for Model-free Control on Financial Time-series. (arXiv:2311.13326v1 [cs.LG])
    Curriculum learning and imitation learning have been leveraged extensively in the robotics domain. However, minimal research has been done on leveraging these ideas on control tasks over highly stochastic time-series data. Here, we theoretically and empirically explore these approaches in a representative control task over complex time-series data. We implement the fundamental ideas of curriculum learning via data augmentation, while imitation learning is implemented via policy distillation from an oracle. Our findings reveal that curriculum learning should be considered a novel direction in improving control-task performance over complex time-series. Our ample random-seed out-sample empirics and ablation studies are highly encouraging for curriculum learning for time-series control. These findings are especially encouraging as we tune all overlapping hyperparameters on the baseline -- giving an advantage to the baseline. On the other hand, we find that imitation learning should be used with caution.
    Explaining high-dimensional text classifiers. (arXiv:2311.13454v1 [cs.LG])
    Explainability has become a valuable tool in the last few years, helping humans better understand AI-guided decisions. However, the classic explainability tools are sometimes quite limited when considering high-dimensional inputs and neural network classifiers. We present a new explainability method using theoretically proven high-dimensional properties in neural network classifiers. We present two usages of it: 1) On the classical sentiment analysis task for the IMDB reviews dataset, and 2) our Malware-Detection task for our PowerShell scripts dataset.
    Inferring Actual Treatment Pathways from Patient Records. (arXiv:2309.01897v2 [cs.LG] UPDATED)
    Treatment pathways are step-by-step plans outlining the recommended medical care for specific diseases; they get revised when different treatments are found to improve patient outcomes. Examining health records is an important part of this revision process, but inferring patients' actual treatments from health data is challenging due to complex event-coding schemes and the absence of pathway-related annotations. This study aims to infer the actual treatment steps for a particular patient group from administrative health records (AHR) - a common form of tabular healthcare data - and address several technique- and methodology-based gaps in treatment pathway-inference research. We introduce Defrag, a method for examining AHRs to infer the real-world treatment steps for a particular patient group. Defrag learns the semantic and temporal meaning of healthcare event sequences, allowing it to reliably infer treatment steps from complex healthcare data. To our knowledge, Defrag is the first pathway-inference method to utilise a neural network (NN), an approach made possible by a novel, self-supervised learning objective. We also developed a testing and validation framework for pathway inference, which we use to characterise and evaluate Defrag's pathway inference ability and compare against baselines. We demonstrate Defrag's effectiveness by identifying best-practice pathway fragments for breast cancer, lung cancer, and melanoma in public healthcare records. Additionally, we use synthetic data experiments to demonstrate the characteristics of the Defrag method, and to compare Defrag to several baselines where it significantly outperforms non-NN-based methods. Defrag significantly outperforms several existing pathway-inference methods and offers an innovative and effective approach for inferring treatment pathways from AHRs. Open-source code is provided to encourage further research in this area.
    Benchmarking Toxic Molecule Classification using Graph Neural Networks and Few Shot Learning. (arXiv:2311.13490v1 [q-bio.QM])
    Traditional methods like Graph Convolutional Networks (GCNs) face challenges with limited data and class imbalance, leading to suboptimal performance in graph classification tasks during toxicity prediction of molecules as a whole. To address these issues, we harness the power of Graph Isomorphic Networks, Multi Headed Attention and Free Large-scale Adversarial Augmentation separately on Graphs for precisely capturing the structural data of molecules and their toxicological properties. Additionally, we incorporate Few-Shot Learning to improve the model's generalization with limited annotated samples. Extensive experiments on a diverse toxicology dataset demonstrate that our method achieves an impressive state-of-art AUC-ROC value of 0.816, surpassing the baseline GCN model by 11.4%. This highlights the significance of our proposed methodology and Few Shot Learning in advancing Toxic Molecular Classification, with the potential to enhance drug discovery and environmental risk assessment processes.
    Leveraging Different Learning Styles for Improved Knowledge Distillation in Biomedical Imaging. (arXiv:2212.02931v3 [cs.CV] UPDATED)
    Learning style refers to a type of training mechanism adopted by an individual to gain new knowledge. As suggested by the VARK model, humans have different learning preferences, like Visual (V), Auditory (A), Read/Write (R), and Kinesthetic (K), for acquiring and effectively processing information. Our work endeavors to leverage this concept of knowledge diversification to improve the performance of model compression techniques like Knowledge Distillation (KD) and Mutual Learning (ML). Consequently, we use a single-teacher and two-student network in a unified framework that not only allows for the transfer of knowledge from teacher to students (KD) but also encourages collaborative learning between students (ML). Unlike the conventional approach, where the teacher shares the same knowledge in the form of predictions or feature representations with the student network, our proposed approach employs a more diversified strategy by training one student with predictions and the other with feature maps from the teacher. We further extend this knowledge diversification by facilitating the exchange of predictions and feature maps between the two student networks, enriching their learning experiences. We have conducted comprehensive experiments with three benchmark datasets for both classification and segmentation tasks using two different network architecture combinations. These experimental results demonstrate that knowledge diversification in a combined KD and ML framework outperforms conventional KD or ML techniques (with similar network configuration) that only use predictions with an average improvement of 2%. Furthermore, consistent improvement in performance across different tasks, with various network architectures, and over state-of-the-art techniques establishes the robustness and generalizability of the proposed model
    FedFN: Feature Normalization for Alleviating Data Heterogeneity Problem in Federated Learning. (arXiv:2311.13267v1 [cs.LG])
    Federated Learning (FL) is a collaborative method for training models while preserving data privacy in decentralized settings. However, FL encounters challenges related to data heterogeneity, which can result in performance degradation. In our study, we observe that as data heterogeneity increases, feature representation in the FedAVG model deteriorates more significantly compared to classifier weight. Additionally, we observe that as data heterogeneity increases, the gap between higher feature norms for observed classes, obtained from local models, and feature norms of unobserved classes widens, in contrast to the behavior of classifier weight norms. This widening gap extends to encompass the feature norm disparities between local and the global models. To address these issues, we introduce Federated Averaging with Feature Normalization Update (FedFN), a straightforward learning method. We demonstrate the superior performance of FedFN through extensive experiments, even when applied to pretrained ResNet18. Subsequently, we confirm the applicability of FedFN to foundation models.
    Synaptic Sampling of Neural Networks. (arXiv:2311.13038v1 [cs.AI])
    Probabilistic artificial neural networks offer intriguing prospects for enabling the uncertainty of artificial intelligence methods to be described explicitly in their function; however, the development of techniques that quantify uncertainty by well-understood methods such as Monte Carlo sampling has been limited by the high costs of stochastic sampling on deterministic computing hardware. Emerging computing systems that are amenable to hardware-level probabilistic computing, such as those that leverage stochastic devices, may make probabilistic neural networks more feasible in the not-too-distant future. This paper describes the scANN technique -- \textit{sampling (by coinflips) artificial neural networks} -- which enables neural networks to be sampled directly by treating the weights as Bernoulli coin flips. This method is natively well suited for probabilistic computing techniques that focus on tunable stochastic devices, nearly matches fully deterministic performance while also describing the uncertainty of correct and incorrect neural network outputs.
    MergeSFL: Split Federated Learning with Feature Merging and Batch Size Regulation. (arXiv:2311.13348v1 [cs.LG])
    Recently, federated learning (FL) has emerged as a popular technique for edge AI to mine valuable knowledge in edge computing (EC) systems. To mitigate the computing/communication burden on resource-constrained workers and protect model privacy, split federated learning (SFL) has been released by integrating both data and model parallelism. Despite resource limitations, SFL still faces two other critical challenges in EC, i.e., statistical heterogeneity and system heterogeneity. To address these challenges, we propose a novel SFL framework, termed MergeSFL, by incorporating feature merging and batch size regulation in SFL. Concretely, feature merging aims to merge the features from workers into a mixed feature sequence, which is approximately equivalent to the features derived from IID data and is employed to promote model accuracy. While batch size regulation aims to assign diverse and suitable batch sizes for heterogeneous workers to improve training efficiency. Moreover, MergeSFL explores to jointly optimize these two strategies upon their coupled relationship to better enhance the performance of SFL. Extensive experiments are conducted on a physical platform with 80 NVIDIA Jetson edge devices, and the experimental results show that MergeSFL can improve the final model accuracy by 5.82% to 26.22%, with a speedup by about 1.74x to 4.14x, compared to the baselines.  ( 2 min )
    Orchard: building large cancer phylogenies using stochastic combinatorial search. (arXiv:2311.12917v1 [q-bio.PE])
    Phylogenies depicting the evolutionary history of genetically heterogeneous subpopulations of cells from the same cancer i.e., cancer phylogenies, provide useful insights about cancer development and inform treatment. Cancer phylogenies can be reconstructed using data obtained from bulk DNA sequencing of multiple tissue samples from the same cancer. We introduce Orchard, a fast algorithm that reconstructs cancer phylogenies using point mutations detected in bulk DNA sequencing data. Orchard constructs cancer phylogenies progressively, one point mutation at a time, ultimately sampling complete phylogenies from a posterior distribution implied by the bulk DNA data. Orchard reconstructs more plausible phylogenies than state-of-the-art cancer phylogeny reconstruction methods on 90 simulated cancers and 14 B-progenitor acute lymphoblastic leukemias (B-ALLs). These results demonstrate that Orchard accurately reconstructs cancer phylogenies with up to 300 mutations. We then introduce a simple graph based clustering algorithm that uses a reconstructed phylogeny to infer unique groups of mutations i.e., mutation clusters, that characterize the genetic differences between cancer cell populations, and show that this approach is competitive with state-of-the-art mutation clustering methods.
    LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms. (arXiv:2311.13133v1 [cs.LG])
    Large Language Models are traditionally finetuned on large instruction datasets. However recent studies suggest that small, high-quality datasets can suffice for general purpose instruction following. This lack of consensus surrounding finetuning best practices is in part due to rapidly diverging approaches to LLM evaluation. In this study, we ask whether a small amount of diverse finetuning samples can improve performance on both traditional perplexity-based NLP benchmarks, and on open-ended, model-based evaluation. We finetune open-source MPT-7B and MPT-30B models on instruction finetuning datasets of various sizes ranging from 1k to 60k samples. We find that subsets of 1k-6k instruction finetuning samples are sufficient to achieve good performance on both (1) traditional NLP benchmarks and (2) model-based evaluation. Finally, we show that mixing textbook-style and open-ended QA finetuning datasets optimizes performance on both evaluation paradigms.
    Testing Closeness of Multivariate Distributions via Ramsey Theory. (arXiv:2311.13154v1 [cs.DS])
    We investigate the statistical task of closeness (or equivalence) testing for multidimensional distributions. Specifically, given sample access to two unknown distributions $\mathbf p, \mathbf q$ on $\mathbb R^d$, we want to distinguish between the case that $\mathbf p=\mathbf q$ versus $\|\mathbf p-\mathbf q\|_{A_k} > \epsilon$, where $\|\mathbf p-\mathbf q\|_{A_k}$ denotes the generalized ${A}_k$ distance between $\mathbf p$ and $\mathbf q$ -- measuring the maximum discrepancy between the distributions over any collection of $k$ disjoint, axis-aligned rectangles. Our main result is the first closeness tester for this problem with {\em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. In more detail, we provide a computationally efficient closeness tester with sample complexity $O\left((k^{6/7}/ \mathrm{poly}_d(\epsilon)) \log^d(k)\right)$. On the lower bound side, we establish a qualitatively matching sample complexity lower bound of $\Omega(k^{6/7}/\mathrm{poly}(\epsilon))$, even for $d=2$. These sample complexity bounds are surprising because the sample complexity of the problem in the univariate setting is $\Theta(k^{4/5}/\mathrm{poly}(\epsilon))$. This has the interesting consequence that the jump from one to two dimensions leads to a substantial increase in sample complexity, while increases beyond that do not. As a corollary of our general $A_k$ tester, we obtain $d_{\mathrm TV}$-closeness testers for pairs of $k$-histograms on $\mathbb R^d$ over a common unknown partition, and pairs of uniform distributions supported on the union of $k$ unknown disjoint axis-aligned rectangles. Both our algorithm and our lower bound make essential use of tools from Ramsey theory.
    Modular Blended Attention Network for Video Question Answering. (arXiv:2311.12866v1 [cs.CV])
    In multimodal machine learning tasks, it is due to the complexity of the assignments that the network structure, in most cases, is assembled in a sophisticated way. The holistic architecture can be separated into several logical parts according to the respective ends that the modules are devised to achieve. As the number of modalities of information representation increases, constructing ad hoc subnetworks for processing the data from divergent modalities while mediating the fusion of different information types has become a cumbersome and expensive problem. In this paper, we present an approach to facilitate the question with a reusable and composable neural unit; by connecting the units in series or parallel, the arduous network constructing of multimodal machine learning tasks will be accomplished in a much straightforward way. Additionally, through parameter sharing (weights replication) among the units, the space complexity will be significantly reduced. We have conducted experiments on three commonly used datasets; our method achieves impressive performance compared to several video QA baselines.
    Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge. (arXiv:2311.12889v1 [cs.CV])
    This work presents an enhanced approach to generating scene graphs by incorporating a relationship hierarchy and commonsense knowledge. Specifically, we propose a Bayesian classification head that exploits an informative hierarchical structure. It jointly predicts the super-category or type of relationship between the two objects, along with the detailed relationship under each super-category. We design a commonsense validation pipeline that uses a large language model to critique the results from the scene graph prediction system and then use that feedback to enhance the model performance. The system requires no external large language model assistance at test time, making it more convenient for practical applications. Experiments on the Visual Genome and the OpenImage V6 datasets demonstrate that harnessing hierarchical relationships enhances the model performance by a large margin. The proposed Bayesian head can also be incorporated as a portable module in existing scene graph generation algorithms to improve their results. In addition, the commonsense validation enables the model to generate an extensive set of reasonable predictions beyond dataset annotations.
    A PSO Based Method to Generate Actionable Counterfactuals for High Dimensional Data. (arXiv:2311.12825v1 [cs.AI])
    Counterfactual explanations (CFE) are methods that explain a machine learning model by giving an alternate class prediction of a data point with some minimal changes in its features. It helps the users to identify their data attributes that caused an undesirable prediction like a loan or credit card rejection. We describe an efficient and an actionable counterfactual (CF) generation method based on particle swarm optimization (PSO). We propose a simple objective function for the optimization of the instance-centric CF generation problem. The PSO brings in a lot of flexibility in terms of carrying out multi-objective optimization in large dimensions, capability for multiple CF generation, and setting box constraints or immutability of data attributes. An algorithm is proposed that incorporates these features and it enables greater control over the proximity and sparsity properties over the generated CFs. The proposed algorithm is evaluated with a set of action-ability metrics in real-world datasets, and the results were superior compared to that of the state-of-the-arts.
    An Embodied Generalist Agent in 3D World. (arXiv:2311.12871v1 [cs.CV])
    Leveraging massive knowledge and learning schemes from large language models (LLMs), recent machine learning models show notable successes in building generalist agents that exhibit the capability of general-purpose task solving in diverse domains, including natural language processing, computer vision, and robotics. However, a significant challenge remains as these models exhibit limited ability in understanding and interacting with the 3D world. We argue this limitation significantly hinders the current models from performing real-world tasks and further achieving general intelligence. To this end, we introduce an embodied multi-modal and multi-task generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. Our proposed agent, referred to as LEO, is trained with shared LLM-based model architectures, objectives, and weights in two stages: (i) 3D vision-language alignment and (ii) 3D vision-language-action instruction tuning. To facilitate the training, we meticulously curate and generate an extensive dataset comprising object-level and scene-level multi-modal tasks with exceeding scale and complexity, necessitating a deep understanding of and interaction with the 3D world. Through rigorous experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, embodied navigation, and robotic manipulation. Our ablation results further provide valuable insights for the development of future embodied generalist agents.
    Accelerating Inference in Molecular Diffusion Models with Latent Representations of Protein Structure. (arXiv:2311.13466v1 [q-bio.BM])
    Diffusion generative models have emerged as a powerful framework for addressing problems in structural biology and structure-based drug design. These models operate directly on 3D molecular structures. Due to the unfavorable scaling of graph neural networks (GNNs) with graph size as well as the relatively slow inference speeds inherent to diffusion models, many existing molecular diffusion models rely on coarse-grained representations of protein structure to make training and inference feasible. However, such coarse-grained representations discard essential information for modeling molecular interactions and impair the quality of generated structures. In this work, we present a novel GNN-based architecture for learning latent representations of molecular structure. When trained end-to-end with a diffusion model for de novo ligand design, our model achieves comparable performance to one with an all-atom protein representation while exhibiting a 3-fold reduction in inference time.
    On diffusion-based generative models and their error bounds: The log-concave case with full convergence estimates. (arXiv:2311.13584v1 [cs.LG])
    We provide full theoretical guarantees for the convergence behaviour of diffusion-based generative models under the assumption of strongly logconcave data distributions while our approximating class of functions used for score estimation is made of Lipschitz continuous functions. We demonstrate via a motivating example, sampling from a Gaussian distribution with unknown mean, the powerfulness of our approach. In this case, explicit estimates are provided for the associated optimization problem, i.e. score approximation, while these are combined with the corresponding sampling estimates. As a result, we obtain the best known upper bound estimates in terms of key quantities of interest, such as the dimension and rates of convergence, for the Wasserstein-2 distance between the data distribution (Gaussian with unknown mean) and our sampling algorithm. Beyond the motivating example and in order to allow for the use of a diverse range of stochastic optimizers, we present our results using an $L^2$-accurate score estimation assumption, which crucially is formed under an expectation with respect to the stochastic optimizer and our novel auxiliary process that uses only known information. This approach yields the best known convergence rate for our sampling algorithm.
    GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization. (arXiv:2309.16020v2 [cs.CV] UPDATED)
    Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth. This task has considerable challenges due to immense variation in geographic landscapes. The image-to-image retrieval-based approaches fail to solve this problem on a global scale as it is not feasible to construct a large gallery of images covering the entire world. Instead, existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task. However, their performance is limited by the predefined classes and often results in inaccurate localizations when an image's location significantly deviates from its class center. To overcome these limitations, we propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations. GeoCLIP's location encoder models the Earth as a continuous function by employing positional encoding through random Fourier features and constructing a hierarchical representation that captures information at varying resolutions to yield a semantically rich high-dimensional feature suitable to use even beyond geo-localization. To the best of our knowledge, this is the first work employing GPS encoding for geo-localization. We demonstrate the efficacy of our method via extensive experiments and ablations on benchmark datasets. We achieve competitive performance with just 20% of training data, highlighting its effectiveness even in limited-data settings. Furthermore, we qualitatively demonstrate geo-localization using a text query by leveraging CLIP backbone of our image encoder. The project webpage is available at: https://vicentevivan.github.io/GeoCLIP
    ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation. (arXiv:2311.13258v1 [cs.CV])
    State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction, such as relations between objects. In this work, we present ViStruct, a training framework to learn VLMs for effective visual structural knowledge extraction. Two novel designs are incorporated. First, we propose to leverage the inherent structure of programming language to depict visual structural information. This approach enables explicit and consistent representation of visual structural information of multiple granularities, such as concepts, relations, and events, in a well-organized structured format. Second, we introduce curriculum-based learning for VLMs to progressively comprehend visual structures, from fundamental visual concepts to intricate event structures. Our intuition is that lower-level knowledge may contribute to complex visual structure understanding. Furthermore, we compile and release a collection of datasets tailored for visual structural knowledge extraction. We adopt a weakly-supervised approach to directly generate visual event structures from captions for ViStruct training, capitalizing on abundant image-caption pairs from the web. In experiments, we evaluate ViStruct on visual structure prediction tasks, demonstrating its effectiveness in improving the understanding of visual structures. The code is public at \url{https://github.com/Yangyi-Chen/vi-struct}.  ( 2 min )
    Confidant: Customizing Transformer-based LLMs via Collaborative Edge Training. (arXiv:2311.13381v1 [cs.LG])
    Transformer-based large language models (LLMs) have demonstrated impressive capabilities in a variety of natural language processing (NLP) tasks. Nonetheless, it is challenging to deploy and fine-tune LLMs on mobile edge devices with limited computing, memory, and energy budgets. In this paper, we propose Confidant, a multi-backend collaborative training framework for customizing state-of-the-art LLMs on commodity mobile devices like smartphones. Confidant partitions an LLM into several sub-models so that each fits into a mobile device's memory. A pipeline parallel training mechanism is further developed to ensure fast and efficient distributed training. In addition, we propose a novel backend scheduler to allocate different attention heads to heterogeneous compute hardware, including mobile CPU and GPUs, to maximize the compute resource utilization on each edge device. Our preliminary experimental results show that Confidant achieves at most 45.3% memory reduction and 8.03x inference speedup in practical settings.  ( 2 min )
    NeutronOrch: Rethinking Sample-based GNN Training under CPU-GPU Heterogeneous Environments. (arXiv:2311.13225v1 [cs.DC])
    Graph Neural Networks (GNNs) have demonstrated outstanding performance in various applications. Existing frameworks utilize CPU-GPU heterogeneous environments to train GNN models and integrate mini-batch and sampling techniques to overcome the GPU memory limitation. In CPU-GPU heterogeneous environments, we can divide sample-based GNN training into three steps: sample, gather, and train. Existing GNN systems use different task orchestrating methods to employ each step on CPU or GPU. After extensive experiments and analysis, we find that existing task orchestrating methods fail to fully utilize the heterogeneous resources, limited by inefficient CPU processing or GPU resource contention. In this paper, we propose NeutronOrch, a system for sample-based GNN training that incorporates a layer-based task orchestrating method and ensures balanced utilization of the CPU and GPU. NeutronOrch decouples the training process by layer and pushes down the training task of the bottom layer to the CPU. This significantly reduces the computational load and memory footprint of GPU training. To avoid inefficient CPU processing, NeutronOrch only offloads the training of frequently accessed vertices to the CPU and lets GPU reuse their embeddings with bounded staleness. Furthermore, NeutronOrch provides a fine-grained pipeline design for the layer-based task orchestrating method, fully overlapping different tasks on heterogeneous resources while strictly guaranteeing bounded staleness. The experimental results show that compared with the state-of-the-art GNN systems, NeutronOrch can achieve up to 4.61x performance speedup.  ( 2 min )
    ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization. (arXiv:2311.13171v1 [cs.LG])
    Parameter-efficient fine-tuning (PEFT) techniques make it possible to efficiently adapt a language model to create "expert" models that specialize to new tasks or domains. Recent techniques in model merging and compositional generalization leverage these expert models by dynamically composing modules to improve zero/few-shot generalization. Despite the efficiency of PEFT methods, the size of expert models can make it onerous to retrieve expert models per query over high-latency networks like the Internet or serve multiple experts on a single GPU. To address these issues, we present ComPEFT, a novel method for compressing fine-tuning residuals (task vectors) of PEFT based models. ComPEFT employs sparsification and ternary quantization to reduce the size of the PEFT module without performing any additional retraining while preserving or enhancing model performance. In extensive evaluation across T5, T0, and LLaMA-based models with 200M - 65B parameters, ComPEFT achieves compression ratios of 8x - 50x. In particular, we show that ComPEFT improves with scale - stronger models exhibit higher compressibility and better performance. For example, we show that ComPEFT applied to LLaMA outperforms QLoRA by 4.16% on MMLU with a storage size reduction of up to 26x. In addition, we show that the compressed experts produced by ComPEFT maintain few-shot compositional generalization capabilities, facilitate efficient communication and computation, and exhibit enhanced performance when merged. Lastly, we provide an analysis of different method components, compare it with other PEFT methods, and test ComPEFT's efficacy for compressing the residual of full-finetuning. Our code is available at https://github.com/prateeky2806/compeft.
    Revisiting Supervision for Continual Representation Learning. (arXiv:2311.13321v1 [cs.LG])
    In the field of continual learning, models are designed to learn tasks one after the other. While most research has centered on supervised continual learning, recent studies have highlighted the strengths of self-supervised continual representation learning. The improved transferability of representations built with self-supervised methods is often associated with the role played by the multi-layer perceptron projector. In this work, we depart from this observation and reexamine the role of supervision in continual representation learning. We reckon that additional information, such as human annotations, should not deteriorate the quality of representations. Our findings show that supervised models when enhanced with a multi-layer perceptron head, can outperform self-supervised models in continual representation learning.  ( 2 min )
    Improving performance of heart rate time series classification by grouping subjects. (arXiv:2311.13285v1 [cs.LG])
    Unlike the more commonly analyzed ECG or PPG data for activity classification, heart rate time series data is less detailed, often noisier and can contain missing data points. Using the BigIdeasLab_STEP dataset, which includes heart rate time series annotated with specific tasks performed by individuals, we sought to determine if general classification was achievable. Our analyses showed that the accuracy is sensitive to the choice of window/stride size. Moreover, we found variable classification performances between subjects due to differences in the physical structure of their hearts. Various techniques were used to minimize this variability. First of all, normalization proved to be a crucial step and significantly improved the performance. Secondly, grouping subjects and performing classification inside a group helped to improve performance and decrease inter-subject variability. Finally, we show that including handcrafted features as input to a deep learning (DL) network improves the classification performance further. Together, these findings indicate that heart rate time series can be utilized for classification tasks like predicting activity. However, normalization or grouping techniques need to be chosen carefully to minimize the issue of subject variability.  ( 2 min )
    Immunohistochemistry guided segmentation of benign epithelial cells, in situ lesions, and invasive epithelial cells in breast cancer slides. (arXiv:2311.13261v1 [eess.IV])
    Digital pathology enables automatic analysis of histopathological sections using artificial intelligence (AI). Automatic evaluation could improve diagnostic efficiency and help find associations between morphological features and clinical outcome. For development of such prediction models, identifying invasive epithelial cells, and separating these from benign epithelial cells and in situ lesions would be the first step. In this study, we aimed to develop an AI model for segmentation of epithelial cells in sections from breast cancer. We generated epithelial ground truth masks by restaining hematoxylin and eosin (HE) sections with cytokeratin (CK) AE1/AE3, and by pathologists' annotations. HE/CK image pairs were used to train a convolutional neural network, and data augmentation was used to make the model more robust. Tissue microarrays (TMAs) from 839 patients, and whole slide images from two patients were used for training and evaluation of the models. The sections were derived from four cohorts of breast cancer patients. TMAs from 21 patients from a fifth cohort was used as a second test set. In quantitative evaluation, a mean Dice score of 0.70, 0.79, and 0.75 for invasive epithelial cells, benign epithelial cells, and in situ lesions, respectively, were achieved. In qualitative scoring (0-5) by pathologists, results were best for all epithelium and invasive epithelium, with scores of 4.7 and 4.4. Scores for benign epithelium and in situ lesions were 3.7 and 2.0. The proposed model segmented epithelial cells in HE stained breast cancer slides well, but further work is needed for accurate division between the classes. Immunohistochemistry, together with pathologists' annotations, enabled the creation of accurate ground truths. The model is made freely available in FastPathology and the code is available at https://github.com/AICAN-Research/breast-epithelium-segmentation  ( 3 min )
    Towards Hetero-Client Federated Multi-Task Learning. (arXiv:2311.13250v1 [cs.CV])
    Federated Learning (FL) enables joint training across distributed clients using their local data privately. Federated Multi-Task Learning (FMTL) builds on FL to handle multiple tasks, assuming model congruity that identical model architecture is deployed in each client. To relax this assumption and thus extend real-world applicability, we introduce a novel problem setting, Hetero-Client Federated Multi-Task Learning (HC-FMTL), to accommodate diverse task setups. The main challenge of HC-FMTL is the model incongruity issue that invalidates conventional aggregation methods. It also escalates the difficulties in accurate model aggregation to deal with data and task heterogeneity inherent in FMTL. To address these challenges, we propose the FedHCA$^2$ framework, which allows for federated training of personalized models by modeling relationships among heterogeneous clients. Drawing on our theoretical insights into the difference between multi-task and federated optimization, we propose the Hyper Conflict-Averse Aggregation scheme to mitigate conflicts during encoder updates. Additionally, inspired by task interaction in MTL, the Hyper Cross Attention Aggregation scheme uses layer-wise cross attention to enhance decoder interactions while alleviating model incongruity. Moreover, we employ learnable Hyper Aggregation Weights for each client to customize personalized parameter updates. Extensive experiments demonstrate the superior performance of FedHCA$^2$ in various HC-FMTL scenarios compared to representative methods. Our code will be made publicly available.  ( 2 min )
    Recurrent neural networks and transfer learning for elasto-plasticity in woven composites. (arXiv:2311.13434v1 [cond-mat.mtrl-sci])
    As a surrogate for computationally intensive meso-scale simulation of woven composites, this article presents Recurrent Neural Network (RNN) models. Leveraging the power of transfer learning, the initialization challenges and sparse data issues inherent in cyclic shear strain loads are addressed in the RNN models. A mean-field model generates a comprehensive data set representing elasto-plastic behavior. In simulations, arbitrary six-dimensional strain histories are used to predict stresses under random walking as the source task and cyclic loading conditions as the target task. Incorporating sub-scale properties enhances RNN versatility. In order to achieve accurate predictions, the model uses a grid search method to tune network architecture and hyper-parameter configurations. The results of this study demonstrate that transfer learning can be used to effectively adapt the RNN to varying strain conditions, which establishes its potential as a useful tool for modeling path-dependent responses in woven composites.  ( 2 min )
    Learning principle and mathematical realization of the learning mechanism in the brain. (arXiv:2311.13341v1 [cs.LG])
    While deep learning has achieved remarkable success, there is no clear explanation about why it works so well. In order to discuss this question quantitatively, we need a mathematical framework that explains what learning is in the first place. After several considerations, we succeeded in constructing a mathematical framework that can provide a unified understanding of all types of learning, including deep learning and learning in the brain. We call it learning principle, and it follows that all learning is equivalent to estimating the probability of input data. We not only derived this principle, but also mentioned its application to actual machine learning models. For example, we found that conventional supervised learning is equivalent to estimating conditional probabilities, and succeeded in making supervised learning more effective and generalized. We also proposed a new method of defining the values of estimated probability using differentiation, and showed that unsupervised learning can be performed on arbitrary dataset without any prior knowledge. Namely, this method is a general-purpose machine learning in the true sense. Moreover, we succeeded in describing the learning mechanism in the brain by considering the time evolution of a fully or partially connected model and applying this new method. The learning principle provides solutions to many unsolved problems in deep learning and cognitive neuroscience.  ( 2 min )
    An Empirical Study of Uncertainty Estimation Techniques for Detecting Drift in Data Streams. (arXiv:2311.13374v1 [cs.LG])
    In safety-critical domains such as autonomous driving and medical diagnosis, the reliability of machine learning models is crucial. One significant challenge to reliability is concept drift, which can cause model deterioration over time. Traditionally, drift detectors rely on true labels, which are often scarce and costly. This study conducts a comprehensive empirical evaluation of using uncertainty values as substitutes for error rates in detecting drifts, aiming to alleviate the reliance on labeled post-deployment data. We examine five uncertainty estimation methods in conjunction with the ADWIN detector across seven real-world datasets. Our results reveal that while the SWAG method exhibits superior calibration, the overall accuracy in detecting drifts is not notably impacted by the choice of uncertainty estimation method, with even the most basic method demonstrating competitive performance. These findings offer valuable insights into the practical applicability of uncertainty-based drift detection in real-world, safety-critical applications.  ( 2 min )
    Joint Multi-View Collaborative Clustering. (arXiv:2311.12859v1 [cs.CV])
    Data is increasingly being collected from multiple sources and described by multiple views. These multi-view data provide richer information than traditional single-view data. Fusing the former for specific tasks is an essential component of multi-view clustering. Since the goal of multi-view clustering algorithms is to discover the common latent structure shared by multiple views, the majority of proposed solutions overlook the advantages of incorporating knowledge derived from horizontal collaboration between multi-view data and the final consensus. To fill this gap, we propose the Joint Multi-View Collaborative Clustering (JMVCC) solution, which involves the generation of basic partitions using Non-negative Matrix Factorization (NMF) and the horizontal collaboration principle, followed by the fusion of these local partitions using ensemble clustering. Furthermore, we propose a weighting method to reduce the risk of negative collaboration (i.e., views with low quality) during the generation and fusion of local partitions. The experimental results, which were obtained using a variety of data sets, demonstrate that JMVCC outperforms other multi-view clustering algorithms and is robust to noisy views.  ( 2 min )
    Weak-Form Latent Space Dynamics Identification. (arXiv:2311.12880v1 [eess.SY])
    Recent work in data-driven modeling has demonstrated that a weak formulation of model equations enhances the noise robustness of a wide range of computational methods. In this paper, we demonstrate the power of the weak form to enhance the LaSDI (Latent Space Dynamics Identification) algorithm, a recently developed data-driven reduced order modeling technique. We introduce a weak form-based version WLaSDI (Weak-form Latent Space Dynamics Identification). WLaSDI first compresses data, then projects onto the test functions and learns the local latent space models. Notably, WLaSDI demonstrates significantly enhanced robustness to noise. With WLaSDI, the local latent space is obtained using weak-form equation learning techniques. Compared to the standard sparse identification of nonlinear dynamics (SINDy) used in LaSDI, the variance reduction of the weak form guarantees a robust and precise latent space recovery, hence allowing for a fast, robust, and accurate simulation. We demonstrate the efficacy of WLaSDI vs. LaSDI on several common benchmark examples including viscid and inviscid Burgers', radial advection, and heat conduction. For instance, in the case of 1D inviscid Burgers' simulations with the addition of up to 100% Gaussian white noise, the relative error remains consistently below 6% for WLaSDI, while it can exceed 10,000% for LaSDI. Similarly, for radial advection simulations, the relative errors stay below 15% for WLaSDI, in stark contrast to the potential errors of up to 10,000% with LaSDI. Moreover, speedups of several orders of magnitude can be obtained with WLaSDI. For example applying WLaSDI to 1D Burgers' yields a 140X speedup compared to the corresponding full order model. Python code to reproduce the results in this work is available at (https://github.com/MathBioCU/PyWSINDy_ODE) and (https://github.com/MathBioCU/PyWLaSDI).  ( 3 min )
    The Case for Universal Basic Computing Power. (arXiv:2311.12872v1 [cs.AI])
    The Universal Basic Computing Power (UBCP) initiative ensures global, free access to a set amount of computing power specifically for AI research and development (R&D). This initiative comprises three key elements. First, UBCP must be cost free, with its usage limited to AI R&D and minimal additional conditions. Second, UBCP should continually incorporate the state of the art AI advancements, including efficiently distilled, compressed, and deployed training data, foundational models, benchmarks, and governance tools. Lastly, it's essential for UBCP to be universally accessible, ensuring convenience for all users. We urge major stakeholders in AI development large platforms, open source contributors, and policymakers to prioritize the UBCP initiative.  ( 2 min )
    A Review of Deep Reinforcement Learning in Serverless Computing: Function Scheduling and Resource Auto-Scaling. (arXiv:2311.12839v1 [cs.DC])
    In the rapidly evolving field of serverless computing, efficient function scheduling and resource scaling are critical for optimizing performance and cost. This paper presents a comprehensive review of the application of Deep Reinforcement Learning (DRL) techniques in these areas. We begin by providing an overview of serverless computing, highlighting its benefits and challenges, with a particular focus on function scheduling and resource scaling. We then delve into the principles of deep reinforcement learning (DRL) and its potential for addressing these challenges. A systematic review of recent studies applying DRL to serverless computing is presented, covering various algorithms, models, and performances. Our analysis reveals that DRL, with its ability to learn and adapt from an environment, shows promising results in improving the efficiency of function scheduling and resource scaling in serverless computing. However, several challenges remain, including the need for more realistic simulation environments, handling of cold starts, and the trade-off between learning time and scheduling performance. We conclude by discussing potential future directions for this research area, emphasizing the need for more robust DRL models, better benchmarking methods, and the exploration of multi-agent reinforcement learning for more complex serverless architectures. This review serves as a valuable resource for researchers and practitioners aiming to understand and advance the application of DRL in serverless computing.  ( 2 min )
    Attribute-Aware Deep Hashing with Self-Consistency for Large-Scale Fine-Grained Image Retrieval. (arXiv:2311.12894v1 [cs.IR])
    Our work focuses on tackling large-scale fine-grained image retrieval as ranking the images depicting the concept of interests (i.e., the same sub-category labels) highest based on the fine-grained details in the query. It is desirable to alleviate the challenges of both fine-grained nature of small inter-class variations with large intra-class variations and explosive growth of fine-grained data for such a practical task. In this paper, we propose attribute-aware hashing networks with self-consistency for generating attribute-aware hash codes to not only make the retrieval process efficient, but also establish explicit correspondences between hash codes and visual attributes. Specifically, based on the captured visual representations by attention, we develop an encoder-decoder structure network of a reconstruction task to unsupervisedly distill high-level attribute-specific vectors from the appearance-specific visual representations without attribute annotations. Our models are also equipped with a feature decorrelation constraint upon these attribute vectors to strengthen their representative abilities. Then, driven by preserving original entities' similarity, the required hash codes can be generated from these attribute-specific vectors and thus become attribute-aware. Furthermore, to combat simplicity bias in deep hashing, we consider the model design from the perspective of the self-consistency principle and propose to further enhance models' self-consistency by equipping an additional image reconstruction path. Comprehensive quantitative experiments under diverse empirical settings on six fine-grained retrieval datasets and two generic retrieval datasets show the superiority of our models over competing methods.  ( 3 min )
    Advancing The Rate-Distortion-Computation Frontier For Neural Image Compression. (arXiv:2311.12821v1 [cs.CV])
    The rate-distortion performance of neural image compression models has exceeded the state-of-the-art for non-learned codecs, but neural codecs are still far from widespread deployment and adoption. The largest obstacle is having efficient models that are feasible on a wide variety of consumer hardware. Comparative research and evaluation is difficult due to the lack of standard benchmarking platforms and due to variations in hardware architectures and test environments. Through our rate-distortion-computation (RDC) study we demonstrate that neither floating-point operations (FLOPs) nor runtime are sufficient on their own to accurately rank neural compression methods. We also explore the RDC frontier, which leads to a family of model architectures with the best empirical trade-off between computational requirements and RD performance. Finally, we identify a novel neural compression architecture that yields state-of-the-art RD performance with rate savings of 23.1% over BPG (7.0% over VTM and 3.0% over ELIC) without requiring significantly more FLOPs than other learning-based codecs.  ( 2 min )
    Progression and Challenges of IoT in Healthcare: A Short Review. (arXiv:2311.12869v1 [cs.CR])
    Smart healthcare, an integral element of connected living, plays a pivotal role in fulfilling a fundamental human need. The burgeoning field of smart healthcare is poised to generate substantial revenue in the foreseeable future. Its multifaceted framework encompasses vital components such as the Internet of Things (IoT), medical sensors, artificial intelligence (AI), edge and cloud computing, as well as next-generation wireless communication technologies. Many research papers discuss smart healthcare and healthcare more broadly. Numerous nations have strategically deployed the Internet of Medical Things (IoMT) alongside other measures to combat the propagation of COVID-19. This combined effort has not only enhanced the safety of frontline healthcare workers but has also augmented the overall efficacy in managing the pandemic, subsequently reducing its impact on human lives and mortality rates. Remarkable strides have been made in both applications and technology within the IoMT domain. However, it is imperative to acknowledge that this technological advancement has introduced certain challenges, particularly in the realm of security. The rapid and extensive adoption of IoMT worldwide has magnified issues related to security and privacy. These encompass a spectrum of concerns, ranging from replay attacks, man-in-the-middle attacks, impersonation, privileged insider threats, remote hijacking, password guessing, and denial of service (DoS) attacks, to malware incursions. In this comprehensive review, we undertake a comparative analysis of existing strategies designed for the detection and prevention of malware in IoT environments.  ( 3 min )
    Towards the generation of synchronized and believable non-verbal facial behaviors of a talking virtual agent. (arXiv:2311.12804v1 [cs.CV])
    This paper introduces a new model to generate rhythmically relevant non-verbal facial behaviors for virtual agents while they speak. The model demonstrates perceived performance comparable to behaviors directly extracted from the data and replayed on a virtual agent, in terms of synchronization with speech and believability. Interestingly, we found that training the model with two different sets of data, instead of one, did not necessarily improve its performance. The expressiveness of the people in the dataset and the shooting conditions are key elements. We also show that employing an adversarial model, in which fabricated fake examples are introduced during the training phase, increases the perception of synchronization with speech. A collection of videos demonstrating the results and code can be accessed at: https://github.com/aldelb/non_verbal_facial_animation.  ( 2 min )
    Combining low-dose CT-based radiomics and metabolomics for early lung cancer screening support. (arXiv:2311.12810v1 [cs.CV])
    Due to its predominantly asymptomatic or mildly symptomatic progression, lung cancer is often diagnosed in advanced stages, resulting in poorer survival rates for patients. As with other cancers, early detection significantly improves the chances of successful treatment. Early diagnosis can be facilitated through screening programs designed to detect lung tissue tumors when they are still small, typically around 3mm in size. However, the analysis of extensive screening program data is hampered by limited access to medical experts. In this study, we developed a procedure for identifying potential malignant neoplastic lesions within lung parenchyma. The system leverages machine learning (ML) techniques applied to two types of measurements: low-dose Computed Tomography-based radiomics and metabolomics. Using data from two Polish screening programs, two ML algorithms were tested, along with various integration methods, to create a final model that combines both modalities to support lung cancer screening.  ( 2 min )
    Targeted Activation Penalties Help CNNs Ignore Spurious Signals. (arXiv:2311.12813v1 [cs.CV])
    Neural networks (NNs) can learn to rely on spurious signals in the training data, leading to poor generalisation. Recent methods tackle this problem by training NNs with additional ground-truth annotations of such signals. These methods may, however, let spurious signals re-emerge in deep convolutional NNs (CNNs). We propose Targeted Activation Penalty (TAP), a new method tackling the same problem by penalising activations to control the re-emergence of spurious signals in deep CNNs, while also lowering training times and memory usage. In addition, ground-truth annotations can be expensive to obtain. We show that TAP still works well with annotations generated by pre-trained models as effective substitutes of ground-truth annotations. We demonstrate the power of TAP against two state-of-the-art baselines on the MNIST benchmark and on two clinical image datasets, using four different CNN architectures.  ( 2 min )
    Meticulously Selecting 1% of the Dataset for Pre-training! Generating Differentially Private Images Data with Semantics Query. (arXiv:2311.12850v1 [cs.CV])
    Differential Privacy (DP) image data synthesis, which leverages the DP technique to generate synthetic data to replace the sensitive data, allowing organizations to share and utilize synthetic images without privacy concerns. Previous methods incorporate the advanced techniques of generative models and pre-training on a public dataset to produce exceptional DP image data, but suffer from problems of unstable training and massive computational resource demands. This paper proposes a novel DP image synthesis method, termed PRIVIMAGE, which meticulously selects pre-training data, promoting the efficient creation of DP datasets with high fidelity and utility. PRIVIMAGE first establishes a semantic query function using a public dataset. Then, this function assists in querying the semantic distribution of the sensitive dataset, facilitating the selection of data from the public dataset with analogous semantics for pre-training. Finally, we pre-train an image generative model using the selected data and then fine-tune this model on the sensitive dataset using Differentially Private Stochastic Gradient Descent (DP-SGD). PRIVIMAGE allows us to train a lightly parameterized generative model, reducing the noise in the gradient during DP-SGD training and enhancing training stability. Extensive experiments demonstrate that PRIVIMAGE uses only 1% of the public dataset for pre-training and 7.6% of the parameters in the generative model compared to the state-of-the-art method, whereas achieves superior synthetic performance and conserves more computational resources. On average, PRIVIMAGE achieves 30.1% lower FID and 12.6% higher Classification Accuracy than the state-of-the-art method. The replication package and datasets can be accessed online.  ( 3 min )
    Personalization of Affective Models to Enable Neuropsychiatric Digital Precision Health Interventions: A Feasibility Study. (arXiv:2311.12812v1 [cs.CV])
    Mobile digital therapeutics for autism spectrum disorder (ASD) often target emotion recognition and evocation, which is a challenge for children with ASD. While such mobile applications often use computer vision machine learning (ML) models to guide the adaptive nature of the digital intervention, a single model is usually deployed and applied to all children. Here, we explore the potential of model personalization, or training a single emotion recognition model per person, to improve the performance of these underlying emotion recognition models used to guide digital health therapies for children with ASD. We conducted experiments on the Emognition dataset, a video dataset of human subjects evoking a series of emotions. For a subset of 10 individuals in the dataset with a sufficient representation of at least two ground truth emotion labels, we trained a personalized version of three classical ML models on a set of 51 features extracted from each video frame. We measured the importance of each facial feature for all personalized models and observed differing ranked lists of top features across subjects, motivating the need for model personalization. We then compared the personalized models against a generalized model trained using data from all 10 participants. The mean F1-scores achieved by the personalized models were 90.48%, 92.66%, and 86.40%, respectively. By contrast, the mean F1-scores reached by non-personalized models trained on different human subjects and evaluated using the same test set were 88.55%, 91.78%, and 80.42%, respectively. The personalized models outperformed the generalized models for 7 out of 10 participants. PCA analyses on the remaining 3 participants revealed relatively facial configuration differences between emotion labels within each subject, suggesting that personalized ML will fail when the variation among data points within a subjects data is too low.  ( 3 min )
    Adaptive Bayesian Learning with Action and State-Dependent Signal Variance. (arXiv:2311.12878v1 [stat.ME])
    This manuscript presents an advanced framework for Bayesian learning by incorporating action and state-dependent signal variances into decision-making models. This framework is pivotal in understanding complex data-feedback loops and decision-making processes in various economic systems. Through a series of examples, we demonstrate the versatility of this approach in different contexts, ranging from simple Bayesian updating in stable environments to complex models involving social learning and state-dependent uncertainties. The paper uniquely contributes to the understanding of the nuanced interplay between data, actions, outcomes, and the inherent uncertainty in economic models.  ( 2 min )
    RAEDiff: Denoising Diffusion Probabilistic Models Based Reversible Adversarial Examples Self-Generation and Self-Recovery. (arXiv:2311.12858v1 [cs.CR])
    Collected and annotated datasets, which are obtained through extensive efforts, are effective for training Deep Neural Network (DNN) models. However, these datasets are susceptible to be misused by unauthorized users, resulting in infringement of Intellectual Property (IP) rights owned by the dataset creators. Reversible Adversarial Exsamples (RAE) can help to solve the issues of IP protection for datasets. RAEs are adversarial perturbed images that can be restored to the original. As a cutting-edge approach, RAE scheme can serve the purposes of preventing unauthorized users from engaging in malicious model training, as well as ensuring the legitimate usage of authorized users. Nevertheless, in the existing work, RAEs still rely on the embedded auxiliary information for restoration, which may compromise their adversarial abilities. In this paper, a novel self-generation and self-recovery method, named as RAEDiff, is introduced for generating RAEs based on a Denoising Diffusion Probabilistic Models (DDPM). It diffuses datasets into a Biased Gaussian Distribution (BGD) and utilizes the prior knowledge of the DDPM for generating and recovering RAEs. The experimental results demonstrate that RAEDiff effectively self-generates adversarial perturbations for DNN models, including Artificial Intelligence Generated Content (AIGC) models, while also exhibiting significant self-recovery capabilities.  ( 2 min )
    ECNR: Efficient Compressive Neural Representation of Time-Varying Volumetric Datasets. (arXiv:2311.12831v1 [cs.CV])
    Due to its conceptual simplicity and generality, compressive neural representation has emerged as a promising alternative to traditional compression methods for managing massive volumetric datasets. The state-of-the-art neural compression solution, neurcomp, however, utilizes a single large multilayer perceptron (MLP) to encode the global volume, incurring slow training and inference. This paper presents an efficient compressive neural representation (ECNR) solution that improves upon neurcomp to handle large-scale time-varying datasets. At the heart of our approach is a multiscale structure that uses the Laplacian pyramid for adaptive signal fitting via implicit neural representation. We leverage multiple small MLPs at each scale for fitting local content or residual blocks. By assigning similar blocks to the same MLP via size uniformization, we enable balanced parallelization among MLPs to significantly speed up training and inference. A deep compression strategy is then employed to compact the resulting model. We demonstrate the effectiveness of ECNR with multiple datasets and compare it with neurcomp and two state-of-the-art conventional compression methods (SZ3 and TTHRESH). Our results position ECNR as a promising alternative to neurcomp for scientific data compression.  ( 2 min )
    Evolution of Convolutional Neural Network (CNN): Compute vs Memory bandwidth for Edge AI. (arXiv:2311.12816v1 [cs.CV])
    Convolutional Neural Networks (CNNs) have greatly influenced the field of Embedded Vision and Edge Artificial Intelligence (AI), enabling powerful machine learning capabilities on resource-constrained devices. This article explores the relationship between CNN compute requirements and memory bandwidth in the context of Edge AI. We delve into the historical progression of CNN architectures, from the early pioneering models to the current state-of-the-art designs, highlighting the advancements in compute-intensive operations. We examine the impact of increasing model complexity on both computational requirements and memory access patterns. The paper presents a comparison analysis of the evolving trade-off between compute demands and memory bandwidth requirements in CNNs. This analysis provides insights into designing efficient architectures and potential hardware accelerators in enhancing CNN performance on edge devices.  ( 2 min )
    Reducing the Environmental Impact of Wireless Communication via Probabilistic Machine Learning. (arXiv:2311.12807v1 [cs.NI])
    Machine learning methods are increasingly adopted in communications problems, particularly those arising in next generation wireless settings. Though seen as a key climate mitigation and societal adaptation enabler, communications related energy consumption is high and is expected to grow in future networks in spite of anticipated efficiency gains in 6G due to exponential communications traffic growth. To make meaningful climate mitigation impact in the communications sector, a mindset shift away from maximizing throughput at all cost and towards prioritizing energy efficiency is needed. Moreover, this must be adopted in both existing (without incurring further embodied carbon costs through equipment replacement) and future network infrastructure, given the long development time of mobile generations. To that end, we present summaries of two such problems, from both current and next generation network specifications, where probabilistic inference methods were used to great effect: using Bayesian parameter tuning we are able to safely reduce the energy consumption of existing hardware on a live communications network by $11\%$ whilst maintaining operator specified performance envelopes; through spatiotemporal Gaussian process surrogate modeling we reduce the overhead in a next generation hybrid beamforming system by over $60\%$, greatly improving the networks' ability to target highly mobile users such as autonomous vehicles. The Bayesian paradigm is itself helpful in terms of energy usage, since training a Bayesian optimization model can require much less computation than, say, training a deep neural network.  ( 2 min )
    MatGD: Materials Graph Digitizer. (arXiv:2311.12806v1 [cs.CV])
    We have developed MatGD (Material Graph Digitizer), which is a tool for digitizing a data line from scientific graphs. The algorithm behind the tool consists of four steps: (1) identifying graphs within subfigures, (2) separating axes and data sections, (3) discerning the data lines by eliminating irrelevant graph objects and matching with the legend, and (4) data extraction and saving. From the 62,534 papers in the areas of batteries, catalysis, and MOFs, 501,045 figures were mined. Remarkably, our tool showcased performance with over 99% accuracy in legend marker and text detection. Moreover, its capability for data line separation stood at 66%, which is much higher compared to other existing figure mining tools. We believe that this tool will be integral to collecting both past and future data from publications, and these data can be used to train various machine learning models that can enhance material predictions and new materials discovery.  ( 2 min )
    HydraScreen: A Generalizable Structure-Based Deep Learning Approach to Drug Discovery. (arXiv:2311.12814v1 [q-bio.BM])
    We propose HydraScreen, a deep-learning approach that aims to provide a framework for more robust machine-learning-accelerated drug discovery. HydraScreen utilizes a state-of-the-art 3D convolutional neural network, designed for the effective representation of molecular structures and interactions in protein-ligand binding. We design an end-to-end pipeline for high-throughput screening and lead optimization, targeting applications in structure-based drug design. We assess our approach using established public benchmarks based on the CASF 2016 core set, achieving top-tier results in affinity and pose prediction (Pearson's r = 0.86, RMSE = 1.15, Top-1 = 0.95). Furthermore, we utilize a novel interaction profiling approach to identify potential biases in the model and dataset to boost interpretability and support the unbiased nature of our method. Finally, we showcase HydraScreen's capacity to generalize across unseen proteins and ligands, offering directions for future development of robust machine learning scoring functions. HydraScreen (accessible at https://hydrascreen.ro5.ai) provides a user-friendly GUI and a public API, facilitating easy assessment of individual protein-ligand complexes.  ( 2 min )
    MiniAnDE: a reduced AnDE ensemble to deal with microarray data. (arXiv:2311.12879v1 [q-bio.QM])
    This article focuses on the supervised classification of datasets with a large number of variables and a small number of instances. This is the case, for example, for microarray data sets commonly used in bioinformatics. Complex classifiers that require estimating statistics over many variables are not suitable for this type of data. Probabilistic classifiers with low-order probability tables, e.g. NB and AODE, are good alternatives for dealing with this type of data. AODE usually improves NB in accuracy, but suffers from high spatial complexity since $k$ models, each with $n+1$ variables, are included in the AODE ensemble. In this paper, we propose MiniAnDE, an algorithm that includes only a small number of heterogeneous base classifiers in the ensemble, i.e., each model only includes a different subset of the $k$ predictive variables. Experimental evaluation shows that using MiniAnDE classifiers on microarray data is feasible and outperforms NB and other ensembles such as bagging and random forest.  ( 2 min )
    Fixing the problems of deep neural networks will require better training data and learning algorithms. (arXiv:2311.12819v1 [cs.CV])
    Bowers and colleagues argue that DNNs are poor models of biological vision because they often learn to rival human accuracy by relying on strategies that differ markedly from those of humans. We show that this problem is worsening as DNNs are becoming larger-scale and increasingly more accurate, and prescribe methods for building DNNs that can reliably model biological vision.  ( 2 min )
  • Open

    Multi-fidelity Bayesian Optimization in Engineering Design. (arXiv:2311.13050v1 [cs.CE])
    Resided at the intersection of multi-fidelity optimization (MFO) and Bayesian optimization (BO), MF BO has found a niche in solving expensive engineering design optimization problems, thanks to its advantages in incorporating physical and mathematical understandings of the problems, saving resources, addressing exploitation-exploration trade-off, considering uncertainty, and processing parallel computing. The increasing number of works dedicated to MF BO suggests the need for a comprehensive review of this advanced optimization technique. In this paper, we survey recent developments of two essential ingredients of MF BO: Gaussian process (GP) based MF surrogates and acquisition functions. We first categorize the existing MF modeling methods and MFO strategies to locate MF BO in a large family of surrogate-based optimization and MFO algorithms. We then exploit the common properties shared between the methods from each ingredient of MF BO to describe important GP-based MF surrogate models and review various acquisition functions. By doing so, we expect to provide a structured understanding of MF BO. Finally, we attempt to reveal important aspects that require further research for applications of MF BO in solving intricate yet important design optimization problems, including constrained optimization, high-dimensional optimization, optimization under uncertainty, and multi-objective optimization.  ( 2 min )
    W-kernel and essential subspace for frequencist's evaluation of Bayesian estimators. (arXiv:2311.13017v1 [stat.ME])
    The posterior covariance matrix W defined by the log-likelihood of each observation plays important roles both in the sensitivity analysis and frequencist's evaluation of the Bayesian estimators. This study focused on the matrix W and its principal space; we term the latter as an essential subspace. First, it is shown that they appear in various statistical settings, such as the evaluation of the posterior sensitivity, assessment of the frequencist's uncertainty from posterior samples, and stochastic expansion of the loss; a key tool to treat frequencist's properties is the recently proposed Bayesian infinitesimal jackknife approximation (Giordano and Broderick (2023)). In the following part, we show that the matrix W can be interpreted as a reproducing kernel; it is named as W-kernel. Using the W-kernel, the essential subspace is expressed as a principal space given by the kernel PCA. A relation to the Fisher kernel and neural tangent kernel is established, which elucidates the connection to the classical asymptotic theory; it also leads to a sort of Bayesian-frequencist's duality. Finally, two applications, selection of a representative set of observations and dimensional reduction in the approximate bootstrap, are discussed. In the former, incomplete Cholesky decomposition is introduced as an efficient method to compute the essential subspace. In the latter, different implementations of the approximate bootstrap for posterior means are compared.  ( 2 min )
    Hinge-Wasserstein: Mitigating Overconfidence in Regression by Classification. (arXiv:2306.00560v2 [cs.LG] UPDATED)
    Computer vision systems that are deployed in safety-critical applications need to quantify their output uncertainty. We study regression from images to parameter values and here it is common to detect uncertainty by predicting probability distributions. In this context, we investigate the regression-by-classification paradigm which can represent multimodal distributions, without a prior assumption on the number of modes. Through experiments on a specifically designed synthetic dataset, we demonstrate that traditional loss functions lead to poor probability distribution estimates and severe overconfidence, in the absence of full ground truth distributions. In order to alleviate these issues, we propose hinge-Wasserstein -- a simple improvement of the Wasserstein loss that reduces the penalty for weak secondary modes during training. This enables prediction of complex distributions with multiple modes, and allows training on datasets where full ground truth distributions are not available. In extensive experiments, we show that the proposed loss leads to substantially better uncertainty estimation on two challenging computer vision tasks: horizon line detection and stereo disparity estimation.
    Labeling Neural Representations with Inverse Recognition. (arXiv:2311.13594v1 [cs.LG])
    Deep Neural Networks (DNNs) demonstrated remarkable capabilities in learning complex hierarchical data representations, but the nature of these representations remains largely unknown. Existing global explainability methods, such as Network Dissection, face limitations such as reliance on segmentation masks, lack of statistical significance testing, and high computational demands. We propose Inverse Recognition (INVERT), a scalable approach for connecting learned representations with human-understandable concepts by leveraging their capacity to discriminate between these concepts. In contrast to prior work, INVERT is capable of handling diverse types of neurons, exhibits less computational complexity, and does not rely on the availability of segmentation masks. Moreover, INVERT provides an interpretable metric assessing the alignment between the representation and its corresponding explanation and delivering a measure of statistical significance, emphasizing its utility and credibility. We demonstrate the applicability of INVERT in various scenarios, including the identification of representations affected by spurious correlations, and the interpretation of the hierarchical structure of decision-making within the models.
    Why Shallow Networks Struggle with Approximating and Learning High Frequency: A Numerical Study. (arXiv:2306.17301v2 [cs.LG] UPDATED)
    In this work, a comprehensive numerical study involving analysis and experiments shows why a two-layer neural network has difficulties handling high frequencies in approximation and learning when machine precision and computation cost are important factors in real practice. In particular, the following basic computational issues are investigated: (1) the minimal numerical error one can achieve given a finite machine precision, (2) the computation cost to achieve a given accuracy, and (3) stability with respect to perturbations. The key to the study is the conditioning of the representation and its learning dynamics. Explicit answers to the above questions with numerical verifications are presented.
    Span-Based Optimal Sample Complexity for Average Reward MDPs. (arXiv:2311.13469v1 [cs.LG])
    We study the sample complexity of learning an $\varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. We establish the complexity bound $\widetilde{O}\left(SA\frac{H}{\varepsilon^2} \right)$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters $S,A,H$ and $\varepsilon$, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters. Our result is based on reducing the average-reward MDP to a discounted MDP. To establish the optimality of this reduction, we develop improved bounds for $\gamma$-discounted MDPs, showing that $\widetilde{O}\left(SA\frac{H}{(1-\gamma)^2\varepsilon^2} \right)$ samples suffice to learn a $\varepsilon$-optimal policy in weakly communicating MDPs under the regime that $\gamma \geq 1 - \frac{1}{H}$, circumventing the well-known lower bound of $\widetilde{\Omega}\left(SA\frac{1}{(1-\gamma)^3\varepsilon^2} \right)$ for general $\gamma$-discounted MDPs. Our analysis develops upper bounds on certain instance-dependent variance parameters in terms of the span parameter. These bounds are tighter than those based on the mixing time or diameter of the MDP and may be of broader use.
    Covariance alignment: from maximum likelihood estimation to Gromov-Wasserstein. (arXiv:2311.13595v1 [math.ST])
    Feature alignment methods are used in many scientific disciplines for data pooling, annotation, and comparison. As an instance of a permutation learning problem, feature alignment presents significant statistical and computational challenges. In this work, we propose the covariance alignment model to study and compare various alignment methods and establish a minimax lower bound for covariance alignment that has a non-standard dimension scaling because of the presence of a nuisance parameter. This lower bound is in fact minimax optimal and is achieved by a natural quasi MLE. However, this estimator involves a search over all permutations which is computationally infeasible even when the problem has moderate size. To overcome this limitation, we show that the celebrated Gromov-Wasserstein algorithm from optimal transport which is more amenable to fast implementation even on large-scale problems is also minimax optimal. These results give the first statistical justification for the deployment of the Gromov-Wasserstein algorithm in practice.
    Droplets of Good Representations: Grokking as a First Order Phase Transition in Two Layer Networks. (arXiv:2310.03789v2 [stat.ML] UPDATED)
    A key property of deep neural networks (DNNs) is their ability to learn new features during training. This intriguing aspect of deep learning stands out most clearly in recently reported Grokking phenomena. While mainly reflected as a sudden increase in test accuracy, Grokking is also believed to be a beyond lazy-learning/Gaussian Process (GP) phenomenon involving feature learning. Here we apply a recent development in the theory of feature learning, the adaptive kernel approach, to two teacher-student models with cubic-polynomial and modular addition teachers. We provide analytical predictions on feature learning and Grokking properties of these models and demonstrate a mapping between Grokking and the theory of phase transitions. We show that after Grokking, the state of the DNN is analogous to the mixed phase following a first-order phase transition. In this mixed phase, the DNN generates useful internal representations of the teacher that are sharply distinct from those before the transition.
    A Unified Framework for Trace-induced Quantum Kernels. (arXiv:2311.13552v1 [quant-ph])
    Quantum kernel methods are promising candidates for achieving a practical quantum advantage for certain machine learning tasks. Similar to classical machine learning, an exact form of a quantum kernel is expected to have a great impact on the model performance. In this work we combine all trace-induced quantum kernels, including the commonly-used global fidelity and local projected quantum kernels, into a common framework. We show how generalized trace-induced quantum kernels can be constructed as combinations of the fundamental building blocks we coin "Lego" kernels, which impose an inductive bias on the resulting quantum models. We relate the expressive power and generalization ability to the number of non-zero weight Lego kernels and propose a systematic approach to increase the complexity of a quantum kernel model, leading to a new form of the local projected kernels that require fewer quantum resources in terms of the number of quantum gates and measurement shots. We show numerically that models based on local projected kernels can achieve comparable performance to the global fidelity quantum kernel. Our work unifies existing quantum kernels and provides a systematic framework to compare their properties.
    Efficient Algorithms for the CCA Family: Unconstrained Objectives with Unbiased Gradients. (arXiv:2310.01012v2 [cs.LG] UPDATED)
    The Canonical Correlation Analysis (CCA) family of methods is foundational in multi-view learning. Regularised linear CCA methods can be seen to generalise Partial Least Squares (PLS) and be unified with a Generalized Eigenvalue Problem (GEP) framework. However, classical algorithms for these linear methods are computationally infeasible for large-scale data. Extensions to Deep CCA show great promise, but current training procedures are slow and complicated. First we propose a novel unconstrained objective that characterizes the top subspace of GEPs. Our core contribution is a family of fast algorithms for stochastic PLS, stochastic CCA, and Deep CCA, simply obtained by applying stochastic gradient descent (SGD) to the corresponding CCA objectives. These methods show far faster convergence and recover higher correlations than the previous state-of-the-art on all standard CCA and Deep CCA benchmarks. This speed allows us to perform a first-of-its-kind PLS analysis of an extremely large biomedical dataset from the UK Biobank, with over 33,000 individuals and 500,000 variants. Finally, we not only match the performance of `CCA-family' Self-Supervised Learning (SSL) methods on CIFAR-10 and CIFAR-100 with minimal hyper-parameter tuning, but also establish the first solid theoretical links to classical CCA, laying the groundwork for future insights.
    A General Theoretical Paradigm to Understand Learning from Human Preferences. (arXiv:2310.12036v2 [cs.AI] UPDATED)
    The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an approach that bypasses the second approximation and learn directly a policy from collected data without the reward modelling stage. However, this method still heavily relies on the first approximation. In this paper we try to gain a deeper theoretical understanding of these practical algorithms. In particular we derive a new general objective called $\Psi$PO for learning from human preferences that is expressed in terms of pairwise preferences and therefore bypasses both approximations. This new general objective allows us to perform an in-depth analysis of the behavior of RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$ simply to Identity, for which we can derive an efficient optimisation procedure, prove performance guarantees and demonstrate its empirical superiority to DPO on some illustrative examples.
    Differentially Private Wireless Federated Learning Using Orthogonal Sequences. (arXiv:2306.08280v2 [cs.IT] UPDATED)
    We propose a privacy-preserving uplink over-the-air computation (AirComp) method, termed FLORAS, for single-input single-output (SISO) wireless federated learning (FL) systems. From the perspective of communication designs, FLORAS eliminates the requirement of channel state information at the transmitters (CSIT) by leveraging the properties of orthogonal sequences. From the privacy perspective, we prove that FLORAS offers both item-level and client-level differential privacy (DP) guarantees. Moreover, by properly adjusting the system parameters, FLORAS can flexibly achieve different DP levels at no additional cost. A new FL convergence bound is derived which, combined with the privacy guarantees, allows for a smooth tradeoff between the achieved convergence rate and differential privacy levels. Experimental results demonstrate the advantages of FLORAS compared with the baseline AirComp method, and validate that the analytical results can guide the design of privacy-preserving FL with different tradeoff requirements on the model convergence and privacy levels.
    $\sigma$-PCA: a unified neural model for linear and nonlinear principal component analysis. (arXiv:2311.13580v1 [cs.LG])
    Linear principal component analysis (PCA), nonlinear PCA, and linear independent component analysis (ICA) -- those are three methods with single-layer autoencoder formulations for learning linear transformations from data. Linear PCA learns orthogonal transformations (rotations) that orient axes to maximise variance, but it suffers from a subspace rotational indeterminacy: it fails to find a unique rotation for axes that share the same variance. Both nonlinear PCA and linear ICA reduce the subspace indeterminacy from rotational to permutational by maximising statistical independence under the assumption of unit variance. The main difference between them is that nonlinear PCA only learns rotations while linear ICA learns not just rotations but any linear transformation with unit variance. The relationship between all three can be understood by the singular value decomposition of the linear ICA transformation into a sequence of rotation, scale, rotation. Linear PCA learns the first rotation; nonlinear PCA learns the second. The scale is simply the inverse of the standard deviations. The problem is that, in contrast to linear PCA, conventional nonlinear PCA cannot be used directly on the data to learn the first rotation, the first being special as it reduces dimensionality and orders by variances. In this paper, we have identified the cause, and as a solution we propose $\sigma$-PCA: a unified neural model for linear and nonlinear PCA as single-layer autoencoders. One of its key ingredients: modelling not just the rotation but also the scale -- the variances. This model bridges the disparity between linear and nonlinear PCA. And so, like linear PCA, it can learn a semi-orthogonal transformation that reduces dimensionality and orders by variances, but, unlike linear PCA, it does not suffer from rotational indeterminacy.
    Using Taylor-Approximated Gradients to Improve the Frank-Wolfe Method for Empirical Risk Minimization. (arXiv:2208.13933v2 [cs.LG] UPDATED)
    The Frank-Wolfe method has become increasingly useful in statistical and machine learning applications, due to the structure-inducing properties of the iterates, and especially in settings where linear minimization over the feasible set is more computationally efficient than projection. In the setting of Empirical Risk Minimization -- one of the fundamental optimization problems in statistical and machine learning -- the computational effectiveness of Frank-Wolfe methods typically grows linearly in the number of data observations $n$. This is in stark contrast to the case for typical stochastic projection methods. In order to reduce this dependence on $n$, we look to second-order smoothness of typical smooth loss functions (least squares loss and logistic loss, for example) and we propose amending the Frank-Wolfe method with Taylor series-approximated gradients, including variants for both deterministic and stochastic settings. Compared with current state-of-the-art methods in the regime where the optimality tolerance $\varepsilon$ is sufficiently small, our methods are able to simultaneously reduce the dependence on large $n$ while obtaining optimal convergence rates of Frank-Wolfe methods, in both the convex and non-convex settings. We also propose a novel adaptive step-size approach for which we have computational guarantees. Last of all, we present computational experiments which show that our methods exhibit very significant speed-ups over existing methods on real-world datasets for both convex and non-convex binary classification problems.
    Byzantine-Robust Learning on Heterogeneous Datasets via Bucketing. (arXiv:2006.09365v6 [cs.LG] UPDATED)
    In Byzantine robust distributed or federated learning, a central server wants to train a machine learning model over data distributed across multiple workers. However, a fraction of these workers may deviate from the prescribed algorithm and send arbitrary messages. While this problem has received significant attention recently, most current defenses assume that the workers have identical data. For realistic cases when the data across workers are heterogeneous (non-iid), we design new attacks which circumvent current defenses, leading to significant loss of performance. We then propose a simple bucketing scheme that adapts existing robust algorithms to heterogeneous datasets at a negligible computational cost. We also theoretically and experimentally validate our approach, showing that combining bucketing with existing robust algorithms is effective against challenging attacks. Our work is the first to establish guaranteed convergence for the non-iid Byzantine robust problem under realistic assumptions.
    Diffusion models for probabilistic programming. (arXiv:2311.00474v2 [cs.LG] UPDATED)
    We propose Diffusion Model Variational Inference (DMVI), a novel method for automated approximate inference in probabilistic programming languages (PPLs). DMVI utilizes diffusion models as variational approximations to the true posterior distribution by deriving a novel bound to the marginal likelihood objective used in Bayesian modelling. DMVI is easy to implement, allows hassle-free inference in PPLs without the drawbacks of, e.g., variational inference using normalizing flows, and does not make any constraints on the underlying neural network model. We evaluate DMVI on a set of common Bayesian models and show that its posterior inferences are in general more accurate than those of contemporary methods used in PPLs while having a similar computational cost and requiring less manual tuning.
    On diffusion-based generative models and their error bounds: The log-concave case with full convergence estimates. (arXiv:2311.13584v1 [cs.LG])
    We provide full theoretical guarantees for the convergence behaviour of diffusion-based generative models under the assumption of strongly logconcave data distributions while our approximating class of functions used for score estimation is made of Lipschitz continuous functions. We demonstrate via a motivating example, sampling from a Gaussian distribution with unknown mean, the powerfulness of our approach. In this case, explicit estimates are provided for the associated optimization problem, i.e. score approximation, while these are combined with the corresponding sampling estimates. As a result, we obtain the best known upper bound estimates in terms of key quantities of interest, such as the dimension and rates of convergence, for the Wasserstein-2 distance between the data distribution (Gaussian with unknown mean) and our sampling algorithm. Beyond the motivating example and in order to allow for the use of a diverse range of stochastic optimizers, we present our results using an $L^2$-accurate score estimation assumption, which crucially is formed under an expectation with respect to the stochastic optimizer and our novel auxiliary process that uses only known information. This approach yields the best known convergence rate for our sampling algorithm.
    Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks. (arXiv:2311.13180v1 [stat.ML])
    We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Such a feedback structure is popular in applications such as personalized medicine and online advertisement, where the online data often do not arrive in a fully serial manner. We consider high-dimensional and linear settings where the reward function of the bandit model admits either a sparse or low-rank structure and ask how small a number of batches are needed for a comparable performance with fully dynamic data in which $L = T$. For these settings, we design a provably sample-efficient algorithm which achieves a $ \mathcal{\tilde O}(s_0^2 \log^2 T)$ regret in the sparse case and $ \mathcal{\tilde O} ( r ^2 \log^2 T)$ regret in the low-rank case, using only $L = \mathcal{O}( \log T)$ batches. Here $s_0$ and $r$ are the sparsity and rank of the reward parameter in sparse and low-rank cases, respectively, and $ \mathcal{\tilde O}(\cdot)$ omits logarithmic factors involving the feature dimensions. In other words, our algorithm achieves regret bounds comparable to those in fully sequential setting with only $\mathcal{O}( \log T)$ batches. Our algorithm features a novel batch allocation method that adjusts the batch sizes according to the estimation accuracy within each batch and cumulative regret. Furthermore, we also conduct experiments with synthetic and real-world data to validate our theory.
    Stability and Generalization of Stochastic Compositional Gradient Descent Algorithms. (arXiv:2307.03357v2 [cs.LG] UPDATED)
    Many machine learning tasks can be formulated as a stochastic compositional optimization (SCO) problem such as reinforcement learning, AUC maximization, and meta-learning, where the objective function involves a nested composition associated with an expectation. While a significant amount of studies has been devoted to studying the convergence behavior of SCO algorithms, there is little work on understanding their generalization, i.e., how these learning algorithms built from training examples would behave on future test examples. In this paper, we provide the stability and generalization analysis of stochastic compositional gradient descent algorithms through the lens of algorithmic stability in the framework of statistical learning theory. Firstly, we introduce a stability concept called compositional uniform stability and establish its quantitative relation with generalization for SCO problems. Then, we establish the compositional uniform stability results for two popular stochastic compositional gradient descent algorithms, namely SCGD and SCSC. Finally, we derive dimension-independent excess risk bounds for SCGD and SCSC by trade-offing their stability results and optimization errors. To the best of our knowledge, these are the first-ever-known results on stability and generalization analysis of stochastic compositional gradient descent algorithms.  ( 2 min )
    Ensemble transport smoothing. Part II: Nonlinear updates. (arXiv:2210.17435v2 [stat.ME] UPDATED)
    Smoothing is a specialized form of Bayesian inference for state-space models that characterizes the posterior distribution of a collection of states given an associated sequence of observations. Ramgraber et al. (2023) proposes a general framework for transport-based ensemble smoothing, which includes linear Kalman-type smoothers as special cases. Here, we build on this foundation to realize and demonstrate nonlinear backward ensemble transport smoothers. We discuss parameterization and regularization of the associated transport maps, and then examine the performance of these smoothers for nonlinear and chaotic dynamical systems that exhibit non-Gaussian behavior. In these settings, our nonlinear transport smoothers yield lower estimation error than conventional linear smoothers and state-of-the-art iterative ensemble Kalman smoothers, for comparable numbers of model evaluations.
    Explaining high-dimensional text classifiers. (arXiv:2311.13454v1 [cs.LG])
    Explainability has become a valuable tool in the last few years, helping humans better understand AI-guided decisions. However, the classic explainability tools are sometimes quite limited when considering high-dimensional inputs and neural network classifiers. We present a new explainability method using theoretically proven high-dimensional properties in neural network classifiers. We present two usages of it: 1) On the classical sentiment analysis task for the IMDB reviews dataset, and 2) our Malware-Detection task for our PowerShell scripts dataset.
    Looking at the posterior: accuracy and uncertainty of neural-network predictions. (arXiv:2211.14605v2 [cs.LG] UPDATED)
    Bayesian inference can quantify uncertainty in the predictions of neural networks using posterior distributions for model parameters and network output. By looking at these posterior distributions, one can separate the origin of uncertainty into aleatoric and epistemic contributions. One goal of uncertainty quantification is to inform on prediction accuracy. Here we show that prediction accuracy depends on both epistemic and aleatoric uncertainty in an intricate fashion that cannot be understood in terms of marginalized uncertainty distributions alone. How the accuracy relates to epistemic and aleatoric uncertainties depends not only on the model architecture, but also on the properties of the dataset. We discuss the significance of these results for active learning and introduce a novel acquisition function that outperforms common uncertainty-based methods. To arrive at our results, we approximated the posteriors using deep ensembles, for fully-connected, convolutional and attention-based neural networks.
    Testing Closeness of Multivariate Distributions via Ramsey Theory. (arXiv:2311.13154v1 [cs.DS])
    We investigate the statistical task of closeness (or equivalence) testing for multidimensional distributions. Specifically, given sample access to two unknown distributions $\mathbf p, \mathbf q$ on $\mathbb R^d$, we want to distinguish between the case that $\mathbf p=\mathbf q$ versus $\|\mathbf p-\mathbf q\|_{A_k} > \epsilon$, where $\|\mathbf p-\mathbf q\|_{A_k}$ denotes the generalized ${A}_k$ distance between $\mathbf p$ and $\mathbf q$ -- measuring the maximum discrepancy between the distributions over any collection of $k$ disjoint, axis-aligned rectangles. Our main result is the first closeness tester for this problem with {\em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. In more detail, we provide a computationally efficient closeness tester with sample complexity $O\left((k^{6/7}/ \mathrm{poly}_d(\epsilon)) \log^d(k)\right)$. On the lower bound side, we establish a qualitatively matching sample complexity lower bound of $\Omega(k^{6/7}/\mathrm{poly}(\epsilon))$, even for $d=2$. These sample complexity bounds are surprising because the sample complexity of the problem in the univariate setting is $\Theta(k^{4/5}/\mathrm{poly}(\epsilon))$. This has the interesting consequence that the jump from one to two dimensions leads to a substantial increase in sample complexity, while increases beyond that do not. As a corollary of our general $A_k$ tester, we obtain $d_{\mathrm TV}$-closeness testers for pairs of $k$-histograms on $\mathbb R^d$ over a common unknown partition, and pairs of uniform distributions supported on the union of $k$ unknown disjoint axis-aligned rectangles. Both our algorithm and our lower bound make essential use of tools from Ramsey theory.
    Ensemble transport smoothing. Part I: Unified framework. (arXiv:2210.17000v2 [stat.ME] UPDATED)
    Smoothers are algorithms for Bayesian time series re-analysis. Most operational smoothers rely either on affine Kalman-type transformations or on sequential importance sampling. These strategies occupy opposite ends of a spectrum that trades computational efficiency and scalability for statistical generality and consistency: non-Gaussianity renders affine Kalman updates inconsistent with the true Bayesian solution, while the ensemble size required for successful importance sampling can be prohibitive. This paper revisits the smoothing problem from the perspective of measure transport, which offers the prospect of consistent prior-to-posterior transformations for Bayesian inference. We leverage this capacity by proposing a general ensemble framework for transport-based smoothing. Within this framework, we derive a comprehensive set of smoothing recursions based on nonlinear transport maps and detail how they exploit the structure of state-space models in fully non-Gaussian settings. We also describe how many standard Kalman-type smoothing algorithms emerge as special cases of our framework. A companion paper (Ramgraber et al., 2023) explores the implementation of nonlinear ensemble transport smoothers in greater depth.
    The Tempered Hilbert Simplex Distance and Its Application To Non-linear Embeddings of TEMs. (arXiv:2311.13459v1 [cs.LG])
    Tempered Exponential Measures (TEMs) are a parametric generalization of the exponential family of distributions maximizing the tempered entropy function among positive measures subject to a probability normalization of their power densities. Calculus on TEMs relies on a deformed algebra of arithmetic operators induced by the deformed logarithms used to define the tempered entropy. In this work, we introduce three different parameterizations of finite discrete TEMs via Legendre functions of the negative tempered entropy function. In particular, we establish an isometry between such parameterizations in terms of a generalization of the Hilbert log cross-ratio simplex distance to a tempered Hilbert co-simplex distance. Similar to the Hilbert geometry, the tempered Hilbert distance is characterized as a $t$-symmetrization of the oriented tempered Funk distance. We motivate our construction by introducing the notion of $t$-lengths of smooth curves in a tautological Finsler manifold. We then demonstrate the properties of our generalized structure in different settings and numerically examine the quality of its differentiable approximations for optimization in machine learning settings.  ( 2 min )
    Efficient Numerical Integration in Reproducing Kernel Hilbert Spaces via Leverage Scores Sampling. (arXiv:2311.13548v1 [stat.ML])
    In this work we consider the problem of numerical integration, i.e., approximating integrals with respect to a target probability measure using only pointwise evaluations of the integrand. We focus on the setting in which the target distribution is only accessible through a set of $n$ i.i.d. observations, and the integrand belongs to a reproducing kernel Hilbert space. We propose an efficient procedure which exploits a small i.i.d. random subset of $m<n$ samples drawn either uniformly or using approximate leverage scores from the initial observations. Our main result is an upper bound on the approximation error of this procedure for both sampling strategies. It yields sufficient conditions on the subsample size to recover the standard (optimal) $n^{-1/2}$ rate while reducing drastically the number of functions evaluations, and thus the overall computational cost. Moreover, we obtain rates with respect to the number $m$ of evaluations of the integrand which adapt to its smoothness, and match known optimal rates for instance for Sobolev spaces. We illustrate our theoretical findings with numerical experiments on real datasets, which highlight the attractive efficiency-accuracy tradeoff of our method compared to existing randomized and greedy quadrature methods. We note that, the problem of numerical integration in RKHS amounts to designing a discrete approximation of the kernel mean embedding of the target distribution. As a consequence, direct applications of our results also include the efficient computation of maximum mean discrepancies between distributions and the design of efficient kernel-based tests.  ( 3 min )
    A note on estimating the dimension from a random geometric graph. (arXiv:2311.13059v1 [stat.ML])
    Let $G_n$ be a random geometric graph with vertex set $[n]$ based on $n$ i.i.d.\ random vectors $X_1,\ldots,X_n$ drawn from an unknown density $f$ on $\R^d$. An edge $(i,j)$ is present when $\|X_i -X_j\| \le r_n$, for a given threshold $r_n$ possibly depending upon $n$, where $\| \cdot \|$ denotes Euclidean distance. We study the problem of estimating the dimension $d$ of the underlying space when we have access to the adjacency matrix of the graph but do not know $r_n$ or the vectors $X_i$. The main result of the paper is that there exists an estimator of $d$ that converges to $d$ in probability as $n \to \infty$ for all densities with $\int f^5 < \infty$ whenever $n^{3/2} r_n^d \to \infty$ and $r_n = o(1)$. The conditions allow very sparse graphs since when $n^{3/2} r_n^d \to 0$, the graph contains isolated edges only, with high probability. We also show that, without any condition on the density, a consistent estimator of $d$ exists when $n r_n^d \to \infty$ and $r_n = o(1)$.  ( 2 min )
    A model-free approach to fingertip slip and disturbance detection for grasp stability inference. (arXiv:2311.13245v1 [cs.RO])
    Robotic capacities in object manipulation are incomparable to those of humans. Besides years of learning, humans rely heavily on the richness of information from physical interaction with the environment. In particular, tactile sensing is crucial in providing such rich feedback. Despite its potential contributions to robotic manipulation, tactile sensing is less exploited; mainly due to the complexity of the time series provided by tactile sensors. In this work, we propose a method for assessing grasp stability using tactile sensing. More specifically, we propose a methodology to extract task-relevant features and design efficient classifiers to detect object slippage with respect to individual fingertips. We compare two classification models: support vector machine and logistic regression. We use highly sensitive Uskin tactile sensors mounted on an Allegro hand to test and validate our method. Our results demonstrate that the proposed method is effective in slippage detection in an online fashion.  ( 2 min )
    Guided Flows for Generative Modeling and Decision Making. (arXiv:2311.13443v1 [cs.LG])
    Classifier-free guidance is a key component for improving the performance of conditional generative models for many downstream tasks. It drastically improves the quality of samples produced, but has so far only been used for diffusion models. Flow Matching (FM), an alternative simulation-free approach, trains Continuous Normalizing Flows (CNFs) based on regressing vector fields. It remains an open question whether classifier-free guidance can be performed for Flow Matching models, and to what extent does it improve performance. In this paper, we explore the usage of Guided Flows for a variety of downstream applications involving conditional image generation, speech synthesis, and reinforcement learning. In particular, we are the first to apply flow models to the offline reinforcement learning setting. We also show that Guided Flows significantly improves the sample quality in image generation and zero-shot text-to-speech synthesis, and can make use of drastically low amounts of computation without affecting the agent's overall performance.  ( 2 min )
    A projected nonlinear state-space model for forecasting time series signals. (arXiv:2311.13247v1 [stat.ME])
    Learning and forecasting stochastic time series is essential in various scientific fields. However, despite the proposals of nonlinear filters and deep-learning methods, it remains challenging to capture nonlinear dynamics from a few noisy samples and predict future trajectories with uncertainty estimates while maintaining computational efficiency. Here, we propose a fast algorithm to learn and forecast nonlinear dynamics from noisy time series data. A key feature of the proposed model is kernel functions applied to projected lines, enabling fast and efficient capture of nonlinearities in the latent dynamics. Through empirical case studies and benchmarking, the model demonstrates its effectiveness in learning and forecasting complex nonlinear dynamics, offering a valuable tool for researchers and practitioners in time series analysis.  ( 2 min )
    Multi-Objective Bayesian Optimization with Active Preference Learning. (arXiv:2311.13460v1 [cs.LG])
    There are a lot of real-world black-box optimization problems that need to optimize multiple criteria simultaneously. However, in a multi-objective optimization (MOO) problem, identifying the whole Pareto front requires the prohibitive search cost, while in many practical scenarios, the decision maker (DM) only needs a specific solution among the set of the Pareto optimal solutions. We propose a Bayesian optimization (BO) approach to identifying the most preferred solution in the MOO with expensive objective functions, in which a Bayesian preference model of the DM is adaptively estimated by an interactive manner based on the two types of supervisions called the pairwise preference and improvement request. To explore the most preferred solution, we define an acquisition function in which the uncertainty both in the objective functions and the DM preference is incorporated. Further, to minimize the interaction cost with the DM, we also propose an active learning strategy for the preference estimation. We empirically demonstrate the effectiveness of our proposed method through the benchmark function optimization and the hyper-parameter optimization problems for machine learning models.  ( 2 min )
    Differentially Private Non-Convex Optimization under the KL Condition with Optimal Rates. (arXiv:2311.13447v1 [cs.LG])
    We study private empirical risk minimization (ERM) problem for losses satisfying the $(\gamma,\kappa)$-Kurdyka-{\L}ojasiewicz (KL) condition. The Polyak-{\L}ojasiewicz (PL) condition is a special case of this condition when $\kappa=2$. Specifically, we study this problem under the constraint of $\rho$ zero-concentrated differential privacy (zCDP). When $\kappa\in[1,2]$ and the loss function is Lipschitz and smooth over a sufficiently large region, we provide a new algorithm based on variance reduced gradient descent that achieves the rate $\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^\kappa\big)$ on the excess empirical risk, where $n$ is the dataset size and $d$ is the dimension. We further show that this rate is nearly optimal. When $\kappa \geq 2$ and the loss is instead Lipschitz and weakly convex, we show it is possible to achieve the rate $\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^\kappa\big)$ with a private implementation of the proximal point method. When the KL parameters are unknown, we provide a novel modification and analysis of the noisy gradient descent algorithm and show that this algorithm achieves a rate of $\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^{\frac{2\kappa}{4-\kappa}}\big)$ adaptively, which is nearly optimal when $\kappa = 2$. We further show that, without assuming the KL condition, the same gradient descent algorithm can achieve fast convergence to a stationary point when the gradient stays sufficiently large during the run of the algorithm. Specifically, we show that this algorithm can approximate stationary points of Lipschitz, smooth (and possibly nonconvex) objectives with rate as fast as $\tilde{O}\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)$ and never worse than $\tilde{O}\big(\big(\frac{\sqrt{d}}{n\sqrt{\rho}}\big)^{1/2}\big)$. The latter rate matches the best known rate for methods that do not rely on variance reduction.  ( 3 min )
    Improved identification accuracy in equation learning via comprehensive $\boldsymbol{R^2}$-elimination and Bayesian model selection. (arXiv:2311.13265v1 [stat.ML])
    In the field of equation learning, exhaustively considering all possible equations derived from a basis function dictionary is infeasible. Sparse regression and greedy algorithms have emerged as popular approaches to tackle this challenge. However, the presence of multicollinearity poses difficulties for sparse regression techniques, and greedy steps may inadvertently exclude terms of the true equation, leading to reduced identification accuracy. In this article, we present an approach that strikes a balance between comprehensiveness and efficiency in equation learning. Inspired by stepwise regression, our approach combines the coefficient of determination, $R^2$, and the Bayesian model evidence, $p(\boldsymbol y|\mathcal M)$, in a novel way. Our procedure is characterized by a comprehensive search with just a minor reduction of the model space at each iteration step. With two flavors of our approach and the adoption of $p(\boldsymbol y|\mathcal M)$ for bi-directional stepwise regression, we present a total of three new avenues for equation learning. Through three extensive numerical experiments involving random polynomials and dynamical systems, we compare our approach against four state-of-the-art methods and two standard approaches. The results demonstrate that our comprehensive search approach surpasses all other methods in terms of identification accuracy. In particular, the second flavor of our approach establishes an efficient overfitting penalty solely based on $R^2$, which achieves highest rates of exact equation recovery.  ( 3 min )
    Extracting individual variable information for their decoupling, direct mutual information and multi-feature Granger causality. (arXiv:2311.13431v1 [stat.ML])
    Working with multiple variables they usually contain difficult to control complex dependencies. This article proposes extraction of their individual information, e.g. $\overline{X|Y}$ as random variable containing information from $X$, but with removed information about $Y$, by using $(x,y) \leftrightarrow (\bar{x}=\textrm{CDF}_{X|Y=y}(x),y)$ reversible normalization. One application can be decoupling of individual information of variables: reversibly transform $(X_1,\ldots,X_n)\leftrightarrow(\tilde{X}_1,\ldots \tilde{X}_n)$ together containing the same information, but being independent: $\forall_{i\neq j} \tilde{X}_i\perp \tilde{X}_j, \tilde{X}_i\perp X_j$. It requires detailed models of complex conditional probability distributions - it is generally a difficult task, but here can be done through multiple dependency reducing iterations, using imperfect methods (here HCR: Hierarchical Correlation Reconstruction). It could be also used for direct mutual information - evaluating direct information transfer: without use of intermediate variables. For causality direction there is discussed multi-feature Granger causality, e.g. to trace various types of individual information transfers between such decoupled variables, including propagation time (delay).  ( 2 min )
    Learning principle and mathematical realization of the learning mechanism in the brain. (arXiv:2311.13341v1 [cs.LG])
    While deep learning has achieved remarkable success, there is no clear explanation about why it works so well. In order to discuss this question quantitatively, we need a mathematical framework that explains what learning is in the first place. After several considerations, we succeeded in constructing a mathematical framework that can provide a unified understanding of all types of learning, including deep learning and learning in the brain. We call it learning principle, and it follows that all learning is equivalent to estimating the probability of input data. We not only derived this principle, but also mentioned its application to actual machine learning models. For example, we found that conventional supervised learning is equivalent to estimating conditional probabilities, and succeeded in making supervised learning more effective and generalized. We also proposed a new method of defining the values of estimated probability using differentiation, and showed that unsupervised learning can be performed on arbitrary dataset without any prior knowledge. Namely, this method is a general-purpose machine learning in the true sense. Moreover, we succeeded in describing the learning mechanism in the brain by considering the time evolution of a fully or partially connected model and applying this new method. The learning principle provides solutions to many unsolved problems in deep learning and cognitive neuroscience.  ( 2 min )
    Favour: FAst Variance Operator for Uncertainty Rating. (arXiv:2311.13036v1 [cs.LG])
    Bayesian Neural Networks (BNN) have emerged as a crucial approach for interpreting ML predictions. By sampling from the posterior distribution, data scientists may estimate the uncertainty of an inference. Unfortunately many inference samples are often needed, the overhead of which greatly hinder BNN's wide adoption. To mitigate this, previous work proposed propagating the first and second moments of the posterior directly through the network. However, on its own this method is even slower than sampling, so the propagated variance needs to be approximated such as assuming independence between neural nodes. The resulting trade-off between quality and inference time did not match even plain Monte Carlo sampling. Our contribution is a more principled variance propagation framework based on "spiked covariance matrices", which smoothly interpolates between quality and inference time. This is made possible by a new fast algorithm for updating a diagonal-plus-low-rank matrix approximation under various operations. We tested our algorithm against sampling based MC Dropout and Variational Inference on a number of downstream uncertainty themed tasks, such as calibration and out-of-distribution testing. We find that Favour is as fast as performing 2-3 inference samples, while matching the performance of 10-100 samples. In summary, this work enables the use of BNN in the realm of performance critical tasks where they have previously been out of reach.  ( 2 min )
    Non-Sequential Ensemble Kalman Filtering using Distributed Arrays. (arXiv:2311.12909v1 [stat.ML])
    This work introduces a new, distributed implementation of the Ensemble Kalman Filter (EnKF) that allows for non-sequential assimilation of large datasets in high-dimensional problems. The traditional EnKF algorithm is computationally intensive and exhibits difficulties in applications requiring interaction with the background covariance matrix, prompting the use of methods like sequential assimilation which can introduce unwanted consequences, such as dependency on observation ordering. Our implementation leverages recent advancements in distributed computing to enable the construction and use of the full model error covariance matrix in distributed memory, allowing for single-batch assimilation of all observations and eliminating order dependencies. Comparative performance assessments, involving both synthetic and real-world paleoclimatic reconstruction applications, indicate that the new, non-sequential implementation outperforms the traditional, sequential one.  ( 2 min )
    Multi-Objective Optimization via Wasserstein-Fisher-Rao Gradient Flow. (arXiv:2311.13159v1 [cs.LG])
    Multi-objective optimization (MOO) aims to optimize multiple, possibly conflicting objectives with widespread applications. We introduce a novel interacting particle method for MOO inspired by molecular dynamics simulations. Our approach combines overdamped Langevin and birth-death dynamics, incorporating a "dominance potential" to steer particles toward global Pareto optimality. In contrast to previous methods, our method is able to relocate dominated particles, making it particularly adept at managing Pareto fronts of complicated geometries. Our method is also theoretically grounded as a Wasserstein-Fisher-Rao gradient flow with convergence guarantees. Extensive experiments confirm that our approach outperforms state-of-the-art methods on challenging synthetic and real-world datasets.  ( 2 min )

  • Open

    The AI Paranoia and Doomers seems to be taking over all AI subs, so I'm making one about AI Acceleration
    https://www.reddit.com/r/AcceleratingAI/ It's just a sub that's focused on optimistic and positive discussion of the future of AI. If you are into that and want to just discuss and celebrate pushing the technology further, come check it out. I'm also looking for mods. submitted by /u/Zinthaniel [link] [comments]  ( 1 min )
    If you are confident that recursive AI self-improvement is not possible, what makes you so sure?
    We know computer programs and hardware can be optimized. We can foresee machines as smart as humans some time in the next 50 years. A machine like that could write computer programs and optimize hardware. What will prevent recursive self-improvement? ​ submitted by /u/Smallpaul [link] [comments]  ( 1 min )
    TikTok parent company used AI to optimize Linux kernel, boosting performance and efficiency
    . submitted by /u/AminoOxi [link] [comments]  ( 1 min )
    After OpenAI's Blowup, It Seems Pretty Clear That 'AI Safety' Isn't a Real Thing
    The recent events at OpenAI involving Sam Altman's ousting and reinstatement have highlighted a rift between the board and Altman over the pace of technological development and commercialization. The conflict revolves around the argument of 'AI safety' and the clash between OpenAI's mission of responsible technological development and the pursuit of profit. The organizational structure of OpenAI, being a non-profit governed by a board that controls a for-profit company, has set it on a collision course with itself. The episode reveals that 'AI safety' in Silicon Valley is compromised when economic interests come into play. The board's charter prioritizes the organization's mission of pursuing the public good over money, but the economic interests of investors have prevailed. Speculations about the reasons for Altman's ousting include accusations of pursuing additional funding via autocratic Mideast regimes. The incident shows that the board members of OpenAI, who were supposed to be responsible stewards of AI technology, may not have understood the consequences of their actions. The failure of corporate AI safety to protect humanity from runaway AI raises doubts about the ability of such groups to oversee super-intelligent technologies. Source : https://gizmodo.com/ai-safety-openai-sam-altman-ouster-back-microsoft-1851038439 submitted by /u/NuseAI [link] [comments]  ( 1 min )
    GraphQL AI for devs
    Hey there! We're building a developer oriented AI tools suite called - GraphQLAI Basically, we're building a GraphQL-based playground where crafting chat bots, linking them into a networks and other cool stuff faster than you can say "Hello, World!". We're on the brink of the big release (hopefully) and have created a waitlist you can join to get notified when the tools is ready. How much would it cost? ZERO, it's currently FREE. I guess we would need to think about the monetization at some point but definitely not in the nearest future. How does it work? We provide you the infrastucture to build things but you need to use your own OpenAI and Replicate keys. Check out the landing page for a sneak peek: GraphQLAI Direct waitlist link here: Join the Waitlist submitted by /u/oczekkk [link] [comments]  ( 1 min )
    AI Bot Builder Suggestions
    Happy Thanksgiving (if you celebrate it)! I'm a newbie to all of this & the more I read the more overwhelming it becomes so I was hoping for some suggestions. I'm wanting to build an AI Bot for a website. The bot will greet & ask a couple of questions. The user will then submit some information & the Bot will then ask questions based on that information. I was hoping that someone could point me in the right direction on a builder that is easy to build for beginners & can be added to a website. I was looking at Tidio & Claude. Are there any others? Since this will be on a public site, I need to be aware of how many tokens are used, right? Ideally I would have an AI Builder that can also search the internet, I would like to use search for another one I have in mind. Any help would be great....thanks again! submitted by /u/Training_Actuator_59 [link] [comments]  ( 1 min )
    Ai generated music using suno.ai
    submitted by /u/Substance_Technical [link] [comments]  ( 1 min )
    A deeper look at the Q* Model as a combination of A* algorithms and Deep Q-learning networks.
    Hey, folks! Buckle up because the recent buzz in the AI sphere has been nothing short of an intense rollercoaster. Rumors about a groundbreaking AI, enigmatically named Q* (pronounced Q-Star), have been making waves, closely tied to a chaotic series of events that rocked OpenAI and came to light after the abrupt firing of their CEO - Sam Altman ( u/samaltman ). There are several questions I would like to entertain, such as the impacts of Sam Altman's firing, the most probable reasons behind it, and the possible monopoly on highly efficient AI technologies that Microsoft is striving to have. However, all these things are too much for 1 Reddit post, so here I will attempt to explain why Q* is a BIG DEAL, as well as go more in-depth on the theory of combining Q-learning and A* algorithms. A…  ( 1 min )
    AI chip outfit Graphcore's sales to China hit by US export rules
    UK-based AI chip developer Graphcore is laying off staff in China and discontinuing sales in the country due to updated US export restrictions. The company had hoped to boost revenue in the Chinese market, but the restrictions prevent them from selling their Intelligence Processor Units (IPUs) in China. The export bans affect manufacturers worldwide as they cover the use of American manufacturing and design tools. Graphcore is not the only AI silicon company affected by these restrictions, as Nvidia also expects sales to China to be hit. Source : https://www.theregister.com/2023/11/23/brit_ai_chip_outfit_graphcore/ submitted by /u/NuseAI [link] [comments]  ( 1 min )
    OpenAI Unleashes Free Voice Chat Feature to All Mobile Users, Offering Siri-like Experience
    submitted by /u/Upbeat-Interaction13 [link] [comments]  ( 1 min )
    Frontier AI Regulation: Managing Emerging Risks to Public Safety
    submitted by /u/alcanthro [link] [comments]  ( 1 min )
    Reassessing the Impact of AI Evolution on Humanity: An Evolutionary Theory Perspective | hc:52661
    submitted by /u/alcanthro [link] [comments]  ( 1 min )
    Is anyone here working as Head of AI / Chief AI Officer?
    I'm looking to connect with anyone who may be working in a role akin to head of AI or CAIO. I've been seeing a lot of articles around how this is becoming an essential posting for many companies, Forbes, CIO, Zdnet and many others have covered it - and I'm trying to steer my career in this direction. While I am not deeply technical I've worked for tech companies my entire life and recently have been consulting with an AI agency doing training, enablement, and AI strategy for companies looking to get ahead of the curve. Anyone else here working in AI in non tech roles like consulting or strategy? Keen to engage. submitted by /u/zascar [link] [comments]  ( 1 min )
    Celebrating Thanksgiving with My Best Friend
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 1 min )
    Should Larry Summers be on OpenAI's Board of Directors? I say NO. He works to stop action on climate change. He defends the top 1% at the expense of everyone else. Harvard's faculty voted to oust him as their president. He is NOT to be trusted. Disagree? Explain in comments.
    View Poll submitted by /u/Georgeo57 [link] [comments]  ( 1 min )
    Possible OpenAI's Q* breakthrough and DeepMind's AlphaGo-type systems plus LLMs
    tl;dr: OpenAI leaked AI breakthrough called Q*, acing grade-school math. It is hypothesized combination of Q-learning and A*. It was then refuted. DeepMind is working on something similar with Gemini, AlphaGo-style Monte Carlo Tree Search. Scaling these might be crux of planning for increasingly abstract goals and agentic behavior. Academic community has been circling around these ideas for a while. https://www.reuters.com/technology/sam-altmans-ouster-openai-was-precipitated-by-letter-board-about-ai-breakthrough-2023-11-22/ https://twitter.com/MichaelTrazzi/status/1727473723597353386 "Ahead of OpenAI CEO Sam Altman’s four days in exile, several staff researchers sent the board of directors a letter warning of a powerful artificial intelligence discovery that they said could threaten h…  ( 1 min )
    Sam Altman returns as CEO of OpenAI with Microsoft support
    submitted by /u/Fit-Code-5141 [link] [comments]  ( 1 min )
    Seems like we need a model compiler
    Quick tech segway, skip to the heading below if you already know compiler theory. Most people think of compilers as things that takes source code and produce a program. But a compiler is more general than that. A compiler is just a system that takes one set of symbols in and outputs a corresponding set of symbols. Compilers operate on three levels: token, syntax and semantics. In a programming language like Python, a token might be "while" or "=". Syntax might be "variable = value". A semantics might be "a variable that is assigned to is automatically recorded in the lookup table for local variables." Once you have identified the tokens, syntax and semantics of your input symbols then you just need to built the mapping between the input and output. In most compilers, this is the "emitter" which takes the internal representation of the input and walks through it, transforming it to the output symbols as it goes, applying various optimizations as it goes. This is a gross over-simplification and misses many nuances, technologies, approaches, etc. But it gets us where we need to be for this conversation. AI Model Compiling It seems to me that we have obvious need for optimizing compilers in the model space. The problem is identifying the syntax and semantics of a neural network. This seems like something that AI could tackle, however, without having to achieve AGI. Given a functional understanding of compiler optimization and of a model's structure, it should be possible to have an AI identify semantic patterns and re-arrange (even reduce) the syntax to get there. Immediately obvious problems: Training the compiler model seems challenging at best. Only very small models could be compiled (number of tokens in). Proving that the compiler does not damage the model is probably impossible other than empirically. Does anyone know if this is an area of active research? submitted by /u/Tyler_Zoro [link] [comments]  ( 1 min )
  • Open

    [D]AI Ethicists Sound the Alarm: What's Driving Their Call for Heavy Regulation?
    As AI's capabilities expand, so too do concerns about its ethical implications. At the forefront of this debate are AI ethicists, a growing group of experts who advocate for responsible AI development and deployment. AI ethicists raise a range of concerns, including the potential for AI systems to perpetuate bias and discrimination, the lack of transparency and accountability in AI decision-making, and the risks of AI misuse for surveillance, manipulation, and autonomous warfare. In response to these concerns, many AI ethicists are calling for heavy regulation of AI technologies. ​ AI Ethicists Sound the Alarm: What's Driving Their Call for Heavy Regulation? ​ submitted by /u/Xeiristotle [link] [comments]  ( 1 min )
    [D] how do you remember methods in papers you read?
    A few months ago I came across maximum mean discrepancy as a measure of distribution difference, and today I read this term and totally forgot what is means and had to find a youtube video to refresh my understanding. This happens a lot of times in my research. I feel like unless something is really basic (e.g. CNN, cross entropy, etc) and used a lot in my day-to-day model building, I easily forgot what I have read. I wonder is it just because I have a bad memory or I do not have a good way to organize information? submitted by /u/Illustrious-Pay-7516 [link] [comments]  ( 1 min )
    [R] Rethinking Open'sAI's Q-Learning : Insights from the Award-Winning 'Non-delusional Q-learning' Paper
    OpenAI's approach to Q-Learning has been drawing significant attention recently. However, there's a fundamental issue in the way Q-learning is typically implemented in deep learning and neural network environments. This concern is highlighted in the award-winning paper "Non-delusional Q-learning," presented at NeurIPS. The paper suggests a fundamental flaw in the blind application of Q-learning updates to deep neural networks. It points out that such updates can create a self-contradictory scenario where improving the network for the current batch of data inadvertently makes it less effective for other batches. This is akin to a situation in supervised learning where optimizing a network for a specific set of data may degrade its performance on other datasets. For more insights, the full paper can be accessed here: Non-delusional Q-learning Paper(Follow up ICML paper: Practical Non-delusional-Q Learning ) I'm curious about others' views on this topic. What do you think about the implications of these findings for the future of Q-learning in deep learning environments? submitted by /u/Even_Campaign7385 [link] [comments]  ( 1 min )
    [D] Learning Agents
    Are learning agents and Reinforcement Learning synonymous? For example in retail analytics, like assortment optimization submitted by /u/overigegebruiker12 [link] [comments]  ( 9 min )
    [D] Difference between CUDA and Tensor Cores
    CUDA cores (or shader cores in general) have long been used to compute graphics. A very often used operation in computer graphics are matrix multiplications, just like in deep learning. Back in the days (AlexNet) NNs were computed using shader cores, but now have completely moved to be computed on Tensor cores. My question are: Why have these workloads been seperated? (Yes obviously the tensor cores are more specialized and leave out a bunch of unnecessary operations, but how and why not integrate it into the CUDA cores to boost MM operations for computer graphics?) Why isn't the workload offloaded to the other cores when the mathematical operations are the same What makes tensor cores so much more efficient and faster? submitted by /u/3DHydroPrints [link] [comments]  ( 1 min )
    [D] Hugging face and Timm
    Hello all, I am a PyTorch user I work in CV, I usually use the PyTorch models. However, I see people use timm in research papers to train their models I don't understand what it is timm is it a new framework like PyTorch? Further, when I click https://pypi.org/project/timm/ homepage it takes me to hugging face GitHub https://github.com/huggingface/pytorch-image-models is there any connection between timm and hugging face many of my friends use hugging face but I also don't know about hugging face I use simple PyTorch and torchvision.models. If you have any resources to understand about timm please let me know. Please state also the advantages of using timm and hugging face against PyTorch. submitted by /u/NoEntertainment6225 [link] [comments]  ( 1 min )
    [D] 8 bit with bitsandbytes using 16bit memory ?
    I tried running CodeLlama 13B on A100 on bare-metal and Colab and saw that on-loading memory consumption is around 26 GB which is in line with 16 bit while the same model loaded in 4bit takes 9 GB. Going by that logic I am expecting 13b on 8bit to take around 16 GB but that's not the case. Inference speed is also very poor but I can understand that's a known issue. Anything I am missing here or 8-bit loading is inefficient? For context, I am using transformers->4.35.2 bitsandbytes->0.41.2.post2 Model loading is done using "AutoModelForCausalLM.from_pretrained" with "--load_in_8bit" set as true. submitted by /u/Jaskirat8 [link] [comments]  ( 1 min )
    [P] X—LLM: Few lines of code to train your own 7B LLM in Colab using cutting edge techniques like QLoRA
    Like many of you, I often need to train LLMs (Large Language Models). Code hops from one project to another, and it's easy to lose track, resulting in several iterations of the same training process. X—LLM is a solution. It’s a streamlined, user-friendly library designed for efficient model training, offering advanced techniques and customizable options within the Hugging Face ecosystem. Features: - LoRA, QLoRA and fusing - Flash Attention 2 - Gradient checkpointing - bitsandbytes quantization - GPTQ (including post-training quantization) - W&B experiment tracking - Simple training on multiple GPUs at once using DeepSpeed or FSDP Use cases: - Create production-ready solutions or fast prototypes. X—LLM works in both configurations - Finetune a 7B model with 334 million tokens (1.1 million dialogues) for just 50$ - Automatically save each checkpoint during training to the Hugging Face Hub and don't lose any progress - Quantize a model using GPTQ. Reduce 7B Mistral model from 15 GB to 4.3 GB and increase inference speed Github repo: https://github.com/BobaZooba/xllm You can train 7B model, fuse LoRA and upload ready-to-use model to the Hugging Face Hub. All in a single Colab! Link The library has gained 100 stars in less than a day, and now it's almost at 200. People are using it, training models in both Colab and multi-GPU setups. Meanwhile, I'm supporting X—LLM users and currently implementing the most requested feature - DPO. Code example I suggest that you try training your own models and see for yourself how simple it is. If you like it, please consider giving the project a star on GitHub. submitted by /u/DesperatePresence473 [link] [comments]  ( 1 min )
    [P] Comgra: A library for debugging and understanding neural networks
    I'm a machine learning engineer and researcher. I got fed up with how difficult it is to understand why neural networks behave the way they do, so i wrote a library to help with it. Comgra (computation graph analysis) is a library you can use with pytorch to extract all the tensor data you care about and visualize it graphically in a browser. This allows for a much more detailed analysis of what is happening than the usual approach of using tensorboard. You can go investigate tensors as training proceeds, drill down into individual neurons, inspect single data sets that are of special interest to you, track gradients, compare statistics between different training runs, and more. This tool has saved me a ton of time in my research by letting me check my hypotheses much more quickly than normal and by helping me understand how the different parts of my network really interact. I hope this tool can save other people just as much time as it did me. I'm also open for suggestions on how to improve it further: Since I'm already gathering and visualizing a lot of network information, adding more automated analysis would not be much extra work. submitted by /u/Smart-Emu5581 [link] [comments]  ( 1 min )
    [p] Simple LLM trainer script!
    Hey everyone, I came across a post recently where someone found it hard to find simple scripts to fine-tune LLMs with their data. So I put together a repo where you can just type out your requirements in a config.yaml file and the training happens flawlessly based on that. Here's the repo - LLM-Trainer It is still a wip so lemme know if guys want some other features added to this. ​ TIA. submitted by /u/Dry_Long3157 [link] [comments]  ( 1 min )
    [D] In Search of an Ideal Videos Dataset for Object Detection Models – Seeking Recommendations
    Hello everyone, I'm a computer science student working on my final year project, and I'm in need of open source datasets for my project, which revolves around using YOLO (You Only Look Once) for object detection. I would greatly appreciate any recommendations or resources for datasets that could be suitable for this purpose. Thank you in advance for your help! submitted by /u/Nabobery [link] [comments]  ( 1 min )
    Looking for Resources on DeepFakes and Hyperstition [R]
    Hey, I'm a sociology undergrad, and I came across the course of Artificial Intelligence and Media in Contemporary Societies. I found it really interesting, but at the same time concerning. We were discussing a lot about how it would be difficult to control conspiracies (what they called hyperstitions), and the use of AI for political far-right ideology would increase. I finished the course recently, but I still see myself having some question that some of you might have the answer to. Feel free to add something extra if you feel like it, I'm open for debate and other perspectives. Do you know any academic that has good papers about it? We focused more on the anthropological discussion of the issue, and I think we didn't have a base of academics outside that. It can be economics or philosophers, something in this spectrum. Have you experienced something related to conspiration within internet, or do you know/ studied a particular case-study that might be relevant if I want to deepen my knowledge on the subject? Do you know any program, where I can have access to this type of technology (E.g., deepfake creation, ChatGPT vibes) so I can understand how difficult or easy it is for a person that does not have the technical skills to use it. Do you know any groups of people that have used deepfake mechanisms in an organised manner for personal vendettas? I'd appreciate any news articles/ about any incidents you might have heard about, it would be nice too; Do you know any Discord/Telegram/4chan group where I can discuss about this topic. Any answer would be appreciated, so thanks in advance! submitted by /u/visualbobby [link] [comments]  ( 1 min )
    [D] OpenAl vs. Llamalndex: Evaluating RAG Performance
    A comparison of the OpenAl Assistant API vs Llamalndex, where Llamalndex perfromed materially better than OpenAl. Here's the TLDR; OpenAl's Assistants API: Showcases high efficiency with single documents but faces challenges when managing multiple documents. Llamalndex: Exhibits consistent performance across both single and multiple document formats, outshining OpenAl in more complex scenarios. submitted by /u/tombenom [link] [comments]  ( 9 min )
    [D] Open-source LLM
    Hello everybody, I'm looking for open-source LLM for work to be able to perform things like summarising a paper, or to be queryable: "Create a graph for specific parameters” based on a given amount of fed documents. Also, very important to be able to run it offline/locally for security reasons. If anybody has suggestions or some sort of advice I would be delighted. Thank you. submitted by /u/Flat-Measurement-417 [link] [comments]  ( 1 min )
    [R] How to evaluate generative models?
    Hi all. I was researching generative model evaluation and found this post interesting: https://deepsense.ai/evaluation-derangement-syndrome-in-gpu-poor-genai A lot of it kind of corresponds to what I see happening in the industry and feels like a good fit here submitted by /u/Consistent-Thing-966 [link] [comments]  ( 1 min )
    [R] Prompting Frameworks for Large Language Models: A Survey
    submitted by /u/davidmezzetti [link] [comments]  ( 1 min )
    [R] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
    Code: https://github.com/hao-ai-lab/LookaheadDecoding Blog post: https://lmsys.org/blog/2023-11-21-lookahead-decoding/ Description: We introduce lookahead decoding, a new, exact, and parallel decoding algorithm to accelerate LLM inference. Lookahead decoding breaks the sequential dependency in autoregressive decoding by concurrently extracting and verifying n-grams directly with the LLM, utilizing the Jacobi iteration method. Lookahead decoding functions without the need for a draft model or a data store. It linearly decreases the number of decoding steps directly correlating with the log(FLOPs) used per decoding step. Below is a demo of lookahead decoding accelerating LLaMa-2-Chat 7B generation: https://i.redd.it/k61qtr4zz22c1.gif ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [p] User friendly digital audio signal processing
    Hello, the faculty of music at my university in Norway has been granted a lot of money to focus on the use of technology in music and music education. I have been accepted into a position focusing on AI in music. I am first and foremost a musician, so i am working on getting a fundamental grasp on machine learning. I was wondering if you know of any website or software that can be used to train a signal processing model. The idea being that i can give an unprocessed signal and a processed signal and have mode learn the differences. Like a guitar playing without a distortion pedal and with one. Thank you very much. If there are any misconceptions in my question please help educate me. submitted by /u/Seb1234123 [link] [comments]  ( 1 min )
    [Discussion] Tips on getting started implementing ML papers in code?
    I've been diving a lot deeper into some interesting neural network papers recently and I'm looking to try and implement some of the models detailed in the papers. In general, I know that many papers include the code or I can just google the code to implement the model but I want to push myself to start implementing from scratch more. Could anyone offer some tips on how they got started or gained the skills to be able to implement a model effectively within a few hours? Any advice would be much appreciated! submitted by /u/ExitWest7 [link] [comments]  ( 9 min )
    [R] FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
    Paper: https://arxiv.org/abs/2311.05908 Code: https://github.com/HazyResearch/flash-fft-conv Blog post: https://hazyresearch.stanford.edu/blog/2023-11-13-flashfftconv Abstract: Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time. A major bottleneck is the Fast Fourier Transform (FFT)--which allows long convolutions to run in O(NlogN) time in sequence length N but has poor hardware utilization. In this paper, we study how to optimize the FFT convolution. We find two key bottlenecks: the FFT does not effectively use specialized matrix multiply units, and it incurs expensive I/O between layers of the memory hierarchy. In response, we propose FlashFFTConv.…  ( 1 min )
    [D] Am I making a mistake?
    By walking away (for now) from 10 years of health care data analytics experience? I just graduated with my Computer Science degree and got offered an entry level SWE role working more on a SWE/cybersecurity type of role. My intent was to start a MLE role as soon as I graduated back in June 2023 but of course, I keep graduating into recessions lol. My previous degrees were a BS in Biology and a Master's in Public Health. Currently in a MS in CS program. But a big pro is that I would be working with very highly trained Computer Scientists/SWEs willing to train me. I am a bit tired of some of the people I work with in data analytics coming from all kinds of backgrounds which is great, but I am interested in being able to be trained by very knowledgeable people. For the last 10 years, I was the one expected to know everything with little to no training. submitted by /u/computersorceress [link] [comments]  ( 9 min )
    [D] Exclusive: Sam Altman's ouster at OpenAI was precipitated by letter to board about AI breakthrough
    According to one of the sources, long-time executive Mira Murati told employees on Wednesday that a letter about the AI breakthrough called Q* (pronounced Q-Star), precipitated the board's actions. The maker of ChatGPT had made progress on Q*, which some internally believe could be a breakthrough in the startup's search for superintelligence, also known as artificial general intelligence (AGI), one of the people told Reuters. OpenAI defines AGI as AI systems that are smarter than humans. https://www.reuters.com/technology/sam-altmans-ouster-openai-was-precipitated-by-letter-board-about-ai-breakthrough-2023-11-22/ submitted by /u/blabboy [link] [comments]  ( 9 min )
  • Open

    DQN does not learn anything despite exploration
    Heey everyone, I am facing an issue with DRL (again) and I need your help. well, after completing the training process, the reward showed an increasing pattern. However, to verify if my agent learns something, I run two instances one with DQN with no exploration and a constant reward = 1 whatever the action is (pink curve) and the other instance with exploration and a reward that varies with the selected action (yellow curve). However they both give me the same performance in the metric of interest; the exploration in the second instance is epsilon-greedy : self.epsilon -= self.epsilon_decay self.epsilon_decay = (self.epsilon - self.epsilon_min) / self.explore_step where self.explore_step is set to 400000 steps, and self.epsilon_min = 0.01 As I know the loss must decrease during training, however, it keeps fluctuate and fluctuations increase during training. I´ve tried different ranges of the reward, but the issue persists. Could someone tell me the reason behind this ? ​ The metric of interest is the total cost the loss of the second instance submitted by /u/GuavaAgreeable208 [link] [comments]  ( 1 min )
    Learning Agents
    Are learning agents and Reinforcement Learning synonymous? submitted by /u/overigegebruiker12 [link] [comments]  ( 1 min )
    value iteration/bellman backup operator
    hello! I am a college student learning RL and I’m struggling to get a few concepts. I’ve watched videos on youtube but I still don’t understand it. I was wondering if anyone could help me/explain the concepts a little better. The questions I have are related to the value iteration algorithm, which is something I (mostly) get theory-wise, but am confused on turning that theory into an algorithm. I’ve seen algorithms for value iteration and I’m not sure why they work. I also am completely confused on what the bellman backup operator is. I was introduced to it but I don’t understand it and its relation to value iteration. I would greatly appreciate any help with these topics. Feel free to DM. submitted by /u/elaraxelara [link] [comments]  ( 1 min )
    Meta-reinforcement learning via orbitofrontal cortex
    Paper: https://www.nature.com/articles/s41593-023-01485-3 Code: https://github.com/ryhattori/MetaRL-PRL Dataset: https://zenodo.org/records/8378063 Abstract: The meta-reinforcement learning (meta-RL) framework, which involves RL over multiple timescales, has been successful in training deep RL models that generalize to new environments. It has been hypothesized that the prefrontal cortex may mediate meta-RL in the brain, but the evidence is scarce. Here we show that the orbitofrontal cortex (OFC) mediates meta-RL. We trained mice and deep RL models on a probabilistic reversal learning task across sessions during which they improved their trial-by-trial RL policy through meta-learning. Ca2+/calmodulin-dependent protein kinase II-dependent synaptic plasticity in OFC was necessary for this meta-learning but not for the within-session trial-by-trial RL in experts. After meta-learning, OFC activity robustly encoded value signals, and OFC inactivation impaired the RL behaviors. Longitudinal tracking of OFC activity revealed that meta-learning gradually shapes population value coding to guide the ongoing behavioral policy. Our results indicate that two distinct RL algorithms with distinct neural mechanisms and timescales coexist in OFC to support adaptive decision-making. ​ submitted by /u/APaperADay [link] [comments]  ( 1 min )
    Investing in a desktop for DRL
    I recently started studying and researching DRL. All of the environments I am running quite simple and fast, most written in Numpy. However, runs often take too long. I am currently running on a laptop with: 12th Gen Intel(R) Core(TM) i7-12800HX 2.00 GHz, 32GB RAM and NVIDIA RTX A1000. I was thinking of investing in a desktop PC with: AMD Ryzen 97950X 64GB RAM No GPU in the beginning, just use the graphics chip in the processor. My understanding that GPU is far less important than CPU in DRL. Do you see this as a worthy investment? Would this upgrade speed up my calculations? EDIT: Also, I would like to use Linux for the first time on this new PC. Is the OS expected to make a difference in running DRL? submitted by /u/Muscle_Robot [link] [comments]  ( 1 min )
    OpenAI's supposed Q* breakthrough heralds a new age in RL
    Totally exaggerated title, but given the resurgence in RL interest post-RLHF I don't see the trend stopping anytime soon. Seems like multiple research threads are converging. submitted by /u/wardellinthehouse [link] [comments]  ( 1 min )
  • Open

    Search algorithm reveals nearly 200 new kinds of CRISPR systems
    By analyzing bacterial data, researchers have discovered thousands of rare new CRISPR systems that have a range of functions and could enable gene editing, diagnostics, and more.  ( 10 min )
  • Open

    Difference between UpSampling2D and Conv2DTranspose used in U-Net and GAN
    submitted by /u/keghn [link] [comments]  ( 1 min )
  • Open

    Hungry for Gaming: 18 New Games to Join GeForce NOW
    GeForce NOW is bringing 18 new games to the cloud this week, part of a gratitude-filled GFN Thursday. A collaboration between Chromebook Plus, CD PROJEKT RED and GeForce NOW brought an immersive 3D activation to Times Square over the weekend, containing a hidden Easter egg for Cyberpunk 2077 players. Plus, this holiday season, give the Read article >  ( 5 min )
  • Open

    Database reconstruction attacks
    In 2018, three researchers from the US Census Bureau published a paper entitled “Understanding Database Reconstruction Attacks on Public Data.” [1] The article showed that private data on many individuals could be reverse engineered from public data. As I wrote about a few days ago, census blocks are at the bottom of the US Census […] Database reconstruction attacks first appeared on John D. Cook.  ( 6 min )
  • Open

    Enhancing Multi-modal Cooperation via Fine-grained Modality Valuation. (arXiv:2309.06255v2 [cs.CV] UPDATED)
    One primary topic of multi-modal learning is to jointly incorporate heterogeneous information from different modalities. However, most models often suffer from unsatisfactory multi-modal cooperation, which could not jointly utilize all modalities well. Some methods are proposed to identify and enhance the worse learnt modality, but are often hard to provide the fine-grained observation of multi-modal cooperation at sample-level with theoretical support. Hence, it is essential to reasonably observe and improve the fine-grained cooperation between modalities, especially when facing realistic scenarios where the modality discrepancy could vary across different samples. To this end, we introduce a fine-grained modality valuation metric to evaluate the contribution of each modality at sample-level. Via modality valuation, we regretfully observe that the multi-modal model tends to rely on one specific modality, resulting in other modalities being low-contributing. We further analyze this issue and improve cooperation between modalities by enhancing the discriminative ability of low-contributing modalities in a targeted manner. Overall, our methods reasonably observe the fine-grained uni-modal contribution at sample-level and achieve considerable improvement on different multi-modal models.
    Data is often loadable in short depth: Quantum circuits from tensor networks for finance, images, fluids, and proteins. (arXiv:2309.13108v2 [quant-ph] UPDATED)
    Though there has been substantial progress in developing quantum algorithms to study classical datasets, the cost of simply loading classical data is an obstacle to quantum advantage. When the amplitude encoding is used, loading an arbitrary classical vector requires up to exponential circuit depths with respect to the number of qubits. Here, we address this "input problem" with two contributions. First, we introduce a circuit compilation method based on tensor network (TN) theory. Our method -- AMLET (Automatic Multi-layer Loader Exploiting TNs) -- proceeds via careful construction of a specific TN topology and can be tailored to arbitrary circuit depths. Second, we perform numerical experiments on real-world classical data from four distinct areas: finance, images, fluid mechanics, and proteins. To the best of our knowledge, this is the broadest numerical analysis to date of loading classical data into a quantum computer. Consistent with other recent work in this area, the required circuit depths are often several orders of magnitude lower than the exponentially-scaling general loading algorithm would require. Besides introducing a more efficient loading algorithm, this work demonstrates that many classical datasets are loadable in depths that are much shorter than previously expected, which has positive implications for speeding up classical workloads on quantum computers.
    Absolute Policy Optimization. (arXiv:2310.13230v2 [cs.LG] UPDATED)
    In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily emphasize improvement in expected performance, lacking the ability to control over the worst-case performance outcomes. To address this limitation, we introduce a novel objective function; by optimizing which, it will lead to guaranteed monotonic improvement in the lower bound of near-total performance samples (absolute performance). Considering this groundbreaking theoretical advancement, we then refine this theoretically grounded algorithm through a series of approximations, resulting in a practical solution called Absolute Policy Optimization (APO). Our experiments demonstrate the effectiveness of our approach across challenging continuous control benchmark tasks and extend its applicability to mastering Atari games. Our findings reveal that APO significantly outperforms state-of-the-art policy gradient algorithms, resulting in substantial improvements in both expected performance and worst-case performance.
    Computing Approximate $\ell_p$ Sensitivities. (arXiv:2311.04158v2 [cs.LG] UPDATED)
    Recent works in dimensionality reduction for regression tasks have introduced the notion of sensitivity, an estimate of the importance of a specific datapoint in a dataset, offering provable guarantees on the quality of the approximation after removing low-sensitivity datapoints via subsampling. However, fast algorithms for approximating $\ell_p$ sensitivities, which we show is equivalent to approximate $\ell_p$ regression, are known for only the $\ell_2$ setting, in which they are termed leverage scores. In this work, we provide efficient algorithms for approximating $\ell_p$ sensitivities and related summary statistics of a given matrix. In particular, for a given $n \times d$ matrix, we compute $\alpha$-approximation to its $\ell_1$ sensitivities at the cost of $O(n/\alpha)$ sensitivity computations. For estimating the total $\ell_p$ sensitivity (i.e. the sum of $\ell_p$ sensitivities), we provide an algorithm based on importance sampling of $\ell_p$ Lewis weights, which computes a constant factor approximation to the total sensitivity at the cost of roughly $O(\sqrt{d})$ sensitivity computations. Furthermore, we estimate the maximum $\ell_1$ sensitivity, up to a $\sqrt{d}$ factor, using $O(d)$ sensitivity computations. We generalize all these results to $\ell_p$ norms for $p > 1$. Lastly, we experimentally show that for a wide class of matrices in real-world datasets, the total sensitivity can be quickly approximated and is significantly smaller than the theoretical prediction, demonstrating that real-world datasets have low intrinsic effective dimensionality.
    Optimal Locally Private Nonparametric Classification with Public Data. (arXiv:2311.11369v2 [stat.ML] UPDATED)
    In this work, we investigate the problem of public data-assisted non-interactive LDP (Local Differential Privacy) learning with a focus on non-parametric classification. Under the posterior drift assumption, we for the first time derive the mini-max optimal convergence rate with LDP constraint. Then, we present a novel approach, the locally private classification tree, which attains the mini-max optimal convergence rate. Furthermore, we design a data-driven pruning procedure that avoids parameter tuning and produces a fast converging estimator. Comprehensive experiments conducted on synthetic and real datasets show the superior performance of our proposed method. Both our theoretical and experimental findings demonstrate the effectiveness of public data compared to private data, which leads to practical suggestions for prioritizing non-private data collection.
    Continual Learning: Applications and the Road Forward. (arXiv:2311.11908v2 [cs.LG] UPDATED)
    Continual learning is a sub-field of machine learning, which aims to allow machine learning models to continuously learn on new data, by accumulating knowledge without forgetting what was learned in the past. In this work, we take a step back, and ask: "Why should one care about continual learning in the first place?". We set the stage by surveying recent continual learning papers published at three major machine learning conferences, and show that memory-constrained settings dominate the field. Then, we discuss five open problems in machine learning, and even though they seem unrelated to continual learning at first sight, we show that continual learning will inevitably be part of their solution. These problems are model-editing, personalization, on-device learning, faster (re-)training and reinforcement learning. Finally, by comparing the desiderata from these unsolved problems and the current assumptions in continual learning, we highlight and discuss four future directions for continual learning research. We hope that this work offers an interesting perspective on the future of continual learning, while displaying its potential value and the paths we have to pursue in order to make it successful. This work is the result of the many discussions the authors had at the Dagstuhl seminar on Deep Continual Learning, in March 2023.
    BrainWash: A Poisoning Attack to Forget in Continual Learning. (arXiv:2311.11995v2 [cs.LG] UPDATED)
    Continual learning has gained substantial attention within the deep learning community, offering promising solutions to the challenging problem of sequential learning. Yet, a largely unexplored facet of this paradigm is its susceptibility to adversarial attacks, especially with the aim of inducing forgetting. In this paper, we introduce "BrainWash," a novel data poisoning method tailored to impose forgetting on a continual learner. By adding the BrainWash noise to a variety of baselines, we demonstrate how a trained continual learner can be induced to forget its previously learned tasks catastrophically, even when using these continual learning baselines. An important feature of our approach is that the attacker requires no access to previous tasks' data and is armed merely with the model's current parameters and the data belonging to the most recent task. Our extensive experiments highlight the efficacy of BrainWash, showcasing degradation in performance across various regularization-based continual learning methods.
    Hierarchical Joint Graph Learning and Multivariate Time Series Forecasting. (arXiv:2311.12630v1 [cs.LG])
    Multivariate time series is prevalent in many scientific and industrial domains. Modeling multivariate signals is challenging due to their long-range temporal dependencies and intricate interactions--both direct and indirect. To confront these complexities, we introduce a method of representing multivariate signals as nodes in a graph with edges indicating interdependency between them. Specifically, we leverage graph neural networks (GNN) and attention mechanisms to efficiently learn the underlying relationships within the time series data. Moreover, we suggest employing hierarchical signal decompositions running over the graphs to capture multiple spatial dependencies. The effectiveness of our proposed model is evaluated across various real-world benchmark datasets designed for long-term forecasting tasks. The results consistently showcase the superiority of our model, achieving an average 23\% reduction in mean squared error (MSE) compared to existing models.
    Approximating Two-Layer Feedforward Networks for Efficient Transformers. (arXiv:2310.10837v3 [cs.LG] UPDATED)
    How to reduce compute and memory requirements of neural networks (NNs) without sacrificing performance? Many recent works use sparse Mixtures of Experts (MoEs) to build resource-efficient large language models (LMs). Here we introduce several novel perspectives on MoEs, presenting a general framework that unifies various methods to approximate two-layer NNs (e.g., feedforward blocks of Transformers), including product-key memories (PKMs). Leveraging insights from this framework, we propose methods to improve both MoEs and PKMs. Unlike prior work that compares MoEs with dense baselines under the compute-equal condition, our evaluation condition is parameter-equal, which is crucial to properly evaluate LMs. We show that our MoEs are competitive with the dense Transformer-XL on both the WikiText-103 and enwiki8 datasets at two different scales, while being much more resource efficient. This demonstrates that MoEs are relevant not only to extremely large LMs but also to any-scale resource-efficient LMs. Our code is public.
    Multi-Objective Optimization Using the R2 Utility. (arXiv:2305.11774v2 [math.OC] UPDATED)
    The goal of multi-objective optimization is to identify a collection of points which describe the best possible trade-offs between the multiple objectives. In order to solve this vector-valued optimization problem, practitioners often appeal to the use of scalarization functions in order to transform the multi-objective problem into a collection of single-objective problems. This set of scalarized problems can then be solved using traditional single-objective optimization techniques. In this work, we formalise this convention into a general mathematical framework. We show how this strategy effectively recasts the original multi-objective optimization problem into a single-objective optimization problem defined over sets. An appropriate class of objective functions for this new problem is the R2 utility function, which is defined as a weighted integral over the scalarized optimization problems. We show that this utility function is a monotone and submodular set function, which can be optimised effectively using greedy optimization algorithms. We analyse the performance of these greedy algorithms both theoretically and empirically. Our analysis largely focusses on Bayesian optimization, which is a popular probabilistic framework for black-box optimization.
    Multi-channel Speech Separation Using Spatially Selective Deep Non-linear Filters. (arXiv:2304.12023v2 [eess.AS] UPDATED)
    In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel approaches, which rely on the different spectro-temporal characteristics of the speech signals, multi-channel approaches should additionally utilize the different spatial locations of the sources for a more powerful separation especially when the number of sources increases. To enhance the spatial processing in a multi-channel source separation scenario, in this work, we propose a deep neural network (DNN) based spatially selective filter (SSF) that can be spatially steered to extract the speaker of interest by initializing a recurrent neural network layer with the target direction. We compare the proposed SSF with a common end-to-end direct separation (DS) approach trained using utterance-wise permutation invariant training (PIT), which only implicitly learns to perform spatial filtering. We show that the SSF has a clear advantage over a DS approach with the same underlying network architecture when there are more than two speakers in the mixture, which can be attributed to a better use of the spatial information. Furthermore, we find that the SSF generalizes much better to additional noise sources that were not seen during training and to scenarios with speakers positioned at a similar angle.
    DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. (arXiv:2305.10429v4 [cs.CL] UPDATED)
    The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.
    Is TinyML Sustainable? Assessing the Environmental Impacts of Machine Learning on Microcontrollers. (arXiv:2301.11899v3 [cs.LG] UPDATED)
    The sustained growth of carbon emissions and global waste elicits significant sustainability concerns for our environment's future. The growing Internet of Things (IoT) has the potential to exacerbate this issue. However, an emerging area known as Tiny Machine Learning (TinyML) has the opportunity to help address these environmental challenges through sustainable computing practices. TinyML, the deployment of machine learning (ML) algorithms onto low-cost, low-power microcontroller systems, enables on-device sensor analytics that unlocks numerous always-on ML applications. This article discusses both the potential of these TinyML applications to address critical sustainability challenges, as well as the environmental footprint of this emerging technology. Through a complete life cycle analysis (LCA), we find that TinyML systems present opportunities to offset their carbon emissions by enabling applications that reduce the emissions of other sectors. Nevertheless, when globally scaled, the carbon footprint of TinyML systems is not negligible, necessitating that designers factor in environmental impact when formulating new devices. Finally, we outline research directions to enable further sustainable contributions of TinyML.
    Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting. (arXiv:2307.11494v2 [cs.LG] UPDATED)
    Diffusion models have achieved state-of-the-art performance in generative modeling tasks across various domains. Prior works on time series diffusion models have primarily focused on developing conditional models tailored to specific forecasting or imputation tasks. In this work, we explore the potential of task-agnostic, unconditional diffusion models for several time series applications. We propose TSDiff, an unconditionally-trained diffusion model for time series. Our proposed self-guidance mechanism enables conditioning TSDiff for downstream tasks during inference, without requiring auxiliary networks or altering the training procedure. We demonstrate the effectiveness of our method on three different time series tasks: forecasting, refinement, and synthetic data generation. First, we show that TSDiff is competitive with several task-specific conditional forecasting methods (predict). Second, we leverage the learned implicit probability density of TSDiff to iteratively refine the predictions of base forecasters with reduced computational overhead over reverse diffusion (refine). Notably, the generative performance of the model remains intact -- downstream forecasters trained on synthetic samples from TSDiff outperform forecasters that are trained on samples from other state-of-the-art generative time series models, occasionally even outperforming models trained on real data (synthesize).
    BundleMoCap: Efficient, Robust and Smooth Motion Capture from Sparse Multiview Videos. (arXiv:2311.12679v1 [cs.CV])
    Capturing smooth motions from videos using markerless techniques typically involves complex processes such as temporal constraints, multiple stages with data-driven regression and optimization, and bundle solving over temporal windows. These processes can be inefficient and require tuning multiple objectives across stages. In contrast, BundleMoCap introduces a novel and efficient approach to this problem. It solves the motion capture task in a single stage, eliminating the need for temporal smoothness objectives while still delivering smooth motions. BundleMoCap outperforms the state-of-the-art without increasing complexity. The key concept behind BundleMoCap is manifold interpolation between latent keyframes. By relying on a local manifold smoothness assumption, we can efficiently solve a bundle of frames using a single code. Additionally, the method can be implemented as a sliding window optimization and requires only the first frame to be properly initialized, reducing the overall computational burden. BundleMoCap's strength lies in its ability to achieve high-quality motion capture results with simplicity and efficiency. More details can be found at https://moverseai.github.io/bundle/.
    An actor-critic algorithm with policy gradients to solve the job shop scheduling problem using deep double recurrent agents. (arXiv:2110.09076v2 [math.OC] UPDATED)
    There is a growing interest in integrating machine learning techniques and optimization to solve challenging optimization problems. In this work, we propose a deep reinforcement learning methodology for the job shop scheduling problem (JSSP). The aim is to build up a greedy-like heuristic able to learn on some distribution of JSSP instances, different in the number of jobs and machines. The need for fast scheduling methods is well known, and it arises in many areas, from transportation to healthcare. We model the JSSP as a Markov Decision Process and then we exploit the efficacy of reinforcement learning to solve the problem. We adopt an actor-critic scheme, where the action taken by the agent is influenced by policy considerations on the state-value function. The procedures are adapted to take into account the challenging nature of JSSP, where the state and the action space change not only for every instance but also after each decision. To tackle the variability in the number of jobs and operations in the input, we modeled the agent using two incident LSTM models, a special type of deep neural network. Experiments show the algorithm reaches good solutions in a short time, proving that is possible to generate new greedy heuristics just from learning-based methodologies. Benchmarks have been generated in comparison with the commercial solver CPLEX. As expected, the model can generalize, to some extent, to larger problems or instances originated by a different distribution from the one used in training.
    Personas as a Way to Model Truthfulness in Language Models. (arXiv:2310.18168v3 [cs.CL] UPDATED)
    Large Language Models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. Can language models discern truth from falsehood in this contradicting data? Expanding on the view that LLMs can model different communicative agents, we present the persona hypothesis: LLMs can cluster agents into personas using common features of their generations. For instance, a truthful persona is a group of agents that are likely to produce truthful text and that share similar features like formal writing styles and scientific references. By modeling this persona, LLMs can generalize truthfulness beyond the specific contexts in which each agent generated the training text. For example, the model can infer that the agent ``Wikipedia'' will behave truthfully on topics that were only generated by ``Science'' because they both belong to the truthful persona. We show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.
    LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision. (arXiv:2304.07647v2 [cs.CV] UPDATED)
    We propose LASER, a neuro-symbolic approach to learn semantic video representations that capture rich spatial and temporal properties in video data by leveraging high-level logic specifications. In particular, we formulate the problem in terms of alignment between raw videos and spatio-temporal logic specifications. The alignment algorithm leverages a differentiable symbolic reasoner and a combination of contrastive, temporal, and semantics losses. It effectively and efficiently trains low-level perception models to extract fine-grained video representation in the form of a spatio-temporal scene graph that conforms to the desired high-level specification. In doing so, we explore a novel methodology that weakly supervises the learning of video semantic representations through logic specifications. We evaluate our method on two datasets with rich spatial and temporal specifications: 20BN-Something-Something and MUGEN. We demonstrate that our method learns better fine-grained video semantics than existing baselines.
    Empirical Risk Minimization with Relative Entropy Regularization. (arXiv:2211.06617v3 [math.ST] UPDATED)
    The empirical risk minimization (ERM) problem with relative entropy regularization (ERM-RER) is investigated under the assumption that the reference measure is a {\sigma}-finite measure, and not necessarily a probability measure. Under this assumption, which leads to a generalization of the ERM-RER problem allowing a larger degree of flexibility for incorporating prior knowledge, numerous relevant properties are stated. Among these properties, the solution to this problem, if it exists, is shown to be a unique probability measure, often mutually absolutely continuous with the reference measure. Such a solution exhibits a probably-approximately-correct guarantee for the ERM problem independently of whether the latter possesses a solution. For a fixed dataset, the empirical risk is shown to be a sub-Gaussian random variable when the models are sampled from the solution to the ERM-RER problem. The generalization capabilities of the solution to the ERM-RER problem (the Gibbs algorithm) are studied via the sensitivity of the expected empirical risk to deviations from such a solution towards alternative probability measures. Finally, an interesting connection between sensitivity, generalization error, and lautum information is established
    Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models. (arXiv:2305.12827v3 [cs.LG] UPDATED)
    Task arithmetic has recently emerged as a cost-effective and scalable approach to edit pre-trained models directly in weight space: By adding the fine-tuned weights of different tasks, the model's performance can be improved on these tasks, while negating them leads to task forgetting. Yet, our understanding of the effectiveness of task arithmetic and its underlying principles remains limited. We present a comprehensive study of task arithmetic in vision-language models and show that weight disentanglement is the crucial factor that makes it effective. This property arises during pre-training and manifests when distinct directions in weight space govern separate, localized regions in function space associated with the tasks. Notably, we show that fine-tuning models in their tangent space by linearizing them amplifies weight disentanglement. This leads to substantial performance improvements across multiple task arithmetic benchmarks and diverse models. Building on these findings, we provide theoretical and empirical analyses of the neural tangent kernel (NTK) of these models and establish a compelling link between task arithmetic and the spatial localization of the NTK eigenfunctions. Overall, our work uncovers novel insights into the fundamental mechanisms of task arithmetic and offers a more reliable and effective approach to edit pre-trained models through the NTK linearization.
    Adversarial Reweighting Guided by Wasserstein Distance for Bias Mitigation. (arXiv:2311.12684v1 [cs.LG])
    The unequal representation of different groups in a sample population can lead to discrimination of minority groups when machine learning models make automated decisions. To address these issues, fairness-aware machine learning jointly optimizes two (or more) metrics aiming at predictive effectiveness and low unfairness. However, the inherent under-representation of minorities in the data makes the disparate treatment of subpopulations less noticeable and difficult to deal with during learning. In this paper, we propose a novel adversarial reweighting method to address such \emph{representation bias}. To balance the data distribution between the majority and the minority groups, our approach deemphasizes samples from the majority group. To minimize empirical risk, our method prefers samples from the majority group that are close to the minority group as evaluated by the Wasserstein distance. Our theoretical analysis shows the effectiveness of our adversarial reweighting approach. Experiments demonstrate that our approach mitigates bias without sacrificing classification accuracy, outperforming related state-of-the-art methods on image and tabular benchmark datasets.
    Relphormer: Relational Graph Transformer for Knowledge Graph Representations. (arXiv:2205.10852v6 [cs.CL] UPDATED)
    Transformers have achieved remarkable performance in widespread fields, including natural language processing, computer vision and graph mining. However, vanilla Transformer architectures have not yielded promising improvements in the Knowledge Graph (KG) representations, where the translational distance paradigm dominates this area. Note that vanilla Transformer architectures struggle to capture the intrinsically heterogeneous structural and semantic information of knowledge graphs. To this end, we propose a new variant of Transformer for knowledge graph representations dubbed Relphormer. Specifically, we introduce Triple2Seq which can dynamically sample contextualized sub-graph sequences as the input to alleviate the heterogeneity issue. We propose a novel structure-enhanced self-attention mechanism to encode the relational information and keep the semantic information within entities and relations. Moreover, we utilize masked knowledge modeling for general knowledge graph representation learning, which can be applied to various KG-based tasks including knowledge graph completion, question answering, and recommendation. Experimental results on six datasets show that Relphormer can obtain better performance compared with baselines. Code is available in https://github.com/zjunlp/Relphormer.
    Towards a more inductive world for drug repurposing approaches. (arXiv:2311.12670v1 [cs.LG])
    Drug-target interaction (DTI) prediction is a challenging, albeit essential task in drug repurposing. Learning on graph models have drawn special attention as they can significantly reduce drug repurposing costs and time commitment. However, many current approaches require high-demanding additional information besides DTIs that complicates their evaluation process and usability. Additionally, structural differences in the learning architecture of current models hinder their fair benchmarking. In this work, we first perform an in-depth evaluation of current DTI datasets and prediction models through a robust benchmarking process, and show that DTI prediction methods based on transductive models lack generalization and lead to inflated performance when evaluated as previously done in the literature, hence not being suited for drug repurposing approaches. We then propose a novel biologically-driven strategy for negative edge subsampling and show through in vitro validation that newly discovered interactions are indeed true. We envision this work as the underpinning for future fair benchmarking and robust model design. All generated resources and tools are publicly available as a python package.
    Tensor Train for Global Optimization Problems in Robotics. (arXiv:2206.05077v4 [cs.RO] UPDATED)
    The convergence of many numerical optimization techniques is highly dependent on the initial guess given to the solver. To address this issue, we propose a novel approach that utilizes tensor methods to initialize existing optimization solvers near global optima. Our method does not require access to a database of good solutions. We first transform the cost function, which depends on both task parameters and optimization variables, into a probability density function. Unlike existing approaches, the joint probability distribution of the task parameters and optimization variables is approximated using the Tensor Train model, which enables efficient conditioning and sampling. We treat the task parameters as random variables, and for a given task, we generate samples for decision variables from the conditional distribution to initialize the optimization solver. Our method can produce multiple solutions (when they exist) faster than existing methods. We first evaluate the approach on benchmark functions for numerical optimization that are hard to solve using gradient-based optimization solvers with a naive initialization. The results show that the proposed method can generate samples close to global optima and from multiple modes. We then demonstrate the generality and relevance of our framework to robotics by applying it to inverse kinematics with obstacles and motion planning problems with a 7-DoF manipulator.
    Careful Selection and Thoughtful Discarding: Graph Explicit Pooling Utilizing Discarded Nodes. (arXiv:2311.12644v1 [cs.LG])
    Graph pooling has been increasingly recognized as crucial for Graph Neural Networks (GNNs) to facilitate hierarchical graph representation learning. Existing graph pooling methods commonly consist of two stages: selecting top-ranked nodes and discarding the remaining to construct coarsened graph representations. However, this paper highlights two key issues with these methods: 1) The process of selecting nodes to discard frequently employs additional Graph Convolutional Networks or Multilayer Perceptrons, lacking a thorough evaluation of each node's impact on the final graph representation and subsequent prediction tasks. 2) Current graph pooling methods tend to directly discard the noise segment (dropped) of the graph without accounting for the latent information contained within these elements. To address the first issue, we introduce a novel Graph Explicit Pooling (GrePool) method, which selects nodes by explicitly leveraging the relationships between the nodes and final representation vectors crucial for classification. The second issue is addressed using an extended version of GrePool (i.e., GrePool+), which applies a uniform loss on the discarded nodes. This addition is designed to augment the training process and improve classification accuracy. Furthermore, we conduct comprehensive experiments across 12 widely used datasets to validate our proposed method's effectiveness, including the Open Graph Benchmark datasets. Our experimental results uniformly demonstrate that GrePool outperforms 14 baseline methods for most datasets. Likewise, implementing GrePool+ enhances GrePool's performance without incurring additional computational costs.
    Contrastive Left-Right Wearable Sensors (IMUs) Consistency Matching for HAR. (arXiv:2311.12674v1 [cs.LG])
    Machine learning algorithms are improving rapidly, but annotating training data remains a bottleneck for many applications. In this paper, we show how real data can be used for self-supervised learning without any transformations by taking advantage of the symmetry present in the activities. Our approach involves contrastive matching of two different sensors (left and right wrist or leg-worn IMUs) to make representations of co-occurring sensor data more similar and those of non-co-occurring sensor data more different. We test our approach on the Opportunity and MM-Fit datasets. In MM-Fit we show significant improvement over the baseline supervised and self-supervised method SimCLR, while for Opportunity there is significant improvement over the supervised baseline and slight improvement when compared to SimCLR. Moreover, our method improves supervised baselines even when using only a small amount of the data for training. Future work should explore under which conditions our method is beneficial for human activity recognition systems and other related applications.
    Soft Random Sampling: A Theoretical and Empirical Analysis. (arXiv:2311.12727v1 [cs.LG])
    Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data. SRS selects a subset uniformly at random with replacement from the full data set in each epoch. In this paper, we conduct a theoretical and empirical analysis of SRS. First, we analyze its sampling dynamics including data coverage and occupancy. Next, we investigate its convergence with non-convex objective functions and give the convergence rate. Finally, we provide its generalization performance. We empirically evaluate SRS for image recognition on CIFAR10 and automatic speech recognition on Librispeech and an in-house payload dataset to demonstrate its effectiveness. Compared to existing coreset-based data selection methods, SRS offers a better accuracy-efficiency trade-off. Especially on real-world industrial scale data sets, it is shown to be a powerful training strategy with significant speedup and competitive performance with almost no additional computing cost.
    High-resolution Image-based Malware Classification using Multiple Instance Learning. (arXiv:2311.12760v1 [cs.CR])
    This paper proposes a novel method of classifying malware into families using high-resolution greyscale images and multiple instance learning to overcome adversarial binary enlargement. Current methods of visualisation-based malware classification largely rely on lossy transformations of inputs such as resizing to handle the large, variable-sized images. Through empirical analysis and experimentation, it is shown that these approaches cause crucial information loss that can be exploited. The proposed solution divides the images into patches and uses embedding-based multiple instance learning with a convolutional neural network and an attention aggregation function for classification. The implementation is evaluated on the Microsoft Malware Classification dataset and achieves accuracies of up to $96.6\%$ on adversarially enlarged samples compared to the baseline of $22.8\%$. The Python code is available online at https://github.com/timppeters/MIL-Malware-Images .
    To Compress or Not to Compress- Self-Supervised Learning and Information Theory: A Review. (arXiv:2304.09355v5 [cs.LG] UPDATED)
    Deep neural networks excel in supervised learning tasks but are constrained by the need for extensive labeled data. Self-supervised learning emerges as a promising alternative, allowing models to learn without explicit labels. Information theory, and notably the information bottleneck principle, has been pivotal in shaping deep neural networks. This principle focuses on optimizing the trade-off between compression and preserving relevant information, providing a foundation for efficient network design in supervised contexts. However, its precise role and adaptation in self-supervised learning remain unclear. In this work, we scrutinize various self-supervised learning approaches from an information-theoretic perspective, introducing a unified framework that encapsulates the \textit{self-supervised information-theoretic learning problem}. We weave together existing research into a cohesive narrative, delve into contemporary self-supervised methodologies, and spotlight potential research avenues and inherent challenges. Additionally, we discuss the empirical evaluation of information-theoretic quantities and their estimation methods. Overall, this paper furnishes an exhaustive review of the intersection of information theory, self-supervised learning, and deep neural networks.
    The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks. (arXiv:2306.17844v2 [cs.LG] UPDATED)
    Do neural networks, trained on well-understood algorithmic tasks, reliably rediscover known algorithms for solving those tasks? Several recent studies, on tasks ranging from group arithmetic to in-context linear regression, have suggested that the answer is yes. Using modular addition as a prototypical problem, we show that algorithm discovery in neural networks is sometimes more complex. Small changes to model hyperparameters and initializations can induce the discovery of qualitatively different algorithms from a fixed training set, and even parallel implementations of multiple such algorithms. Some networks trained to perform modular addition implement a familiar Clock algorithm; others implement a previously undescribed, less intuitive, but comprehensible procedure which we term the Pizza algorithm, or a variety of even more complex procedures. Our results show that even simple learning problems can admit a surprising diversity of solutions, motivating the development of new tools for characterizing the behavior of neural networks across their algorithmic phase space.
    Unveiling the Pitfalls of Knowledge Editing for Large Language Models. (arXiv:2310.02129v2 [cs.CL] UPDATED)
    As the cost associated with fine-tuning Large Language Models (LLMs) continues to rise, recent research efforts have pivoted towards developing methodologies to edit implicit knowledge embedded within LLMs. Yet, there's still a dark cloud lingering overhead -- will knowledge editing trigger butterfly effect? since it is still unclear whether knowledge editing might introduce side effects that pose potential risks or not. This paper pioneers the investigation into the potential pitfalls associated with knowledge editing for LLMs. To achieve this, we introduce new benchmark datasets and propose innovative evaluation metrics. Our results underline two pivotal concerns: (1) Knowledge Conflict: Editing groups of facts that logically clash can magnify the inherent inconsistencies in LLMs-a facet neglected by previous methods. (2) Knowledge Distortion: Altering parameters with the aim of editing factual knowledge can irrevocably warp the innate knowledge structure of LLMs. Experimental results vividly demonstrate that knowledge editing might inadvertently cast a shadow of unintended consequences on LLMs, which warrant attention and efforts for future works. Code is available at https://github.com/zjunlp/PitfallsKnowledgeEditing.
    FedDRO: Federated Compositional Optimization for Distributionally Robust Learning. (arXiv:2311.12652v1 [cs.LG])
    Recently, compositional optimization (CO) has gained popularity because of its applications in distributionally robust optimization (DRO) and many other machine learning problems. Large-scale and distributed availability of data demands the development of efficient federated learning (FL) algorithms for solving CO problems. Developing FL algorithms for CO is particularly challenging because of the compositional nature of the objective. Moreover, current state-of-the-art methods to solve such problems rely on large batch gradients (depending on the solution accuracy) not feasible for most practical settings. To address these challenges, in this work, we propose efficient FedAvg-type algorithms for solving non-convex CO in the FL setting. We first establish that vanilla FedAvg is not suitable to solve distributed CO problems because of the data heterogeneity in the compositional objective at each client which leads to the amplification of bias in the local compositional gradient estimates. To this end, we propose a novel FL framework FedDRO that utilizes the DRO problem structure to design a communication strategy that allows FedAvg to control the bias in the estimation of the compositional gradient. A key novelty of our work is to develop solution accuracy-independent algorithms that do not require large batch gradients (and function evaluations) for solving federated CO problems. We establish $\mathcal{O}(\epsilon^{-2})$ sample and $\mathcal{O}(\epsilon^{-3/2})$ communication complexity in the FL setting while achieving linear speedup with the number of clients. We corroborate our theoretical findings with empirical studies on large-scale DRO problems.
    Managing ML-Based Application Non-Functional Behavior: A Multi-Model Approach. (arXiv:2311.12686v1 [cs.LG])
    Modern applications are increasingly driven by Machine Learning (ML) models whose non-deterministic behavior is affecting the entire application life cycle from design to operation. The pervasive adoption of ML is urgently calling for approaches that guarantee a stable non-functional behavior of ML-based applications over time and across model changes. To this aim, non-functional properties of ML models, such as privacy, confidentiality, fairness, and explainability, must be monitored, verified, and maintained. This need is even more pressing when modern applications operate in the edge-cloud continuum, increasing their complexity and dynamicity. Existing approaches mostly focus on i) implementing classifier selection solutions according to the functional behavior of ML models, ii) finding new algorithmic solutions to this need, such as continuous re-training. In this paper, we propose a multi-model approach built on dynamic classifier selection, where multiple ML models showing similar non-functional properties are made available to the application and one model is selected over time according to (dynamic and unpredictable) contextual changes. Our solution goes beyond the state of the art by providing an architectural and methodological approach that continuously guarantees a stable non-functional behavior of ML-based applications, is applicable to different ML models, and is driven by non-functional properties assessed on the models themselves. It consists of a two-step process working during application operation, where model assessment verifies non-functional properties of ML models trained and selected at development time, and model substitution guarantees a continuous and stable support of non-functional properties. We experimentally evaluate our solution in a real-world scenario focusing on non-functional property fairness.
    PAPAL: A Provable PArticle-based Primal-Dual ALgorithm for Mixed Nash Equilibrium. (arXiv:2303.00970v2 [math.OC] UPDATED)
    We consider the non-convex non-concave objective function in two-player zero-sum continuous games. The existence of pure Nash equilibrium requires stringent conditions, posing a major challenge for this problem. To circumvent this difficulty, we examine the problem of identifying a mixed Nash equilibrium, where strategies are randomized and characterized by probability distributions over continuous domains.To this end, we propose PArticle-based Primal-dual ALgorithm (PAPAL) tailored for a weakly entropy-regularized min-max optimization over probability distributions. This algorithm employs the stochastic movements of particles to represent the updates of random strategies for the $\epsilon$-mixed Nash equilibrium. We offer a comprehensive convergence analysis of the proposed algorithm, demonstrating its effectiveness. In contrast to prior research that attempted to update particle importance without movements, PAPAL is the first implementable particle-based algorithm accompanied by non-asymptotic quantitative convergence results, running time, and sample complexity guarantees. Our framework contributes novel insights into the particle-based algorithms for continuous min-max optimization in the general non-convex non-concave setting.
    PyTorch Geometric Signed Directed: A Software Package on Graph Neural Networks for Signed and Directed Graphs. (arXiv:2202.10793v5 [cs.LG] UPDATED)
    Networks are ubiquitous in many real-world applications (e.g., social networks encoding trust/distrust relationships, correlation networks arising from time series data). While many networks are signed or directed, or both, there is a lack of unified software packages on graph neural networks (GNNs) specially designed for signed and directed networks. In this paper, we present PyTorch Geometric Signed Directed (PyGSD), a software package which fills this gap. Along the way, we evaluate the implemented methods with experiments with a view to providing insights into which method to choose for a given task. The deep learning framework consists of easy-to-use GNN models, synthetic and real-world data, as well as task-specific evaluation metrics and loss functions for signed and directed networks. As an extension library for PyG, our proposed software is maintained with open-source releases, detailed documentation, continuous integration, unit tests and code coverage checks. The GitHub repository of the library is https://github.com/SherylHYX/pytorch_geometric_signed_directed.
    Low Dimensional Invariant Embeddings for Universal Geometric Learning. (arXiv:2205.02956v3 [cs.LG] UPDATED)
    This paper studies separating invariants: mappings on $D$ dimensional domains which are invariant to an appropriate group action, and which separate orbits. The motivation for this study comes from the usefulness of separating invariants in proving universality of equivariant neural network architectures. We observe that in several cases the cardinality of separating invariants proposed in the machine learning literature is much larger than the dimension $D$. As a result, the theoretical universal constructions based on these separating invariants is unrealistically large. Our goal in this paper is to resolve this issue. We show that when a continuous family of semi-algebraic separating invariants is available, separation can be obtained by randomly selecting $2D+1 $ of these invariants. We apply this methodology to obtain an efficient scheme for computing separating invariants for several classical group actions which have been studied in the invariant learning literature. Examples include matrix multiplication actions on point clouds by permutations, rotations, and various other linear groups. Often the requirement of invariant separation is relaxed and only generic separation is required. In this case, we show that only $D+1$ invariants are required. More importantly, generic invariants are often significantly easier to compute, as we illustrate by discussing generic and full separation for weighted graphs. Finally we outline an approach for proving that separating invariants can be constructed also when the random parameters have finite precision.
    Learn the Time to Learn: Replay Scheduling in Continual Learning. (arXiv:2209.08660v2 [cs.LG] UPDATED)
    Replay methods are known to be successful at mitigating catastrophic forgetting in continual learning scenarios despite having limited access to historical data. However, storing historical data is cheap in many real-world settings, yet replaying all historical data is often prohibited due to processing time constraints. In such settings, we propose that continual learning systems should learn the time to learn and schedule which tasks to replay at different time steps. We first demonstrate the benefits of our proposal by using Monte Carlo tree search to find a proper replay schedule, and show that the found replay schedules can outperform fixed scheduling policies when combined with various replay methods in different continual learning settings. Additionally, we propose a framework for learning replay scheduling policies with reinforcement learning. We show that the learned policies can generalize better in new continual learning scenarios compared to equally replaying all seen tasks, without added computational cost. Our study reveals the importance of learning the time to learn in continual learning, which brings current research closer to real-world needs.
    Langevin dynamics based algorithm e-TH$\varepsilon$O POULA for stochastic optimization problems with discontinuous stochastic gradient. (arXiv:2210.13193v2 [math.OC] UPDATED)
    We introduce a new Langevin dynamics based algorithm, called e-TH$\varepsilon$O POULA, to solve optimization problems with discontinuous stochastic gradients which naturally appear in real-world applications such as quantile estimation, vector quantization, CVaR minimization, and regularized optimization problems involving ReLU neural networks. We demonstrate both theoretically and numerically the applicability of the e-TH$\varepsilon$O POULA algorithm. More precisely, under the conditions that the stochastic gradient is locally Lipschitz in average and satisfies a certain convexity at infinity condition, we establish non-asymptotic error bounds for e-TH$\varepsilon$O POULA in Wasserstein distances and provide a non-asymptotic estimate for the expected excess risk, which can be controlled to be arbitrarily small. Three key applications in finance and insurance are provided, namely, multi-period portfolio optimization, transfer learning in multi-period portfolio optimization, and insurance claim prediction, which involve neural networks with (Leaky)-ReLU activation functions. Numerical experiments conducted using real-world datasets illustrate the superior empirical performance of e-TH$\varepsilon$O POULA compared to SGLD, TUSLA, ADAM, and AMSGrad in terms of model accuracy.
    Multifidelity Deep Operator Networks For Data-Driven and Physics-Informed Problems. (arXiv:2204.09157v2 [math.NA] UPDATED)
    Operator learning for complex nonlinear systems is increasingly common in modeling multi-physics and multi-scale systems. However, training such high-dimensional operators requires a large amount of expensive, high-fidelity data, either from experiments or simulations. In this work, we present a composite Deep Operator Network (DeepONet) for learning using two datasets with different levels of fidelity to accurately learn complex operators when sufficient high-fidelity data is not available. Additionally, we demonstrate that the presence of low-fidelity data can improve the predictions of physics-informed learning with DeepONets. We demonstrate the new multi-fidelity training in diverse examples, including modeling of the ice-sheet dynamics of the Humboldt glacier, Greenland, using two different fidelity models and also using the same physical model at two different resolutions.
    Unsupervised discovery of Interpretable Visual Concepts. (arXiv:2309.00018v2 [cs.CV] UPDATED)
    Providing interpretability of deep-learning models to non-experts, while fundamental for a responsible real-world usage, is challenging. Attribution maps from xAI techniques, such as Integrated Gradients, are a typical example of a visualization technique containing a high level of information, but with difficult interpretation. In this paper, we propose two methods, Maximum Activation Groups Extraction (MAGE) and Multiscale Interpretable Visualization (Ms-IV), to explain the model's decision, enhancing global interpretability. MAGE finds, for a given CNN, combinations of features which, globally, form a semantic meaning, that we call concepts. We group these similar feature patterns by clustering in ``concepts'', that we visualize through Ms-IV. This last method is inspired by Occlusion and Sensitivity analysis (incorporating causality), and uses a novel metric, called Class-aware Order Correlation (CaOC), to globally evaluate the most important image regions according to the model's decision space. We compare our approach to xAI methods such as LIME and Integrated Gradients. Experimental results evince the Ms-IV higher localization and faithfulness values. Finally, qualitative evaluation of combined MAGE and Ms-IV demonstrates humans' ability to agree, based on the visualization, with the decision of clusters' concepts; and, to detect, among a given set of networks, the existence of bias.
    Beyond Labeling Oracles: What does it mean to steal ML models?. (arXiv:2310.01959v2 [cs.LG] UPDATED)
    Model extraction attacks are designed to steal trained models with only query access, as is often provided through APIs that ML-as-a-Service providers offer. ML models are expensive to train, in part because data is hard to obtain, and a primary incentive for model extraction is to acquire a model while incurring less cost than training from scratch. Literature on model extraction commonly claims or presumes that the attacker is able to save on both data acquisition and labeling costs. We show that the attacker often does not. This is because current attacks implicitly rely on the adversary being able to sample from the victim model's data distribution. We thoroughly evaluate factors influencing the success of model extraction. We discover that prior knowledge of the attacker, i.e. access to in-distribution data, dominates other factors like the attack policy the adversary follows to choose which queries to make to the victim model API. Thus, an adversary looking to develop an equally capable model with a fixed budget has little practical incentive to perform model extraction, since for the attack to work they need to collect in-distribution data, saving only on the cost of labeling. With low labeling costs in the current market, the usefulness of such attacks is questionable. Ultimately, we demonstrate that the effect of prior knowledge needs to be explicitly decoupled from the attack policy. To this end, we propose a benchmark to evaluate attack policy directly.
    Online Regularization towards Always-Valid High-Dimensional Dynamic Pricing. (arXiv:2007.02470v3 [stat.ML] UPDATED)
    Devising dynamic pricing policy with always valid online statistical learning procedure is an important and as yet unresolved problem. Most existing dynamic pricing policy, which focus on the faithfulness of adopted customer choice models, exhibit a limited capability for adapting the online uncertainty of learned statistical model during pricing process. In this paper, we propose a novel approach for designing dynamic pricing policy based regularized online statistical learning with theoretical guarantees. The new approach overcomes the challenge of continuous monitoring of online Lasso procedure and possesses several appealing properties. In particular, we make the decisive observation that the always-validity of pricing decisions builds and thrives on the online regularization scheme. Our proposed online regularization scheme equips the proposed optimistic online regularized maximum likelihood pricing (OORMLP) pricing policy with three major advantages: encode market noise knowledge into pricing process optimism; empower online statistical learning with always-validity over all decision points; envelop prediction error process with time-uniform non-asymptotic oracle inequalities. This type of non-asymptotic inference results allows us to design more sample-efficient and robust dynamic pricing algorithms in practice. In theory, the proposed OORMLP algorithm exploits the sparsity structure of high-dimensional models and secures a logarithmic regret in a decision horizon. These theoretical advances are made possible by proposing an optimistic online Lasso procedure that resolves dynamic pricing problems at the process level, based on a novel use of non-asymptotic martingale concentration. In experiments, we evaluate OORMLP in different synthetic and real pricing problem settings, and demonstrate that OORMLP advances the state-of-the-art methods.
    How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study. (arXiv:2306.03163v2 [cs.LG] UPDATED)
    Training deep learning models in the cloud or on dedicated hardware is expensive. A more cost-efficient option are hyperscale clouds offering spot instances, a cheap but ephemeral alternative to on-demand resources. As spot instance availability can change depending on the time of day, continent, and cloud provider, it could be more cost-efficient to distribute resources over the world. Still, it has not been investigated whether geo-distributed, data-parallel spot deep learning training could be a more cost-efficient alternative to centralized training. This paper aims to answer the question: Can deep learning models be cost-efficiently trained on a global market of spot VMs spanning different data centers and cloud providers? To provide guidance, we extensively evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV, NLP and ASR models. To expand the current training options further, we compare the scalability potential for hybrid-cloud scenarios by adding cloud resources to on-premise hardware to improve training throughput. Finally, we show how leveraging spot instance pricing enables a new cost-efficient way to train models with multiple cheap VMs, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.
    Stable Adam Optimization for 16-bit Neural Networks Training. (arXiv:2307.16189v6 [cs.LG] UPDATED)
    In this research, we address critical concerns related to the numerical instability observed in 16-bit computations of machine learning models. Such instability, particularly when employing popular optimization algorithms like Adam, often leads to unstable training of deep neural networks. This not only disrupts the learning process but also poses significant challenges in deploying dependable models in real-world applications. Our investigation identifies the epsilon hyperparameter as the primary source of this instability. A nuanced exploration reveals that subtle adjustments to epsilon within 16-bit computations can enhance the numerical stability of Adam, enabling more stable training of 16-bit neural networks. We propose a novel, dependable approach that leverages updates from the Adam optimizer to bolster the stability of the learning process. Our contributions provide deeper insights into optimization challenges in low-precision computations and offer solutions to ensure the stability of deep neural network training, paving the way for their dependable use in various applications.
    Towards Data-Algorithm Dependent Generalization: a Case Study on Overparameterized Linear Regression. (arXiv:2202.06054v4 [cs.LG] UPDATED)
    One of the major open problems in machine learning is to characterize generalization in the overparameterized regime, where most traditional generalization bounds become inconsistent even for overparameterized linear regression. In many scenarios, this failure can be attributed to obscuring the crucial interplay between the training algorithm and the underlying data distribution. This paper demonstrate that the generalization behavior of overparameterized model should be analyzed in a both data-relevant and algorithm-relevant manner. To make a formal characterization, We introduce a notion called data-algorithm compatibility, which considers the generalization behavior of the entire data-dependent training trajectory, instead of traditional last-iterate analysis. We validate our claim by studying the setting of solving overparameterized linear regression with gradient descent. Specifically, we perform a data-dependent trajectory analysis and derive a sufficient condition for compatibility in such a setting. Our theoretical results demonstrate that if we take early stopping iterates into consideration, generalization can hold with significantly weaker restrictions on the problem instance than the previous last-iterate analysis.
    YolOOD: Utilizing Object Detection Concepts for Multi-Label Out-of-Distribution Detection. (arXiv:2212.02081v2 [cs.CV] UPDATED)
    Out-of-distribution (OOD) detection has attracted a large amount of attention from the machine learning research community in recent years due to its importance in deployed systems. Most of the previous studies focused on the detection of OOD samples in the multi-class classification task. However, OOD detection in the multi-label classification task, a more common real-world use case, remains an underexplored domain. In this research, we propose YolOOD - a method that utilizes concepts from the object detection domain to perform OOD detection in the multi-label classification task. Object detection models have an inherent ability to distinguish between objects of interest (in-distribution) and irrelevant objects (e.g., OOD objects) in images that contain multiple objects belonging to different class categories. These abilities allow us to convert a regular object detection model into an image classifier with inherent OOD detection capabilities with just minor changes. We compare our approach to state-of-the-art OOD detection methods and demonstrate YolOOD's ability to outperform these methods on a comprehensive suite of in-distribution and OOD benchmark datasets.
    Long-term Causal Effects Estimation via Latent Surrogates Representation Learning. (arXiv:2208.04589v3 [cs.LG] UPDATED)
    Estimating long-term causal effects based on short-term surrogates is a significant but challenging problem in many real-world applications, e.g., marketing and medicine. Despite its success in certain domains, most existing methods estimate causal effects in an idealistic and simplistic way - ignoring the causal structure among short-term outcomes and treating all of them as surrogates. However, such methods cannot be well applied to real-world scenarios, in which the partially observed surrogates are mixed with their proxies among short-term outcomes. To this end, we develop our flexible method, Laser, to estimate long-term causal effects in the more realistic situation that the surrogates are observed or have observed proxies.Given the indistinguishability between the surrogates and proxies, we utilize identifiable variational auto-encoder (iVAE) to recover the whole valid surrogates on all the surrogates candidates without the need of distinguishing the observed surrogates or the proxies of latent surrogates. With the help of the recovered surrogates, we further devise an unbiased estimation of long-term causal effects. Extensive experimental results on the real-world and semi-synthetic datasets demonstrate the effectiveness of our proposed method.
    Attending to Graph Transformers. (arXiv:2302.04181v2 [cs.LG] UPDATED)
    Recently, transformer architectures for graphs emerged as an alternative to established techniques for machine learning with graphs, such as (message-passing) graph neural networks. So far, they have shown promising empirical results, e.g., on molecular prediction datasets, often attributed to their ability to circumvent graph neural networks' shortcomings, such as over-smoothing and over-squashing. Here, we derive a taxonomy of graph transformer architectures, bringing some order to this emerging field. We overview their theoretical properties, survey structural and positional encodings, and discuss extensions for important graph classes, e.g., 3D molecular graphs. Empirically, we probe how well graph transformers can recover various graph properties, how well they can deal with heterophilic graphs, and to what extent they prevent over-squashing. Further, we outline open challenges and research direction to stimulate future work. Our code is available at https://github.com/luis-mueller/probing-graph-transformers.
    How does Contrastive Learning Organize Images?. (arXiv:2305.10229v2 [cs.LG] UPDATED)
    Contrastive learning, a dominant self-supervised technique, emphasizes similarity in representations between augmentations of the same input and dissimilarity for different ones. Although low contrastive loss often correlates with high classification accuracy, recent studies challenge this direct relationship, spotlighting the crucial role of inductive biases. We delve into these biases from a clustering viewpoint, noting that contrastive learning creates locally dense clusters, contrasting the globally dense clusters from supervised learning. To capture this discrepancy, we introduce the "RLD (Relative Local Density)" metric. While this cluster property can hinder linear classification accuracy, leveraging a Graph Convolutional Network (GCN) based classifier mitigates this, boosting accuracy and reducing parameter requirements. The code is available \href{https://github.com/xsgxlz/How-does-Contrastive-Learning-Organize-Images/tree/main}{here}.
    Attacking Motion Planners Using Adversarial Perception Errors. (arXiv:2311.12722v1 [cs.RO])
    Autonomous driving (AD) systems are often built and tested in a modular fashion, where the performance of different modules is measured using task-specific metrics. These metrics should be chosen so as to capture the downstream impact of each module and the performance of the system as a whole. For example, high perception quality should enable prediction and planning to be performed safely. Even though this is true in general, we show here that it is possible to construct planner inputs that score very highly on various perception quality metrics but still lead to planning failures. In an analogy to adversarial attacks on image classifiers, we call such inputs \textbf{adversarial perception errors} and show they can be systematically constructed using a simple boundary-attack algorithm. We demonstrate the effectiveness of this algorithm by finding attacks for two different black-box planners in several urban and highway driving scenarios using the CARLA simulator. Finally, we analyse the properties of these attacks and show that they are isolated in the input space of the planner, and discuss their implications for AD system deployment and testing.
    A Randomized Approach for Tight Privacy Accounting. (arXiv:2304.07927v2 [cs.CR] UPDATED)
    Bounding privacy leakage over compositions, i.e., privacy accounting, is a key challenge in differential privacy (DP). The privacy parameter ($\eps$ or $\delta$) is often easy to estimate but hard to bound. In this paper, we propose a new differential privacy paradigm called estimate-verify-release (EVR), which addresses the challenges of providing a strict upper bound for privacy parameter in DP compositions by converting an estimate of privacy parameter into a formal guarantee. The EVR paradigm first estimates the privacy parameter of a mechanism, then verifies whether it meets this guarantee, and finally releases the query output based on the verification result. The core component of the EVR is privacy verification. We develop a randomized privacy verifier using Monte Carlo (MC) technique. Furthermore, we propose an MC-based DP accountant that outperforms existing DP accounting techniques in terms of accuracy and efficiency. Our empirical evaluation shows the newly proposed EVR paradigm improves the utility-privacy tradeoff for privacy-preserving machine learning.
    Composite Score for Anomaly Detection in Imbalanced Real-World Industrial Dataset. (arXiv:2211.15513v2 [cs.CV] UPDATED)
    In recent years, the industrial sector has evolved towards its fourth revolution. The quality control domain is particularly interested in advanced machine learning for computer vision anomaly detection. Nevertheless, several challenges have to be faced, including imbalanced datasets, the image complexity, and the zero-false-negative (ZFN) constraint to guarantee the high-quality requirement. This paper illustrates a use case for an industrial partner, where Printed Circuit Board Assembly (PCBA) images are first reconstructed with a Vector Quantized Generative Adversarial Network (VQGAN) trained on normal products. Then, several multi-level metrics are extracted on a few normal and abnormal images, highlighting anomalies through reconstruction differences. Finally, a classifer is trained to build a composite anomaly score thanks to the metrics extracted. This three-step approach is performed on the public MVTec-AD datasets and on the partner PCBA dataset, where it achieves a regular accuracy of 95.69% and 87.93% under the ZFN constraint.
    Compressive Fourier collocation methods for high-dimensional diffusion equations with periodic boundary conditions. (arXiv:2206.01255v5 [math.NA] UPDATED)
    High-dimensional Partial Differential Equations (PDEs) are a popular mathematical modelling tool, with applications ranging from finance to computational chemistry. However, standard numerical techniques for solving these PDEs are typically affected by the curse of dimensionality. In this work, we tackle this challenge while focusing on stationary diffusion equations defined over a high-dimensional domain with periodic boundary conditions. Inspired by recent progress in sparse function approximation in high dimensions, we propose a new method called compressive Fourier collocation. Combining ideas from compressive sensing and spectral collocation, our method replaces the use of structured collocation grids with Monte Carlo sampling and employs sparse recovery techniques, such as orthogonal matching pursuit and $\ell^1$ minimization, to approximate the Fourier coefficients of the PDE solution. We conduct a rigorous theoretical analysis showing that the approximation error of the proposed method is comparable with the best $s$-term approximation (with respect to the Fourier basis) to the solution. Using the recently introduced framework of random sampling in bounded Riesz systems, our analysis shows that the compressive Fourier collocation method mitigates the curse of dimensionality with respect to the number of collocation points under sufficient conditions on the regularity of the diffusion coefficient. We also present numerical experiments that illustrate the accuracy and stability of the method for the approximation of sparse and compressible solutions.
    Quantifying Impairment and Disease Severity Using AI Models Trained on Healthy Subjects. (arXiv:2311.12781v1 [cs.LG])
    Automatic assessment of impairment and disease severity is a key challenge in data-driven medicine. We propose a novel framework to address this challenge, which leverages AI models trained exclusively on healthy individuals. The COnfidence-Based chaRacterization of Anomalies (COBRA) score exploits the decrease in confidence of these models when presented with impaired or diseased patients to quantify their deviation from the healthy population. We applied the COBRA score to address a key limitation of current clinical evaluation of upper-body impairment in stroke patients. The gold-standard Fugl-Meyer Assessment (FMA) requires in-person administration by a trained assessor for 30-45 minutes, which restricts monitoring frequency and precludes physicians from adapting rehabilitation protocols to the progress of each patient. The COBRA score, computed automatically in under one minute, is shown to be strongly correlated with the FMA on an independent test cohort for two different data modalities: wearable sensors ($\rho = 0.845$, 95% CI [0.743,0.908]) and video ($\rho = 0.746$, 95% C.I [0.594, 0.847]). To demonstrate the generalizability of the approach to other conditions, the COBRA score was also applied to quantify severity of knee osteoarthritis from magnetic-resonance imaging scans, again achieving significant correlation with an independent clinical assessment ($\rho = 0.644$, 95% C.I [0.585,0.696]).
    Optimality in Mean Estimation: Beyond Worst-Case, Beyond Sub-Gaussian, and Beyond $1+\alpha$ Moments. (arXiv:2311.12784v1 [math.ST])
    There is growing interest in improving our algorithmic understanding of fundamental statistical problems such as mean estimation, driven by the goal of understanding the limits of what we can extract from valuable data. The state of the art results for mean estimation in $\mathbb{R}$ are 1) the optimal sub-Gaussian mean estimator by [LV22], with the tight sub-Gaussian constant for all distributions with finite but unknown variance, and 2) the analysis of the median-of-means algorithm by [BCL13] and a lower bound by [DLLO16], characterizing the big-O optimal errors for distributions for which only a $1+\alpha$ moment exists for $\alpha \in (0,1)$. Both results, however, are optimal only in the worst case. We initiate the fine-grained study of the mean estimation problem: Can algorithms leverage useful features of the input distribution to beat the sub-Gaussian rate, without explicit knowledge of such features? We resolve this question with an unexpectedly nuanced answer: "Yes in limited regimes, but in general no". For any distribution $p$ with a finite mean, we construct a distribution $q$ whose mean is well-separated from $p$'s, yet $p$ and $q$ are not distinguishable with high probability, and $q$ further preserves $p$'s moments up to constants. The main consequence is that no reasonable estimator can asymptotically achieve better than the sub-Gaussian error rate for any distribution, matching the worst-case result of [LV22]. More generally, we introduce a new definitional framework to analyze the fine-grained optimality of algorithms, which we call "neighborhood optimality", interpolating between the unattainably strong "instance optimality" and the trivially weak "admissibility" definitions. Applying the new framework, we show that median-of-means is neighborhood optimal, up to constant factors. It is open to find a neighborhood-optimal estimator without constant factor slackness.
    minimax: Efficient Baselines for Autocurricula in JAX. (arXiv:2311.12716v1 [cs.LG])
    Unsupervised environment design (UED) is a form of automatic curriculum learning for training robust decision-making agents to zero-shot transfer into unseen environments. Such autocurricula have received much interest from the RL community. However, UED experiments, based on CPU rollouts and GPU model updates, have often required several weeks of training. This compute requirement is a major obstacle to rapid innovation for the field. This work introduces the minimax library for UED training on accelerated hardware. Using JAX to implement fully-tensorized environments and autocurriculum algorithms, minimax allows the entire training loop to be compiled for hardware acceleration. To provide a petri dish for rapid experimentation, minimax includes a tensorized grid-world based on MiniGrid, in addition to reusable abstractions for conducting autocurricula in procedurally-generated environments. With these components, minimax provides strong UED baselines, including new parallelized variants, which achieve over 120$\times$ speedups in wall time compared to previous implementations when training with equal batch sizes. The minimax library is available under the Apache 2.0 license at https://github.com/facebookresearch/minimax.
    Explaining Deep Learning Models for Age-related Gait Classification based on time series acceleration. (arXiv:2311.12089v1 [cs.LG])
    Gait analysis holds significant importance in monitoring daily health, particularly among older adults. Advancements in sensor technology enable the capture of movement in real-life environments and generate big data. Machine learning, notably deep learning (DL), shows promise to use these big data in gait analysis. However, the inherent black-box nature of these models poses challenges for their clinical application. This study aims to enhance transparency in DL-based gait classification for aged-related gait patterns using Explainable Artificial Intelligence, such as SHAP. A total of 244 subjects, comprising 129 adults and 115 older adults (age>65), were included. They performed a 3-minute walking task while accelerometers were affixed to the lumbar segment L3. DL models, convolutional neural network (CNN) and gated recurrent unit (GRU), were trained using 1-stride and 8-stride accelerations, respectively, to classify adult and older adult groups. SHAP was employed to explain the models' predictions. CNN achieved a satisfactory performance with an accuracy of 81.4% and an AUC of 0.89, and GRU demonstrated promising results with an accuracy of 84.5% and an AUC of 0.94. SHAP analysis revealed that both CNN and GRU assigned higher SHAP values to the data from vertical and walking directions, particularly emphasizing data around heel contact, spanning from the terminal swing to loading response phases. Furthermore, SHAP values indicated that GRU did not treat every stride equally. CNN accurately distinguished between adults and older adults based on the characteristics of a single stride's data. GRU achieved accurate classification by considering the relationships and subtle differences between strides. In both models, data around heel contact emerged as most critical, suggesting differences in acceleration and deceleration patterns during walking between different age groups.
    An efficient likelihood-free Bayesian inference method based on sequential neural posterior estimation. (arXiv:2311.12530v1 [stat.ML])
    Sequential neural posterior estimation (SNPE) techniques have been recently proposed for dealing with simulation-based models with intractable likelihoods. Unlike approximate Bayesian computation, SNPE techniques learn the posterior from sequential simulation using neural network-based conditional density estimators. This paper reclaims SNPE-B proposed by Lueckmann et al. (2017), which suffers from inefficiency and slow inference due to inefficient utilization of simulated data and high variance of parameter updates. To address these issues, we firstly introduce a concentrated loss function based on an adaptive calibration kernel that reweights the simulated data appropriately to improve the data efficiency. Moreover, we provide a theoretical analysis of the variance of associated Monte Carlo estimators. Based on this analysis, we then propose several variance reduction techniques to further accelerate the process of learning. Numerical experiments demonstrate that our method outperforms the original method together with other existing competitors on certain tasks.
    The limitation of neural nets for approximation and optimization. (arXiv:2311.12253v1 [cs.LG])
    We are interested in assessing the use of neural networks as surrogate models to approximate and minimize objective functions in optimization problems. While neural networks are widely used for machine learning tasks such as classification and regression, their application in solving optimization problems has been limited. Our study begins by determining the best activation function for approximating the objective functions of popular nonlinear optimization test problems, and the evidence provided shows that~SiLU has the best performance. We then analyze the accuracy of function value, gradient, and Hessian approximations for such objective functions obtained through interpolation/regression models and neural networks. When compared to interpolation/regression models, neural networks can deliver competitive zero- and first-order approximations (at a high training cost) but underperform on second-order approximation. However, it is shown that combining a neural net activation function with the natural basis for quadratic interpolation/regression can waive the necessity of including cross terms in the natural basis, leading to models with fewer parameters to determine. Lastly, we provide evidence that the performance of a state-of-the-art derivative-free optimization algorithm can hardly be improved when the gradient of an objective function is approximated using any of the surrogate models considered, including neural networks.
    Interpretation of the Transformer and Improvement of the Extractor. (arXiv:2311.12678v1 [cs.LG])
    It has been over six years since the Transformer architecture was put forward. Surprisingly, the vanilla Transformer architecture is still widely used today. One reason is that the lack of deep understanding and comprehensive interpretation of the Transformer architecture makes it more challenging to improve the Transformer architecture. In this paper, we first interpret the Transformer architecture comprehensively in plain words based on our understanding and experiences. The interpretations are further proved and verified. These interpretations also cover the Extractor, a family of drop-in replacements for the multi-head self-attention in the Transformer architecture. Then, we propose an improvement on a type of the Extractor that outperforms the self-attention, without introducing additional trainable parameters. Experimental results demonstrate that the improved Extractor performs even better, showing a way to improve the Transformer architecture.
    Personalized Federated Learning with Multi-branch Architecture. (arXiv:2211.07931v3 [cs.LG] UPDATED)
    Federated learning (FL) is a decentralized machine learning technique that enables multiple clients to collaboratively train models without requiring clients to reveal their raw data to each other. Although traditional FL trains a single global model with average performance among clients, statistical data heterogeneity across clients has resulted in the development of personalized FL (PFL), which trains personalized models with good performance on each client's data. A key challenge with PFL is how to facilitate clients with similar data to collaborate more in a situation where each client has data from complex distribution and cannot determine one another's distribution. In this paper, we propose a new PFL method (pFedMB) using multi-branch architecture, which achieves personalization by splitting each layer of a neural network into multiple branches and assigning client-specific weights to each branch. We also design an aggregation method to improve the communication efficiency and the model performance, with which each branch is globally updated with weighted averaging by client-specific weights assigned to the branch. pFedMB is simple but effective in facilitating each client to share knowledge with similar clients by adjusting the weights assigned to each branch. We experimentally show that pFedMB performs better than the state-of-the-art PFL methods using the CIFAR10 and CIFAR100 datasets.
    Modeling Political Orientation of Social Media Posts: An Extended Analysis. (arXiv:2311.12323v1 [cs.SI])
    Developing machine learning models to characterize political polarization on online social media presents significant challenges. These challenges mainly stem from various factors such as the lack of annotated data, presence of noise in social media datasets, and the sheer volume of data. The common research practice typically examines the biased structure of online user communities for a given topic or qualitatively measuring the impacts of polarized topics on social media. However, there is limited work focusing on analyzing polarization at the ground-level, specifically in the social media posts themselves. Such existing analysis heavily relies on annotated data, which often requires laborious human labeling, offers labels only to specific problems, and lacks the ability to determine the near-future bias state of a social media conversations. Understanding the degree of political orientation conveyed in social media posts is crucial for quantifying the bias of online user communities and investigating the spread of polarized content. In this work, we first introduce two heuristic methods that leverage on news media bias and post content to label social media posts. Next, we compare the efficacy and quality of heuristically labeled dataset with a randomly sampled human-annotated dataset. Additionally, we demonstrate that current machine learning models can exhibit improved performance in predicting political orientation of social media posts, employing both traditional supervised learning and few-shot learning setups. We conduct experiments using the proposed heuristic methods and machine learning approaches to predict the political orientation of posts collected from two social media forums with diverse political ideologies: Gab and Twitter.
    Koopman Learning with Episodic Memory. (arXiv:2311.12615v1 [math.DS])
    Koopman operator theory, a data-driven dynamical systems framework, has found significant success in learning models from complex, real-world data sets, enabling state-of-the-art prediction and control. The greater interpretability and lower computational costs of these models, compared to traditional machine learning methodologies, make Koopman learning an especially appealing approach. Despite this, little work has been performed on endowing Koopman learning with the ability to learn from its own mistakes. To address this, we equip Koopman methods - developed for predicting non-stationary time-series - with an episodic memory mechanism, enabling global recall of (or attention to) periods in time where similar dynamics previously occurred. We find that a basic implementation of Koopman learning with episodic memory leads to significant improvements in prediction on synthetic and real-world data. Our framework has considerable potential for expansion, allowing for future advances, and opens exciting new directions for Koopman learning.
    Tailoring Adversarial Attacks on Deep Neural Networks for Targeted Class Manipulation Using DeepFool Algorithm. (arXiv:2310.13019v3 [cs.CV] UPDATED)
    Deep neural networks (DNNs) have significantly advanced various domains, but their vulnerability to adversarial attacks poses serious concerns. Understanding these vulnerabilities and developing effective defense mechanisms is crucial. DeepFool, an algorithm proposed by Moosavi-Dezfooli et al. (2016), finds minimal perturbations to misclassify input images. However, DeepFool lacks a targeted approach, making it less effective in specific attack scenarios. Also, in previous related works, researchers primarily focus on success, not considering how much an image is getting distorted; the integrity of the image quality, and the confidence level to misclassifying. So, in this paper, we propose Enhanced Targeted DeepFool, an augmented version of DeepFool that allows targeting specific classes for misclassification and also introduce a minimum confidence score requirement hyperparameter to enhance flexibility. Our experiments demonstrate the effectiveness and efficiency of the proposed method across different deep neural network architectures while preserving image integrity as much and perturbation rate as less as possible. By using our approach, the behavior of models can be manipulated arbitrarily using the perturbed images, as we can specify both the target class and the associated confidence score, unlike other DeepFool-derivative works, such as Targeted DeepFool by Gajjar et al. (2022). Results show that one of the deep convolutional neural network architectures, AlexNet, and one of the state-of-the-art model Vision Transformer exhibit high robustness to getting fooled. This approach can have larger implication, as our tuning of confidence level can expose the robustness of image recognition models. Our code will be made public upon acceptance of the paper.
    Classifier Calibration with ROC-Regularized Isotonic Regression. (arXiv:2311.12436v1 [cs.LG])
    Calibration of machine learning classifiers is necessary to obtain reliable and interpretable predictions, bridging the gap between model confidence and actual probabilities. One prominent technique, isotonic regression (IR), aims at calibrating binary classifiers by minimizing the cross entropy on a calibration set via monotone transformations. IR acts as an adaptive binning procedure, which allows achieving a calibration error of zero, but leaves open the issue of the effect on performance. In this paper, we first prove that IR preserves the convex hull of the ROC curve -- an essential performance metric for binary classifiers. This ensures that a classifier is calibrated while controlling for overfitting of the calibration set. We then present a novel generalization of isotonic regression to accommodate classifiers with K classes. Our method constructs a multidimensional adaptive binning scheme on the probability simplex, again achieving a multi-class calibration error equal to zero. We regularize this algorithm by imposing a form of monotony that preserves the K-dimensional ROC surface of the classifier. We show empirically that this general monotony criterion is effective in striking a balance between reducing cross entropy loss and avoiding overfitting of the calibration set.
    Automated Detection of hidden Damages and Impurities in Aluminum Die Casting Materials and Fibre-Metal Laminates using Low-quality X-ray Radiography, Synthetic X-ray Data Augmentation by Simulation, and Machine Learning. (arXiv:2311.12041v1 [cs.CV])
    Detection and characterization of hidden defects, impurities, and damages in layered composites like Fibre laminates, e.g., Fibre Metal Laminates (FML), as well as in monolithic materials, e.g., aluminum die casting materials, is still a challenge. This work discusses methods and challenges in data-driven modeling of automated damage and defect detectors using X-ray single- and multi-projection (CT) images. Three main issues are identified: Data and feature variance, data feature labeling (for supervised machine learning), and the missing ground truth. It will be shown that only simulation of data can deliver a ground truth data set and accurate labeling. Noise has significant impact on the feature detection and will be discussed. Data-driven feature detectors are implemented with semantic pixel- or z-profile Convolutional Neural Networks and LSTM Auto-encoders. Data is measured with three different devices: A low-quality and low-cost (Low-Q), a mid- and a high-quality (micro-CT, Mid-/High-Q) device. The goals of this work are the training of robust and generalized feature detectors with synthetic data and the transition from High- and Mid-Q laboratory measuring technologies towards in-field usable technologies and methods.
    Data-Guided Regulator for Adaptive Nonlinear Control. (arXiv:2311.12230v1 [eess.SY])
    This paper addresses the problem of designing a data-driven feedback controller for complex nonlinear dynamical systems in the presence of time-varying disturbances with unknown dynamics. Such disturbances are modeled as the "unknown" part of the system dynamics. The goal is to achieve finite-time regulation of system states through direct policy updates while also generating informative data that can subsequently be used for data-driven stabilization or system identification. First, we expand upon the notion of "regularizability" and characterize this system characteristic for a linear time-varying representation of the nonlinear system with locally-bounded higher-order terms. "Rapid-regularizability" then gauges the extent by which a system can be regulated in finite time, in contrast to its asymptotic behavior. We then propose the Data-Guided Regulation for Adaptive Nonlinear Control ( DG-RAN) algorithm, an online iterative synthesis procedure that utilizes discrete time-series data from a single trajectory for regulating system states and identifying disturbance dynamics. The effectiveness of our approach is demonstrated on a 6-DOF power descent guidance problem in the presence of adverse environmental disturbances.
    Infinite forecast combinations based on Dirichlet process. (arXiv:2311.12379v1 [cs.LG])
    Forecast combination integrates information from various sources by consolidating multiple forecast results from the target time series. Instead of the need to select a single optimal forecasting model, this paper introduces a deep learning ensemble forecasting model based on the Dirichlet process. Initially, the learning rate is sampled with three basis distributions as hyperparameters to convert the infinite mixture into a finite one. All checkpoints are collected to establish a deep learning sub-model pool, and weight adjustment and diversity strategies are developed during the combination process. The main advantage of this method is its ability to generate the required base learners through a single training process, utilizing the decaying strategy to tackle the challenge posed by the stochastic nature of gradient descent in determining the optimal learning rate. To ensure the method's generalizability and competitiveness, this paper conducts an empirical analysis using the weekly dataset from the M4 competition and explores sensitivity to the number of models to be combined. The results demonstrate that the ensemble model proposed offers substantial improvements in prediction accuracy and stability compared to a single benchmark model.
    A Survey of Graph Meets Large Language Model: Progress and Future Directions. (arXiv:2311.12399v1 [cs.LG])
    Graph plays a significant role in representing and analyzing complex relationships in real-world applications such as citation networks, social networks, and biological data. Recently, Large Language Models (LLMs), which have achieved tremendous success in various domains, have also been leveraged in graph-related tasks to surpass traditional Graph Neural Networks (GNNs) based methods and yield state-of-the-art performance. In this survey, we first present a comprehensive review and analysis of existing methods that integrate LLMs with graphs. First of all, we propose a new taxonomy, which organizes existing methods into three categories based on the role (i.e., enhancer, predictor, and alignment component) played by LLMs in graph-related tasks. Then we systematically survey the representative methods along the three categories of the taxonomy. Finally, we discuss the remaining limitations of existing studies and highlight promising avenues for future research. The relevant papers are summarized and will be consistently updated at: https://github.com/yhLeeee/Awesome-LLMs-in-Graph-tasks.
    Enhancing Novel Object Detection via Cooperative Foundational Models. (arXiv:2311.12068v1 [cs.CV])
    In this work, we address the challenging and emergent problem of novel object detection (NOD), focusing on the accurate detection of both known and novel object categories during inference. Traditional object detection algorithms are inherently closed-set, limiting their capability to handle NOD. We present a novel approach to transform existing closed-set detectors into open-set detectors. This transformation is achieved by leveraging the complementary strengths of pre-trained foundational models, specifically CLIP and SAM, through our cooperative mechanism. Furthermore, by integrating this mechanism with state-of-the-art open-set detectors such as GDINO, we establish new benchmarks in object detection performance. Our method achieves 17.42 mAP in novel object detection and 42.08 mAP for known objects on the challenging LVIS dataset. Adapting our approach to the COCO OVD split, we surpass the current state-of-the-art by a margin of 7.2 $ \text{AP}_{50} $ for novel classes. Our code is available at https://github.com/rohit901/cooperative-foundational-models .
    Fast Inner-Product Algorithms and Architectures for Deep Neural Network Accelerators. (arXiv:2311.12224v1 [cs.AR])
    We introduce a new algorithm called the Free-pipeline Fast Inner Product (FFIP) and its hardware architecture that improve an under-explored fast inner-product algorithm (FIP) proposed by Winograd in 1968. Unlike the unrelated Winograd minimal filtering algorithms for convolutional layers, FIP is applicable to all machine learning (ML) model layers that can mainly decompose to matrix multiplication, including fully-connected, convolutional, recurrent, and attention/transformer layers. We implement FIP for the first time in an ML accelerator then present our FFIP algorithm and generalized architecture which inherently improve FIP's clock frequency and, as a consequence, throughput for a similar hardware cost. Finally, we contribute ML-specific optimizations for the FIP and FFIP algorithms and architectures. We show that FFIP can be seamlessly incorporated into traditional fixed-point systolic array ML accelerators to achieve the same throughput with half the number of multiply-accumulate (MAC) units, or it can double the maximum systolic array size that can fit onto devices with a fixed hardware budget. Our FFIP implementation for non-sparse ML models with 8 to 16-bit fixed-point inputs achieves higher throughput and compute efficiency than the best-in-class prior solutions on the same type of compute platform.
    TouchSDF: A DeepSDF Approach for 3D Shape Reconstruction using Vision-Based Tactile Sensing. (arXiv:2311.12602v1 [cs.CV])
    Humans rely on their visual and tactile senses to develop a comprehensive 3D understanding of their physical environment. Recently, there has been a growing interest in exploring and manipulating objects using data-driven approaches that utilise high-resolution vision-based tactile sensors. However, 3D shape reconstruction using tactile sensing has lagged behind visual shape reconstruction because of limitations in existing techniques, including the inability to generalise over unseen shapes, the absence of real-world testing, and limited expressive capacity imposed by discrete representations. To address these challenges, we propose TouchSDF, a Deep Learning approach for tactile 3D shape reconstruction that leverages the rich information provided by a vision-based tactile sensor and the expressivity of the implicit neural representation DeepSDF. Our technique consists of two components: (1) a Convolutional Neural Network that maps tactile images into local meshes representing the surface at the touch location, and (2) an implicit neural function that predicts a signed distance function to extract the desired 3D shape. This combination allows TouchSDF to reconstruct smooth and continuous 3D shapes from tactile inputs in simulation and real-world settings, opening up research avenues for robust 3D-aware representations and improved multimodal perception in robotics. Code and supplementary material are available at: https://touchsdf.github.io/
    Enhancing Low-dose CT Image Reconstruction by Integrating Supervised and Unsupervised Learning. (arXiv:2311.12071v1 [eess.IV])
    Traditional model-based image reconstruction (MBIR) methods combine forward and noise models with simple object priors. Recent application of deep learning methods for image reconstruction provides a successful data-driven approach to addressing the challenges when reconstructing images with undersampled measurements or various types of noise. In this work, we propose a hybrid supervised-unsupervised learning framework for X-ray computed tomography (CT) image reconstruction. The proposed learning formulation leverages both sparsity or unsupervised learning-based priors and neural network reconstructors to simulate a fixed-point iteration process. Each proposed trained block consists of a deterministic MBIR solver and a neural network. The information flows in parallel through these two reconstructors and is then optimally combined. Multiple such blocks are cascaded to form a reconstruction pipeline. We demonstrate the efficacy of this learned hybrid model for low-dose CT image reconstruction with limited training data, where we use the NIH AAPM Mayo Clinic Low Dose CT Grand Challenge dataset for training and testing. In our experiments, we study combinations of supervised deep network reconstructors and MBIR solver with learned sparse representation-based priors or analytical priors. Our results demonstrate the promising performance of the proposed framework compared to recent low-dose CT reconstruction methods.
    A New Type Of Upper And Lower Bounds On Right-Tail Probabilities Of Continuous Random Variables. (arXiv:2311.12612v1 [math.PR])
    In this paper, I present a completely new type of upper and lower bounds on the right-tail probabilities of continuous random variables with unbounded support and with semi-bounded support from the left. The presented upper and lower right-tail bounds depend only on the probability density function (PDF), its first derivative, and two parameters that are used for tightening the bounds. These tail bounds hold under certain conditions that depend on the PDF, its first and second derivatives, and the two parameters. The new tail bounds are shown to be tight for a wide range of continuous random variables via numerical examples.
    Board-to-Board: Evaluating Moonboard Grade Prediction Generalization. (arXiv:2311.12419v1 [cs.LG])
    Bouldering is a sport where athletes aim to climb up an obstacle using a set of defined holds called a route. Typically routes are assigned a grade to inform climbers of its difficulty and allow them to more easily track their progression. However, the variation in individual climbers technical and physical attributes and many nuances of an individual route make grading a difficult and often biased task. In this work, we apply classical and deep-learning modelling techniques to the 2016, 2017 and 2019 Moonboard datasets, achieving state of the art grade prediction performance with 0.87 MAE and 1.12 RMSE. We achieve this performance on a feature-set that does not require decomposing routes into individual moves, which is a method common in literature and introduces bias. We also demonstrate the generalization capability of this model between editions and introduce a novel vision-based method of grade prediction. While the generalization performance of these techniques is below human level performance currently, we propose these methods as a basis for future work. Such a tool could be implemented in pre-existing mobile applications and would allow climbers to better track their progress and assess new routes with reduced bias.
    Multi-Objective Reinforcement Learning based on Decomposition: A taxonomy and framework. (arXiv:2311.12495v1 [cs.LG])
    Multi-objective reinforcement learning (MORL) extends traditional RL by seeking policies making different compromises among conflicting objectives. The recent surge of interest in MORL has led to diverse studies and solving methods, often drawing from existing knowledge in multi-objective optimization based on decomposition (MOO/D). Yet, a clear categorization based on both RL and MOO/D is lacking in the existing literature. Consequently, MORL researchers face difficulties when trying to classify contributions within a broader context due to the absence of a standardized taxonomy. To tackle such an issue, this paper introduces Multi-Objective Reinforcement Learning based on Decomposition (MORL/D), a novel methodology bridging RL and MOO literature. A comprehensive taxonomy for MORL/D is presented, providing a structured foundation for categorizing existing and potential MORL works. The introduced taxonomy is then used to scrutinize MORL research, enhancing clarity and conciseness through well-defined categorization. Moreover, a flexible framework derived from the taxonomy is introduced. This framework accommodates diverse instantiations using tools from both RL and MOO/D. Implementation across various configurations demonstrates its versatility, assessed against benchmark problems. Results indicate MORL/D instantiations achieve comparable performance with significantly greater versatility than current state-of-the-art approaches. By presenting the taxonomy and framework, this paper offers a comprehensive perspective and a unified vocabulary for MORL. This not only facilitates the identification of algorithmic contributions but also lays the groundwork for novel research avenues in MORL, contributing to the continued advancement of this field.
    ALPHA: AnomaLous Physiological Health Assessment Using Large Language Models. (arXiv:2311.12524v1 [cs.LG])
    This study concentrates on evaluating the efficacy of Large Language Models (LLMs) in healthcare, with a specific focus on their application in personal anomalous health monitoring. Our research primarily investigates the capabilities of LLMs in interpreting and analyzing physiological data obtained from FDA-approved devices. We conducted an extensive analysis using anomalous physiological data gathered in a simulated low-air-pressure plateau environment. This allowed us to assess the precision and reliability of LLMs in understanding and evaluating users' health status with notable specificity. Our findings reveal that LLMs exhibit exceptional performance in determining medical indicators, including a Mean Absolute Error (MAE) of less than 1 beat per minute for heart rate and less than 1% for oxygen saturation (SpO2). Furthermore, the Mean Absolute Percentage Error (MAPE) for these evaluations remained below 1%, with the overall accuracy of health assessments surpassing 85%. In image analysis tasks, such as interpreting photoplethysmography (PPG) data, our specially adapted GPT models demonstrated remarkable proficiency, achieving less than 1 bpm error in cycle count and 7.28 MAE for heart rate estimation. This study highlights LLMs' dual role as health data analysis tools and pivotal elements in advanced AI health assistants, offering personalized health insights and recommendations within the future health assistant framework.
    BEND: Benchmarking DNA Language Models on biologically meaningful tasks. (arXiv:2311.12570v1 [q-bio.GN])
    The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised language modeling of genomic DNA, a paradigm that has seen great success for protein sequence data. Although various DNA language models have been proposed, evaluation tasks often differ between individual works, and might not fully recapitulate the fundamental challenges of genome annotation, including the length, scale and sparsity of the data. In this study, we introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features. BEND is available at https://github.com/frederikkemarin/BEND.
    Exploring Time Granularity on Temporal Graphs for Dynamic Link Prediction in Real-world Networks. (arXiv:2311.12255v1 [cs.LG])
    Dynamic Graph Neural Networks (DGNNs) have emerged as the predominant approach for processing dynamic graph-structured data. However, the influence of temporal information on model performance and robustness remains insufficiently explored, particularly regarding how models address prediction tasks with different time granularities. In this paper, we explore the impact of time granularity when training DGNNs on dynamic graphs through extensive experiments. We examine graphs derived from various domains and compare three different DGNNs to the baseline model across four varied time granularities. We mainly consider the interplay between time granularities, model architectures, and negative sampling strategies to obtain general conclusions. Our results reveal that a sophisticated memory mechanism and proper time granularity are crucial for a DGNN to deliver competitive and robust performance in the dynamic link prediction task. We also discuss drawbacks in considered models and datasets and propose promising directions for future research on the time granularity of temporal graphs.
    Probabilistic Forecast Reconciliation with Kullback-Leibler Divergence Regularization. (arXiv:2311.12279v1 [cs.AI])
    As the popularity of hierarchical point forecast reconciliation methods increases, there is a growing interest in probabilistic forecast reconciliation. Many studies have utilized machine learning or deep learning techniques to implement probabilistic forecasting reconciliation and have made notable progress. However, these methods treat the reconciliation step as a fixed and hard post-processing step, leading to a trade-off between accuracy and coherency. In this paper, we propose a new approach for probabilistic forecast reconciliation. Unlike existing approaches, our proposed approach fuses the prediction step and reconciliation step into a deep learning framework, making the reconciliation step more flexible and soft by introducing the Kullback-Leibler divergence regularization term into the loss function. The approach is evaluated using three hierarchical time series datasets, which shows the advantages of our approach over other probabilistic forecast reconciliation methods.
    Fair Enough? A map of the current limitations of the requirements to have "fair'' algorithms. (arXiv:2311.12435v1 [cs.AI])
    In the recent years, the raise in the usage and efficiency of Artificial Intelligence and, more in general, of Automated Decision-Making systems has brought with it an increasing and welcome awareness of the risks associated with such systems. One of such risks is that of perpetuating or even amplifying bias and unjust disparities present in the data from which many of these systems learn to adjust and optimise their decisions. This awareness has on one side encouraged several scientific communities to come up with more and more appropriate ways and methods to assess, quantify, and possibly mitigate such biases and disparities. On the other hand, it has prompted more and more layers of society, including policy makers, to call for ``fair'' algorithms. We believe that while a lot of excellent and multidisciplinary research is currently being conducted, what is still fundamentally missing is the awareness that having ``fair'' algorithms is per s\'e a nearly meaningless requirement, that needs to be complemented with a lot of additional societal choices to become actionable. Namely, there is a hiatus between what the society is demanding from Automated Decision-Making systems, and what this demand actually means in real-world scenarios. In this work, we outline the key features of such a hiatus, and pinpoint a list of fundamental ambiguities and attention points that we as a society must address in order to give a concrete meaning to the increasing demand of fairness in Automated Decision-Making systems.
    Tiny-VBF: Resource-Efficient Vision Transformer based Lightweight Beamformer for Ultrasound Single-Angle Plane Wave Imaging. (arXiv:2311.12082v1 [eess.IV])
    Accelerating compute intensive non-real-time beam-forming algorithms in ultrasound imaging using deep learning architectures has been gaining momentum in the recent past. Nonetheless, the complexity of the state-of-the-art deep learning techniques poses challenges for deployment on resource-constrained edge devices. In this work, we propose a novel vision transformer based tiny beamformer (Tiny-VBF), which works on the raw radio-frequency channel data acquired through single-angle plane wave insonification. The output of our Tiny-VBF provides fast envelope detection requiring very low frame rate, i.e. 0.34 GOPs/Frame for a frame size of 368 x 128 in comparison to the state-of-the-art deep learning models. It also exhibited an 8% increase in contrast and gains of 5% and 33% in axial and lateral resolution respectively when compared to Tiny-CNN on in-vitro dataset. Additionally, our model showed a 4.2% increase in contrast and gains of 4% and 20% in axial and lateral resolution respectively when compared against conventional Delay-and-Sum (DAS) beamformer. We further propose an accelerator architecture and implement our Tiny-VBF model on a Zynq UltraScale+ MPSoC ZCU104 FPGA using a hybrid quantization scheme with 50% less resource consumption compared to the floating-point implementation, while preserving the image quality.
    Explainable Anomaly Detection using Masked Latent Generative Modeling. (arXiv:2311.12550v1 [cs.LG])
    We present a novel time series anomaly detection method that achieves excellent detection accuracy while offering a superior level of explainability. Our proposed method, TimeVQVAE-AD, leverages masked generative modeling adapted from the cutting-edge time series generation method known as TimeVQVAE. The prior model is trained on the discrete latent space of a time-frequency domain. Notably, the dimensional semantics of the time-frequency domain are preserved in the latent space, enabling us to compute anomaly scores across different frequency bands, which provides a better insight into the detected anomalies. Additionally, the generative nature of the prior model allows for sampling likely normal states for detected anomalies, enhancing the explainability of the detected anomalies through counterfactuals. Our experimental evaluation on the UCR Time Series Anomaly archive demonstrates that TimeVQVAE-AD significantly surpasses the existing methods in terms of detection accuracy and explainability.
    Convolutional Neural Networks for Neuroimaging in Parkinson's Disease: Is Preprocessing Needed?. (arXiv:2311.12561v1 [cs.CV])
    Spatial and intensity normalization are nowadays a prerequisite for neuroimaging analysis. Influenced by voxel-wise and other univariate comparisons, where these corrections are key, they are commonly applied to any type of analysis and imaging modalities. Nuclear imaging modalities such as PET-FDG or FP-CIT SPECT, a common modality used in Parkinson's Disease diagnosis, are especially dependent on intensity normalization. However, these steps are computationally expensive and furthermore, they may introduce deformations in the images, altering the information contained in them. Convolutional Neural Networks (CNNs), for their part, introduce position invariance to pattern recognition, and have been proven to classify objects regardless of their orientation, size, angle, etc. Therefore, a question arises: how well can CNNs account for spatial and intensity differences when analysing nuclear brain imaging? Are spatial and intensity normalization still needed? To answer this question, we have trained four different CNN models based on well-established architectures, using or not different spatial and intensity normalization preprocessing. The results show that a sufficiently complex model such as our three-dimensional version of the ALEXNET can effectively account for spatial differences, achieving a diagnosis accuracy of 94.1% with an area under the ROC curve of 0.984. The visualization of the differences via saliency maps shows that these models are correctly finding patterns that match those found in the literature, without the need of applying any complex spatial normalization procedure. However, the intensity normalization -- and its type -- is revealed as very influential in the results and accuracy of the trained model, and therefore must be well accounted.
    Creating Temporally Correlated High-Resolution Power Injection Profiles Using Physics-Aware GAN. (arXiv:2311.12166v1 [eess.SP])
    Traditional smart meter measurements lack the granularity needed for real-time decision-making. To address this practical problem, we create a generative adversarial networks (GAN) model that enforces temporal consistency on its high-resolution outputs via hard inequality constraints using a convex optimization layer. A unique feature of our GAN model is that it is trained solely on slow timescale aggregated power information obtained from historical smart meter data. The results demonstrate that the model can successfully create minutely interval temporally-correlated instantaneous power injection profiles from 15-minute average power consumption information. This innovative approach, emphasizing inter-neuron constraints, offers a promising avenue for improved high-speed state estimation in distribution systems and enhances the applicability of data-driven solutions for monitoring such systems.
    Detecting subtle macroscopic changes in a finite temperature classical scalar field with machine learning. (arXiv:2311.12303v1 [cond-mat.stat-mech])
    The ability to detect macroscopic changes is important for probing the behaviors of experimental many-body systems from the classical to the quantum realm. Although abrupt changes near phase boundaries can easily be detected, subtle macroscopic changes are much more difficult to detect as the changes can be obscured by noise. In this study, as a toy model for detecting subtle macroscopic changes in many-body systems, we try to differentiate scalar field samples at varying temperatures. We compare different methods for making such differentiations, from physics method, statistics method, to AI method. Our finding suggests that the AI method outperforms both the statistical method and the physics method in its sensitivity. Our result provides a proof-of-concept that AI can potentially detect macroscopic changes in many-body systems that elude physical measures.
    Variational Elliptical Processes. (arXiv:2311.12566v1 [cs.LG])
    We present elliptical processes, a family of non-parametric probabilistic models that subsume Gaussian processes and Student's t processes. This generalization includes a range of new heavy-tailed behaviors while retaining computational tractability. Elliptical processes are based on a representation of elliptical distributions as a continuous mixture of Gaussian distributions. We parameterize this mixture distribution as a spline normalizing flow, which we train using variational inference. The proposed form of the variational posterior enables a sparse variational elliptical process applicable to large-scale problems. We highlight advantages compared to Gaussian processes through regression and classification experiments. Elliptical processes can supersede Gaussian processes in several settings, including cases where the likelihood is non-Gaussian or when accurate tail modeling is essential.
    Improving Label Assignments Learning by Dynamic Sample Dropout Combined with Layer-wise Optimization in Speech Separation. (arXiv:2311.12199v1 [cs.SD])
    In supervised speech separation, permutation invariant training (PIT) is widely used to handle label ambiguity by selecting the best permutation to update the model. Despite its success, previous studies showed that PIT is plagued by excessive label assignment switching in adjacent epochs, impeding the model to learn better label assignments. To address this issue, we propose a novel training strategy, dynamic sample dropout (DSD), which considers previous best label assignments and evaluation metrics to exclude the samples that may negatively impact the learned label assignments during training. Additionally, we include layer-wise optimization (LO) to improve the performance by solving layer-decoupling. Our experiments showed that combining DSD and LO outperforms the baseline and solves excessive label assignment switching and layer-decoupling issues. The proposed DSD and LO approach is easy to implement, requires no extra training sets or steps, and shows generality to various speech separation tasks.
    Beyond Simulated Drivers: Evaluating the Impact of Real-World Car-Following in Mixed Traffic Control. (arXiv:2311.12261v1 [cs.RO])
    Human-driven vehicles can amplify naturally occurring perturbations in traffic, leading to congestion and consequently increased fuel consumption, higher collision risks, and reduced capacity utilization. While previous research has highlighted that a fraction of Robot Vehicles (RVs) can mitigate these issues, they often rely on simulations with simplistic, model-based Human-driven Vehicles (HVs) during car-following scenarios. Diverging from this trend, in this study, we analyze real-world human driving trajectories, extracting a wide range of acceleration behaviors during car-following. We then incorporate these behaviors in simulation where RVs from prior studies are employed to mitigate congestion, and evaluate their safety, efficiency, and stability. Further, we also introduce a reinforcement learning based RV that utilizes a congestion stage classifier neural network to optimize either "safety+stability" or "efficiency" in the presence of the diverse human driving behaviors. We evaluate the proposed RVs in two different mixed traffic control environments at various densities, configurations, and penetration rates and compare with the existing RVs.
    Fast Controllable Diffusion Models for Undersampled MRI Reconstruction. (arXiv:2311.12078v1 [eess.IV])
    Supervised deep learning methods have shown promise in Magnetic Resonance Imaging (MRI) undersampling reconstruction, but their requirement for paired data limits their generalizability to the diverse MRI acquisition parameters. Recently, unsupervised controllable generative diffusion models have been applied to MRI undersampling reconstruction, without paired data or model retraining for different MRI acquisitions. However, diffusion models are generally slow in sampling and state-of-the-art acceleration techniques can lead to sub-optimal results when directly applied to the controllable generation process. This study introduces a new algorithm called Predictor-Projector-Noisor (PPN), which enhances and accelerates controllable generation of diffusion models for MRI undersampling reconstruction. Our results demonstrate that PPN produces high-fidelity MR images that conform to undersampled k-space measurements with significantly shorter reconstruction time than other controllable sampling methods. In addition, the unsupervised PPN accelerated diffusion models are adaptable to different MRI acquisition parameters, making them more practical for clinical use than supervised learning techniques.
    Differentially Private Optimizers Can Learn Adversarially Robust Models. (arXiv:2211.08942v2 [cs.LG] UPDATED)
    Machine learning models have shone in a variety of domains and attracted increasing attention from both the security and the privacy communities. One important yet worrying question is: Will training models under the differential privacy (DP) constraint have an unfavorable impact on their adversarial robustness? While previous works have postulated that privacy comes at the cost of worse robustness, we give the first theoretical analysis to show that DP models can indeed be robust and accurate, even sometimes more robust than their naturally-trained non-private counterparts. We observe three key factors that influence the privacy-robustness-accuracy tradeoff: (1) hyper-parameters for DP optimizers are critical; (2) pre-training on public data significantly mitigates the accuracy and robustness drop; (3) choice of DP optimizers makes a difference. With these factors set properly, we achieve 90\% natural accuracy, 72\% robust accuracy ($+9\%$ than the non-private model) under $l_2(0.5)$ attack, and 69\% robust accuracy ($+16\%$ than the non-private model) with pre-trained SimCLRv2 model under $l_\infty(4/255)$ attack on CIFAR10 with $\epsilon=2$. In fact, we show both theoretically and empirically that DP models are Pareto optimal on the accuracy-robustness tradeoff. Empirically, the robustness of DP models is consistently observed across various datasets and models. We believe our encouraging results are a significant step towards training models that are private as well as robust.
    Conditional Modeling Based Automatic Video Summarization. (arXiv:2311.12159v1 [cs.CV])
    The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story. Video summarization methods mainly rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video. There are other non-visual factors, such as interestingness, representativeness, and storyline consistency that should also be considered for generating high-quality video summaries. Current methods do not adequately take into account these non-visual factors, resulting in suboptimal performance. In this work, a new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries. The method utilizes a conditional modeling perspective and introduces multiple meaningful random variables and joint distributions to characterize the key components of video summarization. Helper distributions are employed to improve the training of the model. A conditional attention module is designed to mitigate potential performance degradation in the presence of multi-modal input. The proposed video summarization method incorporates the above innovative design choices that aim to narrow the gap between human-generated and machine-generated video summaries. Extensive experiments show that the proposed approach outperforms existing methods and achieves state-of-the-art performance on commonly used video summarization datasets.
    Random Linear Projections Loss for Hyperplane-Based Optimization in Regression Neural Networks. (arXiv:2311.12356v1 [cs.LG])
    Despite their popularity across a wide range of domains, regression neural networks are prone to overfitting complex datasets. In this work, we propose a loss function termed Random Linear Projections (RLP) loss, which is empirically shown to mitigate overfitting. With RLP loss, the distance between sets of hyperplanes connecting fixed-size subsets of the neural network's feature-prediction pairs and feature-label pairs is minimized. The intuition behind this loss derives from the notion that if two functions share the same hyperplanes connecting all subsets of feature-label pairs, then these functions must necessarily be equivalent. Our empirical studies, conducted across benchmark datasets and representative synthetic examples, demonstrate the improvements of the proposed RLP loss over mean squared error (MSE). Specifically, neural networks trained with the RLP loss achieve better performance while requiring fewer data samples and are more robust to additive noise. We provide theoretical analysis supporting our empirical findings.
    Node classification in random trees. (arXiv:2311.12167v1 [cs.LG])
    We propose a method for the classification of objects that are structured as random trees. Our aim is to model a distribution over the node label assignments in settings where the tree data structure is associated with node attributes (typically high dimensional embeddings). The tree topology is not predetermined and none of the label assignments are present during inference. Other methods that produce a distribution over node label assignment in trees (or more generally in graphs) either assume conditional independence of the label assignment, operate on a fixed graph topology, or require part of the node labels to be observed. Our method defines a Markov Network with the corresponding topology of the random tree and an associated Gibbs distribution. We parameterize the Gibbs distribution with a Graph Neural Network that operates on the random tree and the node embeddings. This allows us to estimate the likelihood of node assignments for a given random tree and use MCMC to sample from the distribution of node assignments. We evaluate our method on the tasks of node classification in trees on the Stanford Sentiment Treebank dataset. Our method outperforms the baselines on this dataset, demonstrating its effectiveness for modeling joint distributions of node labels in random trees.
    Improvements in Interlayer Pipelining of CNN Accelerators Using Genetic Algorithms. (arXiv:2311.12235v1 [cs.AR])
    Deploying Convolutional Neural Networks (CNNs) on edge platforms necessitates efficient hardware acceleration. Any unnecessary data movement in such accelerators can unacceptably degrade performance and efficiency. To address this, we develop a layer fusion technique targeting CNNs, that reduces off-chip data communication using a Genetic Algorithm (GA) applied to graph-based topological sort. Results show a 1.8$\times$ increase in energy efficiency and 1.9$\times$ improvement in energy-delay product (EDP) for MobileNet-v3 on a SIMBA-like mobile architecture. Our approach consistently improves workload performance, averaging 1.4$\times$ improvement to EDP for SIMBA and 1.12$\times$ for Eyeriss.
    On the Out-of-Distribution Coverage of Combining Split Conformal Prediction and Bayesian Deep Learning. (arXiv:2311.12688v1 [cs.LG])
    Bayesian deep learning and conformal prediction are two methods that have been used to convey uncertainty and increase safety in machine learning systems. We focus on combining Bayesian deep learning with split conformal prediction and how this combination effects out-of-distribution coverage; particularly in the case of multiclass image classification. We suggest that if the model is generally underconfident on the calibration set, then the resultant conformal sets may exhibit worse out-of-distribution coverage compared to simple predictive credible sets. Conversely, if the model is overconfident on the calibration set, the use of conformal prediction may improve out-of-distribution coverage. We evaluate prediction sets as a result of combining split conformal methods and neural networks trained with (i) stochastic gradient descent, (ii) deep ensembles, and (iii) mean-field variational inference. Our results suggest that combining Bayesian deep learning models with split conformal prediction can, in some cases, cause unintended consequences such as reducing out-of-distribution coverage.
    TransCDR: a deep learning model for enhancing the generalizability of cancer drug response prediction through transfer learning and multimodal data fusion for drug representation. (arXiv:2311.12040v1 [q-bio.QM])
    Accurate and robust drug response prediction is of utmost importance in precision medicine. Although many models have been developed to utilize the representations of drugs and cancer cell lines for predicting cancer drug responses (CDR), their performances can be improved by addressing issues such as insufficient data modality, suboptimal fusion algorithms, and poor generalizability for novel drugs or cell lines. We introduce TransCDR, which uses transfer learning to learn drug representations and fuses multi-modality features of drugs and cell lines by a self-attention mechanism, to predict the IC50 values or sensitive states of drugs on cell lines. We are the first to systematically evaluate the generalization of the CDR prediction model to novel (i.e., never-before-seen) compound scaffolds and cell line clusters. TransCDR shows better generalizability than 8 state-of-the-art models. TransCDR outperforms its 5 variants that train drug encoders (i.e., RNN and AttentiveFP) from scratch under various scenarios. The most critical contributors among multiple drug notations and omics profiles are Extended Connectivity Fingerprint and genetic mutation. Additionally, the attention-based fusion module further enhances the predictive performance of TransCDR. TransCDR, trained on the GDSC dataset, demonstrates strong predictive performance on the external testing set CCLE. It is also utilized to predict missing CDRs on GDSC. Moreover, we investigate the biological mechanisms underlying drug response by classifying 7,675 patients from TCGA into drug-sensitive or drug-resistant groups, followed by a Gene Set Enrichment Analysis. TransCDR emerges as a potent tool with significant potential in drug response prediction. The source code and data can be accessed at https://github.com/XiaoqiongXia/TransCDR.
    IEKM: A Model Incorporating External Keyword Matrices. (arXiv:2311.12310v1 [cs.AI])
    A customer service platform system with a core text semantic similarity (STS) task faces two urgent challenges: Firstly, one platform system needs to adapt to different domains of customers, i.e., different domains adaptation (DDA). Secondly, it is difficult for the model of the platform system to distinguish sentence pairs that are literally close but semantically different, i.e., hard negative samples. In this paper, we propose an incorporation external keywords matrices model (IEKM) to address these challenges. The model uses external tools or dictionaries to construct external matrices and fuses them to the self-attention layers of the Transformer structure through gating units, thus enabling flexible corrections to the model results. We evaluate the method on multiple datasets and the results show that our method has improved performance on all datasets. To demonstrate that our method can effectively solve all the above challenges, we conduct a flexible correction experiment, which results in an increase in the F1 value from 56.61 to 73.53. Our code will be publicly available.
    Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey. (arXiv:2311.12351v1 [cs.CL])
    With the bomb ignited by ChatGPT, Transformer-based Large Language Models (LLMs) have paved a revolutionary path toward Artificial General Intelligence (AGI) and have been applied in diverse areas as knowledge bases, human interfaces, and dynamic agents. However, a prevailing limitation exists: many current LLMs, constrained by resources, are primarily pre-trained on shorter texts, rendering them less effective for longer-context prompts, commonly encountered in real-world settings. In this paper, we present a comprehensive survey focusing on the advancement of model architecture in Transformer-based LLMs to optimize long-context capabilities across all stages from pre-training to inference. We firstly delineate and analyze the problems of handling long-context input and output with the current Transformer-based models. Then, we mainly offer a holistic taxonomy to navigate the landscape of Transformer upgrades on architecture to solve these problems. Afterward, we provide the investigation on wildly used evaluation necessities tailored for long-context LLMs, including datasets, metrics, and baseline models, as well as some amazing optimization toolkits like libraries, systems, and compilers to augment LLMs' efficiency and efficacy across different stages. Finally, we further discuss the predominant challenges and potential avenues for future research in this domain. Additionally, we have established a repository where we curate relevant literature with real-time updates at https://github.com/Strivin0311/long-llms-learning.
    Feature Engineering with Regularity Structures. (arXiv:2108.05879v2 [stat.ML] UPDATED)
    We investigate the use of models from the theory of regularity structures as features in machine learning tasks. A model is a polynomial function of a space-time signal designed to well-approximate solutions to partial differential equations (PDEs), even in low regularity regimes. Models can be seen as natural multi-dimensional generalisations of signatures of paths; our work therefore aims to extend the recent use of signatures in data science beyond the context of time-ordered data. We provide a flexible definition of a model feature vector associated to a space-time signal, along with two algorithms which illustrate ways in which these features can be combined with linear regression. We apply these algorithms in several numerical experiments designed to learn solutions to PDEs with a given forcing and boundary data. Our experiments include semi-linear parabolic and wave equations with forcing, and Burgers' equation with no forcing. We find an advantage in favour of our algorithms when compared to several alternative methods. Additionally, in the experiment with Burgers' equation, we find non-trivial predictive power when noise is added to the observations.
    Neural Network Pruning by Gradient Descent. (arXiv:2311.12526v1 [cs.LG])
    The rapid increase in the parameters of deep learning models has led to significant costs, challenging computational efficiency and model interpretability. In this paper, we introduce a novel and straightforward neural network pruning framework that incorporates the Gumbel-Softmax technique. This framework enables the simultaneous optimization of a network's weights and topology in an end-to-end process using stochastic gradient descent. Empirical results demonstrate its exceptional compression capability, maintaining high accuracy on the MNIST dataset with only 0.15\% of the original network parameters. Moreover, our framework enhances neural network interpretability, not only by allowing easy extraction of feature importance directly from the pruned network but also by enabling visualization of feature symmetry and the pathways of information propagation from features to outcomes. Although the pruning strategy is learned through deep learning, it is surprisingly intuitive and understandable, focusing on selecting key representative features and exploiting data patterns to achieve extreme sparse pruning. We believe our method opens a promising new avenue for deep learning pruning and the creation of interpretable machine learning systems.
    Improving Source-Free Target Adaptation with Vision Transformers Leveraging Domain Representation Images. (arXiv:2311.12589v1 [cs.CV])
    Unsupervised Domain Adaptation (UDA) methods facilitate knowledge transfer from a labeled source domain to an unlabeled target domain, navigating the obstacle of domain shift. While Convolutional Neural Networks (CNNs) are a staple in UDA, the rise of Vision Transformers (ViTs) provides new avenues for domain generalization. This paper presents an innovative method to bolster ViT performance in source-free target adaptation, beginning with an evaluation of how key, query, and value elements affect ViT outcomes. Experiments indicate that altering the key component has negligible effects on Transformer performance. Leveraging this discovery, we introduce Domain Representation Images (DRIs), feeding embeddings through the key element. DRIs act as domain-specific markers, effortlessly merging with the training regimen. To assess our method, we perform target adaptation tests on the Cross Instance DRI source-only (SO) control. We measure the efficacy of target adaptation with and without DRIs, against existing benchmarks like SHOT-B* and adaptations via CDTrans. Findings demonstrate that excluding DRIs offers limited gains over SHOT-B*, while their inclusion in the key segment boosts average precision promoting superior domain generalization. This research underscores the vital role of DRIs in enhancing ViT efficiency in UDA scenarios, setting a precedent for further domain adaptation explorations.
    ChronoPscychosis: Temporal Segmentation and Its Impact on Schizophrenia Classification Using Motor Activity Data. (arXiv:2311.12590v1 [cs.LG])
    Schizophrenia is a complicated mental illness characterized by a broad spectrum of symptoms affecting cognition, behavior, and emotion. The task of identifying reliable biomarkers to classify Schizophrenia accurately continues to be a challenge in the field of psychiatry. We investigate the temporal patterns within the motor activity data as a potential key to enhancing the categorization of individuals with Schizophrenia, using the dataset having motor activity recordings of 22 Schizophrenia patients and 32 control subjects. The dataset contains per-minute motor activity measurements collected for an average of 12.7 days in a row for each participant. We dissect each day into segments (Twelve, Eight, six, four, three, and two parts) and evaluate their impact on classification. We employ sixteen statistical features within these temporal segments and train them on Seven machine learning models to get deeper insights. LightGBM model outperforms the other six models. Our results indicate that the temporal segmentation significantly improves the classification, with AUC-ROC = 0.93, F1 score = 0.84( LightGBM- without any segmentation) and AUC-ROC = 0.98, F1 score = 0.93( LightGBM- with segmentation). Distinguishing between diurnal and nocturnal segments amplifies the differences between Schizophrenia patients and controls. However, further subdivisions into smaller time segments do not affect the AUC- ROC significantly. Morning, afternoon, evening, and night partitioning gives similar classification performance to day-night partitioning. These findings are valuable as they indicate that extensive temporal classification beyond distinguishing between day and night does not yield substantial results, offering an efficient approach for further classification, early diagnosis, and monitoring of Schizophrenia.
    Weight Norm Control. (arXiv:2311.11446v2 [cs.LG] UPDATED)
    We note that decoupled weight decay regularization is a particular case of weight norm control where the target norm of weights is set to 0. Any optimization method (e.g., Adam) which uses decoupled weight decay regularization (respectively, AdamW) can be viewed as a particular case of a more general algorithm with weight norm control (respectively, AdamWN). We argue that setting the target norm of weights to 0 can be suboptimal and other target norm values can be considered. For instance, any training run where AdamW achieves a particular norm of weights can be challenged by AdamWN scheduled to achieve a comparable norm of weights. We discuss various implications of introducing weight norm control instead of weight decay.
    Causal Reinforcement Learning: A Survey. (arXiv:2307.01452v2 [cs.LG] UPDATED)
    Reinforcement learning is an essential paradigm for solving sequential decision problems under uncertainty. Despite many remarkable achievements in recent decades, applying reinforcement learning methods in the real world remains challenging. One of the main obstacles is that reinforcement learning agents lack a fundamental understanding of the world and must therefore learn from scratch through numerous trial-and-error interactions. They may also face challenges in providing explanations for their decisions and generalizing the acquired knowledge. Causality, however, offers a notable advantage as it can formalize knowledge in a systematic manner and leverage invariance for effective knowledge transfer. This has led to the emergence of causal reinforcement learning, a subfield of reinforcement learning that seeks to enhance existing algorithms by incorporating causal relationships into the learning process. In this survey, we comprehensively review the literature on causal reinforcement learning. We first introduce the basic concepts of causality and reinforcement learning, and then explain how causality can address core challenges in non-causal reinforcement learning. We categorize and systematically review existing causal reinforcement learning approaches based on their target problems and methodologies. Finally, we outline open issues and future directions in this emerging field.
    Mapping "Brain Coral" Regions on Mars using Deep Learning. (arXiv:2311.12292v1 [astro-ph.EP])
    One of the main objectives of the Mars Exploration Program is to search for evidence of past or current life on the planet. To achieve this, Mars exploration has been focusing on regions that may have liquid or frozen water. A set of critical areas may have seen cycles of ice thawing in the relatively recent past in response to periodic changes in the obliquity of Mars. In this work, we use convolutional neural networks to detect surface regions containing "Brain Coral" terrain, a landform on Mars whose similarity in morphology and scale to sorted stone circles on Earth suggests that it may have formed as a consequence of freeze/thaw cycles. We use large images (~100-1000 megapixels) from the Mars Reconnaissance Orbiter to search for these landforms at resolutions close to a few tens of centimeters per pixel (~25--50 cm). Over 52,000 images (~28 TB) were searched (~5% of the Martian surface) where we found detections in over 200 images. To expedite the processing we leverage a classifier network (prior to segmentation) in the Fourier domain that can take advantage of JPEG compression by leveraging blocks of coefficients from a discrete cosine transform in lieu of decoding the entire image at the full spatial resolution. The hybrid pipeline approach maintains ~93% accuracy while cutting down on ~95% of the total processing time compared to running the segmentation network at the full resolution on every image. The timely processing of big data sets helps inform mission operations, geologic surveys to prioritize candidate landing sites, avoid hazardous areas, or map the spatial extent of certain terrain. The segmentation masks and source code are available on Github for the community to explore and build upon.
    Federated Learning via Consensus Mechanism on Heterogeneous Data: A New Perspective on Convergence. (arXiv:2311.12358v1 [cs.LG])
    Federated learning (FL) on heterogeneous data (non-IID data) has recently received great attention. Most existing methods focus on studying the convergence guarantees for the global objective. While these methods can guarantee the decrease of the global objective in each communication round, they fail to ensure risk decrease for each client. In this paper, to address the problem,we propose FedCOME, which introduces a consensus mechanism to enforce decreased risk for each client after each training round. In particular, we allow a slight adjustment to a client's gradient on the server side, which generates an acute angle between the corrected gradient and the original ones of other clients. We theoretically show that the consensus mechanism can guarantee the convergence of the global objective. To generalize the consensus mechanism to the partial participation FL scenario, we devise a novel client sampling strategy to select the most representative clients for the global data distribution. Training on these selected clients with the consensus mechanism could empirically lead to risk decrease for clients that are not selected. Finally, we conduct extensive experiments on four benchmark datasets to show the superiority of FedCOME against other state-of-the-art methods in terms of effectiveness, efficiency and fairness. For reproducibility, we make our source code publicly available at: \url{https://github.com/fedcome/fedcome}.
    PyDaddy: A Python package for discovering stochastic dynamical equations from timeseries data. (arXiv:2205.02645v3 [q-bio.QM] UPDATED)
    Stochastic differential equations (SDEs) are an important framework to model dynamics with randomness, as is common in most biological systems. The inverse problem of integrating these models with empirical data remains a major challenge. Here, we present a software package, PyDaDDy (Python Library for Data Driven Dynamics) that takes time series data as an input and outputs an interpretable SDE. We achieve this by combining traditional approaches from stochastic calculus literature with state-of-the-art equation discovery techniques. We validate our approach on synthetic datasets, and demonstrate the generality and applicability of the method on two real-world datasets of vastly different spatiotemporal scales: (i) collective movement of fish school where stochasticity plays a crucial role, and (ii) confined migration of a single cell, primarily following a relaxed oscillation. We make the method available as an easy-to-use, open-source Python package, PyDaddy (Python Library for Data Driven Dynamics).
    CartiMorph: a framework for automated knee articular cartilage morphometrics. (arXiv:2308.01981v3 [eess.IV] UPDATED)
    We introduce CartiMorph, a framework for automated knee articular cartilage morphometrics. It takes an image as input and generates quantitative metrics for cartilage subregions, including the percentage of full-thickness cartilage loss (FCL), mean thickness, surface area, and volume. CartiMorph leverages the power of deep learning models for hierarchical image feature representation. Deep learning models were trained and validated for tissue segmentation, template construction, and template-to-image registration. We established methods for surface-normal-based cartilage thickness mapping, FCL estimation, and rule-based cartilage parcellation. Our cartilage thickness map showed less error in thin and peripheral regions. We evaluated the effectiveness of the adopted segmentation model by comparing the quantitative metrics obtained from model segmentation and those from manual segmentation. The root-mean-squared deviation of the FCL measurements was less than 8%, and strong correlations were observed for the mean thickness (Pearson's correlation coefficient $\rho \in [0.82,0.97]$), surface area ($\rho \in [0.82,0.98]$) and volume ($\rho \in [0.89,0.98]$) measurements. We compared our FCL measurements with those from a previous study and found that our measurements deviated less from the ground truths. We observed superior performance of the proposed rule-based cartilage parcellation method compared with the atlas-based approach. CartiMorph has the potential to promote imaging biomarkers discovery for knee osteoarthritis.
    Post-Training Quantization with Low-precision Minifloats and Integers on FPGAs. (arXiv:2311.12359v1 [cs.CV])
    Post-Training Quantization (PTQ) is a powerful technique for model compression, reducing the precision of neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point quantization (FP8) in the context of PTQ for model inference. However, the exploration of floating-point formats smaller than 8 bits and their comparison with integer quantization remains relatively limited. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. Our work presents a novel PTQ design-space exploration, comparing minifloat and integer quantization schemes across a range of 3 to 8 bits for both weights and activations. We examine the applicability of various PTQ techniques to minifloats, including weight equalization, bias correction, SmoothQuant, gradient-based learned rounding, and the GPTQ method. Our experiments validate the effectiveness of low-precision minifloats when compared to their integer counterparts across a spectrum of accuracy-precision trade-offs on a set of reference deep learning vision workloads. Finally, we evaluate our results against an FPGA-based hardware cost model, showing that integer quantization often remains the Pareto-optimal option, given its relatively smaller hardware resource footprint.
    Learning from Synthetic Human Group Activities. (arXiv:2306.16772v3 [cs.CV] UPDATED)
    The study of complex human interactions and group activities has become a focal point in human-centric computer vision. However, progress in related tasks is often hindered by the challenges of obtaining large-scale labeled datasets from real-world scenarios. To address the limitation, we introduce M3Act, a synthetic data generator for multi-view multi-group multi-person human atomic actions and group activities. Powered by the Unity engine, M3Act features multiple semantic groups, highly diverse and photorealistic images, and a comprehensive set of annotations, which facilitates the learning of human-centered tasks across single-person, multi-person, and multi-group conditions. We demonstrate the advantages of M3Act across three core experiments using various input modalities. First, adding our synthetic data significantly improves the performance of MOTRv2 on DanceTrack, leading to a hop on the leaderboard from 10th to 2nd place. With M3Act, we achieve tracking results on par with MOTRv2*, which is trained with 62.5% more real-world data. Second, M3Act improves the benchmark performances on CAD2 by 5.59% and 7.43% on group activity and atomic action accuracy respectively. Moreover, M3Act opens new research for controllable 3D group activity generation. We define multiple metrics and propose a competitive baseline for the novel task.
    LGL-BCI: A Lightweight Geometric Learning Framework for Motor Imagery-Based Brain-Computer Interfaces. (arXiv:2310.08051v3 [cs.LG] UPDATED)
    Brain-Computer Interfaces (BCIs) are a groundbreaking technology for interacting with external devices using brain signals. Despite advancements, electroencephalogram (EEG)-based Motor Imagery (MI) tasks face challenges like amplitude and phase variability, and complex spatial correlations, with a need for smaller model size and faster inference. This study introduces the LGL-BCI framework, employing a Geometric Deep Learning Framework for EEG processing in non-Euclidean metric spaces, particularly the Symmetric Positive Definite (SPD) Manifold space. LGL-BCI offers robust EEG data representation and captures spatial correlations. We propose an EEG channel selection solution via a feature decomposition algorithm to reduce SPD matrix dimensionality, with a lossless transformation boosting inference speed. Extensive experiments show LGL-BCI's superior accuracy and efficiency compared to current solutions, highlighting geometric deep learning's potential in MI-BCI applications. The efficiency, assessed on two public EEG datasets and two real-world EEG devices, significantly outperforms the state-of-the-art solution in accuracy ($82.54\%$ versus $62.22\%$) with fewer parameters (64.9M compared to 183.7M).
    Source Free Unsupervised Graph Domain Adaptation. (arXiv:2112.00955v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved great success on a variety of tasks with graph-structural data, among which node classification is an essential one. Unsupervised Graph Domain Adaptation (UGDA) shows its practical value of reducing the labeling cost for node classification. It leverages knowledge from a labeled graph (i.e., source domain) to tackle the same task on another unlabeled graph (i.e., target domain). Most existing UGDA methods heavily rely on the labeled graph in the source domain. They utilize labels from the source domain as the supervision signal and are jointly trained on both the source graph and the target graph. However, in some real-world scenarios, the source graph is inaccessible because of privacy issues. Therefore, we propose a novel scenario named Source Free Unsupervised Graph Domain Adaptation (SFUGDA). In this scenario, the only information we can leverage from the source domain is the well-trained source model, without any exposure to the source graph and its labels. As a result, existing UGDA methods are not feasible anymore. To address the non-trivial adaptation challenges in this practical scenario, we propose a model-agnostic algorithm called SOGA for domain adaptation to fully exploit the discriminative ability of the source model while preserving the consistency of structural proximity on the target graph. We prove the effectiveness of the proposed algorithm both theoretically and empirically. The experimental results on four cross-domain tasks show consistent improvements in the Macro-F1 score and Macro-AUC.
    Editing Personality for LLMs. (arXiv:2310.02168v2 [cs.CL] UPDATED)
    This paper introduces an innovative task focused on editing the personality traits of Large Language Models (LLMs). This task seeks to adjust the models' responses to opinion-related questions on specified topics since an individual's personality often manifests in the form of their expressed opinions, thereby showcasing different personality traits. Specifically, we construct a new benchmark dataset PersonalityEdit to address this task. Drawing on the theory in Social Psychology, we isolate three representative traits, namely Neuroticism, Extraversion, and Agreeableness, as the foundation for our benchmark. We then gather data using GPT-4, generating responses that not only align with a specified topic but also embody the targeted personality trait. We conduct comprehensive experiments involving various baselines and discuss the representation of personality behavior in LLMs. Our intriguing findings uncover potential challenges of the proposed task, illustrating several remaining issues. We anticipate that our work can provide the NLP community with insights. Code and datasets will be released at https://github.com/zjunlp/EasyEdit.
    Supervised and Unsupervised Deep Learning Approaches for EEG Seizure Prediction. (arXiv:2304.14922v2 [eess.SP] UPDATED)
    Epilepsy affects more than 50 million people worldwide, making it one of the world's most prevalent neurological diseases. The main symptom of epilepsy is seizures, which occur abruptly and can cause serious injury or death. The ability to predict the occurrence of an epileptic seizure could alleviate many risks and stresses people with epilepsy face. We formulate the problem of detecting preictal (or pre-seizure) with reference to normal EEG as a precursor to incoming seizure. To this end, we developed several supervised deep learning approaches model to identify preictal EEG from normal EEG. We further develop novel unsupervised deep learning approaches to train the models on only normal EEG, and detecting pre-seizure EEG as an anomalous event. These deep learning models were trained and evaluated on two large EEG seizure datasets in a person-specific manner. We found that both supervised and unsupervised approaches are feasible; however, their performance varies depending on the patient, approach and architecture. This new line of research has the potential to develop therapeutic interventions and save human lives.
    A Generalized Surface Loss for Reducing the Hausdorff Distance in Medical Imaging Segmentation. (arXiv:2302.03868v2 [eess.IV] UPDATED)
    Within medical imaging segmentation, the Dice coefficient and Hausdorff-based metrics are standard measures of success for deep learning models. However, modern loss functions for medical image segmentation often only consider the Dice coefficient or similar region-based metrics during training. As a result, segmentation architectures trained over such loss functions run the risk of achieving high accuracy for the Dice coefficient but low accuracy for Hausdorff-based metrics. Low accuracy on Hausdorff-based metrics can be problematic for applications such as tumor segmentation, where such benchmarks are crucial. For example, high Dice scores accompanied by significant Hausdorff errors could indicate that the predictions fail to detect small tumors. We propose the Generalized Surface Loss function, a novel loss function to minimize Hausdorff-based metrics with more desirable numerical properties than current methods and with weighting terms for class imbalance. Our loss function outperforms other losses when tested on the LiTS and BraTS datasets using the state-of-the-art nnUNet architecture. These results suggest we can improve medical imaging segmentation accuracy with our novel loss function.
    Recursive Algorithmic Reasoning. (arXiv:2307.00337v2 [cs.LG] UPDATED)
    Learning models that execute algorithms can enable us to address a key problem in deep learning: generalizing to out-of-distribution data. However, neural networks are currently unable to execute recursive algorithms because they do not have arbitrarily large memory to store and recall state. To address this, we (1) propose a way to augment graph neural networks (GNNs) with a stack, and (2) develop an approach for capturing intermediate algorithm trajectories that improves algorithmic alignment with recursive algorithms over previous methods. The stack allows the network to learn to store and recall a portion of the state of the network at a particular time, analogous to the action of a call stack in a recursive algorithm. This augmentation permits the network to reason recursively. We empirically demonstrate that our proposals significantly improve generalization to larger input graphs over prior work on depth-first search (DFS).
    Regression-Based Analysis of Multimodal Single-Cell Data Integration Strategies. (arXiv:2311.12711v1 [cs.LG])
    Multimodal single-cell technologies enable the simultaneous collection of diverse data types from individual cells, enhancing our understanding of cellular states. However, the integration of these datatypes and modeling the interrelationships between modalities presents substantial computational and analytical challenges in disease biomarker detection and drug discovery. Established practices rely on isolated methodologies to investigate individual molecular aspects separately, often resulting in inaccurate analyses. To address these obstacles, distinct Machine Learning Techniques are leveraged, each of its own kind to model the co-variation of DNA to RNA, and finally to surface proteins in single cells during hematopoietic stem cell development, which simplifies understanding of underlying cellular mechanisms and immune responses. Experiments conducted on a curated subset of a 300,000-cell time course dataset, highlights the exceptional performance of Echo State Networks, boasting a remarkable state-of-the-art correlation score of 0.94 and 0.895 on Multi-omic and CiteSeq datasets. Beyond the confines of this study, these findings hold promise for advancing comprehension of cellular differentiation and function, leveraging the potential of Machine Learning.
    Content Augmented Graph Neural Networks. (arXiv:2311.12741v1 [cs.LG])
    In recent years, graph neural networks (GNNs) have become a popular tool for solving various problems over graphs. In these models, the link structure of the graph is typically exploited and nodes' embeddings are iteratively updated based on adjacent nodes. Nodes' contents are used solely in the form of feature vectors, served as nodes' first-layer embeddings. However, the filters or convolutions, applied during iterations/layers to these initial embeddings lead to their impact diminish and contribute insignificantly to the final embeddings. In order to address this issue, in this paper we propose augmenting nodes' embeddings by embeddings generating from their content, at higher GNN layers. More precisely, we propose models wherein a structural embedding using a GNN and a content embedding are computed for each node. These two are combined using a combination layer to form the embedding of a node at a given layer. We suggest methods such as using an auto-encoder or building a content graph, to generate content embeddings. In the end, by conducting experiments over several real-world datasets, we demonstrate the high accuracy and performance of our models.
    Carbohydrate NMR chemical shift predictions using E(3) equivariant graph neural networks. (arXiv:2311.12657v1 [cs.LG])
    Carbohydrates, vital components of biological systems, are well-known for their structural diversity. Nuclear Magnetic Resonance (NMR) spectroscopy plays a crucial role in understanding their intricate molecular arrangements and is essential in assessing and verifying the molecular structure of organic molecules. An important part of this process is to predict the NMR chemical shift from the molecular structure. This work introduces a novel approach that leverages E(3) equivariant graph neural networks to predict carbohydrate NMR spectra. Notably, our model achieves a substantial reduction in mean absolute error, up to threefold, compared to traditional models that rely solely on two-dimensional molecular structure. Even with limited data, the model excels, highlighting its robustness and generalization capabilities. The implications are far-reaching and go beyond an advanced understanding of carbohydrate structures and spectral interpretation. For example, it could accelerate research in pharmaceutical applications, biochemistry, and structural biology, offering a faster and more reliable analysis of molecular structures. Furthermore, our approach is a key step towards a new data-driven era in spectroscopy, potentially influencing spectroscopic techniques beyond NMR.
    Kernel t-distributed stochastic neighbor embedding. (arXiv:2307.07081v2 [cs.LG] UPDATED)
    This paper presents a kernelized version of the t-SNE algorithm, capable of mapping high-dimensional data to a low-dimensional space while preserving the pairwise distances between the data points in a non-Euclidean metric. This can be achieved using a kernel trick only in the high dimensional space or in both spaces, leading to an end-to-end kernelized version. The proposed kernelized version of the t-SNE algorithm can offer new views on the relationships between data points, which can improve performance and accuracy in particular applications, such as classification problems involving kernel methods. The differences between t-SNE and its kernelized version are illustrated for several datasets, showing a neater clustering of points belonging to different classes.
    Fair Polylog-Approximate Low-Cost Hierarchical Clustering. (arXiv:2311.12501v1 [cs.LG])
    Research in fair machine learning, and particularly clustering, has been crucial in recent years given the many ethical controversies that modern intelligent systems have posed. Ahmadian et al. [2020] established the study of fairness in \textit{hierarchical} clustering, a stronger, more structured variant of its well-known flat counterpart, though their proposed algorithm that optimizes for Dasgupta's [2016] famous cost function was highly theoretical. Knittel et al. [2023] then proposed the first practical fair approximation for cost, however they were unable to break the polynomial-approximate barrier they posed as a hurdle of interest. We break this barrier, proposing the first truly polylogarithmic-approximate low-cost fair hierarchical clustering, thus greatly bridging the gap between the best fair and vanilla hierarchical clustering approximations.
    Rockafellian Relaxation and Stochastic Optimization under Perturbations. (arXiv:2204.04762v4 [math.OC] UPDATED)
    In practice, optimization models are often prone to unavoidable inaccuracies due to dubious assumptions and corrupted data. Traditionally, this placed special emphasis on risk-based and robust formulations, and their focus on ``conservative" decisions. We develop, in contrast, an ``optimistic" framework based on Rockafellian relaxations in which optimization is conducted not only over the original decision space but also jointly with a choice of model perturbation. The framework enables us to address challenging problems with ambiguous probability distributions from the areas of two-stage stochastic optimization without relatively complete recourse, probability functions lacking continuity properties, expectation constraints, and outlier analysis. We are also able to circumvent the fundamental difficulty in stochastic optimization that convergence of distributions fails to guarantee convergence of expectations. The framework centers on the novel concepts of exact and limit-exact Rockafellians, with interpretations of ``negative'' regularization emerging in certain settings. We illustrate the role of Phi-divergence, examine rates of convergence under changing distributions, and explore extensions to first-order optimality conditions. The main development is free of assumptions about convexity, smoothness, and even continuity of objective functions. Numerical results in the setting of computer vision and text analytics with label noise illustrate the framework.
    Random Fourier Signature Features. (arXiv:2311.12214v1 [stat.ML])
    Tensor algebras give rise to one of the most powerful measures of similarity for sequences of arbitrary length called the signature kernel accompanied with attractive theoretical guarantees from stochastic analysis. Previous algorithms to compute the signature kernel scale quadratically in terms of the length and the number of the sequences. To mitigate this severe computational bottleneck, we develop a random Fourier feature-based acceleration of the signature kernel acting on the inherently non-Euclidean domain of sequences. We show uniform approximation guarantees for the proposed unbiased estimator of the signature kernel, while keeping its computation linear in the sequence length and number. In addition, combined with recent advances on tensor projections, we derive two even more scalable time series features with favourable concentration properties and computational complexity both in time and memory. Our empirical results show that the reduction in computational cost comes at a negligible price in terms of accuracy on moderate-sized datasets, and it enables one to scale to large datasets up to a million time series.
    Heterogeneous Domain Adaptation with Positive and Unlabeled Data. (arXiv:2304.07955v2 [cs.LG] UPDATED)
    Heterogeneous unsupervised domain adaptation (HUDA) is the most challenging domain adaptation setting where the feature spaces of source and target domains are heterogeneous, and the target domain has only unlabeled data. Existing HUDA methods assume that both positive and negative examples are available in the source domain, which may not be satisfied in some real applications. This paper addresses a new challenging setting called positive and unlabeled heterogeneous unsupervised domain adaptation (PU-HUDA), a HUDA setting where the source domain only has positives. PU-HUDA can also be viewed as an extension of PU learning where the positive and unlabeled examples are sampled from different domains. A naive combination of existing HUDA and PU learning methods is ineffective in PU-HUDA due to the gap in label distribution between the source and target domains. To overcome this issue, we propose a novel method, predictive adversarial domain adaptation (PADA), which can predict likely positive examples from the unlabeled target data and simultaneously align the feature spaces to reduce the distribution divergence between the whole source data and the likely positive target data. PADA achieves this by a unified adversarial training framework for learning a classifier to predict positive examples and a feature transformer to transform the target feature space to that of the source. Specifically, they are both trained to fool a common discriminator that determines whether the likely positive examples are from the target or source domain. We experimentally show that PADA outperforms several baseline methods, such as the naive combination of HUDA and PU learning.
    Exploring Graph Classification Techniques Under Low Data Constraints: A Comprehensive Study. (arXiv:2311.12737v1 [cs.LG])
    This survey paper presents a brief overview of recent research on graph data augmentation and few-shot learning. It covers various techniques for graph data augmentation, including node and edge perturbation, graph coarsening, and graph generation, as well as the latest developments in few-shot learning, such as meta-learning and model-agnostic meta-learning. The paper explores these areas in depth and delves into further sub classifications. Rule based approaches and learning based approaches are surveyed under graph augmentation techniques. Few-Shot Learning on graphs is also studied in terms of metric learning techniques and optimization-based techniques. In all, this paper provides an extensive array of techniques that can be employed in solving graph processing problems faced in low-data scenarios.
    Utilizing Language Models for Tour Itinerary Recommendation. (arXiv:2311.12355v1 [cs.IR])
    Tour itinerary recommendation involves planning a sequence of relevant Point-of-Interest (POIs), which combines challenges from the fields of both Operations Research (OR) and Recommendation Systems (RS). As an OR problem, there is the need to maximize a certain utility (e.g., popularity of POIs in the tour) while adhering to some constraints (e.g., maximum time for the tour). As a RS problem, it is heavily related to problem or filtering or ranking a subset of POIs that are relevant to a user and recommending it as part of an itinerary. In this paper, we explore the use of language models for the task of tour itinerary recommendation and planning. This task has the unique requirement of recommending personalized POIs relevant to users and planning these POIs as an itinerary that satisfies various constraints. We discuss some approaches in this area, such as using word embedding techniques like Word2Vec and GloVe for learning POI embeddings and transformer-based techniques like BERT for generating itineraries.
    Fair Text Classification with Wasserstein Independence. (arXiv:2311.12689v1 [cs.CL])
    Group fairness is a central research topic in text classification, where reaching fair treatment between sensitive groups (e.g. women vs. men) remains an open challenge. This paper presents a novel method for mitigating biases in neural text classification, agnostic to the model architecture. Considering the difficulty to distinguish fair from unfair information in a text encoder, we take inspiration from adversarial training to induce Wasserstein independence between representations learned to predict our target label and the ones learned to predict some sensitive attribute. Our approach provides two significant advantages. Firstly, it does not require annotations of sensitive attributes in both testing and training data. This is more suitable for real-life scenarios compared to existing methods that require annotations of sensitive attributes at train time. Second, our approach exhibits a comparable or better fairness-accuracy trade-off compared to existing methods.
    Active Bird2Vec: Towards End-to-End Bird Sound Monitoring with Transformers. (arXiv:2308.07121v2 [cs.SD] UPDATED)
    We propose a shift towards end-to-end learning in bird sound monitoring by combining self-supervised (SSL) and deep active learning (DAL). Leveraging transformer models, we aim to bypass traditional spectrogram conversions, enabling direct raw audio processing. ActiveBird2Vec is set to generate high-quality bird sound representations through SSL, potentially accelerating the assessment of environmental changes and decision-making processes for wind farms. Additionally, we seek to utilize the wide variety of bird vocalizations through DAL, reducing the reliance on extensively labeled datasets by human experts. We plan to curate a comprehensive set of tasks through Huggingface Datasets, enhancing future comparability and reproducibility of bioacoustic research. A comparative analysis between various transformer models will be conducted to evaluate their proficiency in bird sound recognition tasks. We aim to accelerate the progression of avian bioacoustic research and contribute to more effective conservation strategies.
    Power grid operational risk assessment using graph neural network surrogates. (arXiv:2311.12309v1 [cs.LG])
    We investigate the utility of graph neural networks (GNNs) as proxies of power grid operational decision-making algorithms (optimal power flow (OPF) and security-constrained unit commitment (SCUC)) to enable rigorous quantification of the operational risk. To conduct principled risk analysis, numerous Monte Carlo (MC) samples are drawn from the (foretasted) probability distributions of spatio-temporally correlated stochastic grid variables. The corresponding OPF and SCUC solutions, which are needed to quantify the risk, are generated using traditional OPF and SCUC solvers to generate data for training GNN model(s). The GNN model performance is evaluated in terms of the accuracy of predicting quantities of interests (QoIs) derived from the decision variables in OPF and SCUC. Specifically, we focus on thermal power generation and load shedding at system and individual zone level. We also perform reliability and risk quantification based on GNN predictions and compare with that obtained from OPF/SCUC solutions. Our results demonstrate that GNNs are capable of providing fast and accurate prediction of QoIs and thus can be good surrogate models for OPF and SCUC. The excellent accuracy of GNN-based reliability and risk assessment further suggests that GNN surrogate has the potential to be applied in real-time and hours-ahead risk quantification.
    Moderating Model Marketplaces: Platform Governance Puzzles for AI Intermediaries. (arXiv:2311.12573v1 [cs.CY])
    The AI development community is increasingly making use of hosting intermediaries such as Hugging Face provide easy access to user-uploaded models and training data. These model marketplaces lower technical deployment barriers for hundreds of thousands of users, yet can be used in numerous potentially harmful and illegal ways. In this article, we explain ways in which AI systems, which can both `contain' content and be open-ended tools, present one of the trickiest platform governance challenges seen to date. We provide case studies of several incidents across three illustrative platforms -- Hugging Face, GitHub and Civitai -- to examine how model marketplaces moderate models. Building on this analysis, we outline important (and yet nevertheless limited) practices that industry has been developing to respond to moderation demands: licensing, access and use restrictions, automated content moderation, and open policy development. While the policy challenge at hand is a considerable one, we conclude with some ideas as to how platforms could better mobilize resources to act as a careful, fair, and proportionate regulatory access point.
    SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction. (arXiv:2311.12754v1 [cs.CV])
    3D occupancy prediction is an important task for the robustness of vision-centric autonomous driving, which aims to predict whether each point is occupied in the surrounding 3D space. Existing methods usually require 3D occupancy labels to produce meaningful results. However, it is very laborious to annotate the occupancy status of each voxel. In this paper, we propose SelfOcc to explore a self-supervised way to learn 3D occupancy using only video sequences. We first transform the images into the 3D space (e.g., bird's eye view) to obtain 3D representation of the scene. We directly impose constraints on the 3D representations by treating them as signed distance fields. We can then render 2D images of previous and future frames as self-supervision signals to learn the 3D representations. We propose an MVS-embedded strategy to directly optimize the SDF-induced weights with multiple depth proposals. Our SelfOcc outperforms the previous best method SceneRF by 58.7% using a single frame as input on SemanticKITTI and is the first self-supervised work that produces reasonable 3D occupancy for surround cameras on Occ3D. SelfOcc produces high-quality depth and achieves state-of-the-art results on novel depth synthesis, monocular depth estimation, and surround-view depth estimation on the SemanticKITTI, KITTI-2015, and nuScenes, respectively. Code: https://github.com/huang-yh/SelfOcc.
    Graph Neural Ordinary Differential Equations-based method for Collaborative Filtering. (arXiv:2311.12329v1 [cs.LG])
    Graph Convolution Networks (GCNs) are widely considered state-of-the-art for collaborative filtering. Although several GCN-based methods have been proposed and achieved state-of-the-art performance in various tasks, they can be computationally expensive and time-consuming to train if too many layers are created. However, since the linear GCN model can be interpreted as a differential equation, it is possible to transfer it to an ODE problem. This inspired us to address the computational limitations of GCN-based models by designing a simple and efficient NODE-based model that can skip some GCN layers to reach the final state, thus avoiding the need to create many layers. In this work, we propose a Graph Neural Ordinary Differential Equation-based method for Collaborative Filtering (GODE-CF). This method estimates the final embedding by utilizing the information captured by one or two GCN layers. To validate our approach, we conducted experiments on multiple datasets. The results demonstrate that our model outperforms competitive baselines, including GCN-based models and other state-of-the-art CF methods. Notably, our proposed GODE-CF model has several advantages over traditional GCN-based models. It is simple, efficient, and has a fast training time, making it a practical choice for real-world situations.
    Attacks of fairness in Federated Learning. (arXiv:2311.12715v1 [cs.LG])
    Federated Learning is an important emerging distributed training paradigm that keeps data private on clients. It is now well understood that by controlling only a small subset of FL clients, it is possible to introduce a backdoor to a federated learning model, in the presence of certain attributes. In this paper, we present a new type of attack that compromises the fairness of the trained model. Fairness is understood to be the attribute-level performance distribution of a trained model. It is particularly salient in domains where, for example, skewed accuracy discrimination between subpopulations could have disastrous consequences. We find that by employing a threat model similar to that of a backdoor attack, an attacker is able to influence the aggregated model to have an unfair performance distribution between any given set of attributes. Furthermore, we find that this attack is possible by controlling only a single client. While combating naturally induced unfairness in FL has previously been discussed in depth, its artificially induced kind has been neglected. We show that defending against attacks on fairness should be a critical consideration in any situation where unfairness in a trained model could benefit a user who participated in its training.
    Bridging Algorithmic Information Theory and Machine Learning: A New Approach to Kernel Learning. (arXiv:2311.12624v1 [cs.LG])
    Machine Learning (ML) and Algorithmic Information Theory (AIT) look at Complexity from different points of view. We explore the interface between AIT and Kernel Methods (that are prevalent in ML) by adopting an AIT perspective on the problem of learning kernels from data, in kernel ridge regression, through the method of Sparse Kernel Flows. In particular, by looking at the differences and commonalities between Minimal Description Length (MDL) and Regularization in Machine Learning (RML), we prove that the method of Sparse Kernel Flows is the natural approach to adopt to learn kernels from data. This paper shows that it is not necessary to use the statistical route to derive Sparse Kernel Flows and that one can directly work with code-lengths and complexities that are concepts that show up in AIT.
    Topological properties of basins of attraction and expressiveness of width bounded neural networks. (arXiv:2011.04923v5 [cs.LG] UPDATED)
    In Radhakrishnan et al. [2020], the authors empirically show that autoencoders trained with usual SGD methods shape out basins of attraction around their training data. We consider network functions of width not exceeding the input dimension and prove that in this situation basins of attraction are bounded and their complement cannot have bounded components. Our conditions in these results are met in several experiments of the latter work and we thus address a question posed therein. We also show that under some more restrictive conditions the basins of attraction are path-connected. The tightness of the conditions in our results is demonstrated by means of several examples. Finally, the arguments used to prove the above results allow us to derive a root cause why scalar-valued neural network functions that fulfill our bounded width condition are not dense in spaces of continuous functions.
    Kuro Siwo: 12.1 billion $m^2$ under the water. A global multi-temporal satellite dataset for rapid flood mapping. (arXiv:2311.12056v1 [cs.CV])
    Global floods, exacerbated by climate change, pose severe threats to human life, infrastructure, and the environment. This urgency is highlighted by recent catastrophic events in Pakistan and New Zealand, underlining the critical need for precise flood mapping for guiding restoration efforts, understanding vulnerabilities, and preparing for future events. While Synthetic Aperture Radar (SAR) offers day-and-night, all-weather imaging capabilities, harnessing it for deep learning is hindered by the absence of a large annotated dataset. To bridge this gap, we introduce Kuro Siwo, a meticulously curated multi-temporal dataset, spanning 32 flood events globally. Our dataset maps more than 63 billion m2 of land, with 12.1 billion of them being either a flooded area or a permanent water body. Kuro Siwo stands out for its unparalleled annotation quality to facilitate rapid flood mapping in a supervised setting. We also augment learning by including a large unlabeled set of SAR samples, aimed at self-supervised pretraining. We provide an extensive benchmark and strong baselines for a diverse set of flood events from Europe, America, Africa and Australia. Our benchmark demonstrates the quality of Kuro Siwo annotations, training models that can achieve $\approx$ 85% and $\approx$ 87% in F1-score for flooded areas and general water detection respectively. This work calls on the deep learning community to develop solution-driven algorithms for rapid flood mapping, with the potential to aid civil protection and humanitarian agencies amid climate change challenges. Our code and data will be made available at https://github.com/Orion-AI-Lab/KuroSiwo
    Shortcut Learning in Deep Neural Networks. (arXiv:2004.07780v5 [cs.CV] UPDATED)
    Deep learning has triggered the current rise of artificial intelligence and is the workhorse of today's machine intelligence. Numerous success stories have rapidly spread all over science, industry and society, but its limitations have only recently come into focus. In this perspective we seek to distill how many of deep learning's problems can be seen as different symptoms of the same underlying problem: shortcut learning. Shortcuts are decision rules that perform well on standard benchmarks but fail to transfer to more challenging testing conditions, such as real-world scenarios. Related issues are known in Comparative Psychology, Education and Linguistics, suggesting that shortcut learning may be a common characteristic of learning systems, biological and artificial alike. Based on these observations, we develop a set of recommendations for model interpretation and benchmarking, highlighting recent advances in machine learning to improve robustness and transferability from the lab to real-world applications.
    Self supervised convolutional kernel based handcrafted feature harmonization: Enhanced left ventricle hypertension disease phenotyping on echocardiography. (arXiv:2310.08897v2 [eess.IV] UPDATED)
    Radiomics, a medical imaging technique, extracts quantitative handcrafted features from images to predict diseases. Harmonization in those features ensures consistent feature extraction across various imaging devices and protocols. Methods for harmonization include standardized imaging protocols, statistical adjustments, and evaluating feature robustness. Myocardial diseases such as Left Ventricular Hypertrophy (LVH) and Hypertensive Heart Disease (HHD) are diagnosed via echocardiography, but variable imaging settings pose challenges. Harmonization techniques are crucial for applying handcrafted features in disease diagnosis in such scenario. Self-supervised learning (SSL) enhances data understanding within limited datasets and adapts to diverse data settings. ConvNeXt-V2 integrates convolutional layers into SSL, displaying superior performance in various tasks. This study focuses on convolutional filters within SSL, using them as preprocessing to convert images into feature maps for handcrafted feature harmonization. Our proposed method excelled in harmonization evaluation and exhibited superior LVH classification performance compared to existing methods.
    Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. (arXiv:2311.12786v1 [cs.LG])
    Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings, and show that: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such hidden capabilities are relevant leads to sample-efficient 'revival' of the capability, i.e., the model begins reusing these capability after only a few gradient steps. This indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.
    Neural Bregman Divergences for Distance Learning. (arXiv:2206.04763v2 [cs.LG] UPDATED)
    Many metric learning tasks, such as triplet learning, nearest neighbor retrieval, and visualization, are treated primarily as embedding tasks where the ultimate metric is some variant of the Euclidean distance (e.g., cosine or Mahalanobis), and the algorithm must learn to embed points into the pre-chosen space. The study of non-Euclidean geometries is often not explored, which we believe is due to a lack of tools for learning non-Euclidean measures of distance. Recent work has shown that Bregman divergences can be learned from data, opening a promising approach to learning asymmetric distances. We propose a new approach to learning arbitrary Bergman divergences in a differentiable manner via input convex neural networks and show that it overcomes significant limitations of previous works. We also demonstrate that our method more faithfully learns divergences over a set of both new and previously studied tasks, including asymmetric regression, ranking, and clustering. Our tests further extend to known asymmetric, but non-Bregman tasks, where our method still performs competitively despite misspecification, showing the general utility of our approach for asymmetric learning.
    Machine-learning-accelerated simulations to enable automatic surface reconstruction. (arXiv:2305.07251v2 [cond-mat.mtrl-sci] UPDATED)
    Understanding material surfaces and interfaces is vital in applications like catalysis or electronics. By combining energies from electronic structure with statistical mechanics, ab initio simulations can in principle predict the structure of material surfaces as a function of thermodynamic variables. However, accurate energy simulations are prohibitive when coupled to the vast phase space that must be statistically sampled. Here, we present a bi-faceted computational loop to predict surface phase diagrams of multi-component materials that accelerates both the energy scoring and statistical sampling methods. Fast, scalable, and data-efficient machine learning interatomic potentials are trained on high-throughput density-functional theory calculations through closed-loop active learning. Markov-chain Monte Carlo sampling in the semi-grand canonical ensemble is enabled by using virtual surface sites. The predicted surfaces for GaN(0001), Si(111), and SrTiO3(001) are in agreement with past work and suggest that the proposed strategy can model complex material surfaces and discover previously unreported surface terminations.
    Machine-Guided Discovery of a Real-World Rogue Wave Model. (arXiv:2311.12579v1 [physics.geo-ph])
    Big data and large-scale machine learning have had a profound impact on science and engineering, particularly in fields focused on forecasting and prediction. Yet, it is still not clear how we can use the superior pattern matching abilities of machine learning models for scientific discovery. This is because the goals of machine learning and science are generally not aligned. In addition to being accurate, scientific theories must also be causally consistent with the underlying physical process and allow for human analysis, reasoning, and manipulation to advance the field. In this paper, we present a case study on discovering a new symbolic model for oceanic rogue waves from data using causal analysis, deep learning, parsimony-guided model selection, and symbolic regression. We train an artificial neural network on causal features from an extensive dataset of observations from wave buoys, while selecting for predictive performance and causal invariance. We apply symbolic regression to distill this black-box model into a mathematical equation that retains the neural network's predictive capabilities, while allowing for interpretation in the context of existing wave theory. The resulting model reproduces known behavior, generates well-calibrated probabilities, and achieves better predictive scores on unseen data than current theory. This showcases how machine learning can facilitate inductive scientific discovery, and paves the way for more accurate rogue wave forecasting.
    Influencer Videos: Unboxing the Mystique. (arXiv:2012.12311v3 [cs.LG] UPDATED)
    Influencer marketing has become a very popular tool to reach customers. Despite the rapid growth in influencer videos, there has been little research on the effectiveness of their constituent features in explaining video engagement. We study YouTube influencers and analyze their unstructured video data across text, audio and images using an "interpretable deep learning" framework that accomplishes both goals of prediction and interpretation. Our prediction-based approach analyzes unstructured data and finds that "what is said" in words (text) is more influential than "how it is said" in imagery (images) or acoustics (audio). Our novel interpretation-based approach is implemented after completion of model prediction by analyzing the same source of unstructured data to measure importance attributed to the video features. We eliminate several spurious relationships in two steps, identifying a subset of relationships which are confirmed using theory. We uncover novel findings that establish distinct associations for measures of shallow and deep engagement based on the dual-system framework of human thinking. Our approach is validated using simulated data, and we discuss the learnings from our findings for influencers and brands.
    Quantum Inception Score. (arXiv:2311.12163v1 [quant-ph])
    Motivated by the great success of classical generative models in machine learning, enthusiastic exploration of their quantum version has recently started. To depart on this journey, it is important to develop a relevant metric to evaluate the quality of quantum generative models; in the classical case, one such examples is the inception score. In this paper, we propose the quantum inception score, which relates the quality to the classical capacity of the quantum channel that classifies a given dataset. We prove that, under this proposed measure, the quantum generative models provide better quality than their classical counterparts because of the presence of quantum coherence and entanglement. Finally, we harness the quantum fluctuation theorem to characterize the physical limitation of the quality of quantum generative models.
    Deep learning-based detection of morphological features associated with hypoxia in H&E breast cancer whole slide images. (arXiv:2311.12601v1 [cs.CV])
    Hypoxia occurs when tumour cells outgrow their blood supply, leading to regions of low oxygen levels within the tumour. Calculating hypoxia levels can be an important step in understanding the biology of tumours, their clinical progression and response to treatment. This study demonstrates a novel application of deep learning to evaluate hypoxia in the context of breast cancer histomorphology. More precisely, we show that Weakly Supervised Deep Learning (WSDL) models can accurately detect hypoxia associated features in routine Hematoxylin and Eosin (H&E) whole slide images (WSI). We trained and evaluated a deep Multiple Instance Learning model on tiles from WSI H&E tissue from breast cancer primary sites (n=240) obtaining on average an AUC of 0.87 on a left-out test set. We also showed significant differences between features of hypoxic and normoxic tissue regions as distinguished by the WSDL models. Such DL hypoxia H&E WSI detection models could potentially be extended to other tumour types and easily integrated into the pathology workflow without requiring additional costly assays.
    Learning to Optimise Wind Farms with Graph Transformers. (arXiv:2311.12750v1 [cs.LG])
    This work proposes a novel data-driven model capable of providing accurate predictions for the power generation of all wind turbines in wind farms of arbitrary layout, yaw angle configurations and wind conditions. The proposed model functions by encoding a wind farm into a fully-connected graph and processing the graph representation through a graph transformer. The graph transformer surrogate is shown to generalise well and is able to uncover latent structural patterns within the graph representation of wind farms. It is demonstrated how the resulting surrogate model can be used to optimise yaw angle configurations using genetic algorithms, achieving similar levels of accuracy to industrially-standard wind farm simulation tools while only taking a fraction of the computational cost.
    Summary of the DISPLACE Challenge 2023 -- DIarization of SPeaker and LAnguage in Conversational Environments. (arXiv:2311.12564v1 [eess.AS])
    In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The DISPLACE (DIarization of SPeaker and LAnguage in Conversational Environments) challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition. The challenge entailed two tracks: Track-1 focused on speaker diarization (SD) in multilingual situations while, Track-2 addressed the language diarization (LD) in a multi-speaker scenario. Both the tracks were evaluated using the same underlying audio data. To facilitate this evaluation, a real-world dataset featuring multilingual, multi-speaker conversational far-field speech was recorded and distributed. Furthermore, a baseline system was made available for both SD and LD task which mimicked the state-of-art in these tasks. The challenge garnered a total of $42$ world-wide registrations and received a total of $19$ combined submissions for Track-1 and Track-2. This paper describes the challenge, details of the datasets, tasks, and the baseline system. Additionally, the paper provides a concise overview of the submitted systems in both tracks, with an emphasis given to the top performing systems. The paper also presents insights and future perspectives for SD and LD tasks, focusing on the key challenges that the systems need to overcome before wide-spread commercial deployment on such conversations.
    Decentralised Q-Learning for Multi-Agent Markov Decision Processes with a Satisfiability Criterion. (arXiv:2311.12613v1 [eess.SY])
    In this paper, we propose a reinforcement learning algorithm to solve a multi-agent Markov decision process (MMDP). The goal, inspired by Blackwell's Approachability Theorem, is to lower the time average cost of each agent to below a pre-specified agent-specific bound. For the MMDP, we assume the state dynamics to be controlled by the joint actions of agents, but the per-stage costs to only depend on the individual agent's actions. We combine the Q-learning algorithm for a weighted combination of the costs of each agent, obtained by a gossip algorithm with the Metropolis-Hastings or Multiplicative Weights formalisms to modulate the averaging matrix of the gossip. We use multiple timescales in our algorithm and prove that under mild conditions, it approximately achieves the desired bounds for each of the agents. We also demonstrate the empirical performance of this algorithm in the more general setting of MMDPs having jointly controlled per-stage costs.
    Orthogonally weighted $\ell_{2,1}$ regularization for rank-aware joint sparse recovery: algorithm and analysis. (arXiv:2311.12282v1 [math.NA])
    We propose and analyze an efficient algorithm for solving the joint sparse recovery problem using a new regularization-based method, named orthogonally weighted $\ell_{2,1}$ ($\mathit{ow}\ell_{2,1}$), which is specifically designed to take into account the rank of the solution matrix. This method has applications in feature extraction, matrix column selection, and dictionary learning, and it is distinct from commonly used $\ell_{2,1}$ regularization and other existing regularization-based approaches because it can exploit the full rank of the row-sparse solution matrix, a key feature in many applications. We provide a proof of the method's rank-awareness, establish the existence of solutions to the proposed optimization problem, and develop an efficient algorithm for solving it, whose convergence is analyzed. We also present numerical experiments to illustrate the theory and demonstrate the effectiveness of our method on real-life problems.
    A Supervised Contrastive Learning Pretrain-Finetune Approach for Time Series. (arXiv:2311.12290v1 [cs.LG])
    Foundation models have recently gained attention within the field of machine learning thanks to its efficiency in broad data processing. While researchers had attempted to extend this success to time series models, the main challenge is effectively extracting representations and transferring knowledge from pretraining datasets to the target finetuning dataset. To tackle this issue, we introduce a novel pretraining procedure that leverages supervised contrastive learning to distinguish features within each pretraining dataset. This pretraining phase enables a probabilistic similarity metric, which assesses the likelihood of a univariate sample being closely related to one of the pretraining datasets. Subsequently, using this similarity metric as a guide, we propose a fine-tuning procedure designed to enhance the accurate prediction of the target data by aligning it more closely with the learned dynamics of the pretraining datasets. Our experiments have shown promising results which demonstrate the efficacy of our approach.
    Learning Causal Representations from General Environments: Identifiability and Intrinsic Ambiguity. (arXiv:2311.12267v1 [cs.LG])
    This paper studies causal representation learning, the task of recovering high-level latent variables and their causal relationships from low-level data that we observe, assuming access to observations generated from multiple environments. While existing works are able to prove full identifiability of the underlying data generating process, they typically assume access to single-node, hard interventions which is rather unrealistic in practice. The main contribution of this paper is characterize a notion of identifiability which is provably the best one can achieve when hard interventions are not available. First, for linear causal models, we provide identifiability guarantee for data observed from general environments without assuming any similarities between them. While the causal graph is shown to be fully recovered, the latent variables are only identified up to an effect-domination ambiguity (EDA). We then propose an algorithm, LiNGCReL which is guaranteed to recover the ground-truth model up to EDA, and we demonstrate its effectiveness via numerical experiments. Moving on to general non-parametric causal models, we prove the same idenfifiability guarantee assuming access to groups of soft interventions. Finally, we provide counterparts of our identifiability results, indicating that EDA is basically inevitable in our setting.
    nach0: Multimodal Natural and Chemical Languages Foundation Model. (arXiv:2311.12410v1 [cs.CL])
    Large Language Models (LLMs) have substantially driven scientific progress in various domains, and many papers have demonstrated their ability to tackle complex problems with creative solutions. Our paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks: biomedical question answering, named entity recognition, molecular generation, molecular synthesis, attributes prediction, and others. nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings to incorporate a range of chemical and linguistic knowledge. We employed instruction tuning, where specific task-related instructions are utilized to fine-tune nach0 for the final set of tasks. To train nach0 effectively, we leverage the NeMo framework, enabling efficient parallel optimization of both base and large model versions. Extensive experiments demonstrate that our model outperforms state-of-the-art baselines on single-domain and cross-domain tasks. Furthermore, it can generate high-quality outputs in molecular and textual formats, showcasing its effectiveness in multi-domain setups.
    Looped Transformers are Better at Learning Learning Algorithms. (arXiv:2311.12424v1 [cs.LG])
    Transformers have demonstrated effectiveness in \emph{in-context solving} data-fitting problems from various (latent) models, as reported by Garg et al. However, the absence of an inherent iterative structure in the transformer architecture presents a challenge in emulating the iterative algorithms, which are commonly employed in traditional machine learning methods. To address this, we propose the utilization of \emph{looped} transformer architecture and its associated training methodology, with the aim of incorporating iterative characteristics into the transformer architectures. Experimental results suggest that the looped transformer achieves performance comparable to the standard transformer in solving various data-fitting problems, while utilizing less than 10\% of the parameter count.
    Delta Score: Improving the Binding Assessment of Structure-Based Drug Design Methods. (arXiv:2311.12035v1 [q-bio.QM])
    Structure-based drug design (SBDD) stands at the forefront of drug discovery, emphasizing the creation of molecules that target specific binding pockets. Recent advances in this area have witnessed the adoption of deep generative models and geometric deep learning techniques, modeling SBDD as a conditional generation task where the target structure serves as context. Historically, evaluation of these models centered on docking scores, which quantitatively depict the predicted binding affinity between a molecule and its target pocket. Though state-of-the-art models purport that a majority of their generated ligands exceed the docking score of ground truth ligands in test sets, it begs the question: Do these scores align with real-world biological needs? In this paper, we introduce the delta score, a novel evaluation metric grounded in tangible pharmaceutical requisites. Our experiments reveal that molecules produced by current deep generative models significantly lag behind ground truth reference ligands when assessed with the delta score. This novel metric not only complements existing benchmarks but also provides a pivotal direction for subsequent research in the domain.
    Leveraging healthy population variability in deep learning unsupervised anomaly detection in brain FDG PET. (arXiv:2311.12081v1 [eess.IV])
    Unsupervised anomaly detection is a popular approach for the analysis of neuroimaging data as it allows to identify a wide variety of anomalies from unlabelled data. It relies on building a subject-specific model of healthy appearance to which a subject's image can be compared to detect anomalies. In the literature, it is common for anomaly detection to rely on analysing the residual image between the subject's image and its pseudo-healthy reconstruction. This approach however has limitations partly due to the pseudo-healthy reconstructions being imperfect and to the lack of natural thresholding mechanism. Our proposed method, inspired by Z-scores, leverages the healthy population variability to overcome these limitations. Our experiments conducted on FDG PET scans from the ADNI database demonstrate the effectiveness of our approach in accurately identifying Alzheimer's disease related anomalies.
    In-Context Learning Functions with Varying Number of Minima. (arXiv:2311.12538v1 [cs.LG])
    Large Language Models (LLMs) have proven effective at In-Context Learning (ICL), an ability that allows them to create predictors from labeled examples. Few studies have explored the interplay between ICL and specific properties of functions it attempts to approximate. In our study, we use a formal framework to explore ICL and propose a new task of approximating functions with varying number of minima. We implement a method that allows for producing functions with given inputs as minima. We find that increasing the number of minima degrades ICL performance. At the same time, our evaluation shows that ICL outperforms 2-layer Neural Network (2NN) model. Furthermore, ICL learns faster than 2NN in all settings. We validate the findings through a set of few-shot experiments across various hyperparameter configurations.
    SecureBERT and LLAMA 2 Empowered Control Area Network Intrusion Detection and Classification. (arXiv:2311.12074v1 [cs.CR])
    Numerous studies have proved their effective strength in detecting Control Area Network (CAN) attacks. In the realm of understanding the human semantic space, transformer-based models have demonstrated remarkable effectiveness. Leveraging pre-trained transformers has become a common strategy in various language-related tasks, enabling these models to grasp human semantics more comprehensively. To delve into the adaptability evaluation on pre-trained models for CAN intrusion detection, we have developed two distinct models: CAN-SecureBERT and CAN-LLAMA2. Notably, our CAN-LLAMA2 model surpasses the state-of-the-art models by achieving an exceptional performance 0.999993 in terms of balanced accuracy, precision detection rate, F1 score, and a remarkably low false alarm rate of 3.10e-6. Impressively, the false alarm rate is 52 times smaller than that of the leading model, MTH-IDS (Multitiered Hybrid Intrusion Detection System). Our study underscores the promise of employing a Large Language Model as the foundational model, while incorporating adapters for other cybersecurity-related tasks and maintaining the model's inherent language-related capabilities.
    Differentiable Sampling of Categorical Distributions Using the CatLog-Derivative Trick. (arXiv:2311.12569v1 [cs.LG])
    Categorical random variables can faithfully represent the discrete and uncertain aspects of data as part of a discrete latent variable model. Learning in such models necessitates taking gradients with respect to the parameters of the categorical probability distributions, which is often intractable due to their combinatorial nature. A popular technique to estimate these otherwise intractable gradients is the Log-Derivative trick. This trick forms the basis of the well-known REINFORCE gradient estimator and its many extensions. While the Log-Derivative trick allows us to differentiate through samples drawn from categorical distributions, it does not take into account the discrete nature of the distribution itself. Our first contribution addresses this shortcoming by introducing the CatLog-Derivative trick - a variation of the Log-Derivative trick tailored towards categorical distributions. Secondly, we use the CatLog-Derivative trick to introduce IndeCateR, a novel and unbiased gradient estimator for the important case of products of independent categorical distributions with provably lower variance than REINFORCE. Thirdly, we empirically show that IndeCateR can be efficiently implemented and that its gradient estimates have significantly lower bias and variance for the same number of samples compared to the state of the art.
    One-shot backpropagation for multi-step prediction in physics-based system identification -- EXTENDED VERSION. (arXiv:2310.20567v2 [eess.SY] UPDATED)
    The aim of this paper is to present a novel physics-based framework for the identification of dynamical systems, in which the physical and structural insights are reflected directly into a backpropagation-based learning algorithm. The main result is a method to compute in closed form the gradient of a multi-step loss function, while enforcing physical properties and constraints. The derived algorithm has been exploited to identify the unknown inertia matrix of a space debris, and the results show the reliability of the method in capturing the physical adherence of the estimated parameters.
    Harnessing FPGA Technology for Enhanced Biomedical Computation. (arXiv:2311.12439v1 [cs.LG])
    This research delves into sophisticated neural network frameworks like Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory Networks (LSTMs), and Deep Belief Networks (DBNs) for improved analysis of ECG signals via Field Programmable Gate Arrays (FPGAs). The MIT-BIH Arrhythmia Database serves as the foundation for training and evaluating our models, with added Gaussian noise to heighten the algorithms' resilience. The developed architectures incorporate various layers for specific processing and categorization functions, employing strategies such as the EarlyStopping callback and Dropout layer to prevent overfitting. Additionally, this paper details the creation of a tailored Tensor Compute Unit (TCU) accelerator for the PYNQ Z1 platform. It provides a thorough methodology for implementing FPGA-based machine learning, encompassing the configuration of the Tensil toolchain in Docker, selection of architectures, PS-PL configuration, and the compilation and deployment of models. By evaluating performance indicators like latency and throughput, we showcase the efficacy of FPGAs in advanced biomedical computing. This study ultimately serves as a comprehensive guide to optimizing neural network operations on FPGAs across various fields.
    On Counterfactual Data Augmentation Under Confounding. (arXiv:2305.18183v2 [cs.LG] UPDATED)
    Counterfactual data augmentation has recently emerged as a method to mitigate confounding biases in the training data. These biases, such as spurious correlations, arise due to various observed and unobserved confounding variables in the data generation process. In this paper, we formally analyze how confounding biases impact downstream classifiers and present a causal viewpoint to the solutions based on counterfactual data augmentation. We explore how removing confounding biases serves as a means to learn invariant features, ultimately aiding in generalization beyond the observed data distribution. Additionally, we present a straightforward yet powerful algorithm for generating counterfactual images, which effectively mitigates the influence of confounding effects on downstream classifiers. Through experiments on MNIST variants and the CelebA datasets, we demonstrate how our simple augmentation method helps existing state-of-the-art methods achieve good results.
    SSVEP-DAN: A Data Alignment Network for SSVEP-based Brain Computer Interfaces. (arXiv:2311.12666v1 [cs.LG])
    Steady-state visual-evoked potential (SSVEP)-based brain-computer interfaces (BCIs) offer a non-invasive means of communication through high-speed speller systems. However, their efficiency heavily relies on individual training data obtained during time-consuming calibration sessions. To address the challenge of data insufficiency in SSVEP-based BCIs, we present SSVEP-DAN, the first dedicated neural network model designed for aligning SSVEP data across different domains, which can encompass various sessions, subjects, or devices. Our experimental results across multiple cross-domain scenarios demonstrate SSVEP-DAN's capability to transform existing source SSVEP data into supplementary calibration data, significantly enhancing SSVEP decoding accuracy in scenarios with limited calibration data. We envision SSVEP-DAN as a catalyst for practical SSVEP-based BCI applications with minimal calibration. The source codes in this work are available at: https://github.com/CECNL/SSVEP-DAN.
    survex: an R package for explaining machine learning survival models. (arXiv:2308.16113v2 [cs.LG] UPDATED)
    Due to their flexibility and superior performance, machine learning models frequently complement and outperform traditional statistical survival models. However, their widespread adoption is hindered by a lack of user-friendly tools to explain their internal operations and prediction rationales. To tackle this issue, we introduce the survex R package, which provides a cohesive framework for explaining any survival model by applying explainable artificial intelligence techniques. The capabilities of the proposed software encompass understanding and diagnosing survival models, which can lead to their improvement. By revealing insights into the decision-making process, such as variable effects and importances, survex enables the assessment of model reliability and the detection of biases. Thus, transparency and responsibility may be promoted in sensitive areas, such as biomedical research and healthcare applications.
    Hyb-NeRF: A Multiresolution Hybrid Encoding for Neural Radiance Fields. (arXiv:2311.12490v1 [cs.CV])
    Recent advances in Neural radiance fields (NeRF) have enabled high-fidelity scene reconstruction for novel view synthesis. However, NeRF requires hundreds of network evaluations per pixel to approximate a volume rendering integral, making it slow to train. Caching NeRFs into explicit data structures can effectively enhance rendering speed but at the cost of higher memory usage. To address these issues, we present Hyb-NeRF, a novel neural radiance field with a multi-resolution hybrid encoding that achieves efficient neural modeling and fast rendering, which also allows for high-quality novel view synthesis. The key idea of Hyb-NeRF is to represent the scene using different encoding strategies from coarse-to-fine resolution levels. Hyb-NeRF exploits memory-efficiency learnable positional features at coarse resolutions and the fast optimization speed and local details of hash-based feature grids at fine resolutions. In addition, to further boost performance, we embed cone tracing-based features in our learnable positional encoding that eliminates encoding ambiguity and reduces aliasing artifacts. Extensive experiments on both synthetic and real-world datasets show that Hyb-NeRF achieves faster rendering speed with better rending quality and even a lower memory footprint in comparison to previous state-of-the-art methods.
    Image Transformation for IoT Time-Series Data: A Review. (arXiv:2311.12742v1 [cs.LG])
    In the era of the Internet of Things (IoT), where smartphones, built-in systems, wireless sensors, and nearly every smart device connect through local networks or the internet, billions of smart things communicate with each other and generate vast amounts of time-series data. As IoT time-series data is high-dimensional and high-frequency, time-series classification or regression has been a challenging issue in IoT. Recently, deep learning algorithms have demonstrated superior performance results in time-series data classification in many smart and intelligent IoT applications. However, it is hard to explore the hidden dynamic patterns and trends in time-series. Recent studies show that transforming IoT data into images improves the performance of the learning model. In this paper, we present a review of these studies which use image transformation/encoding techniques in IoT domain. We examine the studies according to their encoding techniques, data types, and application areas. Lastly, we emphasize the challenges and future dimensions of image transformation.
    Formal Verification of Long Short-Term Memory based Audio Classifiers: A Star based Approach. (arXiv:2311.12130v1 [cs.SD])
    Formally verifying audio classification systems is essential to ensure accurate signal classification across real-world applications like surveillance, automotive voice commands, and multimedia content management, preventing potential errors with serious consequences. Drawing from recent research, this study advances the utilization of star-set-based formal verification, extended through reachability analysis, tailored explicitly for Long Short-Term Memory architectures and their Convolutional variations within the audio classification domain. By conceptualizing the classification process as a sequence of set operations, the star set-based reachability approach streamlines the exploration of potential operational states attainable by the system. The paper serves as an encompassing case study, validating and verifying sequence audio classification analytics within real-world contexts. It accentuates the necessity for robustness verification to ensure precise and dependable predictions, particularly in light of the impact of noise on the accuracy of output classifications.
    Provable Representation with Efficient Planning for Partially Observable Reinforcement Learning. (arXiv:2311.12244v1 [cs.LG])
    In real-world reinforcement learning problems, the state information is often only partially observable, which breaks the basic assumption in Markov decision processes, and thus, leads to inferior performances. Partially Observable Markov Decision Processes have been introduced to explicitly take the issue into account for learning, exploration, and planning, but presenting significant computational and statistical challenges. To address these difficulties, we exploit the representation view, which leads to a coherent design framework for a practically tractable reinforcement learning algorithm upon partial observations. We provide a theoretical analysis for justifying the statistical efficiency of the proposed algorithm. We also empirically demonstrate the proposed algorithm can surpass state-of-the-art performance with partial observations across various benchmarks, therefore, pushing reliable reinforcement learning towards more practical applications.
    PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics. (arXiv:2311.12198v1 [cs.GR])
    We introduce PhysGaussian, a new method that seamlessly integrates physically grounded Newtonian dynamics within 3D Gaussians to achieve high-quality novel motion synthesis. Employing a custom Material Point Method (MPM), our approach enriches 3D Gaussian kernels with physically meaningful kinematic deformation and mechanical stress attributes, all evolved in line with continuum mechanics principles. A defining characteristic of our method is the seamless integration between physical simulation and visual rendering: both components utilize the same 3D Gaussian kernels as their discrete representations. This negates the necessity for triangle/tetrahedron meshing, marching cubes, "cage meshes," or any other geometry embedding, highlighting the principle of "what you see is what you simulate (WS$^2$)." Our method demonstrates exceptional versatility across a wide variety of materials--including elastic entities, metals, non-Newtonian fluids, and granular materials--showcasing its strong capabilities in creating diverse visual content with novel viewpoints and movements. Our project page is at: https://xpandora.github.io/PhysGaussian/
    Stable Diffusion For Aerial Object Detection. (arXiv:2311.12345v1 [cs.CV])
    Aerial object detection is a challenging task, in which one major obstacle lies in the limitations of large-scale data collection and the long-tail distribution of certain classes. Synthetic data offers a promising solution, especially with recent advances in diffusion-based methods like stable diffusion (SD). However, the direct application of diffusion methods to aerial domains poses unique challenges: stable diffusion's optimization for rich ground-level semantics doesn't align with the sparse nature of aerial objects, and the extraction of post-synthesis object coordinates remains problematic. To address these challenges, we introduce a synthetic data augmentation framework tailored for aerial images. It encompasses sparse-to-dense region of interest (ROI) extraction to bridge the semantic gap, fine-tuning the diffusion model with low-rank adaptation (LORA) to circumvent exhaustive retraining, and finally, a Copy-Paste method to compose synthesized objects with backgrounds, providing a nuanced approach to aerial object detection through synthetic data.
    Inverse Problems with Learned Forward Operators. (arXiv:2311.12528v1 [math.NA])
    Solving inverse problems requires knowledge of the forward operator, but accurate models can be computationally expensive and hence cheaper variants are desired that do not compromise reconstruction quality. This chapter reviews reconstruction methods in inverse problems with learned forward operators that follow two different paradigms. The first one is completely agnostic to the forward operator and learns its restriction to the subspace spanned by the training data. The framework of regularisation by projection is then used to find a reconstruction. The second one uses a simplified model of the physics of the measurement process and only relies on the training data to learn a model correction. We present the theory of these two approaches and compare them numerically. A common theme emerges: both methods require, or at least benefit from, training data not only for the forward operator, but also for its adjoint.
    Discovering Effective Policies for Land-Use Planning. (arXiv:2311.12304v1 [cs.NE])
    How areas of land are allocated for different uses, such as forests, urban, and agriculture, has a large effect on carbon balance, and therefore climate change. Based on available historical data on changes in land use and a simulation of carbon emissions/absorption, a surrogate model can be learned that makes it possible to evaluate the different options available to decision-makers efficiently. An evolutionary search process can then be used to discover effective land-use policies for specific locations. Such a system was built on the Project Resilience platform and evaluated with the Land-Use Harmonization dataset and the BLUE simulator. It generates Pareto fronts that trade off carbon impact and amount of change customized to different locations, thus providing a potentially useful tool for land-use planning.
    Heuristics for Detecting CoinJoin Transactions on the Bitcoin Blockchain. (arXiv:2311.12491v1 [cs.CR])
    This research delves into the intricacies of Bitcoin, a decentralized peer-to-peer network, and its associated blockchain, which records all transactions since its inception. While this ensures integrity and transparency, the transparent nature of Bitcoin potentially compromises users' privacy rights. To address this concern, users have adopted CoinJoin, a method that amalgamates multiple transaction intents into a single, larger transaction to bolster transactional privacy. This process complicates individual transaction tracing and disrupts many established blockchain analysis heuristics. Despite its significance, limited research has been conducted on identifying CoinJoin transactions. Particularly noteworthy are varied CoinJoin implementations such as JoinMarket, Wasabi, and Whirlpool, each presenting distinct challenges due to their unique transaction structures. This study delves deeply into the open-source implementations of these protocols, aiming to develop refined heuristics for identifying their transactions on the blockchain. Our exhaustive analysis covers transactions up to block 760,000, offering a comprehensive insight into CoinJoin transactions and their implications for Bitcoin blockchain analysis.
    ROOT-SGD: Sharp Nonasymptotics and Asymptotic Efficiency in a Single Algorithm. (arXiv:2008.12690v2 [math.OC] UPDATED)
    We study the problem of solving strongly convex and smooth unconstrained optimization problems using stochastic first-order algorithms. We devise a novel algorithm, referred to as \emph{Recursive One-Over-T SGD} (\ROOTSGD), based on an easily implementable, recursive averaging of past stochastic gradients. We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an asymptotic sense. On the nonasymptotic side, we prove risk bounds on the last iterate of \ROOTSGD with leading-order terms that match the optimal statistical risk with a unity pre-factor, along with a higher-order term that scales at the sharp rate of $O(n^{-3/2})$ under the Lipschitz condition on the Hessian matrix. On the asymptotic side, we show that when a mild, one-point Hessian continuity condition is imposed, the rescaled last iterate of (multi-epoch) \ROOTSGD converges asymptotically to a Gaussian limit with the Cram\'{e}r-Rao optimal asymptotic covariance, for a broad range of step-size choices.
    Resilient Control of Networked Microgrids using Vertical Federated Reinforcement Learning: Designs and Real-Time Test-Bed Validations. (arXiv:2311.12264v1 [eess.SY])
    Improving system-level resiliency of networked microgrids is an important aspect with increased population of inverter-based resources (IBRs). This paper (1) presents resilient control design in presence of adversarial cyber-events, and proposes a novel federated reinforcement learning (Fed-RL) approach to tackle (a) model complexities, unknown dynamical behaviors of IBR devices, (b) privacy issues regarding data sharing in multi-party-owned networked grids, and (2) transfers learned controls from simulation to hardware-in-the-loop test-bed, thereby bridging the gap between simulation and real world. With these multi-prong objectives, first, we formulate a reinforcement learning (RL) training setup generating episodic trajectories with adversaries (attack signal) injected at the primary controllers of the grid forming (GFM) inverters where RL agents (or controllers) are being trained to mitigate the injected attacks. For networked microgrids, the horizontal Fed-RL method involving distinct independent environments is not appropriate, leading us to develop vertical variant Federated Soft Actor-Critic (FedSAC) algorithm to grasp the interconnected dynamics of networked microgrid. Next, utilizing OpenAI Gym interface, we built a custom simulation set-up in GridLAB-D/HELICS co-simulation platform, named Resilient RL Co-simulation (ResRLCoSIM), to train the RL agents with IEEE 123-bus benchmark test systems comprising 3 interconnected microgrids. Finally, the learned policies in simulation world are transferred to the real-time hardware-in-the-loop test-bed set-up developed using high-fidelity Hypersim platform. Experiments show that the simulator-trained RL controllers produce convincing results with the real-time test-bed set-up, validating the minimization of sim-to-real gap.
    One Size Fits All for Semantic Shifts: Adaptive Prompt Tuning for Continual Learning. (arXiv:2311.12048v1 [cs.LG])
    In real-world continual learning scenarios, tasks often exhibit intricate and unpredictable semantic shifts, posing challenges for fixed prompt management strategies. We identify the inadequacy of universal and specific prompting in handling these dynamic shifts. Universal prompting is ineffective for tasks with abrupt semantic changes, while specific prompting struggles with overfitting under mild semantic shifts. To overcome these limitations, we propose an adaptive prompting approach that tailors minimal yet sufficient prompts based on the task semantics. Our methodology, SemPrompt, incorporates a two-level semantic grouping process: macroscopic semantic assignment and microscopic semantic refinement. This process ensures optimal prompt utilization for varying task semantics, improving the efficiency and effectiveness of learning in real-world CL settings. Our experimental results demonstrate that SemPrompt consistently outperforms existing methods in adapting to diverse semantic shifts in tasks.  ( 2 min )
    Masked Autoencoders Are Robust Neural Architecture Search Learners. (arXiv:2311.12086v1 [cs.LG])
    Neural Architecture Search (NAS) currently relies heavily on labeled data, which is both expensive and time-consuming to acquire. In this paper, we propose a novel NAS framework based on Masked Autoencoders (MAE) that eliminates the need for labeled data during the search process. By replacing the supervised learning objective with an image reconstruction task, our approach enables the robust discovery of network architectures without compromising performance and generalization ability. Additionally, we address the problem of performance collapse encountered in the widely-used Differentiable Architecture Search (DARTS) method in the unsupervised paradigm by introducing a multi-scale decoder. Through extensive experiments conducted on various search spaces and datasets, we demonstrate the effectiveness and robustness of the proposed method, providing empirical evidence of its superiority over baseline approaches.  ( 2 min )
    MaskFlow: Object-Aware Motion Estimation. (arXiv:2311.12476v1 [cs.CV])
    We introduce a novel motion estimation method, MaskFlow, that is capable of estimating accurate motion fields, even in very challenging cases with small objects, large displacements and drastic appearance changes. In addition to lower-level features, that are used in other Deep Neural Network (DNN)-based motion estimation methods, MaskFlow draws from object-level features and segmentations. These features and segmentations are used to approximate the objects' translation motion field. We propose a novel and effective way of incorporating the incomplete translation motion field into a subsequent motion estimation network for refinement and completion. We also produced a new challenging synthetic dataset with motion field ground truth, and also provide extra ground truth for the object-instance matchings and corresponding segmentation masks. We demonstrate that MaskFlow outperforms state of the art methods when evaluated on our new challenging dataset, whilst still producing comparable results on the popular FlyingThings3D benchmark dataset.  ( 2 min )
  • Open

    Towards Data-Algorithm Dependent Generalization: a Case Study on Overparameterized Linear Regression. (arXiv:2202.06054v4 [cs.LG] UPDATED)
    One of the major open problems in machine learning is to characterize generalization in the overparameterized regime, where most traditional generalization bounds become inconsistent even for overparameterized linear regression. In many scenarios, this failure can be attributed to obscuring the crucial interplay between the training algorithm and the underlying data distribution. This paper demonstrate that the generalization behavior of overparameterized model should be analyzed in a both data-relevant and algorithm-relevant manner. To make a formal characterization, We introduce a notion called data-algorithm compatibility, which considers the generalization behavior of the entire data-dependent training trajectory, instead of traditional last-iterate analysis. We validate our claim by studying the setting of solving overparameterized linear regression with gradient descent. Specifically, we perform a data-dependent trajectory analysis and derive a sufficient condition for compatibility in such a setting. Our theoretical results demonstrate that if we take early stopping iterates into consideration, generalization can hold with significantly weaker restrictions on the problem instance than the previous last-iterate analysis.
    Topological properties of basins of attraction and expressiveness of width bounded neural networks. (arXiv:2011.04923v5 [cs.LG] UPDATED)
    In Radhakrishnan et al. [2020], the authors empirically show that autoencoders trained with usual SGD methods shape out basins of attraction around their training data. We consider network functions of width not exceeding the input dimension and prove that in this situation basins of attraction are bounded and their complement cannot have bounded components. Our conditions in these results are met in several experiments of the latter work and we thus address a question posed therein. We also show that under some more restrictive conditions the basins of attraction are path-connected. The tightness of the conditions in our results is demonstrated by means of several examples. Finally, the arguments used to prove the above results allow us to derive a root cause why scalar-valued neural network functions that fulfill our bounded width condition are not dense in spaces of continuous functions.
    Differentiable Sampling of Categorical Distributions Using the CatLog-Derivative Trick. (arXiv:2311.12569v1 [cs.LG])
    Categorical random variables can faithfully represent the discrete and uncertain aspects of data as part of a discrete latent variable model. Learning in such models necessitates taking gradients with respect to the parameters of the categorical probability distributions, which is often intractable due to their combinatorial nature. A popular technique to estimate these otherwise intractable gradients is the Log-Derivative trick. This trick forms the basis of the well-known REINFORCE gradient estimator and its many extensions. While the Log-Derivative trick allows us to differentiate through samples drawn from categorical distributions, it does not take into account the discrete nature of the distribution itself. Our first contribution addresses this shortcoming by introducing the CatLog-Derivative trick - a variation of the Log-Derivative trick tailored towards categorical distributions. Secondly, we use the CatLog-Derivative trick to introduce IndeCateR, a novel and unbiased gradient estimator for the important case of products of independent categorical distributions with provably lower variance than REINFORCE. Thirdly, we empirically show that IndeCateR can be efficiently implemented and that its gradient estimates have significantly lower bias and variance for the same number of samples compared to the state of the art.
    A New Type Of Upper And Lower Bounds On Right-Tail Probabilities Of Continuous Random Variables. (arXiv:2311.12612v1 [math.PR])
    In this paper, I present a completely new type of upper and lower bounds on the right-tail probabilities of continuous random variables with unbounded support and with semi-bounded support from the left. The presented upper and lower right-tail bounds depend only on the probability density function (PDF), its first derivative, and two parameters that are used for tightening the bounds. These tail bounds hold under certain conditions that depend on the PDF, its first and second derivatives, and the two parameters. The new tail bounds are shown to be tight for a wide range of continuous random variables via numerical examples.
    On Counterfactual Data Augmentation Under Confounding. (arXiv:2305.18183v2 [cs.LG] UPDATED)
    Counterfactual data augmentation has recently emerged as a method to mitigate confounding biases in the training data. These biases, such as spurious correlations, arise due to various observed and unobserved confounding variables in the data generation process. In this paper, we formally analyze how confounding biases impact downstream classifiers and present a causal viewpoint to the solutions based on counterfactual data augmentation. We explore how removing confounding biases serves as a means to learn invariant features, ultimately aiding in generalization beyond the observed data distribution. Additionally, we present a straightforward yet powerful algorithm for generating counterfactual images, which effectively mitigates the influence of confounding effects on downstream classifiers. Through experiments on MNIST variants and the CelebA datasets, we demonstrate how our simple augmentation method helps existing state-of-the-art methods achieve good results.
    Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting. (arXiv:2307.11494v2 [cs.LG] UPDATED)
    Diffusion models have achieved state-of-the-art performance in generative modeling tasks across various domains. Prior works on time series diffusion models have primarily focused on developing conditional models tailored to specific forecasting or imputation tasks. In this work, we explore the potential of task-agnostic, unconditional diffusion models for several time series applications. We propose TSDiff, an unconditionally-trained diffusion model for time series. Our proposed self-guidance mechanism enables conditioning TSDiff for downstream tasks during inference, without requiring auxiliary networks or altering the training procedure. We demonstrate the effectiveness of our method on three different time series tasks: forecasting, refinement, and synthetic data generation. First, we show that TSDiff is competitive with several task-specific conditional forecasting methods (predict). Second, we leverage the learned implicit probability density of TSDiff to iteratively refine the predictions of base forecasters with reduced computational overhead over reverse diffusion (refine). Notably, the generative performance of the model remains intact -- downstream forecasters trained on synthetic samples from TSDiff outperform forecasters that are trained on samples from other state-of-the-art generative time series models, occasionally even outperforming models trained on real data (synthesize).
    Feature Engineering with Regularity Structures. (arXiv:2108.05879v2 [stat.ML] UPDATED)
    We investigate the use of models from the theory of regularity structures as features in machine learning tasks. A model is a polynomial function of a space-time signal designed to well-approximate solutions to partial differential equations (PDEs), even in low regularity regimes. Models can be seen as natural multi-dimensional generalisations of signatures of paths; our work therefore aims to extend the recent use of signatures in data science beyond the context of time-ordered data. We provide a flexible definition of a model feature vector associated to a space-time signal, along with two algorithms which illustrate ways in which these features can be combined with linear regression. We apply these algorithms in several numerical experiments designed to learn solutions to PDEs with a given forcing and boundary data. Our experiments include semi-linear parabolic and wave equations with forcing, and Burgers' equation with no forcing. We find an advantage in favour of our algorithms when compared to several alternative methods. Additionally, in the experiment with Burgers' equation, we find non-trivial predictive power when noise is added to the observations.
    Infinite forecast combinations based on Dirichlet process. (arXiv:2311.12379v1 [cs.LG])
    Forecast combination integrates information from various sources by consolidating multiple forecast results from the target time series. Instead of the need to select a single optimal forecasting model, this paper introduces a deep learning ensemble forecasting model based on the Dirichlet process. Initially, the learning rate is sampled with three basis distributions as hyperparameters to convert the infinite mixture into a finite one. All checkpoints are collected to establish a deep learning sub-model pool, and weight adjustment and diversity strategies are developed during the combination process. The main advantage of this method is its ability to generate the required base learners through a single training process, utilizing the decaying strategy to tackle the challenge posed by the stochastic nature of gradient descent in determining the optimal learning rate. To ensure the method's generalizability and competitiveness, this paper conducts an empirical analysis using the weekly dataset from the M4 competition and explores sensitivity to the number of models to be combined. The results demonstrate that the ensemble model proposed offers substantial improvements in prediction accuracy and stability compared to a single benchmark model.
    Data thinning for convolution-closed distributions. (arXiv:2301.07276v3 [stat.ME] UPDATED)
    We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, and binomial distributions, among others. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the usual approach of cross-validation via sample splitting, especially in settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis, for which traditional sample splitting is unattractive or unavailable.
    An inexact LPA for DC composite optimization and application to matrix completions with outliers. (arXiv:2303.16822v3 [math.OC] UPDATED)
    This paper concerns a class of DC composite optimization problems which, as an extension of convex composite optimization problems and DC programs with nonsmooth components, often arises in robust factorization models of low-rank matrix recovery. For this class of nonconvex and nonsmooth problems, we propose an inexact linearized proximal algorithm (iLPA) by computing in each step an inexact minimizer of a strongly convex majorization constructed with a partial linearization of their objective functions at the current iterate, and establish the convergence of the generated iterate sequence under the Kurdyka-\L\"ojasiewicz (KL) property of a potential function. In particular, by leveraging the composite structure, we provide a verifiable condition for the potential function to have the KL property of exponent $1/2$ at the limit point, so for the iterate sequence to have a local R-linear convergence rate. Finally, we apply the proposed iLPA to a robust factorization model for matrix completions with outliers and non-uniform sampling, and numerical comparison with the Polyak subgradient method confirms its superiority in terms of computing time and quality of solutions.
    Variational Elliptical Processes. (arXiv:2311.12566v1 [cs.LG])
    We present elliptical processes, a family of non-parametric probabilistic models that subsume Gaussian processes and Student's t processes. This generalization includes a range of new heavy-tailed behaviors while retaining computational tractability. Elliptical processes are based on a representation of elliptical distributions as a continuous mixture of Gaussian distributions. We parameterize this mixture distribution as a spline normalizing flow, which we train using variational inference. The proposed form of the variational posterior enables a sparse variational elliptical process applicable to large-scale problems. We highlight advantages compared to Gaussian processes through regression and classification experiments. Elliptical processes can supersede Gaussian processes in several settings, including cases where the likelihood is non-Gaussian or when accurate tail modeling is essential.
    Explainable Anomaly Detection using Masked Latent Generative Modeling. (arXiv:2311.12550v1 [cs.LG])
    We present a novel time series anomaly detection method that achieves excellent detection accuracy while offering a superior level of explainability. Our proposed method, TimeVQVAE-AD, leverages masked generative modeling adapted from the cutting-edge time series generation method known as TimeVQVAE. The prior model is trained on the discrete latent space of a time-frequency domain. Notably, the dimensional semantics of the time-frequency domain are preserved in the latent space, enabling us to compute anomaly scores across different frequency bands, which provides a better insight into the detected anomalies. Additionally, the generative nature of the prior model allows for sampling likely normal states for detected anomalies, enhancing the explainability of the detected anomalies through counterfactuals. Our experimental evaluation on the UCR Time Series Anomaly archive demonstrates that TimeVQVAE-AD significantly surpasses the existing methods in terms of detection accuracy and explainability.
    On the Out-of-Distribution Coverage of Combining Split Conformal Prediction and Bayesian Deep Learning. (arXiv:2311.12688v1 [cs.LG])
    Bayesian deep learning and conformal prediction are two methods that have been used to convey uncertainty and increase safety in machine learning systems. We focus on combining Bayesian deep learning with split conformal prediction and how this combination effects out-of-distribution coverage; particularly in the case of multiclass image classification. We suggest that if the model is generally underconfident on the calibration set, then the resultant conformal sets may exhibit worse out-of-distribution coverage compared to simple predictive credible sets. Conversely, if the model is overconfident on the calibration set, the use of conformal prediction may improve out-of-distribution coverage. We evaluate prediction sets as a result of combining split conformal methods and neural networks trained with (i) stochastic gradient descent, (ii) deep ensembles, and (iii) mean-field variational inference. Our results suggest that combining Bayesian deep learning models with split conformal prediction can, in some cases, cause unintended consequences such as reducing out-of-distribution coverage.
    survex: an R package for explaining machine learning survival models. (arXiv:2308.16113v2 [cs.LG] UPDATED)
    Due to their flexibility and superior performance, machine learning models frequently complement and outperform traditional statistical survival models. However, their widespread adoption is hindered by a lack of user-friendly tools to explain their internal operations and prediction rationales. To tackle this issue, we introduce the survex R package, which provides a cohesive framework for explaining any survival model by applying explainable artificial intelligence techniques. The capabilities of the proposed software encompass understanding and diagnosing survival models, which can lead to their improvement. By revealing insights into the decision-making process, such as variable effects and importances, survex enables the assessment of model reliability and the detection of biases. Thus, transparency and responsibility may be promoted in sensitive areas, such as biomedical research and healthcare applications.
    Online Regularization towards Always-Valid High-Dimensional Dynamic Pricing. (arXiv:2007.02470v3 [stat.ML] UPDATED)
    Devising dynamic pricing policy with always valid online statistical learning procedure is an important and as yet unresolved problem. Most existing dynamic pricing policy, which focus on the faithfulness of adopted customer choice models, exhibit a limited capability for adapting the online uncertainty of learned statistical model during pricing process. In this paper, we propose a novel approach for designing dynamic pricing policy based regularized online statistical learning with theoretical guarantees. The new approach overcomes the challenge of continuous monitoring of online Lasso procedure and possesses several appealing properties. In particular, we make the decisive observation that the always-validity of pricing decisions builds and thrives on the online regularization scheme. Our proposed online regularization scheme equips the proposed optimistic online regularized maximum likelihood pricing (OORMLP) pricing policy with three major advantages: encode market noise knowledge into pricing process optimism; empower online statistical learning with always-validity over all decision points; envelop prediction error process with time-uniform non-asymptotic oracle inequalities. This type of non-asymptotic inference results allows us to design more sample-efficient and robust dynamic pricing algorithms in practice. In theory, the proposed OORMLP algorithm exploits the sparsity structure of high-dimensional models and secures a logarithmic regret in a decision horizon. These theoretical advances are made possible by proposing an optimistic online Lasso procedure that resolves dynamic pricing problems at the process level, based on a novel use of non-asymptotic martingale concentration. In experiments, we evaluate OORMLP in different synthetic and real pricing problem settings, and demonstrate that OORMLP advances the state-of-the-art methods.  ( 3 min )
    Individualized Dynamic Latent Factor Model with Application to Mobile Health Data. (arXiv:2311.12392v1 [stat.ME])
    Mobile health has emerged as a major success in tracking individual health status, due to the popularity and power of smartphones and wearable devices. This has also brought great challenges in handling heterogeneous, multi-resolution data which arise ubiquitously in mobile health due to irregular multivariate measurements collected from individuals. In this paper, we propose an individualized dynamic latent factor model for irregular multi-resolution time series data to interpolate unsampled measurements of time series with low resolution. One major advantage of the proposed method is the capability to integrate multiple irregular time series and multiple subjects by mapping the multi-resolution data to the latent space. In addition, the proposed individualized dynamic latent factor model is applicable to capturing heterogeneous longitudinal information through individualized dynamic latent factors. In theory, we provide the integrated interpolation error bound of the proposed estimator and derive the convergence rate with B-spline approximation methods. Both the simulation studies and the application to smartwatch data demonstrate the superior performance of the proposed method compared to existing methods.  ( 2 min )
    Multi-Objective Optimization Using the R2 Utility. (arXiv:2305.11774v2 [math.OC] UPDATED)
    The goal of multi-objective optimization is to identify a collection of points which describe the best possible trade-offs between the multiple objectives. In order to solve this vector-valued optimization problem, practitioners often appeal to the use of scalarization functions in order to transform the multi-objective problem into a collection of single-objective problems. This set of scalarized problems can then be solved using traditional single-objective optimization techniques. In this work, we formalise this convention into a general mathematical framework. We show how this strategy effectively recasts the original multi-objective optimization problem into a single-objective optimization problem defined over sets. An appropriate class of objective functions for this new problem is the R2 utility function, which is defined as a weighted integral over the scalarized optimization problems. We show that this utility function is a monotone and submodular set function, which can be optimised effectively using greedy optimization algorithms. We analyse the performance of these greedy algorithms both theoretically and empirically. Our analysis largely focusses on Bayesian optimization, which is a popular probabilistic framework for black-box optimization.  ( 2 min )
    Learning optimal smooth invariant subspaces for data approximation. (arXiv:2311.12544v1 [math.OC])
    In this article, we consider the problem of approximating a finite set of data (usually huge in applications) by invariant subspaces generated through a small set of smooth functions. The invariance is either by translations under a full-rank lattice or through the action of crystallographic groups. Smoothness is ensured by stipulating that the generators belong to a Paley-Wiener space, that is selected in an optimal way based on the characteristics of the given data. To complete our investigation, we analyze the fundamental role played by the lattice in the process of approximation.  ( 2 min )
    Predictive Density Combination Using a Tree-Based Synthesis Function. (arXiv:2311.12671v1 [econ.EM])
    Bayesian predictive synthesis (BPS) provides a method for combining multiple predictive distributions based on agent/expert opinion analysis theory and encompasses a range of existing density forecast pooling methods. The key ingredient in BPS is a ``synthesis'' function. This is typically specified parametrically as a dynamic linear regression. In this paper, we develop a nonparametric treatment of the synthesis function using regression trees. We show the advantages of our tree-based approach in two macroeconomic forecasting applications. The first uses density forecasts for GDP growth from the euro area's Survey of Professional Forecasters. The second combines density forecasts of US inflation produced by many regression models involving different predictors. Both applications demonstrate the benefits -- in terms of improved forecast accuracy and interpretability -- of modeling the synthesis function nonparametrically.  ( 2 min )
    A Unified Framework for Pattern Recovery in Penalized and Thresholded Estimation and its Geometry. (arXiv:2307.10158v3 [math.ST] UPDATED)
    We consider the framework of penalized estimation where the penalty term is given by a real-valued polyhedral gauge, which encompasses methods such as LASSO (and many variants thereof such as the generalized LASSO), SLOPE, OSCAR, PACS and others. Each of these estimators can uncover a different structure or ``pattern'' of the unknown parameter vector. We define a general notion of patterns based on subdifferentials and formalize an approach to measure pattern complexity. For pattern recovery, we provide a minimal condition for a particular pattern to be detected by the procedure with positive probability, the so-called accessibility condition. Using our approach, we also introduce the stronger noiseless recovery condition. For the LASSO, it is well known that the irrepresentability condition is necessary for pattern recovery with probability larger than $1/2$ and we show that the noiseless recovery plays exactly the same role, thereby extending and unifying the irrepresentability condition of the LASSO to a broad class of penalized estimators. We also show that the noiseless recovery condition can be relaxed when turning to thresholded penalized estimators, extending the idea of the thresholded LASSO: we prove that the accessibility condition is already sufficient (and necessary) for sure pattern recovery by thresholded penalized estimation provided that the signal of the pattern is large enough. Throughout the article, we demonstrate how our findings can be interpreted through a geometrical lens.  ( 3 min )
    ROOT-SGD: Sharp Nonasymptotics and Asymptotic Efficiency in a Single Algorithm. (arXiv:2008.12690v2 [math.OC] UPDATED)
    We study the problem of solving strongly convex and smooth unconstrained optimization problems using stochastic first-order algorithms. We devise a novel algorithm, referred to as \emph{Recursive One-Over-T SGD} (\ROOTSGD), based on an easily implementable, recursive averaging of past stochastic gradients. We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an asymptotic sense. On the nonasymptotic side, we prove risk bounds on the last iterate of \ROOTSGD with leading-order terms that match the optimal statistical risk with a unity pre-factor, along with a higher-order term that scales at the sharp rate of $O(n^{-3/2})$ under the Lipschitz condition on the Hessian matrix. On the asymptotic side, we show that when a mild, one-point Hessian continuity condition is imposed, the rescaled last iterate of (multi-epoch) \ROOTSGD converges asymptotically to a Gaussian limit with the Cram\'{e}r-Rao optimal asymptotic covariance, for a broad range of step-size choices.  ( 2 min )
    Bridging Algorithmic Information Theory and Machine Learning: A New Approach to Kernel Learning. (arXiv:2311.12624v1 [cs.LG])
    Machine Learning (ML) and Algorithmic Information Theory (AIT) look at Complexity from different points of view. We explore the interface between AIT and Kernel Methods (that are prevalent in ML) by adopting an AIT perspective on the problem of learning kernels from data, in kernel ridge regression, through the method of Sparse Kernel Flows. In particular, by looking at the differences and commonalities between Minimal Description Length (MDL) and Regularization in Machine Learning (RML), we prove that the method of Sparse Kernel Flows is the natural approach to adopt to learn kernels from data. This paper shows that it is not necessary to use the statistical route to derive Sparse Kernel Flows and that one can directly work with code-lengths and complexities that are concepts that show up in AIT.  ( 2 min )
    Langevin dynamics based algorithm e-TH$\varepsilon$O POULA for stochastic optimization problems with discontinuous stochastic gradient. (arXiv:2210.13193v2 [math.OC] UPDATED)
    We introduce a new Langevin dynamics based algorithm, called e-TH$\varepsilon$O POULA, to solve optimization problems with discontinuous stochastic gradients which naturally appear in real-world applications such as quantile estimation, vector quantization, CVaR minimization, and regularized optimization problems involving ReLU neural networks. We demonstrate both theoretically and numerically the applicability of the e-TH$\varepsilon$O POULA algorithm. More precisely, under the conditions that the stochastic gradient is locally Lipschitz in average and satisfies a certain convexity at infinity condition, we establish non-asymptotic error bounds for e-TH$\varepsilon$O POULA in Wasserstein distances and provide a non-asymptotic estimate for the expected excess risk, which can be controlled to be arbitrarily small. Three key applications in finance and insurance are provided, namely, multi-period portfolio optimization, transfer learning in multi-period portfolio optimization, and insurance claim prediction, which involve neural networks with (Leaky)-ReLU activation functions. Numerical experiments conducted using real-world datasets illustrate the superior empirical performance of e-TH$\varepsilon$O POULA compared to SGLD, TUSLA, ADAM, and AMSGrad in terms of model accuracy.  ( 2 min )
    PyTorch Geometric Signed Directed: A Software Package on Graph Neural Networks for Signed and Directed Graphs. (arXiv:2202.10793v5 [cs.LG] UPDATED)
    Networks are ubiquitous in many real-world applications (e.g., social networks encoding trust/distrust relationships, correlation networks arising from time series data). While many networks are signed or directed, or both, there is a lack of unified software packages on graph neural networks (GNNs) specially designed for signed and directed networks. In this paper, we present PyTorch Geometric Signed Directed (PyGSD), a software package which fills this gap. Along the way, we evaluate the implemented methods with experiments with a view to providing insights into which method to choose for a given task. The deep learning framework consists of easy-to-use GNN models, synthetic and real-world data, as well as task-specific evaluation metrics and loss functions for signed and directed networks. As an extension library for PyG, our proposed software is maintained with open-source releases, detailed documentation, continuous integration, unit tests and code coverage checks. The GitHub repository of the library is https://github.com/SherylHYX/pytorch_geometric_signed_directed.  ( 3 min )
    Flexible variable selection in the presence of missing data. (arXiv:2202.12989v4 [stat.ME] UPDATED)
    In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose a nonparametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithm that achieve control of commonly used error rates. Through simulations, we show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed method to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.  ( 3 min )
    Optimality in Mean Estimation: Beyond Worst-Case, Beyond Sub-Gaussian, and Beyond $1+\alpha$ Moments. (arXiv:2311.12784v1 [math.ST])
    There is growing interest in improving our algorithmic understanding of fundamental statistical problems such as mean estimation, driven by the goal of understanding the limits of what we can extract from valuable data. The state of the art results for mean estimation in $\mathbb{R}$ are 1) the optimal sub-Gaussian mean estimator by [LV22], with the tight sub-Gaussian constant for all distributions with finite but unknown variance, and 2) the analysis of the median-of-means algorithm by [BCL13] and a lower bound by [DLLO16], characterizing the big-O optimal errors for distributions for which only a $1+\alpha$ moment exists for $\alpha \in (0,1)$. Both results, however, are optimal only in the worst case. We initiate the fine-grained study of the mean estimation problem: Can algorithms leverage useful features of the input distribution to beat the sub-Gaussian rate, without explicit knowledge of such features? We resolve this question with an unexpectedly nuanced answer: "Yes in limited regimes, but in general no". For any distribution $p$ with a finite mean, we construct a distribution $q$ whose mean is well-separated from $p$'s, yet $p$ and $q$ are not distinguishable with high probability, and $q$ further preserves $p$'s moments up to constants. The main consequence is that no reasonable estimator can asymptotically achieve better than the sub-Gaussian error rate for any distribution, matching the worst-case result of [LV22]. More generally, we introduce a new definitional framework to analyze the fine-grained optimality of algorithms, which we call "neighborhood optimality", interpolating between the unattainably strong "instance optimality" and the trivially weak "admissibility" definitions. Applying the new framework, we show that median-of-means is neighborhood optimal, up to constant factors. It is open to find a neighborhood-optimal estimator without constant factor slackness.  ( 3 min )
    Quantum Inception Score. (arXiv:2311.12163v1 [quant-ph])
    Motivated by the great success of classical generative models in machine learning, enthusiastic exploration of their quantum version has recently started. To depart on this journey, it is important to develop a relevant metric to evaluate the quality of quantum generative models; in the classical case, one such examples is the inception score. In this paper, we propose the quantum inception score, which relates the quality to the classical capacity of the quantum channel that classifies a given dataset. We prove that, under this proposed measure, the quantum generative models provide better quality than their classical counterparts because of the presence of quantum coherence and entanglement. Finally, we harness the quantum fluctuation theorem to characterize the physical limitation of the quality of quantum generative models.  ( 2 min )
    An efficient likelihood-free Bayesian inference method based on sequential neural posterior estimation. (arXiv:2311.12530v1 [stat.ML])
    Sequential neural posterior estimation (SNPE) techniques have been recently proposed for dealing with simulation-based models with intractable likelihoods. Unlike approximate Bayesian computation, SNPE techniques learn the posterior from sequential simulation using neural network-based conditional density estimators. This paper reclaims SNPE-B proposed by Lueckmann et al. (2017), which suffers from inefficiency and slow inference due to inefficient utilization of simulated data and high variance of parameter updates. To address these issues, we firstly introduce a concentrated loss function based on an adaptive calibration kernel that reweights the simulated data appropriately to improve the data efficiency. Moreover, we provide a theoretical analysis of the variance of associated Monte Carlo estimators. Based on this analysis, we then propose several variance reduction techniques to further accelerate the process of learning. Numerical experiments demonstrate that our method outperforms the original method together with other existing competitors on certain tasks.  ( 2 min )
    Random Fourier Signature Features. (arXiv:2311.12214v1 [stat.ML])
    Tensor algebras give rise to one of the most powerful measures of similarity for sequences of arbitrary length called the signature kernel accompanied with attractive theoretical guarantees from stochastic analysis. Previous algorithms to compute the signature kernel scale quadratically in terms of the length and the number of the sequences. To mitigate this severe computational bottleneck, we develop a random Fourier feature-based acceleration of the signature kernel acting on the inherently non-Euclidean domain of sequences. We show uniform approximation guarantees for the proposed unbiased estimator of the signature kernel, while keeping its computation linear in the sequence length and number. In addition, combined with recent advances on tensor projections, we derive two even more scalable time series features with favourable concentration properties and computational complexity both in time and memory. Our empirical results show that the reduction in computational cost comes at a negligible price in terms of accuracy on moderate-sized datasets, and it enables one to scale to large datasets up to a million time series.  ( 2 min )
    Learning Causal Representations from General Environments: Identifiability and Intrinsic Ambiguity. (arXiv:2311.12267v1 [cs.LG])
    This paper studies causal representation learning, the task of recovering high-level latent variables and their causal relationships from low-level data that we observe, assuming access to observations generated from multiple environments. While existing works are able to prove full identifiability of the underlying data generating process, they typically assume access to single-node, hard interventions which is rather unrealistic in practice. The main contribution of this paper is characterize a notion of identifiability which is provably the best one can achieve when hard interventions are not available. First, for linear causal models, we provide identifiability guarantee for data observed from general environments without assuming any similarities between them. While the causal graph is shown to be fully recovered, the latent variables are only identified up to an effect-domination ambiguity (EDA). We then propose an algorithm, LiNGCReL which is guaranteed to recover the ground-truth model up to EDA, and we demonstrate its effectiveness via numerical experiments. Moving on to general non-parametric causal models, we prove the same idenfifiability guarantee assuming access to groups of soft interventions. Finally, we provide counterparts of our identifiability results, indicating that EDA is basically inevitable in our setting.  ( 2 min )
    Provable Representation with Efficient Planning for Partially Observable Reinforcement Learning. (arXiv:2311.12244v1 [cs.LG])
    In real-world reinforcement learning problems, the state information is often only partially observable, which breaks the basic assumption in Markov decision processes, and thus, leads to inferior performances. Partially Observable Markov Decision Processes have been introduced to explicitly take the issue into account for learning, exploration, and planning, but presenting significant computational and statistical challenges. To address these difficulties, we exploit the representation view, which leads to a coherent design framework for a practically tractable reinforcement learning algorithm upon partial observations. We provide a theoretical analysis for justifying the statistical efficiency of the proposed algorithm. We also empirically demonstrate the proposed algorithm can surpass state-of-the-art performance with partial observations across various benchmarks, therefore, pushing reliable reinforcement learning towards more practical applications.  ( 2 min )
    A unified consensus-based parallel ADMM algorithm for high-dimensional regression with combined regularizations. (arXiv:2311.12319v1 [stat.ML])
    The parallel alternating direction method of multipliers (ADMM) algorithm is widely recognized for its effectiveness in handling large-scale datasets stored in a distributed manner, making it a popular choice for solving statistical learning models. However, there is currently limited research on parallel algorithms specifically designed for high-dimensional regression with combined (composite) regularization terms. These terms, such as elastic-net, sparse group lasso, sparse fused lasso, and their nonconvex variants, have gained significant attention in various fields due to their ability to incorporate prior information and promote sparsity within specific groups or fused variables. The scarcity of parallel algorithms for combined regularizations can be attributed to the inherent nonsmoothness and complexity of these terms, as well as the absence of closed-form solutions for certain proximal operators associated with them. In this paper, we propose a unified constrained optimization formulation based on the consensus problem for these types of convex and nonconvex regression problems and derive the corresponding parallel ADMM algorithms. Furthermore, we prove that the proposed algorithm not only has global convergence but also exhibits linear convergence rate. Extensive simulation experiments, along with a financial example, serve to demonstrate the reliability, stability, and scalability of our algorithm. The R package for implementing the proposed algorithms can be obtained at https://github.com/xfwu1016/CPADMM.  ( 2 min )
    The limitation of neural nets for approximation and optimization. (arXiv:2311.12253v1 [cs.LG])
    We are interested in assessing the use of neural networks as surrogate models to approximate and minimize objective functions in optimization problems. While neural networks are widely used for machine learning tasks such as classification and regression, their application in solving optimization problems has been limited. Our study begins by determining the best activation function for approximating the objective functions of popular nonlinear optimization test problems, and the evidence provided shows that~SiLU has the best performance. We then analyze the accuracy of function value, gradient, and Hessian approximations for such objective functions obtained through interpolation/regression models and neural networks. When compared to interpolation/regression models, neural networks can deliver competitive zero- and first-order approximations (at a high training cost) but underperform on second-order approximation. However, it is shown that combining a neural net activation function with the natural basis for quadratic interpolation/regression can waive the necessity of including cross terms in the natural basis, leading to models with fewer parameters to determine. Lastly, we provide evidence that the performance of a state-of-the-art derivative-free optimization algorithm can hardly be improved when the gradient of an objective function is approximated using any of the surrogate models considered, including neural networks.  ( 2 min )
    PAPAL: A Provable PArticle-based Primal-Dual ALgorithm for Mixed Nash Equilibrium. (arXiv:2303.00970v2 [math.OC] UPDATED)
    We consider the non-convex non-concave objective function in two-player zero-sum continuous games. The existence of pure Nash equilibrium requires stringent conditions, posing a major challenge for this problem. To circumvent this difficulty, we examine the problem of identifying a mixed Nash equilibrium, where strategies are randomized and characterized by probability distributions over continuous domains.To this end, we propose PArticle-based Primal-dual ALgorithm (PAPAL) tailored for a weakly entropy-regularized min-max optimization over probability distributions. This algorithm employs the stochastic movements of particles to represent the updates of random strategies for the $\epsilon$-mixed Nash equilibrium. We offer a comprehensive convergence analysis of the proposed algorithm, demonstrating its effectiveness. In contrast to prior research that attempted to update particle importance without movements, PAPAL is the first implementable particle-based algorithm accompanied by non-asymptotic quantitative convergence results, running time, and sample complexity guarantees. Our framework contributes novel insights into the particle-based algorithms for continuous min-max optimization in the general non-convex non-concave setting.  ( 2 min )

  • Open

    Amazon EC2 DL2q instance for cost-efficient, high-performance AI inference is now generally available
    This is a guest post by A.K Roy from Qualcomm AI. Amazon Elastic Compute Cloud (Amazon EC2) DL2q instances, powered by Qualcomm AI 100 Standard accelerators, can be used to cost-efficiently deploy deep learning (DL) workloads in the cloud. They can also be used to develop and validate performance and accuracy of DL workloads that […]  ( 9 min )
    Your guide to generative AI and ML at AWS re:Invent 2023
    Yes, the AWS re:Invent season is upon us and as always, the place to be is Las Vegas! You marked your calendars, you booked your hotel, and you even purchased the airfare. Now all you need is some guidance on generative AI and machine learning (ML) sessions to attend at this twelfth edition of re:Invent. And although generative AI has appeared in previous events, this year we’re taking it to the next level. In addition to several exciting announcements during keynotes, most of the sessions in our track will feature generative AI in one form or another, so we can truly call our track “Generative AI and ML.” In this post, we give you a sense of how the track is organized and highlight a few sessions we think you’ll like. And although our track focuses on generative AI, many other tracks have related sessions. Use the “Generative AI” tag as you are browsing the session catalog to find them.  ( 16 min )
    Build a contextual chatbot for financial services using Amazon SageMaker JumpStart, Llama 2 and Amazon OpenSearch Serverless with Vector Engine
    The financial service (FinServ) industry has unique generative AI requirements related to domain-specific data, data security, regulatory controls, and industry compliance standards. In addition, customers are looking for choices to select the most performant and cost-effective machine learning (ML) model and the ability to perform necessary customization (fine-tuning) to fit their business use cases. Amazon […]  ( 11 min )
    Build well-architected IDP solutions with a custom lens – Part 1: Operational excellence
    The IDP Well-Architected Lens is intended for all AWS customers who use AWS to run intelligent document processing (IDP) solutions and are searching for guidance on how to build secure, efficient, and reliable IDP solutions on AWS. Building a production-ready solution in the cloud involves a series of trade-offs between resources, time, customer expectation, and […]  ( 14 min )
    Build well-architected IDP solutions with a custom lens – Part 2: Security
    Building a production-ready solution in AWS involves a series of trade-offs between resources, time, customer expectation, and business outcome. The AWS Well-Architected Framework helps you understand the benefits and risks of decisions you make while building workloads on AWS. By using the Framework, you will learn current operational and architectural recommendations for designing and operating […]  ( 11 min )
    Build well-architected IDP solutions with a custom lens – Part 3: Reliability
    The IDP Well-Architected Custom Lens is intended for all AWS customers who use AWS to run intelligent document processing (IDP) solutions and are searching for guidance on how to build a secure, efficient, and reliable IDP solution on AWS. Building a production-ready solution in the cloud involves a series of trade-offs between resources, time, customer […]  ( 13 min )
    Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency
    When a customer has a production-ready intelligent document processing (IDP) workload, we often receive requests for a Well-Architected review. To build an enterprise solution, developer resources, cost, time and user-experience have to be balanced to achieve the desired business outcome. The AWS Well-Architected Framework provides a systematic way for organizations to learn operational and architectural […]  ( 10 min )
    Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization
    Building a production-ready solution in the cloud involves a series of trade-off between resources, time, customer expectation, and business outcome. The AWS Well-Architected Framework helps you understand the benefits and risks of decisions you make while building workloads on AWS. An intelligent document processing (IDP) project usually combines optical character recognition (OCR) and natural language […]  ( 13 min )
    Build well-architected IDP solutions with a custom lens – Part 6: Sustainability
    An intelligent document processing (IDP) project typically combines optical character recognition (OCR) and natural language processing (NLP) to automatically read and understand documents. Customers across all industries run IDP workloads on AWS to deliver business value by automating use cases such as KYC forms, tax documents, invoices, insurance claims, delivery reports, inventory reports, and more. […]  ( 11 min )
    How Amazon Search M5 saved 30% for LLM training cost by using AWS Trainium
    For decades, Amazon has pioneered and innovated machine learning (ML), bringing delightful experiences to its customers. From the earliest days, Amazon has used ML for various use cases such as book recommendations, search, and fraud detection. Similar to the rest of the industry, the advancements of accelerated hardware have allowed Amazon teams to pursue model […]  ( 11 min )
    Geospatial generative AI with Amazon Bedrock and Amazon Location Service
    Today, geospatial workflows typically consist of loading data, transforming it, and then producing visual insights like maps, text, or charts. Generative AI can automate these tasks through autonomous agents. In this post, we discuss how to use foundation models from Amazon Bedrock to power agents to complete geospatial tasks. These agents can perform various tasks […]  ( 11 min )
  • Open

    One-Minute Daily AI News 11/22/2023
    Anthropic just released Claude 2.1, an improvement on its flagship large language model that keeps it competitive with the GPT series — and now has the useful added feature of “being developed by a company not actively at war with itself.”[1] Google has announced that its Bard AI chatbot can now answer questions about YouTube videos.[2] United States Endorses Responsible AI Measures for Global Militaries.[3] TikTok parent company used AI to optimize Linux kernel, boosting performance and efficiency.[4] Sources: [1] https://techcrunch.com/2023/11/21/anthropic-claude-2-1/ [2] https://techcrunch.com/2023/11/22/googles-bard-ai-chatbot-can-now-answer-questions-about-youtube-videos/ [3] https://www.defense.gov/News/News-Stories/Article/Article/3597093/united-states-endorses-responsible-ai-measures-for-global-militaries/ [4] https://www.tomshardware.com/news/chinese-company-uses-ai-to-optimize-linux-kernel submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    Don't listen to Legends about a guy, but to the guy himself
    submitted by /u/the_anonymizer [link] [comments]
    Sam Altman Returns to OpenAI
    ​ https://preview.redd.it/ol8wihp76y1c1.png?width=2338&format=png&auto=webp&s=86bbc780575d54531d3a287da22c3be25a0b80d1 After 5 days of being on his ouster, Sam Altman has successfully been reinstated. Just this morning, OpenAI announced the news on Twitter (X): ​ https://preview.redd.it/h1h0k6na6y1c1.png?width=1202&format=png&auto=webp&s=d0df7444668d3a916cba4f3ca467a4fbe1009d8b This announcement was supported by many big people such as the OpenAI employees who said on Twitter (X), "we are so back." ​ https://preview.redd.it/9qgse8nc6y1c1.png?width=1202&format=png&auto=webp&s=1f039c8f5f52451604195771f8dc3fa3c348f5db And Satya Nadella, Microsoft’s chief executive, said on Twitter (X) that he was “encouraged by the changes to OpenAI board,” calling it a “first essential step on a path to more stable, well-informed, and effective governance.” ​ https://preview.redd.it/3q0iakxe6y1c1.png?width=1202&format=png&auto=webp&s=44a50a986681bdffd0361f9547f5bd4948e5b710 Check out more in this New York Times Article: https://www.nytimes.com/2023/11/22/technology/openai-sam-altman-returns.html submitted by /u/ThatNoCodeGuy [link] [comments]  ( 1 min )
    Ilya Sutskever: The next Oppenhiemer?
    If AGI is accomplished by OpenAI will Ilya be considered the next Oppenhiemer? submitted by /u/Alone_Ad6784 [link] [comments]  ( 1 min )
    ChatGPT, invent future doctrines for superintelligent AI.
    submitted by /u/Philipp [link] [comments]  ( 1 min )
    Google Earth automatic tracing possible?
    I'd like to know if I (or someone with actual talent, since I have next to none) can create an ai tool to quickly measure the perimeters of buildings? Or if something like this already exists (right now, I just do it all manually, very time consuming)... Of note, I have essentially 0 programming experience, even tho ai fascinates me, I just can't seem to grasp how to utilize it. submitted by /u/FormerSBO [link] [comments]
    Is there a FOSS machine learning voice cloning tool for TTS?
    Basically, I'd like to try cloning my own voice for use with text-to-speech. Instead of paying to use a web-based service, I was wondering if I can basically download a model, train it with my own voice and then generate speech via text with it? I'm pretty knowledgeable with computers being in IT but I'm entirely inexperienced in machine learning and AI other than using pre-made services like ChatGPT, Google Bard, UberDuck TTS, Gigapixel AI, Mid journey, DALL-E and so-on and so-forth. If there is such a tool like this with good quality (even if I have to provide a LOT of audio to train it, that's ok), could you please point me in the direction I should be looking? If possible to a tutorial as well, hahah. Thank you! submitted by /u/IDE_IS_LIFE [link] [comments]
    AI Tool to Take an Excel Workbook and Turn into an Illustration and Design
    I've been searching for an AI tool that can take our firm's projection models and convert them into some kind of design... Any thoughts? submitted by /u/ninyaha [link] [comments]  ( 1 min )
    Sam happily returning to OpenAI more powerful than ever after this weekend
    submitted by /u/ptitrainvaloin [link] [comments]
    Each neuron has 10,000 connections too other neurons, and there are 100 billion neurons in the human brain. How close is AI to that level of connectivity?
    As we venture deeper into the era of artificial intelligence, we're continually debating several aspects of its ability, emergence, potential, trajectory, and timeline. With so many companies/people doing so many things, it's hard to keep up with, or even understand what is actually going on. I personally subscribe to the idea that hardware needs to evolve along with the coding architecture/algorithms that current AI is made of. But I don't know how to translate the terms and numbers used to describe the progress in the first place, which is true for most people. The relatively few of you out there who do know what these numbers and definitions mean have conversations and arguments that go above most people's heads. So I'm making this post to ask, as well as create conversation around how the progress of AI, in terms of connectivity, compares to the human brain. I am really hoping we can do this in such a way for everyone to understand. I'm also aware that mimicking the human brain isn't necessarily the goal with AI, and that the definition of AGI doesn't require it to be human-like, but rather human-level, generally. Thanks for reading. I'm looking forward to the responses. :) submitted by /u/SlowCrates [link] [comments]  ( 1 min )
    "AGI: Shaping Tomorrow's World"
    submitted by /u/Fit-Code-5141 [link] [comments]
    A developer made 140,000$ in 3 months with his AI wrapper before Stripe shut him down.
    submitted by /u/No-Net-4704 [link] [comments]  ( 1 min )
    The Rube Goldberg Singularity Machine.
    submitted by /u/Philipp [link] [comments]  ( 1 min )
    Breaking! Hogwarts Internal Surveillance Leak
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 1 min )
    How AI image generators produced images using the same prompts.
    So I tried to use 5 image generators for this prompt: "A fairy with a cute dog with mountains in the background in anime" and look which one produced an output that is in HD and crisp in quality. https://preview.redd.it/4n7enbruhv1c1.png?width=1287&format=png&auto=webp&s=0480202d121d401c26d0823298a3d1d8e157542a submitted by /u/Objective_Pipe_7388 [link] [comments]  ( 1 min )
    Growing up with Generative AI.
    Remember growing up with MS Paint, MS Word, or Solitaire as the norm? Well, today's kids are growing up with tools that can create any image or art piece with a simple command. They also have access to powerful GPTs that can write thousands of Harry Potter books in just minutes. The best part? These tools come pre-installed for free on every basic Computer. Just like MS Paint was. Makes you kind of wonder how that generation will see the world. Imagine how simple things like art class and what they're going to be when your PC can create, not manipulate, but create anything you can imagine.. Or literature class. Took me back a bit, thought I'd share and see what you guys think. Full article and credit to : https://www.linkedin.com/pulse/growing-up-generative-ai-emmanuel-kasigazi-oboyf?utm_source=share&utm_medium=member_android&utm_campaign=share_via submitted by /u/Pay-Me-No-Mind [link] [comments]  ( 1 min )
    OpenAI Episode 5: Sam Altman to return as OpenAI CEO with new board members
    submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    Sam Altman has officially returned as CEO of OpenAI.
    submitted by /u/blaine__ [link] [comments]  ( 1 min )
    One-Minute Daily AI News 11/21/2023
    Microsoft releases Orca 2, a pair of small language models that outperform larger counterparts.[1] OpenAI staff make jokes about CEO drama as they announce new ChatGPT voice feature.[2] OpenAI staff threaten to quit en masse unless Sam Altman is reinstated.[3] Australia to force social media companies to crack down on ‘emerging harms’ of AI deep fakes and hate speech.[4] IBM launches free Artificial Intelligence training courses.[5] Sam Altman to return as OpenAI CEO with new board members.[6] Sources: [1] https://venturebeat.com/ai/microsoft-releases-orca-2-a-pair-of-small-language-models-that-outperform-larger-counterparts/ [2] https://siliconangle.com/2023/11/21/openai-staff-make-jokes-ceo-drama-announce-new-chatgpt-voice-feature/ [3] https://www.theguardian.com/technology/2023/nov/20/openai-staff-walkout-sam-altman-board [4] https://www.theguardian.com/australia-news/2023/nov/22/australia-social-media-ai-laws-crackdown-esafety [5] https://www.rte.ie/news/2023/1122/1417785-ibm-ai-training/ [6] https://www.washingtonpost.com/technology/2023/11/22/sam-altman-back-openai/ submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    Debate: How much will AI change movies & music? A writer says "some", an engineer says "all".
    submitted by /u/slhamlet [link] [comments]
  • Open

    [D] RAG Evaluation: LlamaIndex is generally better than OpenAI Assistants
    TLDR; OpenAI's Assistants API: Showcases high efficiency with single documents but faces challenges when managing multiple documents. LlamaIndex: Exhibits consistent performance across both single and multiple document formats, outshining OpenAI in more complex scenarios. https://www.tonic.ai/blog/rag-evaluation-series-validating-rag-performance-openai-vs-llamaindex submitted by /u/ian4321 [link] [comments]  ( 9 min )
    [D] Looking for mentor to collaborate and publish at top vision conference venues
    I'm a PhD student working in computer vision, having a rather "aloof" PI who isn't really interested in guiding his students. Furthermore, everyone else in my department is working on unrelated areas, so I don't have an opportunity to collaborate and learn from them. To cut a long story short, I don't wish to quit the PhD because I really like the field, but I don't have the option to switch PIs. Also, I come from a non-CS background, and I know from first-hand experience how hard it is to get into a CS PhD program with a non-traditional background. With the above premise, I'd really like to do good research and publish at top vision venues like CVPR, ICCV, ECCV, and I'm looking for mentors/collaborators who have experience doing this: - I won't expect much handholding and am capable of doing things with just a little guidance: Though not highly impactful work, I recently worked all on my own (from idea to experiments to writeup) on a paper and got it accepted at a vision conference. In addition to that, I have co-authored a couple other papers as well. - I won't take much of your time: A short weekly call to discuss progress and next steps is all I need. - The idea is that we work on something that interests both of us: I'm open to working on anything as long as it's vision related. You probably have side ideas that you have had little time to experiment on. I can do the grunt work of implementations and experiments and if it turns out to be publication-worthy, that's a win for both of us. If this sounds like something you're interested in, please reach out. :) submitted by /u/fullgoopy_alchemist [link] [comments]  ( 10 min )
    [P] Classification of Active Parallel Pumps for the Highest Efficiency in a Parallel Pumping System
    I have data of the system efficiency at different flows and pressures for a parallel pumping system with either 1 or 2 active pumps. I have data of the efficiency for both 1 and 2 active pumps in the somewhat same flow/pressure area. To optimize the efficiency, my first thought was to remove the data point of the output label (number of active pumps) with the lowest efficiency when comparing it to the surrounding data points. After removing the necessary data points, i'd train a neural network to output the optimal number of active pumps for a given pressure and flow. I'm very new to Machine Learning, so i was wondering if there is another way where i don't have to sort the data this way myself, but rather let some sort of algorithm do it for me? Also, is this the same approach you'd use? submitted by /u/vision_dev [link] [comments]  ( 9 min )
    [D] Unlocking Minds: The Pulse of America on AI Integration - A Nationwide Survey on Attitudes, Hopes, and Fears!
    Hello, thanks for visiting this Reddit page. Please take a moment to complete the survey. The purpose of this survey is to collect data from the community to present to my college class. https://www.surveymonkey.com/r/JRQKPSY submitted by /u/Guilty_Chipmunk_3471 [link] [comments]  ( 9 min )
    [D] Estimating transition probabilities and their ranges
    Hello everyone, I hope this subreddit is the right place to seek help! I have a system with multiple states (N) that can transition from one state to another at every discrete time increment, or stay in the same one. I want to obtain a good estimate of the transition probabilities of the system. I have some data that allows the creation of a transition matrix, treating the problem as a Markov chain. However, there are extra covariates that I would like to use to further "segment" the states. By doing so, I may end up with quite little data, and I'm not confident enough that I would be able to represent the actual system accurately. One solution I thought of was to create a multinomial classifier that, given these extra covariates, provides a probability for each (next) state. However, I…  ( 10 min )
    [D] Instance weighting vs weighted loss function (cost-sensitive learning)
    Hi everyone, For those of you who are familiar with cost-sensitive learning, I know that there are multiple methods including instance reweighting and weighted loss function. Can someone please explain to me how they are different? for example, for a logistic regression, how is instance weighting different that weighted cross entropy (in any possible aspect)? I think I made a giant mistake in my end of studies project and I would appreciate any clarification and explanation as it is really urgent! thanks everyone! submitted by /u/Embarrassed_Trust135 [link] [comments]  ( 1 min )
    [D] Any Open Source Tools for Collecting RLHF data in live AI chats?
    Hey folks, I've been looking into tools designed to collect user feedback on responses given by an AI model in a chat. The picture I have in my mind is a chat interface where the user interacts with a trained and chat-finetuned model. For each response the user has the option to rate it as good/bad (possibly more) and optionally provide what the correct answer should have been. Annotated conversations then get stored and can be later used to further fine-tune the model with RLHF. Essentially, the kind of interface ChatGPT has with the little thumbs up / thumbs down buttons for every response. The key aspect of the tool that I am trying to find is that it's a live chat with a model, that can handle actual user queries with the added option of rating every response. My current search has led me to a couple of data annotation companies and a single open-source tool. I am not looking for a paid data annotation platform or data annotators, at least not at the moment. The single open source tool I found is called Xtreme1, but the documentation around RLHF data annotation seems to be missing and it looks to be a tool where you can post-process the data, where as I am looking to give users the option to provide feedback right in the chat. Does anybody know of any open source tools that can help with that? I am perfectly fine with spending some time putting a few different tools together if that's what it takes, but don't have the necessary front-end expertise to implement something usable on my own. ​ ​ submitted by /u/lightSpeedBrick [link] [comments]  ( 10 min )
    Blog post with reviewing research articles on prompt engineering [R]
    Hello, I've written a blog article about prompt engineering techniques if you are interested. In it, I review various techniques for crafting prompts for Large Language Models such as GPT or LLAMA. https://thethoughtprocess.xyz/language/en/science-en/computer-science/artificial-intelligence/2023/11/22/prompt-engineering-how-to-talk-to-ais-like-chatgpt/ submitted by /u/Addicted_to_math [link] [comments]  ( 9 min )
    [D] The Status of Open Source Code LLMs
    I've been pondering something recently. Did you notice that achieving over 70% on the well-known HumanEval pass@1 hasn't been making major headlines? Models like WizardCoderV2, Phind, Deepseek, and XwinCoder have all surpassed the 67% reported in GPT-4’s report. Some of them are even closely tailing the 82% of GPT-4 API’s. So, are these models really performing that well? Here's something intriguing: I found this image in the latest release of XwinCoder’s repo: Xwin-LM/Xwin-Coder at main · Xwin-LM/Xwin-LM (github.com) ​ Results in XwinCoder repo It shows that GPT-4 achieves a 60% pass@1 on APPS-introductory, which is higher than CodeLLaMA-34B’s pass@100 (56.3) and XwinCoder-34B’s pass@5 (43.0). Interesting, isn't it? This suggests that judging a model based on a single benchmark might not provide the full picture. This leads me to a couple of questions: What exactly is the gap here? How can we definitively say one model outperforms another? How are other recent models performing on benchmarks like APPS and DS1000? I'm interested in hearing your thoughts on this. Has anyone experimented with these new models? What was your experience like? submitted by /u/RealAGIFan [link] [comments]
    [D] Using LLMs as an evaluator
    Hey everyone, i made this blog post of how I built an evaluator using OpenAI Assistants. An evaluator is basically an LLM agent to evaluate LLM applications. I thought it would be nice to share and get your personal take on this subject:) https://medium.com/@jeffreyip54/why-openai-assistants-is-a-big-win-for-llm-evaluation-6821e2a872db By the way, i mentioned my company in the post but please ignore it, not intending to promote anything. submitted by /u/Ok_Constant_9886 [link] [comments]  ( 9 min )
    [D] How are Tome.ai and these other LLM powered presentation builders working under the hood?
    I would assume they just have some prompt they are handing to GPT4 (or some other model), with instructions to build it into a deck outline. But how is that deck outline produced by the LLM then getting translated into visuals in powerpoint? submitted by /u/Alert_Invite_9205 [link] [comments]  ( 9 min )
    AcademicGPT: Empowering Academic Research
    submitted by /u/Jean-Porte [link] [comments]  ( 9 min )
    [D] How do you keep up ?
    I started my PhD in NLP a year or so before the advent of Transformers, and finished it just as ChatGPT was unveiled (literally defended a week before). Halfway through, I felt the sudden acceleration of NLP, where there was so much everywhere all at once. Before, knowing one's domain, and the state-of-the-art GCN, CNN or Bert architectures, was enough. Since, I've been working in a semi-related area (computer assisted humanities) as a data engineer/software developer/ML engineer (it's a small team so many hats). Not much in terms of latest news, so I tried recently to get up to speed with the recent developments. But there are so many ! Everywhere. Even just in NLP, not considering all the other fields such as reinforcement learning, computer vision, all the fundamentals of ML etc. It is damn near impossible to gather an in-depth understanding of a model as they are so complex, and numerous. All of them are built on top of other ones, so you also need to read up on those to understand anything. I follow some people on LinkedIn who just give new names every week or so. Going to look for papers in top conferences is also daunting as there is no guarantee that a paper with an award will translate to an actual system, while companies churn out new architectures without the research paper/methodology being made public. It's overwhelming. So I guess my question is two fold : how does one get up to speed after a year of not being too much in the field ? And how does one keep up after that ? submitted by /u/CursedCrystalCoconut [link] [comments]  ( 10 min )
    [P] New iteration of predictive algorithm (deodel2)
    The deodel project implements a family of shallow predictive algorithms closely related to nearest neighbors. The new iteration improves the processing of numerical attributes. The original version simply discretized the values and compared them as if they were categorical. This didn't take into account the distance separating the two compared values. Whether adjacent or at the extreme of their range, they would be categorized only as a mismatch. Deodel2 takes this into account and substantially improves accuracy for datasets containing many continuous attribute values. If the dataset consists only of categorical attributes, the accuracy will be identical for the two versions. As most datasets do contain lots of numerical attributes, the accuracy increases substantially and approaches that of ensemble algorithms like Random Forest. https://github.com/c4pub/deodel It should be useful in establishing a predictive accuracy baseline for raw/unclean datasets. submitted by /u/eppursim1 [link] [comments]  ( 9 min )
    [R] GPQA: A Graduate-Level Google-Proof Q&A Benchmark
    Paper: https://arxiv.org/abs/2311.12022 Code and data: https://github.com/idavidrein/gpqa/ Abstract: We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities. ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] Are neural processes still relevant?
    There was some hype around neural processes which seem like a marriage between non-parametric models like GPs and NN but have not heard anything about them recently. Have they been superseded by something else? I have some spatio temporal data that I am looking to play with and wonder if this is a good way to dive into them but want to convince myself that they are still a somewhat relevant family of models to spend time on. submitted by /u/deluded_soul [link] [comments]  ( 9 min )
    [R] Exponentially Faster Language Modelling
    TL;DR: Organize your neurons into a tree to get 78x faster inference (theoretical limit is 341x). This was demonstrated on BERT-base, where this change preserved 96% of its downstream GLUE performance. For a quick comparison, DistilBERT offers 1.6x acceleration while preserving 97% of GLUE performance. This is a HuggingFace Featured Paper from 11/21/2023. Paper: https://arxiv.org/abs/2311.10770 Code: https://github.com/pbelcak/UltraFastBERT Model: https://huggingface.co/pbelcak/UltraFastBERT-1x11-long Abstract: Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights. This exponential acceleration was achieved on a 180mn BERT model. Just imagine how amazing the speedup would be on a multi-bn parameter model such as LLaMA if the tree trick (i.e. "fast feedforward networks") continues to scale up to larger layer sizes... submitted by /u/lexected [link] [comments]  ( 9 min )
    [D] DallE3 paper
    What are your thoughts on the DallE3 “paper” which doesn’t cover technical or architectural details? The only useful takeaway seems to be “higher quality data is better” and “image captioning models that provide a great amount of detail can create good datasets.” submitted by /u/Username912773 [link] [comments]  ( 9 min )
    OpenAI: "We have reached an agreement in principle for Sam to return to OpenAI as CEO" [N]
    OpenAI announcement: "We have reached an agreement in principle for Sam to return to OpenAI as CEO with a new initial board of Bret Taylor (Chair), Larry Summers, and Adam D'Angelo. We are collaborating to figure out the details. Thank you so much for your patience through this." https://twitter.com/OpenAI/status/1727205556136579362 submitted by /u/we_are_mammals [link] [comments]  ( 9 min )
    [D] Exploring Optimal Approaches for Expanding Context Windows in Language Models
    I'm looking for insights and advice on extending the context window of the LLMs (most specifically Mistral). Whether you're a researcher, developer, or enthusiast in the field, I'd love to hear about your experiences and recommendations. Are there any specific techniques, methodologies, or tools you've found effective in extending the context window for LLMs? Additionally, if you've encountered challenges in this area, how did you overcome them? Any resources, papers, or community discussions you can point me to would be greatly appreciated. submitted by /u/Infamous-Belt8671 [link] [comments]  ( 9 min )
    [R] Researching quantum machine learning applications for drug discovery/development and medical imaging.
    I'm currently on a research team engaged in the National Science Foundation Regional I-Corps program focused on quantum machine learning applications in life scieces. In drug discovery, our research indicates that quantum technology holds the promise of significantly improving binding affinity predictions, potentially revolutionizing the drug discovery process. Our recent publication (https://arxiv.org/abs/2301.06331) outlines an approach to drug discovery that combines the precision of advanced machine learning models with the computational efficiency of quantum computing. In medical imaging, our preliminary findings suggest that quantum technology has the capacity to revolutionize medical imaging analysis, offering faster and more accurate results that could significantly enhance the precision of diagnoses and treatment outcomes. As we delve deeper into our research, we are seeking insights from industry professionals to better understand the current challenges in each of these industries. Here are a few of my questions. If you’re interested in discussing in more depth, please DM me! For Drug Discovery: Are you currently using any AI-based tool to perform target identification and validation? What types of classical models are being used? What computational analysis is currently performed before a new drug goes through clinical trials - What problems do you face in target validation? Can accurate binding affinity predictions speed-up the target validation process? For Medical Imaging: What are the current challenges for making accurate diagnostics with medical imaging? What is the current state-of-the-art for computer-assisted tools that automatically generate diagnostics? What are their weaknesses? What are the most challenging diseases to diagnose based on medical imaging? Thanks in advance :) submitted by /u/ingenii_quantum_ml [link] [comments]  ( 1 min )
  • Open

    Research Focus: Week of November 22, 2023
    A new deep-learning compiler for dynamic sparsity; Tongue Tap could make tongue gestures viable for VR/AR headsets; Ranking LLM-Generated Loop Invariants for Program Verification; Assessing the limits of zero-shot foundation models in single-cell biology. The post Research Focus: Week of November 22, 2023 appeared first on Microsoft Research.  ( 10 min )
  • Open

    Improving simulations of clouds and their effects on climate
    Posted by Tapio Schneider, Visiting Researcher, and Yi-fan Chen, Engineering Lead, Google Research Today’s climate models successfully capture broad global warming trends. However, because of uncertainties about processes that are small in scale yet globally important, such as clouds and ocean turbulence, these models’ predictions of upcoming climate changes are not very accurate in detail. For example, predictions of the time by which the global mean surface temperature of Earth will have warmed 2℃, relative to preindustrial times, vary by 40–50 years (a full human generation) among today’s models. As a result, we do not have the accurate and geographically granular predictions we need to plan resilient infrastructure, adapt supply chains to climate disruption, and assess the risks of…  ( 91 min )
  • Open

    Teenage Dream: Aspiring Computer Science Major Experiences NVIDIA Life With Make-A-Wish Visit
    A calendar packed with meetings, calls and lab visits may sound like a typical workday for many — but for Luca Lofranco, whose greatest wish was to experience what it’s like to work at NVIDIA, it was a dream come true. Eighteen-year-old Lofranco recently traveled from his hometown near Toronto, Canada, to spend the day Read article >  ( 6 min )
    AI-Powered Tech Company Helps Grocers Start Afresh in Supply Chain Management
    Talk about going after low-hanging fruit. Afresh is an AI startup that helps grocery stores and retailers reduce food waste by making supply chains more efficient. In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with the company’s cofounder and president, Nathan Fenner, about its mission, offerings and the greater challenge of Read article >  ( 5 min )
  • Open

    Predictive auxiliary objectives in deep RL mimic learning in the brain
    Paper: https://arxiv.org/abs/2310.06089 Abstract: The ability to predict upcoming events has been hypothesized to comprise a key aspect of natural and machine cognition. This is supported by trends in deep reinforcement learning (RL), where self-supervised auxiliary objectives such as prediction are widely used to support representation learning and improve task performance. Here, we study the effects predictive auxiliary objectives have on representation learning across different modules of an RL system and how these mimic representational changes observed in the brain. We find that predictive objectives improve and stabilize learning particularly in resource-limited architectures, and we identify settings where longer predictive horizons better support representational transfer. Furthermore, we find that representational changes in this RL system bear a striking resemblance to changes in neural activity observed in the brain across various experiments. Specifically, we draw a connection between the auxiliary predictive model of the RL system and hippocampus, an area thought to learn a predictive model to support memory-guided behavior. We also connect the encoder network and the value learning network of the RL system to visual cortex and striatum in the brain, respectively. This work demonstrates how representation learning in deep RL systems can provide an interpretable framework for modeling multi-region interactions in the brain. The deep RL perspective taken here also suggests an additional role of the hippocampus in the brain -- that of an auxiliary learning system that benefits representation learning in other regions. https://preview.redd.it/bxmygmmt7u1c1.png?width=985&format=png&auto=webp&s=9e5c9dd99249adc48805c4f89e1e4f3cfd94b8e7 submitted by /u/APaperADay [link] [comments]
  • Open

    Designing Interpretable ML System to Enhance Trustworthy AI in Healthcare: A Systematic Review of the Last Decade to A Proposed Robust Framework. (arXiv:2311.11055v1 [cs.AI])
    AI-based medical technologies, including wearables, telemedicine, LLMs, and digital care twins, significantly impact healthcare. Ensuring AI results are accurate and interpretable is crucial, especially for clinicians. This paper reviews processes and challenges of interpretable ML (IML) and explainable AI (XAI) in healthcare. Objectives include reviewing XAI processes, methods, applications, and challenges, with a focus on quality control. The IML process is classified into data pre-processing interpretability, interpretable modeling, and post-processing interpretability. The paper aims to establish the importance of robust interpretability in healthcare through experimental results, providing insights for creating communicable clinician-AI tools. Research questions, eligibility criteria, and goals were identified following PRISMA and PICO methods. PubMed, Scopus, and Web of Science were systematically searched using specific strings. The survey introduces a step-by-step roadmap for implementing XAI in clinical applications, addressing existing gaps and acknowledging XAI model limitations.  ( 2 min )
    A Quadratic Speedup in Finding Nash Equilibria of Quantum Zero-Sum Games. (arXiv:2311.10859v1 [quant-ph])
    Recent developments in domains such as non-local games, quantum interactive proofs, and quantum generative adversarial networks have renewed interest in quantum game theory and, specifically, quantum zero-sum games. Central to classical game theory is the efficient algorithmic computation of Nash equilibria, which represent optimal strategies for both players. In 2008, Jain and Watrous proposed the first classical algorithm for computing equilibria in quantum zero-sum games using the Matrix Multiplicative Weight Updates (MMWU) method to achieve a convergence rate of $\mathcal{O}(d/\epsilon^2)$ iterations to $\epsilon$-Nash equilibria in the $4^d$-dimensional spectraplex. In this work, we propose a hierarchy of quantum optimization algorithms that generalize MMWU via an extra-gradient mechanism. Notably, within this proposed hierarchy, we introduce the Optimistic Matrix Multiplicative Weights Update (OMMWU) algorithm and establish its average-iterate convergence complexity as $\mathcal{O}(d/\epsilon)$ iterations to $\epsilon$-Nash equilibria. This quadratic speed-up relative to Jain and Watrous' original algorithm sets a new benchmark for computing $\epsilon$-Nash equilibria in quantum zero-sum games.  ( 2 min )
    Time-Series Forecasting: Unleashing Long-Term Dependencies with Fractionally Differenced Data. (arXiv:2309.13409v3 [cs.LG] UPDATED)
    This study introduces a novel forecasting strategy that leverages the power of fractional differencing (FD) to capture both short- and long-term dependencies in time series data. Unlike traditional integer differencing methods, FD preserves memory in series while stabilizing it for modeling purposes. By applying FD to financial data from the SPY index and incorporating sentiment analysis from news reports, this empirical analysis explores the effectiveness of FD in conjunction with binary classification of target variables. Supervised classification algorithms were employed to validate the performance of FD series. The results demonstrate the superiority of FD over integer differencing, as confirmed by Receiver Operating Characteristic/Area Under the Curve (ROCAUC) and Mathews Correlation Coefficient (MCC) evaluations.  ( 2 min )
    Meta-Path Learning for Multi-relational Graph Neural Networks. (arXiv:2309.17113v2 [cs.LG] UPDATED)
    Existing multi-relational graph neural networks use one of two strategies for identifying informative relations: either they reduce this problem to low-level weight learning, or they rely on handcrafted chains of relational dependencies, called meta-paths. However, the former approach faces challenges in the presence of many relations (e.g., knowledge graphs), while the latter requires substantial domain expertise to identify relevant meta-paths. In this work we propose a novel approach to learn meta-paths and meta-path GNNs that are highly accurate based on a small number of informative meta-paths. Key element of our approach is a scoring function for measuring the potential informativeness of a relation in the incremental construction of the meta-path. Our experimental evaluation shows that the approach manages to correctly identify relevant meta-paths even with a large number of relations, and substantially outperforms existing multi-relational GNNs on synthetic and real-world experiments.  ( 2 min )
    SALSA-CLRS: A Sparse and Scalable Benchmark for Algorithmic Reasoning. (arXiv:2309.12253v2 [cs.LG] UPDATED)
    We introduce an extension to the CLRS algorithmic learning benchmark, prioritizing scalability and the utilization of sparse representations. Many algorithms in CLRS require global memory or information exchange, mirrored in its execution model, which constructs fully connected (not sparse) graphs based on the underlying problem. Despite CLRS's aim of assessing how effectively learned algorithms can generalize to larger instances, the existing execution model becomes a significant constraint due to its demanding memory requirements and runtime (hard to scale). However, many important algorithms do not demand a fully connected graph; these algorithms, primarily distributed in nature, align closely with the message-passing paradigm employed by Graph Neural Networks. Hence, we propose SALSA-CLRS, an extension of the current CLRS benchmark specifically with scalability and sparseness in mind. Our approach includes adapted algorithms from the original CLRS benchmark and introduces new problems from distributed and randomized algorithms. Moreover, we perform a thorough empirical evaluation of our benchmark. Code is publicly available at https://github.com/jkminder/SALSA-CLRS.  ( 2 min )
    Intrinsically motivated graph exploration using network theories of human curiosity. (arXiv:2307.04962v3 [cs.LG] UPDATED)
    Intrinsically motivated exploration has proven useful for reinforcement learning, even without additional extrinsic rewards. When the environment is naturally represented as a graph, how to guide exploration best remains an open question. In this work, we propose a novel approach for exploring graph-structured data motivated by two theories of human curiosity: the information gap theory and the compression progress theory. The theories view curiosity as an intrinsic motivation to optimize for topological features of subgraphs induced by nodes visited in the environment. We use these proposed features as rewards for graph neural-network-based reinforcement learning. On multiple classes of synthetically generated graphs, we find that trained agents generalize to longer exploratory walks and larger environments than are seen during training. Our method computes more efficiently than the greedy evaluation of the relevant topological properties. The proposed intrinsic motivations bear particular relevance for recommender systems. We demonstrate that next-node recommendations considering curiosity are more predictive of human choices than PageRank centrality in several real-world graph environments, including MovieLens, Amazon Books, and Wikipedia.  ( 3 min )
    Online Two-stage Thermal History Prediction Method for Metal Additive Manufacturing of Thin Walls. (arXiv:2310.16125v2 [cs.LG] UPDATED)
    This paper aims to propose an online two-stage thermal history prediction method, which could be integrated into a metal AM process for performance control. Based on the similarity of temperature curves (curve segments of a temperature profile of one point) between any two successive layers, the first stage of the proposed method designs a layer-to-layer prediction model to estimate the temperature curves of the yet-to-print layer from measured temperatures of certain points on the previously printed layer. With measured/predicted temperature profiles of several points on the same layer, the second stage proposes a reduced order model (ROM) (intra-layer prediction model) to decompose and construct the temperature profiles of all points on the same layer, which could be used to build the temperature field of the entire layer. The training of ROM is performed with an extreme learning machine (ELM) for computational efficiency. Fifteen wire arc AM experiments and nine simulations are designed for thin walls with a fixed length and unidirectional printing of each layer. The test results indicate that the proposed prediction method could construct the thermal history of a yet-to-print layer within 0.1 seconds on a low-cost desktop computer. Meanwhile, the method has acceptable generalization capability in most cases from lower layers to higher layers in the same simulation, as well as from one simulation to a new simulation on different AM process parameters. More importantly, after fine-tuning the proposed method with limited experimental data, the relative errors of all predicted temperature profiles on a new experiment are smaller than 0.09, which demonstrates the applicability and generalization of the proposed two-stage thermal history prediction method in online applications for metal AM.  ( 3 min )
    Beyond IID weights: sparse and low-rank deep Neural Networks are also Gaussian Processes. (arXiv:2310.16597v2 [stat.ML] UPDATED)
    The infinitely wide neural network has been proven a useful and manageable mathematical model that enables the understanding of many phenomena appearing in deep learning. One example is the convergence of random deep networks to Gaussian processes that allows a rigorous analysis of the way the choice of activation function and network weights impacts the training dynamics. In this paper, we extend the seminal proof of Matthews et al. (2018) to a larger class of initial weight distributions (which we call PSEUDO-IID), including the established cases of IID and orthogonal weights, as well as the emerging low-rank and structured sparse settings celebrated for their computational speed-up benefits. We show that fully-connected and convolutional networks initialized with PSEUDO-IID distributions are all effectively equivalent up to their variance. Using our results, one can identify the Edge-of-Chaos for a broader class of neural networks and tune them at criticality in order to enhance their training.  ( 2 min )
    Building a Safer Maritime Environment Through Multi-Path Long-Term Vessel Trajectory Forecasting. (arXiv:2310.18948v2 [cs.LG] UPDATED)
    Maritime transportation is paramount in achieving global economic growth, entailing concurrent ecological obligations in sustainability and safeguarding endangered marine species, most notably preserving large whale populations. In this regard, the Automatic Identification System (AIS) data plays a significant role by offering real-time streaming data on vessel movement, allowing enhanced traffic monitoring. This study explores using AIS data to prevent vessel-to-whale collisions by forecasting long-term vessel trajectories from engineered AIS data sequences. For such a task, we have developed an encoder-decoder model architecture using Bidirectional Long Short-Term Memory Networks (Bi-LSTM) to predict the next 12 hours of vessel trajectories using 1 to 3 hours of AIS data as input. We feed the model with probabilistic features engineered from historical AIS data that refer to each trajectory's potential route and destination. The model then predicts the vessel's trajectory, considering these additional features by leveraging convolutional layers for spatial feature learning and a position-aware attention mechanism that increases the importance of recent timesteps of a sequence during temporal feature learning. The probabilistic features have an F1 Score of approximately 85% and 75% for each feature type, respectively, demonstrating their effectiveness in augmenting information to the neural network. We test our model on the Gulf of St. Lawrence, a region known to be the habitat of North Atlantic Right Whales (NARW). Our model achieved a high R2 score of over 98% using various techniques and features. It stands out among other approaches as it can make complex decisions during turnings and path selection. Our study highlights the potential of data engineering and trajectory forecasting models for marine life species preservation.  ( 3 min )
    Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting. (arXiv:2310.11732v2 [cs.LG] UPDATED)
    Despite the significant progress made in practical applications of aligned language models (LMs), they tend to be overconfident in output answers compared to the corresponding pre-trained LMs. In this work, we systematically evaluate the impact of the alignment process on logit-based uncertainty calibration of LMs under the multiple-choice setting. We first conduct a thoughtful empirical study on how aligned LMs differ in calibration from their pre-trained counterparts. Experimental results reveal that there are two distinct uncertainties in LMs under the multiple-choice setting, which are responsible for the answer decision and the format preference of the LMs, respectively. Then, we investigate the role of these two uncertainties on aligned LM's calibration through fine-tuning in simple synthetic alignment schemes and conclude that one reason for aligned LMs' overconfidence is the conflation of these two types of uncertainty. Furthermore, we examine the utility of common post-hoc calibration methods for aligned LMs and propose an easy-to-implement and sample-efficient method to calibrate aligned LMs. We hope our findings could provide insights into the design of more reliable alignment processes for LMs.  ( 2 min )
    Balancing stability and plasticity in continual learning: the readout-decomposition of activation change (RDAC) framework. (arXiv:2310.04741v3 [cs.LG] UPDATED)
    Continual learning (CL) algorithms strive to acquire new knowledge while preserving prior information. However, this stability-plasticity trade-off remains a central challenge. This paper introduces a framework that dissects this trade-off, offering valuable insights into CL algorithms. The Readout-Decomposition of Activation Change (RDAC) framework first addresses the stability-plasticity dilemma and its relation to catastrophic forgetting. It relates learning-induced activation changes in the range of prior readouts to the degree of stability and changes in the null space to the degree of plasticity. In deep non-linear networks tackling split-CIFAR-110 tasks, the framework clarifies the stability-plasticity trade-offs of the popular regularization algorithms Synaptic intelligence (SI), Elastic-weight consolidation (EWC), and learning without Forgetting (LwF), and replay-based algorithms Gradient episodic memory (GEM), and data replay. GEM and data replay preserved stability and plasticity, while SI, EWC, and LwF traded off plasticity for stability. The inability of the regularization algorithms to maintain plasticity was linked to them restricting the change of activations in the null space of the prior readout. Additionally, for one-hidden-layer linear neural networks, we derived a gradient decomposition algorithm to restrict activation change only in the range of the prior readouts, to maintain high stability while not further sacrificing plasticity. Results demonstrate that the algorithm maintained stability without significant plasticity loss. The RDAC framework informs the behavior of existing CL algorithms and paves the way for novel CL approaches. Finally, it sheds light on the connection between learning-induced activation/representation changes and the stability-plasticity dilemma, also offering insights into representational drift in biological systems.  ( 3 min )
    Online Learning with Set-Valued Feedback. (arXiv:2306.06247v3 [cs.LG] UPDATED)
    We study a variant of online multiclass classification where the learner predicts a single label but receives a \textit{set of labels} as feedback. In this model, the learner is penalized for not outputting a label contained in the revealed set. We show that unlike online multiclass learning with single-label feedback, deterministic and randomized online learnability are \textit{not equivalent} in the realizable setting under set-valued feedback. In addition, we show that deterministic and randomized realizable learnability are equivalent if the Helly number of the collection of sets that can be revealed as feedback is finite. In light of this separation, we give two new combinatorial dimensions, named the Set Littlestone and Measure Shattering dimension, whose finiteness characterizes deterministic and randomized realizable learnability respectively. Additionally, these dimensions lower- and upper bound the deterministic and randomized minimax regret in the realizable setting. Going beyond the realizable setting, we prove that the Measure shattering dimension continues to characterize learnability and quantify minimax regret in the agnostic setting. Finally, we use our results to establish bounds on the minimax regret for three practical learning settings: online multilabel ranking, online multilabel classification, and real-valued prediction with interval-valued response.  ( 2 min )
    Hidden Classification Layers: Enhancing linear separability between classes in neural networks layers. (arXiv:2306.06146v2 [cs.LG] UPDATED)
    In the context of classification problems, Deep Learning (DL) approaches represent state of art. Many DL approaches are based on variations of standard multi-layer feed-forward neural networks. These are also referred to as deep networks. The basic idea is that each hidden neural layer accomplishes a data transformation which is expected to make the data representation "somewhat more linearly separable" than the previous one to obtain a final data representation which is as linearly separable as possible. However, determining the appropriate neural network parameters that can perform these transformations is a critical problem. In this paper, we investigate the impact on deep network classifier performances of a training approach favouring solutions where data representations at the hidden layers have a higher degree of linear separability between the classes with respect to standard methods. To this aim, we propose a neural network architecture which induces an error function involving the outputs of all the network layers. Although similar approaches have already been partially discussed in the past literature, here we propose a new architecture with a novel error function and an extensive experimental analysis. This experimental analysis was made in the context of image classification tasks considering four widely used datasets. The results show that our approach improves the accuracy on the test set in all the considered cases.  ( 3 min )
    FedMFS: Federated Multimodal Fusion Learning with Selective Modality Communication. (arXiv:2310.07048v2 [cs.LG] UPDATED)
    Multimodal federated learning (FL) aims to enrich model training in FL settings where devices are collecting measurements across multiple modalities (e.g., sensors measuring pressure, motion, and other types of data). However, key challenges to multimodal FL remain unaddressed, particularly in heterogeneous network settings: (i) the set of modalities collected by each device will be diverse, and (ii) communication limitations prevent devices from uploading all their locally trained modality models to the server. In this paper, we propose Federated Multimodal Fusion learning with Selective modality communication (FedMFS), a new multimodal fusion FL methodology that can tackle the above mentioned challenges. The key idea is the introduction of a modality selection criterion for each device, which weighs (i) the impact of the modality, gauged by Shapley value analysis, against (ii) the modality model size as a gauge for communication overhead. This enables FedMFS to flexibly balance performance against communication costs, depending on resource constraints and application requirements. Experiments on the real-world ActionSense dataset demonstrate the ability of FedMFS to achieve comparable accuracy to several baselines while reducing the communication overhead by over 4x.  ( 2 min )
    Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks. (arXiv:2310.19909v2 [cs.CV] UPDATED)
    Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performance increases for a range of systems, it is difficult for practitioners to make informed decisions about which backbone to choose. Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-language models, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more. Furthermore, BoB sheds light on promising directions for the research community to advance computer vision by illuminating strengths and weakness of existing approaches through a comprehensive analysis conducted on more than 1500 training runs. While vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular, we find that convolutional neural networks pretrained in a supervised fashion on large training sets still perform best on most tasks among the models we consider. Moreover, in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive, indicating that future works should perform SSL pretraining with advanced architectures and larger pretraining datasets. We release the raw results of our experiments along with code that allows researchers to put their own backbones through the gauntlet here: https://github.com/hsouri/Battle-of-the-Backbones  ( 3 min )
    Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced Transfer Learning. (arXiv:2310.08782v3 [cs.LG] UPDATED)
    Massive data is often considered essential for deep learning applications, but it also incurs significant computational and infrastructural costs. Therefore, dataset pruning (DP) has emerged as an effective way to improve data efficiency by identifying and removing redundant training samples without sacrificing performance. In this work, we aim to address the problem of DP for transfer learning, i.e., how to prune a source dataset for improved pretraining efficiency and lossless finetuning accuracy on downstream target tasks. To our best knowledge, the problem of DP for transfer learning remains open, as previous studies have primarily addressed DP and transfer learning as separate problems. By contrast, we establish a unified viewpoint to integrate DP with transfer learning and find that existing DP methods are not suitable for the transfer learning paradigm. We then propose two new DP methods, label mapping and feature mapping, for supervised and self-supervised pretraining settings respectively, by revisiting the DP problem through the lens of source-target domain mapping. Furthermore, we demonstrate the effectiveness of our approach on numerous transfer learning tasks. We show that source data classes can be pruned by up to 40% ~ 80% without sacrificing downstream performance, resulting in a significant 2 ~ 5 times speed-up during the pretraining stage. Besides, our proposal exhibits broad applicability and can improve other computationally intensive transfer learning techniques, such as adversarial pretraining. Codes are available at https://github.com/OPTML-Group/DP4TL.  ( 3 min )
    Lag-Llama: Towards Foundation Models for Time Series Forecasting. (arXiv:2310.08278v2 [cs.LG] UPDATED)
    Aiming to build foundation models for time-series forecasting and study their scaling behavior, we present here our work-in-progress on Lag-Llama, a general-purpose univariate probabilistic time-series forecasting model trained on a large collection of time-series data. The model shows good zero-shot prediction capabilities on unseen "out-of-distribution" time-series datasets, outperforming supervised baselines. We use smoothly broken power-laws to fit and predict model scaling behavior. The open source code is made available at https://github.com/kashif/pytorch-transformer-ts.
    Effective Bilevel Optimization via Minimax Reformulation. (arXiv:2305.13153v2 [cs.LG] UPDATED)
    Bilevel optimization has found successful applications in various machine learning problems, including hyper-parameter optimization, data cleaning, and meta-learning. However, its huge computational cost presents a significant challenge for its utilization in large-scale problems. This challenge arises due to the nested structure of the bilevel formulation, where each hyper-gradient computation necessitates a costly inner optimization procedure. To address this issue, we propose a reformulation of bilevel optimization as a minimax problem, effectively decoupling the outer-inner dependency. Under mild conditions, we show these two problems are equivalent. Furthermore, we introduce a multi-stage gradient descent and ascent (GDA) algorithm to solve the resulting minimax problem with convergence guarantees. Extensive experimental results demonstrate that our method outperforms state-of-the-art bilevel methods while significantly reducing the computational cost.  ( 2 min )
    SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding. (arXiv:2310.15308v2 [cs.CV] UPDATED)
    The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise. Our method integrates techniques of multi-task learning, continual learning, and distillation. Further, it demands significantly less computational cost compared to traditional multi-task training from scratch, and it only needs a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM and CLIP into a single vision transformer. Compared with deploying SAM and CLIP independently, our merged model, SAM-CLIP, reduces storage and compute costs for inference, making it well-suited for edge device applications. We show that SAM-CLIP not only retains the foundational strengths of SAM and CLIP, but also introduces synergistic functionalities, notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.
    Towards Hierarchical Regional Transformer-based Multiple Instance Learning. (arXiv:2308.12634v2 [cs.CV] UPDATED)
    The classification of gigapixel histopathology images with deep multiple instance learning models has become a critical task in digital pathology and precision medicine. In this work, we propose a Transformer-based multiple instance learning approach that replaces the traditional learned attention mechanism with a regional, Vision Transformer inspired self-attention mechanism. We present a method that fuses regional patch information to derive slide-level predictions and show how this regional aggregation can be stacked to hierarchically process features on different distance levels. To increase predictive accuracy, especially for datasets with small, local morphological features, we introduce a method to focus the image processing on high attention regions during inference. Our approach is able to significantly improve performance over the baseline on two histopathology datasets and points towards promising directions for further research.
    Neural Algorithmic Reasoning for Combinatorial Optimisation. (arXiv:2306.06064v2 [cs.NE] UPDATED)
    Solving NP-hard/complete combinatorial problems with neural networks is a challenging research area that aims to surpass classical approximate algorithms. The long-term objective is to outperform hand-designed heuristics for NP-hard/complete problems by learning to generate superior solutions solely from training data. Current neural-based methods for solving CO problems often overlook the inherent "algorithmic" nature of the problems. In contrast, heuristics designed for CO problems, e.g. TSP, frequently leverage well-established algorithms, such as those for finding the minimum spanning tree. In this paper, we propose leveraging recent advancements in neural algorithmic reasoning to improve the learning of CO problems. Specifically, we suggest pre-training our neural model on relevant algorithms before training it on CO instances. Our results demonstrate that by using this learning setup, we achieve superior performance compared to non-algorithmically informed deep learning models.
    Let the Flows Tell: Solving Graph Combinatorial Optimization Problems with GFlowNets. (arXiv:2305.17010v3 [cs.LG] UPDATED)
    Combinatorial optimization (CO) problems are often NP-hard and thus out of reach for exact algorithms, making them a tempting domain to apply machine learning methods. The highly structured constraints in these problems can hinder either optimization or sampling directly in the solution space. On the other hand, GFlowNets have recently emerged as a powerful machinery to efficiently sample from composite unnormalized densities sequentially and have the potential to amortize such solution-searching processes in CO, as well as generate diverse solution candidates. In this paper, we design Markov decision processes (MDPs) for different combinatorial problems and propose to train conditional GFlowNets to sample from the solution space. Efficient training techniques are also developed to benefit long-range credit assignment. Through extensive experiments on a variety of different CO tasks with synthetic and realistic data, we demonstrate that GFlowNet policies can efficiently find high-quality solutions. Our implementation is open-sourced at https://github.com/zdhNarsil/GFlowNet-CombOpt.
    Efficient Graphics Representation with Differentiable Indirection. (arXiv:2309.08387v2 [cs.GR] UPDATED)
    We introduce differentiable indirection -- a novel learned primitive that employs differentiable multi-scale lookup tables as an effective substitute for traditional compute and data operations across the graphics pipeline. We demonstrate its flexibility on a number of graphics tasks, i.e., geometric and image representation, texture mapping, shading, and radiance field representation. In all cases, differentiable indirection seamlessly integrates into existing architectures, trains rapidly, and yields both versatile and efficient results.
    Revisiting and Advancing Adversarial Training Through A Simple Baseline. (arXiv:2306.07613v2 [cs.CV] UPDATED)
    In this paper, we delve into the essential components of adversarial training which is a pioneering defense technique against adversarial attacks. We indicate that some factors such as the loss function, learning rate scheduler, and data augmentation, which are independent of the model architecture, will influence adversarial robustness and generalization. When these factors are controlled for, we introduce a simple baseline approach, termed SimpleAT, that performs competitively with recent methods and mitigates robust overfitting. We conduct extensive experiments on CIFAR-10/100 and Tiny-ImageNet, which validate the robustness of SimpleAT against state-of-the-art adversarial attackers such as AutoAttack. Our results also demonstrate that SimpleAT exhibits good performance in the presence of various image corruptions, such as those found in the CIFAR-10-C. In addition, we empirically show that SimpleAT is capable of reducing the variance in model predictions, which is considered the primary contributor to robust overfitting. Our results also reveal the connections between SimpleAT and many advanced state-of-the-art adversarial defense methods.  ( 2 min )
    Symmetry-preserving graph attention network to solve routing problems at multiple resolutions. (arXiv:2310.15543v2 [cs.LG] UPDATED)
    Travelling Salesperson Problems (TSPs) and Vehicle Routing Problems (VRPs) have achieved reasonable improvement in accuracy and computation time with the adaptation of Machine Learning (ML) methods. However, none of the previous works completely respects the symmetries arising from TSPs and VRPs including rotation, translation, permutation, and scaling. In this work, we introduce the first-ever completely equivariant model and training to solve combinatorial problems. Furthermore, it is essential to capture the multiscale structure (i.e. from local to global information) of the input graph, especially for the cases of large and long-range graphs, while previous methods are limited to extracting only local information that can lead to a local or sub-optimal solution. To tackle the above limitation, we propose a Multiresolution scheme in combination with Equivariant Graph Attention network (mEGAT) architecture, which can learn the optimal route based on low-level and high-level graph resolutions in an efficient way. In particular, our approach constructs a hierarchy of coarse-graining graphs from the input graph, in which we try to solve the routing problems on simple low-level graphs first, then utilize that knowledge for the more complex high-level graphs. Experimentally, we have shown that our model outperforms existing baselines and proved that symmetry preservation and multiresolution are important recipes for solving combinatorial problems in a data-driven manner. Our source code is publicly available at https://github.com/HySonLab/Multires-NP-hard
    Efficient and Accurate Physics-aware Multiplex Graph Neural Networks for 3D Small Molecules and Macromolecule Complexes. (arXiv:2206.02789v3 [q-bio.BM] UPDATED)
    Recent advances in applying Graph Neural Networks (GNNs) to molecular science have showcased the power of learning three-dimensional (3D) structure representations with GNNs. However, most existing GNNs suffer from the limitations of insufficient modeling of diverse interactions, computational expensive operations, and ignorance of vectorial values. Here, we tackle these limitations by proposing a novel GNN model, Physics-aware Multiplex Graph Neural Network (PaxNet), to efficiently and accurately learn the representations of 3D molecules for both small organic compounds and macromolecule complexes. PaxNet separates the modeling of local and non-local interactions inspired by molecular mechanics, and reduces the expensive angle-related computations. Besides scalar properties, PaxNet can also predict vectorial properties by learning an associated vector for each atom. To evaluate the performance of PaxNet, we compare it with state-of-the-art baselines in two tasks. On small molecule dataset for predicting quantum chemical properties, PaxNet reduces the prediction error by 15% and uses 73% less memory than the best baseline. On macromolecule dataset for predicting protein-ligand binding affinities, PaxNet outperforms the best baseline while reducing the memory consumption by 33% and the inference time by 85%. Thus, PaxNet provides a universal, robust and accurate method for large-scale machine learning of molecules. Our code is available at https://github.com/zetayue/Physics-aware-Multiplex-GNN.  ( 3 min )
    Online Combinatorial Linear Optimization via a Frank-Wolfe-based Metarounding Algorithm. (arXiv:2310.12629v2 [cs.DS] UPDATED)
    Metarounding is an approach to convert an approximation algorithm for linear optimization over some combinatorial classes to an online linear optimization algorithm for the same class. We propose a new metarounding algorithm under a natural assumption that a relax-based approximation algorithm exists for the combinatorial class. Our algorithm is much more efficient in both theoretical and practical aspects.  ( 2 min )
    Structural Node Embeddings with Homomorphism Counts. (arXiv:2308.15283v2 [cs.LG] UPDATED)
    Graph homomorphism counts, first explored by Lov\'asz in 1967, have recently garnered interest as a powerful tool in graph-based machine learning. Grohe (PODS 2020) proposed the theoretical foundations for using homomorphism counts in machine learning on graph level as well as node level tasks. By their very nature, these capture local structural information, which enables the creation of robust structural embeddings. While a first approach for graph level tasks has been made by Nguyen and Maehara (ICML 2020), we experimentally show the effectiveness of homomorphism count based node embeddings. Enriched with node labels, node weights, and edge weights, these offer an interpretable representation of graph data, allowing for enhanced explainability of machine learning models. We propose a theoretical framework for isomorphism-invariant homomorphism count based embeddings which lend themselves to a wide variety of downstream tasks. Our approach capitalises on the efficient computability of graph homomorphism counts for bounded treewidth graph classes, rendering it a practical solution for real-world applications. We demonstrate their expressivity through experiments on benchmark datasets. Although our results do not match the accuracy of state-of-the-art neural architectures, they are comparable to other advanced graph learning models. Remarkably, our approach demarcates itself by ensuring explainability for each individual feature. By integrating interpretable machine learning algorithms like SVMs or Random Forests, we establish a seamless, end-to-end explainable pipeline. Our study contributes to the advancement of graph-based techniques that offer both performance and interpretability.  ( 2 min )
    ERUDITE: Human-in-the-Loop IoT for an Adaptive Personalized Learning System. (arXiv:2303.04292v2 [cs.HC] UPDATED)
    Thanks to the rapid growth in wearable technologies and recent advancement in machine learning and signal processing, monitoring complex human contexts becomes feasible, paving the way to develop human-in-the-loop IoT systems that naturally evolve to adapt to the human and environment state autonomously. Nevertheless, a central challenge in designing many of these IoT systems arises from the requirement to infer the human mental state, such as intention, stress, cognition load, or learning ability. While different human contexts can be inferred from the fusion of different sensor modalities that can correlate to a particular mental state, the human brain provides a richer sensor modality that gives us more insights into the required human context. This paper proposes ERUDITE, a human-in-the-loop IoT system for the learning environment that exploits recent wearable neurotechnology to decode brain signals. Through insights from concept learning theory, ERUDITE can infer the human state of learning and understand when human learning increases or declines. By quantifying human learning as an input sensory signal, ERUDITE can provide adequate personalized feedback to humans in a learning environment to enhance their learning experience. ERUDITE is evaluated across $15$ participants and showed that by using the brain signals as a sensor modality to infer the human learning state and providing personalized adaptation to the learning environment, the participants' learning performance increased on average by $26\%$. Furthermore, we showed that ERUDITE can be deployed on an edge-based prototype to evaluate its practicality and scalability.  ( 3 min )
    SURF: A Generalization Benchmark for GNNs Predicting Fluid Dynamics. (arXiv:2310.20049v3 [cs.LG] UPDATED)
    Simulating fluid dynamics is crucial for the design and development process, ranging from simple valves to complex turbomachinery. Accurately solving the underlying physical equations is computationally expensive. Therefore, learning-based solvers that model interactions on meshes have gained interest due to their promising speed-ups. However, it is unknown to what extent these models truly understand the underlying physical principles and can generalize rather than interpolate. Generalization is a key requirement for a general-purpose fluid simulator, which should adapt to different topologies, resolutions, or thermodynamic ranges. We propose SURF, a benchmark designed to test the $\textit{generalization}$ of learned graph-based fluid simulators. SURF comprises individual datasets and provides specific performance and generalization metrics for evaluating and comparing different models. We empirically demonstrate the applicability of SURF by thoroughly investigating the two state-of-the-art graph-based models, yielding new insights into their generalization.  ( 2 min )
    Tracking Most Significant Shifts in Nonparametric Contextual Bandits. (arXiv:2307.05341v2 [stat.ML] UPDATED)
    We study nonparametric contextual bandits where Lipschitz mean reward functions may change over time. We first establish the minimax dynamic regret rate in this less understood setting in terms of number of changes $L$ and total-variation $V$, both capturing all changes in distribution over context space, and argue that state-of-the-art procedures are suboptimal in this setting. Next, we tend to the question of an adaptivity for this setting, i.e. achieving the minimax rate without knowledge of $L$ or $V$. Quite importantly, we posit that the bandit problem, viewed locally at a given context $X_t$, should not be affected by reward changes in other parts of context space $\cal X$. We therefore propose a notion of change, which we term experienced significant shifts, that better accounts for locality, and thus counts considerably less changes than $L$ and $V$. Furthermore, similar to recent work on non-stationary MAB (Suk & Kpotufe, 2022), experienced significant shifts only count the most significant changes in mean rewards, e.g., severe best-arm changes relevant to observed contexts. Our main result is to show that this more tolerant notion of change can in fact be adapted to.  ( 2 min )
    Provably Safe Reinforcement Learning: Conceptual Analysis, Survey, and Benchmarking. (arXiv:2205.06750v3 [cs.LG] UPDATED)
    Ensuring the safety of reinforcement learning (RL) algorithms is crucial to unlock their potential for many real-world tasks. However, vanilla RL and most safe RL approaches do not guarantee safety. In recent years, several methods have been proposed to provide hard safety guarantees for RL, which is essential for applications where unsafe actions could have disastrous consequences. Nevertheless, there is no comprehensive comparison of these provably safe RL methods. Therefore, we introduce a categorization of existing provably safe RL methods, present the conceptual foundations for both continuous and discrete action spaces, and empirically benchmark existing methods. We categorize the methods based on how they adapt the action: action replacement, action projection, and action masking. Our experiments on an inverted pendulum and a quadrotor stabilization task indicate that action replacement is the best-performing approach for these applications despite its comparatively simple realization. Furthermore, adding a reward penalty, every time the safety verification is engaged, improved training performance in our experiments. Finally, we provide practical guidance on selecting provably safe RL approaches depending on the safety specification, RL algorithm, and type of action space.  ( 2 min )
    Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. (arXiv:2308.15363v4 [cs.DB] UPDATED)
    Large language models (LLMs) have emerged as a new paradigm for Text-to-SQL task. However, the absence of a systematical benchmark inhibits the development of designing effective, efficient and economic LLM-based Text-to-SQL solutions. To address this challenge, in this paper, we first conduct a systematical and extensive comparison over existing prompt engineering methods, including question representation, example selection and example organization, and with these experimental results, we elaborate their pros and cons. Based on these findings, we propose a new integrated solution, named DAIL-SQL, which refreshes the Spider leaderboard with 86.6% execution accuracy and sets a new bar. To explore the potential of open-source LLM, we investigate them in various scenarios, and further enhance their performance with supervised fine-tuning. Our explorations highlight open-source LLMs' potential in Text-to-SQL, as well as the advantages and disadvantages of the supervised fine-tuning. Additionally, towards an efficient and economic LLM-based Text-to-SQL solution, we emphasize the token efficiency in prompt engineering and compare the prior studies under this metric. We hope that our work provides a deeper understanding of Text-to-SQL with LLMs, and inspires further investigations and broad applications.  ( 2 min )
    WATUNet: A Deep Neural Network for Segmentation of Volumetric Sweep Imaging Ultrasound. (arXiv:2311.10857v1 [eess.IV])
    Objective. Limited access to breast cancer diagnosis globally leads to delayed treatment. Ultrasound, an effective yet underutilized method, requires specialized training for sonographers, which hinders its widespread use. Approach. Volume sweep imaging (VSI) is an innovative approach that enables untrained operators to capture high-quality ultrasound images. Combined with deep learning, like convolutional neural networks (CNNs), it can potentially transform breast cancer diagnosis, enhancing accuracy, saving time and costs, and improving patient outcomes. The widely used UNet architecture, known for medical image segmentation, has limitations, such as vanishing gradients and a lack of multi-scale feature extraction and selective region attention. In this study, we present a novel segmentation model known as Wavelet_Attention_UNet (WATUNet). In this model, we incorporate wavelet gates (WGs) and attention gates (AGs) between the encoder and decoder instead of a simple connection to overcome the limitations mentioned, thereby improving model performance. Main results. Two datasets are utilized for the analysis. The public "Breast Ultrasound Images" (BUSI) dataset of 780 images and a VSI dataset of 3818 images. Both datasets contained segmented lesions categorized into three types: no mass, benign mass, and malignant mass. Our segmentation results show superior performance compared to other deep networks. The proposed algorithm attained a Dice coefficient of 0.94 and an F1 score of 0.94 on the VSI dataset and scored 0.93 and 0.94 on the public dataset, respectively.  ( 3 min )
    Language Decision Transformers with Exponential Tilt for Interactive Text Environments. (arXiv:2302.05507v2 [cs.CL] UPDATED)
    Text-based game environments are challenging because agents must deal with long sequences of text, execute compositional actions using text and learn from sparse rewards. We address these challenges by proposing Language Decision Transformers (LDTs), a framework that is based on transformer language models and decision transformers (DTs). Our LDTs extend DTs with 3 components: (1) exponential tilt to guide the agent towards high obtainable goals, (2) novel goal conditioning methods yielding better results than the traditional return-to-go (sum of all future rewards), and (3) a model of future observations that improves agent performance. LDTs are the first to address offline RL with DTs on these challenging games. Our experiments show that LDTs achieve the highest scores among many different types of agents on some of the most challenging Jericho games, such as Enchanter.  ( 2 min )
    Single-Pass Contrastive Learning Can Work for Both Homophilic and Heterophilic Graph. (arXiv:2211.10890v4 [cs.LG] UPDATED)
    Existing graph contrastive learning (GCL) techniques typically require two forward passes for a single instance to construct the contrastive loss, which is effective for capturing the low-frequency signals of node features. Such a dual-pass design has shown empirical success on homophilic graphs, but its effectiveness on heterophilic graphs, where directly connected nodes typically have different labels, is unknown. In addition, existing GCL approaches fail to provide strong performance guarantees. Coupled with the unpredictability of GCL approaches on heterophilic graphs, their applicability in real-world contexts is limited. Then, a natural question arises: Can we design a GCL method that works for both homophilic and heterophilic graphs with a performance guarantee? To answer this question, we theoretically study the concentration property of features obtained by neighborhood aggregation on homophilic and heterophilic graphs, introduce the single-pass augmentation-free graph contrastive learning loss based on the property, and provide performance guarantees for the minimizer of the loss on downstream tasks. As a direct consequence of our analysis, we implement the Single-Pass Graph Contrastive Learning method (SP-GCL). Empirically, on 14 benchmark datasets with varying degrees of homophily, the features learned by the SP-GCL can match or outperform existing strong baselines with significantly less computational overhead, which demonstrates the usefulness of our findings in real-world cases.  ( 3 min )
    Stable Linear Subspace Identification: A Machine Learning Approach. (arXiv:2311.03197v2 [eess.SY] UPDATED)
    Machine Learning (ML) and linear System Identification (SI) have been historically developed independently. In this paper, we leverage well-established ML tools - especially the automatic differentiation framework - to introduce SIMBa, a family of discrete linear multi-step-ahead state-space SI methods using backpropagation. SIMBa relies on a novel Linear-Matrix-Inequality-based free parametrization of Schur matrices to ensure the stability of the identified model. We show how SIMBa generally outperforms traditional linear state-space SI methods, and sometimes significantly, although at the price of a higher computational burden. This performance gap is particularly remarkable compared to other SI methods with stability guarantees, where the gain is frequently above 25% in our investigations, hinting at SIMBa's ability to simultaneously achieve state-of-the-art fitting performance and enforce stability. Interestingly, these observations hold for a wide variety of input-output systems and on both simulated and real-world data, showcasing the flexibility of the proposed approach. We postulate that this new SI paradigm presents a great extension potential to identify structured nonlinear models from data, and we hence open-source SIMBa on https://github.com/Cemempamoi/simba.
    Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction. (arXiv:2310.19845v1 [cs.LG] CROSS LISTED)
    Recently, spam on online social networks has attracted attention in the research and business world. Twitter has become the preferred medium to spread spam content. Many research efforts attempted to encounter social networks spam. Twitter brought extra challenges represented by the feature space size, and imbalanced data distributions. Usually, the related research works focus on part of these main challenges or produce black-box models. In this paper, we propose a modified genetic algorithm for simultaneous dimensionality reduction and hyper parameter optimization over imbalanced datasets. The algorithm initialized an eXtreme Gradient Boosting classifier and reduced the features space of tweets dataset; to generate a spam prediction model. The model is validated using a 50 times repeated 10-fold stratified cross-validation, and analyzed using nonparametric statistical tests. The resulted prediction model attains on average 82.32\% and 92.67\% in terms of geometric mean and accuracy respectively, utilizing less than 10\% of the total feature space. The empirical results show that the modified genetic algorithm outperforms $Chi^2$ and $PCA$ feature selection methods. In addition, eXtreme Gradient Boosting outperforms many machine learning algorithms, including BERT-based deep learning model, in spam prediction. Furthermore, the proposed approach is applied to SMS spam modeling and compared to related works.
    Online Arbitrary Shaped Clustering through Correlated Gaussian Functions. (arXiv:2302.06335v2 [cs.LG] UPDATED)
    There is no convincing evidence that backpropagation is a biologically plausible mechanism, and further studies of alternative learning methods are needed. A novel online clustering algorithm is presented that can produce arbitrary shaped clusters from inputs in an unsupervised manner, and requires no prior knowledge of the number of clusters in the input data. This is achieved by finding correlated outputs from functions that capture commonly occurring input patterns. The algorithm can be deemed more biologically plausible than model optimization through backpropagation, although practical applicability may require additional research. However, the method yields satisfactory results on several toy datasets on a noteworthy range of hyperparameters.  ( 2 min )
    Mitigating Source Bias for Fairer Weak Supervision. (arXiv:2303.17713v2 [cs.LG] UPDATED)
    Weak supervision enables efficient development of training sets by reducing the need for ground truth labels. However, the techniques that make weak supervision attractive -- such as integrating any source of signal to estimate unknown labels -- also entail the danger that the produced pseudolabels are highly biased. Surprisingly, given everyday use and the potential for increased bias, weak supervision has not been studied from the point of view of fairness. We begin such a study, starting with the observation that even when a fair model can be built from a dataset with access to ground-truth labels, the corresponding dataset labeled via weak supervision can be arbitrarily unfair. To address this, we propose and empirically validate a model for source unfairness in weak supervision, then introduce a simple counterfactual fairness-based technique that can mitigate these biases. Theoretically, we show that it is possible for our approach to simultaneously improve both accuracy and fairness -- in contrast to standard fairness approaches that suffer from tradeoffs. Empirically, we show that our technique improves accuracy on weak supervision baselines by as much as 32\% while reducing demographic parity gap by 82.5\%. A simple extension of our method aimed at maximizing performance produces state-of-the-art performance in five out of ten datasets in the WRENCH benchmark.
    Finding emergence in data: causal emergence inspired dynamics learning. (arXiv:2308.09952v2 [physics.soc-ph] UPDATED)
    Modelling complex dynamical systems in a data-driven manner is challenging due to the presence of emergent behaviors and properties that cannot be directly captured by micro-level observational data. Therefore, it is crucial to develop a model that can effectively capture emergent dynamics at the macro-level and quantify emergence based on the available data. Drawing inspiration from the theory of causal emergence, this paper introduces a machine learning framework aimed at learning macro-dynamics within an emergent latent space. The framework achieves this by maximizing the effective information (EI) to obtain a macro-dynamics model with stronger causal effects. Experimental results on both simulated and real data demonstrate the effectiveness of the proposed framework. Not only does it successfully capture emergent patterns, but it also learns the coarse-graining strategy and quantifies the degree of causal emergence in the data. Furthermore, experiments conducted on environments different from the training dataset highlight the superior generalization ability of our model.
    Exposition on over-squashing problem on GNNs: Current Methods, Benchmarks and Challenges. (arXiv:2311.07073v2 [cs.LG] UPDATED)
    Graph-based message-passing neural networks (MPNNs) have achieved remarkable success in both node and graph-level learning tasks. However, several identified problems, including over-smoothing (OSM), limited expressive power, and over-squashing (OSQ), still limit the performance of MPNNs. In particular, OSQ serves as the latest identified problem, where MPNNs gradually lose their learning accuracy when long-range dependencies between graph nodes are required. In this work, we provide an exposition on the OSQ problem by summarizing different formulations of OSQ from current literature, as well as the three different categories of approaches for addressing the OSQ problem. In addition, we also discuss the alignment between OSQ and expressive power and the trade-off between OSQ and OSM. Furthermore, we summarize the empirical methods leveraged from existing works to verify the efficiency of OSQ mitigation approaches, with illustrations of their computational complexities. Lastly, we list some open questions that are of interest for further exploration of the OSQ problem along with potential directions from the best of our knowledge.
    Attribution Patching Outperforms Automated Circuit Discovery. (arXiv:2310.10348v2 [cs.LG] UPDATED)
    Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.
    Graph Attention-based Deep Reinforcement Learning for solving the Chinese Postman Problem with Load-dependent costs. (arXiv:2310.15516v2 [cs.LG] UPDATED)
    Recently, Deep reinforcement learning (DRL) models have shown promising results in solving routing problems. However, most DRL solvers are commonly proposed to solve node routing problems, such as the Traveling Salesman Problem (TSP). Meanwhile, there has been limited research on applying neural methods to arc routing problems, such as the Chinese Postman Problem (CPP), since they often feature irregular and complex solution spaces compared to TSP. To fill these gaps, this paper proposes a novel DRL framework to address the CPP with load-dependent costs (CPP-LC) (Corberan et al., 2018), which is a complex arc routing problem with load constraints. The novelty of our method is two-fold. First, we formulate the CPP-LC as a Markov Decision Process (MDP) sequential model. Subsequently, we introduce an autoregressive model based on DRL, namely Arc-DRL, consisting of an encoder and decoder to address the CPP-LC challenge effectively. Such a framework allows the DRL model to work efficiently and scalably to arc routing problems. Furthermore, we propose a new bio-inspired meta-heuristic solution based on Evolutionary Algorithm (EA) for CPP-LC. Extensive experiments show that Arc-DRL outperforms existing meta-heuristic methods such as Iterative Local Search (ILS) and Variable Neighborhood Search (VNS) proposed by (Corberan et al., 2018) on large benchmark datasets for CPP-LC regarding both solution quality and running time; while the EA gives the best solution quality with much more running time. We release our C++ implementations for metaheuristics such as EA, ILS and VNS along with the code for data generation and our generated data at https://github.com/HySonLab/Chinese_Postman_Problem
    Finite element inspired networks: Learning interpretable deformable object dynamics from partial observations. (arXiv:2307.07975v2 [cs.RO] UPDATED)
    Accurate simulation of deformable linear object (DLO) dynamics is challenging if the task at hand requires a human-interpretable model that also yields fast predictions. To arrive at such a model, we draw inspiration from the rigid finite element method (R-FEM) and model a DLO as a serial chain of rigid bodies whose internal state is unrolled through time by a dynamics network. As this state is not observed directly, the dynamics network is trained jointly with a physics-informed encoder which maps observed motion variables to the DLO's hidden state. To encourage that the state acquires a physically meaningful representation, we leverage the forward kinematics of the underlying R-FEM model as a decoder. Through robot experiments we demonstrate that the proposed architecture provides an easy-to-handle, yet capable DLO dynamics model yielding physically interpretable predictions from partial observations. The project code is available at: \url{https://tinyurl.com/fei-networks}
    Correspondence learning between morphologically different robots via task demonstrations. (arXiv:2310.13458v2 [cs.RO] UPDATED)
    We observe a large variety of robots in terms of their bodies, sensors, and actuators. Given the commonalities in the skill sets, teaching each skill to each different robot independently is inefficient and not scalable when the large variety in the robotic landscape is considered. If we can learn the correspondences between the sensorimotor spaces of different robots, we can expect a skill that is learned in one robot can be more directly and easily transferred to other robots. In this paper, we propose a method to learn correspondences among two or more robots that may have different morphologies. To be specific, besides robots with similar morphologies with different degrees of freedom, we show that a fixed-based manipulator robot with joint control and a differential drive mobile robot can be addressed within the proposed framework. To set up the correspondence among the robots considered, an initial base task is demonstrated to the robots to achieve the same goal. Then, a common latent representation is learned along with the individual robot policies for achieving the goal. After the initial learning stage, the observation of a new task execution by one robot becomes sufficient to generate a latent space representation pertaining to the other robots to achieve the same task. We verified our system in a set of experiments where the correspondence between robots is learned (1) when the robots need to follow the same paths to achieve the same task, (2) when the robots need to follow different trajectories to achieve the same task, and (3) when complexities of the required sensorimotor trajectories are different for the robots. We also provide a proof-of-the-concept realization of correspondence learning between a real manipulator robot and a simulated mobile robot.
    StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. (arXiv:2306.07691v2 [eess.AS] UPDATED)
    In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.
    GQFedWAvg: Optimization-Based Quantized Federated Learning in General Edge Computing Systems. (arXiv:2306.07497v2 [cs.LG] UPDATED)
    The optimal implementation of federated learning (FL) in practical edge computing systems has been an outstanding problem. In this paper, we propose an optimization-based quantized FL algorithm, which can appropriately fit a general edge computing system with uniform or nonuniform computing and communication resources at the workers. Specifically, we first present a new random quantization scheme and analyze its properties. Then, we propose a general quantized FL algorithm, namely GQFedWAvg. Specifically, GQFedWAvg applies the proposed quantization scheme to quantize wisely chosen model update-related vectors and adopts a generalized mini-batch stochastic gradient descent (SGD) method with the weighted average local model updates in global model aggregation. Besides, GQFedWAvg has several adjustable algorithm parameters to flexibly adapt to the computing and communication resources at the server and workers. We also analyze the convergence of GQFedWAvg. Next, we optimize the algorithm parameters of GQFedWAvg to minimize the convergence error under the time and energy constraints. We successfully tackle the challenging non-convex problem using general inner approximation (GIA) and multiple delicate tricks. Finally, we interpret GQFedWAvg's function principle and show its considerable gains over existing FL algorithms using numerical results.
    Generating Medical Prescriptions with Conditional Transformer. (arXiv:2310.19727v2 [cs.CL] UPDATED)
    Access to real-world medication prescriptions is essential for medical research and healthcare quality improvement. However, access to real medication prescriptions is often limited due to the sensitive nature of the information expressed. Additionally, manually labelling these instructions for training and fine-tuning Natural Language Processing (NLP) models can be tedious and expensive. We introduce a novel task-specific model architecture, Label-To-Text-Transformer (\textbf{LT3}), tailored to generate synthetic medication prescriptions based on provided labels, such as a vocabulary list of medications and their attributes. LT3 is trained on a set of around 2K lines of medication prescriptions extracted from the MIMIC-III database, allowing the model to produce valuable synthetic medication prescriptions. We evaluate LT3's performance by contrasting it with a state-of-the-art Pre-trained Language Model (PLM), T5, analysing the quality and diversity of generated texts. We deploy the generated synthetic data to train the SpacyNER model for the Named Entity Recognition (NER) task over the n2c2-2018 dataset. The experiments show that the model trained on synthetic data can achieve a 96-98\% F1 score at Label Recognition on Drug, Frequency, Route, Strength, and Form. LT3 codes and data will be shared at \url{https://github.com/HECTA-UoM/Label-To-Text-Transformer}
    Provably Efficient Iterated CVaR Reinforcement Learning with Function Approximation. (arXiv:2307.02842v2 [cs.LG] UPDATED)
    Risk-sensitive reinforcement learning (RL) aims to optimize policies that balance the expected reward and risk. In this paper, we investigate a novel risk-sensitive RL formulation with an Iterated Conditional Value-at-Risk (CVaR) objective under linear and general function approximations. This new formulation, named ICVaR-RL with function approximation, provides a principled way to guarantee safety at each decision step. For ICVaR-RL with linear function approximation, we propose a computationally efficient algorithm ICVaR-L, which achieves an $\widetilde{O}(\sqrt{\alpha^{-(H+1)}(d^2H^4+dH^6)K})$ regret, where $\alpha$ is the risk level, $d$ is the dimension of state-action features, $H$ is the length of each episode, and $K$ is the number of episodes. We also establish a matching lower bound $\Omega(\sqrt{\alpha^{-(H-1)}d^2K})$ to validate the optimality of ICVaR-L with respect to $d$ and $K$. For ICVaR-RL with general function approximation, we propose algorithm ICVaR-G, which achieves an $\widetilde{O}(\sqrt{\alpha^{-(H+1)}DH^4K})$ regret, where $D$ is a dimensional parameter that depends on the eluder dimension and covering number. Furthermore, our analysis provides several novel techniques for risk-sensitive RL, including an efficient approximation of the CVaR operator, a new ridge regression with CVaR-adapted features, and a refined elliptical potential lemma.
    A benchmark of categorical encoders for binary classification. (arXiv:2307.09191v3 [cs.LG] UPDATED)
    Categorical encoders transform categorical features into numerical representations that are indispensable for a wide range of machine learning models. Existing encoder benchmark studies lack generalizability because of their limited choice of (1) encoders, (2) experimental factors, and (3) datasets. Additionally, inconsistencies arise from the adoption of varying aggregation strategies. This paper is the most comprehensive benchmark of categorical encoders to date, including an extensive evaluation of 32 configurations of encoders from diverse families, with 36 combinations of experimental factors, and on 50 datasets. The study shows the profound influence of dataset selection, experimental factors, and aggregation strategies on the benchmark's conclusions -- aspects disregarded in previous encoder benchmarks.
    StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis. (arXiv:2205.15439v2 [eess.AS] UPDATED)
    Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems, but producing speech with naturalistic prosodic variations, speaking styles and emotional tones remains challenging. Moreover, since duration and speech are generated separately, parallel TTS models still have problems finding the best monotonic alignments that are crucial for naturalistic speech synthesis. Here, we propose StyleTTS, a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance. With novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation schemes, our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets in subjective tests of speech naturalness and speaker similarity. Through self-supervised learning of the speaking styles, our model can synthesize speech with the same prosodic and emotional tone as any given reference speech without the need for explicitly labeling these categories.
    Reflection-Equivariant Diffusion for 3D Structure Determination from Isotopologue Rotational Spectra in Natural Abundance. (arXiv:2310.11609v2 [cs.LG] UPDATED)
    Structure determination is necessary to identify unknown organic molecules, such as those in natural products, forensic samples, the interstellar medium, and laboratory syntheses. Rotational spectroscopy enables structure determination by providing accurate 3D information about small organic molecules via their moments of inertia. Using these moments, Kraitchman analysis determines isotopic substitution coordinates, which are the unsigned $|x|,|y|,|z|$ coordinates of all atoms with natural isotopic abundance, including carbon, nitrogen, and oxygen. While unsigned substitution coordinates can verify guesses of structures, the missing $+/-$ signs make it challenging to determine the actual structure from the substitution coordinates alone. To tackle this inverse problem, we develop KREED (Kraitchman REflection-Equivariant Diffusion), a generative diffusion model that infers a molecule's complete 3D structure from its molecular formula, moments of inertia, and unsigned substitution coordinates of heavy atoms. KREED's top-1 predictions identify the correct 3D structure with >98% accuracy on the QM9 and GEOM datasets when provided with substitution coordinates of all heavy atoms with natural isotopic abundance. When substitution coordinates are restricted to only a subset of carbons, accuracy is retained at 91% on QM9 and 32% on GEOM. On a test set of experimentally measured substitution coordinates gathered from the literature, KREED predicts the correct all-atom 3D structure in 25 of 33 cases, demonstrating experimental applicability for context-free 3D structure determination with rotational spectroscopy.
    Balanced Training for Sparse GANs. (arXiv:2302.14670v2 [cs.LG] UPDATED)
    Over the past few years, there has been growing interest in developing larger and deeper neural networks, including deep generative models like generative adversarial networks (GANs). However, GANs typically come with high computational complexity, leading researchers to explore methods for reducing the training and inference costs. One such approach gaining popularity in supervised learning is dynamic sparse training (DST), which maintains good performance while enjoying excellent training efficiency. Despite its potential benefits, applying DST to GANs presents challenges due to the adversarial nature of the training process. In this paper, we propose a novel metric called the balance ratio (BR) to study the balance between the sparse generator and discriminator. We also introduce a new method called balanced dynamic sparse training (ADAPT), which seeks to control the BR during GAN training to achieve a good trade-off between performance and computational cost. Our proposed method shows promising results on multiple datasets, demonstrating its effectiveness.
    Graph-Based Methods for Discrete Choice. (arXiv:2205.11365v2 [cs.LG] UPDATED)
    Choices made by individuals have widespread impacts--for instance, people choose between political candidates to vote for, between social media posts to share, and between brands to purchase--moreover, data on these choices are increasingly abundant. Discrete choice models are a key tool for learning individual preferences from such data. Additionally, social factors like conformity and contagion influence individual choice. Traditional methods for incorporating these factors into choice models do not account for the entire social network and require hand-crafted features. To overcome these limitations, we use graph learning to study choice in networked contexts. We identify three ways in which graph learning techniques can be used for discrete choice: learning chooser representations, regularizing choice model parameters, and directly constructing predictions from a network. We design methods in each category and test them on real-world choice datasets, including county-level 2016 US election results and Android app installation and usage data. We show that incorporating social network structure can improve the predictions of the standard econometric choice model, the multinomial logit. We provide evidence that app installations are influenced by social context, but we find no such effect on app usage among the same participants, which instead is habit-driven. In the election data, we highlight the additional insights a discrete choice framework provides over classification or regression, the typical approaches. On synthetic data, we demonstrate the sample complexity benefit of using social information in choice models.
    PopArt: Efficient Sparse Regression and Experimental Design for Optimal Sparse Linear Bandits. (arXiv:2210.15345v3 [stat.ML] UPDATED)
    In sparse linear bandits, a learning agent sequentially selects an action and receive reward feedback, and the reward function depends linearly on a few coordinates of the covariates of the actions. This has applications in many real-world sequential decision making problems. In this paper, we propose a simple and computationally efficient sparse linear estimation method called PopArt that enjoys a tighter $\ell_1$ recovery guarantee compared to Lasso (Tibshirani, 1996) in many problems. Our bound naturally motivates an experimental design criterion that is convex and thus computationally efficient to solve. Based on our novel estimator and design criterion, we derive sparse linear bandit algorithms that enjoy improved regret upper bounds upon the state of the art (Hao et al., 2020), especially w.r.t. the geometry of the given action set. Finally, we prove a matching lower bound for sparse linear bandits in the data-poor regime, which closes the gap between upper and lower bounds in prior work.
    SORTAD: Self-Supervised Optimized Random Transformations for Anomaly Detection in Tabular Data. (arXiv:2311.11018v1 [cs.LG])
    We consider a self-supervised approach to anomaly detection in tabular data. Random transformations are applied to the data, and then each transformation is identified based on its output. These predicted transformations are used to identify anomalies. In tabular data this approach faces many challenges that are related to the uncorrelated nature of the data. These challenges affect the transformations that should be used, as well as the use of their predictions. To this end, we propose SORTAD, a novel algorithm that is tailor-made to solve these challenges. SORTAD optimally chooses random transformations that help the classification process, and have a scoring function that is more sensitive to the changes in the transformations classification prediction encountered in tabular data. SORTAD achieved state-of-the-art results on multiple commonly used anomaly detection data sets, as well as in the overall results across all data sets tested.
    Geometric Algebra Transformer. (arXiv:2305.18415v3 [cs.LG] UPDATED)
    Problems involving geometric data arise in physics, chemistry, robotics, computer vision, and many other fields. Such data can take numerous forms, for instance points, direction vectors, translations, or rotations, but to date there is no single architecture that can be applied to such a wide variety of geometric types while respecting their symmetries. In this paper we introduce the Geometric Algebra Transformer (GATr), a general-purpose architecture for geometric data. GATr represents inputs, outputs, and hidden states in the projective geometric (or Clifford) algebra, which offers an efficient 16-dimensional vector-space representation of common geometric objects as well as operators acting on them. GATr is equivariant with respect to E(3), the symmetry group of 3D Euclidean space. As a Transformer, GATr is versatile, efficient, and scalable. We demonstrate GATr in problems from n-body modeling to wall-shear-stress estimation on large arterial meshes to robotic motion planning. GATr consistently outperforms both non-geometric and equivariant baselines in terms of error, data efficiency, and scalability.  ( 2 min )
    Knowledge Augmented Machine Learning with Applications in Autonomous Driving: A Survey. (arXiv:2205.04712v3 [cs.LG] UPDATED)
    The availability of representative datasets is an essential prerequisite for many successful artificial intelligence and machine learning models. However, in real life applications these models often encounter scenarios that are inadequately represented in the data used for training. There are various reasons for the absence of sufficient data, ranging from time and cost constraints to ethical considerations. As a consequence, the reliable usage of these models, especially in safety-critical applications, is still a tremendous challenge. Leveraging additional, already existing sources of knowledge is key to overcome the limitations of purely data-driven approaches. Knowledge augmented machine learning approaches offer the possibility of compensating for deficiencies, errors, or ambiguities in the data, thus increasing the generalization capability of the applied models. Even more, predictions that conform with knowledge are crucial for making trustworthy and safe decisions even in underrepresented scenarios. This work provides an overview of existing techniques and methods in the literature that combine data-driven models with existing knowledge. The identified approaches are structured according to the categories knowledge integration, extraction and conformity. In particular, we address the application of the presented methods in the field of autonomous driving.
    Hyperbolic Space with Hierarchical Margin Boosts Fine-Grained Learning from Coarse Labels. (arXiv:2311.11019v1 [cs.CV])
    Learning fine-grained embeddings from coarse labels is a challenging task due to limited label granularity supervision, i.e., lacking the detailed distinctions required for fine-grained tasks. The task becomes even more demanding when attempting few-shot fine-grained recognition, which holds practical significance in various applications. To address these challenges, we propose a novel method that embeds visual embeddings into a hyperbolic space and enhances their discriminative ability with a hierarchical cosine margins manner. Specifically, the hyperbolic space offers distinct advantages, including the ability to capture hierarchical relationships and increased expressive power, which favors modeling fine-grained objects. Based on the hyperbolic space, we further enforce relatively large/small similarity margins between coarse/fine classes, respectively, yielding the so-called hierarchical cosine margins manner. While enforcing similarity margins in the regular Euclidean space has become popular for deep embedding learning, applying it to the hyperbolic space is non-trivial and validating the benefit for coarse-to-fine generalization is valuable. Extensive experiments conducted on five benchmark datasets showcase the effectiveness of our proposed method, yielding state-of-the-art results surpassing competing methods.
    On Momentum-Based Gradient Methods for Bilevel Optimization with Nonconvex Lower-Level. (arXiv:2303.03944v4 [math.OC] UPDATED)
    Bilevel optimization is a popular two-level hierarchical optimization, which has been widely applied to many machine learning tasks such as hyperparameter learning, meta learning and continual learning. Although many bilevel optimization methods recently have been developed, the bilevel methods are not well studied when the lower-level problem is nonconvex. To fill this gap, in the paper, we study a class of nonconvex bilevel optimization problems, where both upper-level and lower-level problems are nonconvex, and the lower-level problem satisfies Polyak-{\L}ojasiewicz (PL) condition. We propose an efficient momentum-based gradient bilevel method (MGBiO) to solve these deterministic problems. Meanwhile, we propose a class of efficient momentum-based stochastic gradient bilevel methods (MSGBiO and VR-MSGBiO) to solve these stochastic problems. Moreover, we provide a useful convergence analysis framework for our methods. Specifically, under some mild conditions, we prove that our MGBiO method has a sample (or gradient) complexity of $O(\epsilon^{-2})$ for finding an $\epsilon$-stationary solution of the deterministic bilevel problems (i.e., $\|\nabla F(x)\|\leq \epsilon$), which improves the existing best results by a factor of $O(\epsilon^{-1})$. Meanwhile, we prove that our MSGBiO and VR-MSGBiO methods have sample complexities of $\tilde{O}(\epsilon^{-4})$ and $\tilde{O}(\epsilon^{-3})$, respectively, in finding an $\epsilon$-stationary solution of the stochastic bilevel problems (i.e., $\mathbb{E}\|\nabla F(x)\|\leq \epsilon$), which improves the existing best results by a factor of $\tilde{O}(\epsilon^{-3})$. Extensive experimental results on bilevel PL game and hyper-representation learning demonstrate the efficiency of our algorithms. This paper commemorates the mathematician Boris Polyak (1935 -2023).
    Efficient learning of nonlinear prediction models with time-series privileged information. (arXiv:2209.07067v4 [cs.LG] UPDATED)
    In domains where sample sizes are limited, efficient learning algorithms are critical. Learning using privileged information (LuPI) offers increased sample efficiency by allowing prediction models access to auxiliary information at training time which is unavailable when the models are used. In recent work, it was shown that for prediction in linear-Gaussian dynamical systems, a LuPI learner with access to intermediate time series data is never worse and often better in expectation than any unbiased classical learner. We provide new insights into this analysis and generalize it to nonlinear prediction tasks in latent dynamical systems, extending theoretical guarantees to the case where the map connecting latent variables and observations is known up to a linear transform. In addition, we propose algorithms based on random features and representation learning for the case when this map is unknown. A suite of empirical results confirm theoretical findings and show the potential of using privileged time-series information in nonlinear prediction.
    A Whole New Ball Game: A Primal Accelerated Method for Matrix Games and Minimizing the Maximum of Smooth Functions. (arXiv:2311.10886v1 [cs.DS])
    We design algorithms for minimizing $\max_{i\in[n]} f_i(x)$ over a $d$-dimensional Euclidean or simplex domain. When each $f_i$ is $1$-Lipschitz and $1$-smooth, our method computes an $\epsilon$-approximate solution using $\widetilde{O}(n \epsilon^{-1/3} + \epsilon^{-2})$ gradient and function evaluations, and $\widetilde{O}(n \epsilon^{-4/3})$ additional runtime. For large $n$, our evaluation complexity is optimal up to polylogarithmic factors. In the special case where each $f_i$ is linear -- which corresponds to finding a near-optimal primal strategy in a matrix game -- our method finds an $\epsilon$-approximate solution in runtime $\widetilde{O}(n (d/\epsilon)^{2/3} + nd + d\epsilon^{-2})$. For $n>d$ and $\epsilon=1/\sqrt{n}$ this improves over all existing first-order methods. When additionally $d = \omega(n^{8/11})$ our runtime also improves over all known interior point methods. Our algorithm combines three novel primitives: (1) A dynamic data structure which enables efficient stochastic gradient estimation in small $\ell_2$ or $\ell_1$ balls. (2) A mirror descent algorithm tailored to our data structure implementing an oracle which minimizes the objective over these balls. (3) A simple ball oracle acceleration framework suitable for non-Euclidean geometry.  ( 2 min )
    Model-adapted Fourier sampling for generative compressed sensing. (arXiv:2310.04984v2 [cs.IT] UPDATED)
    We study generative compressed sensing when the measurement matrix is randomly subsampled from a unitary matrix (with the DFT as an important special case). It was recently shown that $\textit{O}(kdn\| \boldsymbol{\alpha}\|_{\infty}^{2})$ uniformly random Fourier measurements are sufficient to recover signals in the range of a neural network $G:\mathbb{R}^k \to \mathbb{R}^n$ of depth $d$, where each component of the so-called local coherence vector $\boldsymbol{\alpha}$ quantifies the alignment of a corresponding Fourier vector with the range of $G$. We construct a model-adapted sampling strategy with an improved sample complexity of $\textit{O}(kd\| \boldsymbol{\alpha}\|_{2}^{2})$ measurements. This is enabled by: (1) new theoretical recovery guarantees that we develop for nonuniformly random sampling distributions and then (2) optimizing the sampling distribution to minimize the number of measurements needed for these guarantees. This development offers a sample complexity applicable to natural signal classes, which are often almost maximally coherent with low Fourier frequencies. Finally, we consider a surrogate sampling scheme, and validate its performance in recovery experiments using the CelebA dataset.
    Channel and Gradient-Importance Aware Device Scheduling for Over-the-Air Federated Learning. (arXiv:2305.16854v2 [cs.LG] UPDATED)
    Federated learning (FL) is a popular privacy-preserving distributed training scheme, where multiple devices collaborate to train machine learning models by uploading local model updates. To improve communication efficiency, over-the-air computation (AirComp) has been applied to FL, which leverages analog modulation to harness the superposition property of radio waves such that numerous devices can upload their model updates concurrently for aggregation. However, the uplink channel noise incurs considerable model aggregation distortion, which is critically determined by the device scheduling and compromises the learned model performance. In this paper, we propose a probabilistic device scheduling framework for over-the-air FL, named PO-FL, to mitigate the negative impact of channel noise, where each device is scheduled according to a certain probability and its model update is reweighted using this probability in aggregation. We prove the unbiasedness of this aggregation scheme and demonstrate the convergence of PO-FL on both convex and non-convex loss functions. Our convergence bounds unveil that the device scheduling affects the learning performance through the communication distortion and global update variance. Based on the convergence analysis, we further develop a channel and gradient-importance aware algorithm to optimize the device scheduling probabilities in PO-FL. Extensive simulation results show that the proposed PO-FL framework with channel and gradient-importance awareness achieves faster convergence and produces better models than baseline methods.  ( 3 min )
    Bridging Data-Driven and Knowledge-Driven Approaches for Safety-Critical Scenario Generation in Automated Vehicle Validation. (arXiv:2311.10937v1 [cs.LG])
    Automated driving vehicles~(ADV) promise to enhance driving efficiency and safety, yet they face intricate challenges in safety-critical scenarios. As a result, validating ADV within generated safety-critical scenarios is essential for both development and performance evaluations. This paper investigates the complexities of employing two major scenario-generation solutions: data-driven and knowledge-driven methods. Data-driven methods derive scenarios from recorded datasets, efficiently generating scenarios by altering the existing behavior or trajectories of traffic participants but often falling short in considering ADV perception; knowledge-driven methods provide effective coverage through expert-designed rules, but they may lead to inefficiency in generating safety-critical scenarios within that coverage. To overcome these challenges, we introduce BridgeGen, a safety-critical scenario generation framework, designed to bridge the benefits of both methodologies. Specifically, by utilizing ontology-based techniques, BridgeGen models the five scenario layers in the operational design domain (ODD) from knowledge-driven methods, ensuring broad coverage, and incorporating data-driven strategies to efficiently generate safety-critical scenarios. An optimized scenario generation toolkit is developed within BridgeGen. This expedites the crafting of safety-critical scenarios through a combination of traditional optimization and reinforcement learning schemes. Extensive experiments conducted using Carla simulator demonstrate the effectiveness of BridgeGen in generating diverse safety-critical scenarios.
    DenseNet and Support Vector Machine classifications of major depressive disorder using vertex-wise cortical features. (arXiv:2311.11046v1 [q-bio.QM])
    Major depressive disorder (MDD) is a complex psychiatric disorder that affects the lives of hundreds of millions of individuals around the globe. Even today, researchers debate if morphological alterations in the brain are linked to MDD, likely due to the heterogeneity of this disorder. The application of deep learning tools to neuroimaging data, capable of capturing complex non-linear patterns, has the potential to provide diagnostic and predictive biomarkers for MDD. However, previous attempts to demarcate MDD patients and healthy controls (HC) based on segmented cortical features via linear machine learning approaches have reported low accuracies. In this study, we used globally representative data from the ENIGMA-MDD working group containing an extensive sample of people with MDD (N=2,772) and HC (N=4,240), which allows a comprehensive analysis with generalizable results. Based on the hypothesis that integration of vertex-wise cortical features can improve classification performance, we evaluated the classification of a DenseNet and a Support Vector Machine (SVM), with the expectation that the former would outperform the latter. As we analyzed a multi-site sample, we additionally applied the ComBat harmonization tool to remove potential nuisance effects of site. We found that both classifiers exhibited close to chance performance (balanced accuracy DenseNet: 51%; SVM: 53%), when estimated on unseen sites. Slightly higher classification performance (balanced accuracy DenseNet: 58%; SVM: 55%) was found when the cross-validation folds contained subjects from all sites, indicating site effect. In conclusion, the integration of vertex-wise morphometric features and the use of the non-linear classifier did not lead to the differentiability between MDD and HC. Our results support the notion that MDD classification on this combination of features and classifiers is unfeasible.
    Gates Are Not What You Need in RNNs. (arXiv:2108.00527v2 [cs.LG] UPDATED)
    Recurrent neural networks have flourished in many areas. Consequently, we can see new RNN cells being developed continuously, usually by creating or using gates in a new, original way. But what if we told you that gates in RNNs are redundant? In this paper, we propose a new recurrent cell called Residual Recurrent Unit (RRU) which beats traditional cells and does not employ a single gate. It is based on the residual shortcut connection, linear transformations, ReLU, and normalization. To evaluate our cell's effectiveness, we compare its performance against the widely-used GRU and LSTM cells and the recently proposed Mogrifier LSTM on several tasks including, polyphonic music modeling, language modeling, and sentiment analysis. Our experiments show that RRU outperforms the traditional gated units on most of these tasks. Also, it has better robustness to parameter selection, allowing immediate application in new tasks without much tuning. We have implemented the RRU in TensorFlow, and the code is made available at https://github.com/LUMII-Syslab/RRU .  ( 2 min )
    INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis. (arXiv:2311.10798v1 [cs.LG])
    Synthesizing information from multiple data sources plays a crucial role in the practice of modern medicine. Current applications of artificial intelligence in medicine often focus on single-modality data due to a lack of publicly available, multimodal medical datasets. To address this limitation, we introduce INSPECT, which contains de-identified longitudinal records from a large cohort of patients at risk for pulmonary embolism (PE), along with ground truth labels for multiple outcomes. INSPECT contains data from 19,402 patients, including CT images, radiology report impression sections, and structured electronic health record (EHR) data (i.e. demographics, diagnoses, procedures, vitals, and medications). Using INSPECT, we develop and release a benchmark for evaluating several baseline modeling approaches on a variety of important PE related tasks. We evaluate image-only, EHR-only, and multimodal fusion models. Trained models and the de-identified dataset are made available for non-commercial use under a data use agreement. To the best of our knowledge, INSPECT is the largest multimodal dataset integrating 3D medical imaging and EHR for reproducible methods evaluation and research.  ( 2 min )
    Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models. (arXiv:2306.02080v3 [cs.CV] UPDATED)
    Various adaptation methods, such as LoRA, prompts, and adapters, have been proposed to enhance the performance of pre-trained vision-language models in specific domains. The robustness of these adaptation methods against distribution shifts have not been studied. In this study, we assess the robustness of 11 widely-used adaptation methods across 4 vision-language datasets under multimodal corruptions. Concretely, we introduce 7 benchmark datasets, including 96 visual and 87 textual corruptions, to investigate the robustness of different adaptation methods, the impact of available adaptation examples, and the influence of trainable parameter size during adaptation. Our analysis reveals that: 1) Adaptation methods are more sensitive to text corruptions than visual corruptions. 2) Full fine-tuning does not consistently provide the highest robustness; instead, adapters can achieve better robustness with comparable clean performance. 3) Contrary to expectations, our findings indicate that increasing the number of adaptation data and parameters does not guarantee enhanced robustness; instead it results in even lower robustness. We hope this study could benefit future research in the development of robust multimodal adaptation methods. The benchmark, code, and dataset used in this study can be accessed at https://adarobustness.github.io .  ( 2 min )
    A Variational Autoencoder for Heterogeneous Temporal and Longitudinal Data. (arXiv:2204.09369v2 [cs.LG] UPDATED)
    The variational autoencoder (VAE) is a popular deep latent variable model used to analyse high-dimensional datasets by learning a low-dimensional latent representation of the data. It simultaneously learns a generative model and an inference network to perform approximate posterior inference. Recently proposed extensions to VAEs that can handle temporal and longitudinal data have applications in healthcare, behavioural modelling, and predictive maintenance. However, these extensions do not account for heterogeneous data (i.e., data comprising of continuous and discrete attributes), which is common in many real-life applications. In this work, we propose the heterogeneous longitudinal VAE (HL-VAE) that extends the existing temporal and longitudinal VAEs to heterogeneous data. HL-VAE provides efficient inference for high-dimensional datasets and includes likelihood models for continuous, count, categorical, and ordinal data while accounting for missing observations. We demonstrate our model's efficacy through simulated as well as clinical datasets, and show that our proposed model achieves competitive performance in missing value imputation and predictive accuracy.  ( 2 min )
    Towards a Standardized Reinforcement Learning Framework for AAM Contingency Management. (arXiv:2311.10805v1 [cs.LG])
    Advanced Air Mobility (AAM) is the next generation of air transportation that includes new entrants such as electric vertical takeoff and landing (eVTOL) aircraft, increasingly autonomous flight operations, and small UAS package delivery. With these new vehicles and operational concepts comes a desire to increase densities far beyond what occurs today in and around urban areas, to utilize new battery technology, and to move toward more autonomously-piloted aircraft. To achieve these goals, it becomes essential to introduce new safety management system capabilities that can rapidly assess risk as it evolves across a span of complex hazards and, if necessary, mitigate risk by executing appropriate contingencies via supervised or automated decision-making during flights. Recently, reinforcement learning has shown promise for real-time decision making across a wide variety of applications including contingency management. In this work, we formulate the contingency management problem as a Markov Decision Process (MDP) and integrate the contingency management MDP into the AAM-Gym simulation framework. This enables rapid prototyping of reinforcement learning algorithms and evaluation of existing systems, thus providing a community benchmark for future algorithm development. We report baseline statistical information for the environment and provide example performance metrics.  ( 2 min )
    Learning Task Embeddings for Teamwork Adaptation in Multi-Agent Reinforcement Learning. (arXiv:2207.02249v2 [cs.MA] UPDATED)
    Successful deployment of multi-agent reinforcement learning often requires agents to adapt their behaviour. In this work, we discuss the problem of teamwork adaptation in which a team of agents needs to adapt their policies to solve novel tasks with limited fine-tuning. Motivated by the intuition that agents need to be able to identify and distinguish tasks in order to adapt their behaviour to the current task, we propose to learn multi-agent task embeddings (MATE). These task embeddings are trained using an encoder-decoder architecture optimised for reconstruction of the transition and reward functions which uniquely identify tasks. We show that a team of agents is able to adapt to novel tasks when provided with task embeddings. We propose three MATE training paradigms: independent MATE, centralised MATE, and mixed MATE which vary in the information used for the task encoding. We show that the embeddings learned by MATE identify tasks and provide useful information which agents leverage during adaptation to novel tasks.  ( 2 min )
    Reinforcement Learning with Maskable Stock Representation for Portfolio Management in Customizable Stock Pools. (arXiv:2311.10801v1 [q-fin.PM])
    Portfolio management (PM) is a fundamental financial trading task, which explores the optimal periodical reallocation of capitals into different stocks to pursue long-term profits. Reinforcement learning (RL) has recently shown its potential to train profitable agents for PM through interacting with financial markets. However, existing work mostly focuses on fixed stock pools, which is inconsistent with investors' practical demand. Specifically, the target stock pool of different investors varies dramatically due to their discrepancy on market states and individual investors may temporally adjust stocks they desire to trade (e.g., adding one popular stocks), which lead to customizable stock pools (CSPs). Existing RL methods require to retrain RL agents even with a tiny change of the stock pool, which leads to high computational cost and unstable performance. To tackle this challenge, we propose EarnMore, a rEinforcement leARNing framework with Maskable stOck REpresentation to handle PM with CSPs through one-shot training in a global stock pool (GSP). Specifically, we first introduce a mechanism to mask out the representation of the stocks outside the target pool. Second, we learn meaningful stock representations through a self-supervised masking and reconstruction process. Third, a re-weighting mechanism is designed to make the portfolio concentrate on favorable stocks and neglect the stocks outside the target pool. Through extensive experiments on 8 subset stock pools of the US stock market, we demonstrate that EarnMore significantly outperforms 14 state-of-the-art baselines in terms of 6 popular financial metrics with over 40% improvement on profit.  ( 3 min )
    Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection. (arXiv:2301.11290v2 [cs.SI] UPDATED)
    In this paper, we introduce a novel and computationally efficient method for vertex embedding, community detection, and community size determination. Our approach leverages a normalized one-hot graph encoder and a rank-based cluster size measure. Through extensive simulations, we demonstrate the excellent numerical performance of our proposed graph encoder ensemble algorithm.  ( 2 min )
    Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC. (arXiv:2302.11552v4 [cs.LG] UPDATED)
    Since their introduction, diffusion models have quickly become the prevailing approach to generative modeling in many domains. They can be interpreted as learning the gradients of a time-varying sequence of log-probability density functions. This interpretation has motivated classifier-based and classifier-free guidance as methods for post-hoc control of diffusion models. In this work, we build upon these ideas using the score-based interpretation of diffusion models, and explore alternative ways to condition, modify, and reuse diffusion models for tasks involving compositional generation and guidance. In particular, we investigate why certain types of composition fail using current techniques and present a number of solutions. We conclude that the sampler (not the model) is responsible for this failure and propose new samplers, inspired by MCMC, which enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers. Intriguingly we find these samplers lead to notable improvements in compositional generation across a wide set of problems such as classifier-guided ImageNet modeling and compositional text-to-image generation.  ( 3 min )
    A DeepONet multi-fidelity approach for residual learning in reduced order modeling. (arXiv:2302.12682v3 [math.NA] UPDATED)
    In the present work, we introduce a novel approach to enhance the precision of reduced order models by exploiting a multi-fidelity perspective and DeepONets. Reduced models provide a real-time numerical approximation by simplifying the original model. The error introduced by the such operation is usually neglected and sacrificed in order to reach a fast computation. We propose to couple the model reduction to a machine learning residual learning, such that the above-mentioned error can be learned by a neural network and inferred for new predictions. We emphasize that the framework maximizes the exploitation of high-fidelity information, using it for building the reduced order model and for learning the residual. In this work, we explore the integration of proper orthogonal decomposition (POD), and gappy POD for sensors data, with the recent DeepONet architecture. Numerical investigations for a parametric benchmark function and a nonlinear parametric Navier-Stokes problem are presented.  ( 2 min )
    Facial recognition technology and human raters can predict political orientation from images of expressionless faces even when controlling for demographics and self-presentation. (arXiv:2303.16343v2 [cs.CV] UPDATED)
    Carefully standardized facial images of 591 participants were taken in the laboratory, while controlling for self-presentation, facial expression, head orientation, and image properties. They were presented to human raters and a facial recognition algorithm: both humans (r=.21) and the algorithm (r=.22) could predict participants' scores on a political orientation scale (Cronbach's alpha=.94) decorrelated with age, gender, and ethnicity. These effects are on par with how well job interviews predict job success, or alcohol drives aggressiveness. Algorithm's predictive accuracy was even higher (r=.31) when it leveraged information on participants' age, gender, and ethnicity. Moreover, the associations between facial appearance and political orientation seem to generalize beyond our sample: The predictive model derived from standardized images (while controlling for age, gender, and ethnicity) could predict political orientation (r=.13) from naturalistic images of 3,401 politicians from the U.S., UK, and Canada. The analysis of facial features associated with political orientation revealed that conservatives tended to have larger lower faces. The predictability of political orientation from standardized images has critical implications for privacy, the regulation of facial recognition technology, and understanding the origins and consequences of political orientation.  ( 3 min )
    Extraction and Summarization of Explicit Video Content using Multi-Modal Deep Learning. (arXiv:2311.10899v1 [cs.CV])
    With the increase in video-sharing platforms across the internet, it is difficult for humans to moderate the data for explicit content. Hence, an automated pipeline to scan through video data for explicit content has become the need of the hour. We propose a novel pipeline that uses multi-modal deep learning to first extract the explicit segments of input videos and then summarize their content using text to determine its age appropriateness and age rating. We also evaluate our pipeline's effectiveness in the end using standard metrics.
    Equivariant Neural Operator Learning with Graphon Convolution. (arXiv:2311.10908v1 [cs.LG])
    We propose a general architecture that combines the coefficient learning scheme with a residual operator layer for learning mappings between continuous functions in the 3D Euclidean space. Our proposed model is guaranteed to achieve SE(3)-equivariance by design. From the graph spectrum view, our method can be interpreted as convolution on graphons (dense graphs with infinitely many nodes), which we term InfGCN. By leveraging both the continuous graphon structure and the discrete graph structure of the input data, our model can effectively capture the geometric information while preserving equivariance. Through extensive experiments on large-scale electron density datasets, we observed that our model significantly outperformed the current state-of-the-art architectures. Multiple ablation studies were also carried out to demonstrate the effectiveness of the proposed architecture.
    Comparing Generalization in Learning with Limited Numbers of Exemplars: Transformer vs. RNN in Attractor Dynamics. (arXiv:2311.10763v1 [cs.CL])
    ChatGPT, a widely-recognized large language model (LLM), has recently gained substantial attention for its performance scaling, attributed to the billions of web-sourced natural language sentences used for training. Its underlying architecture, Transformer, has found applications across diverse fields, including video, audio signals, and robotic movement. %The crucial question this raises concerns the Transformer's generalization-in-learning (GIL) capacity. However, this raises a crucial question about Transformer's generalization in learning (GIL) capacity. Is ChatGPT's success chiefly due to the vast dataset used for training, or is there more to the story? To investigate this, we compared Transformer's GIL capabilities with those of a traditional Recurrent Neural Network (RNN) in tasks involving attractor dynamics learning. For performance evaluation, the Dynamic Time Warping (DTW) method has been employed. Our simulation results suggest that under conditions of limited data availability, Transformer's GIL abilities are markedly inferior to those of RNN.
    Learning Deterministic Finite Automata from Confidence Oracles. (arXiv:2311.10963v1 [cs.FL])
    We discuss the problem of learning a deterministic finite automaton (DFA) from a confidence oracle. That is, we are given access to an oracle $Q$ with incomplete knowledge of some target language $L$ over an alphabet $\Sigma$; the oracle maps a string $x\in\Sigma^*$ to a score in the interval $[-1,1]$ indicating its confidence that the string is in the language. The interpretation is that the sign of the score signifies whether $x\in L$, while the magnitude $|Q(x)|$ represents the oracle's confidence. Our goal is to learn a DFA representation of the oracle that preserves the information that it is confident in. The learned DFA should closely match the oracle wherever it is highly confident, but it need not do this when the oracle is less sure of itself.
    Enhancing Robust Representation in Adversarial Training: Alignment and Exclusion Criteria. (arXiv:2310.03358v2 [cs.CV] UPDATED)
    Deep neural networks are vulnerable to adversarial noise. Adversarial Training (AT) has been demonstrated to be the most effective defense strategy to protect neural networks from being fooled. However, we find AT omits to learning robust features, resulting in poor performance of adversarial robustness. To address this issue, we highlight two criteria of robust representation: (1) Exclusion: \emph{the feature of examples keeps away from that of other classes}; (2) Alignment: \emph{the feature of natural and corresponding adversarial examples is close to each other}. These motivate us to propose a generic framework of AT to gain robust representation, by the asymmetric negative contrast and reverse attention. Specifically, we design an asymmetric negative contrast based on predicted probabilities, to push away examples of different classes in the feature space. Moreover, we propose to weight feature by parameters of the linear classifier as the reverse attention, to obtain class-aware feature and pull close the feature of the same class. Empirical evaluations on three benchmark datasets show our methods greatly advance the robustness of AT and achieve state-of-the-art performance.  ( 2 min )
    Analysis of frequent trading effects of various machine learning models. (arXiv:2311.10719v1 [q-fin.TR])
    In recent years, high-frequency trading has emerged as a crucial strategy in stock trading. This study aims to develop an advanced high-frequency trading algorithm and compare the performance of three different mathematical models: the combination of the cross-entropy loss function and the quasi-Newton algorithm, the FCNN model, and the vector machine. The proposed algorithm employs neural network predictions to generate trading signals and execute buy and sell operations based on specific conditions. By harnessing the power of neural networks, the algorithm enhances the accuracy and reliability of the trading strategy. To assess the effectiveness of the algorithm, the study evaluates the performance of the three mathematical models. The combination of the cross-entropy loss function and the quasi-Newton algorithm is a widely utilized logistic regression approach. The FCNN model, on the other hand, is a deep learning algorithm that can extract and classify features from stock data. Meanwhile, the vector machine is a supervised learning algorithm recognized for achieving improved classification results by mapping data into high-dimensional spaces. By comparing the performance of these three models, the study aims to determine the most effective approach for high-frequency trading. This research makes a valuable contribution by introducing a novel methodology for high-frequency trading, thereby providing investors with a more accurate and reliable stock trading strategy.
    Pre- to Post-Contrast Breast MRI Synthesis for Enhanced Tumour Segmentation. (arXiv:2311.10879v1 [eess.IV])
    Despite its benefits for tumour detection and treatment, the administration of contrast agents in dynamic contrast-enhanced MRI (DCE-MRI) is associated with a range of issues, including their invasiveness, bioaccumulation, and a risk of nephrogenic systemic fibrosis. This study explores the feasibility of producing synthetic contrast enhancements by translating pre-contrast T1-weighted fat-saturated breast MRI to their corresponding first DCE-MRI sequence leveraging the capabilities of a generative adversarial network (GAN). Additionally, we introduce a Scaled Aggregate Measure (SAMe) designed for quantitatively evaluating the quality of synthetic data in a principled manner and serving as a basis for selecting the optimal generative model. We assess the generated DCE-MRI data using quantitative image quality metrics and apply them to the downstream task of 3D breast tumour segmentation. Our results highlight the potential of post-contrast DCE-MRI synthesis in enhancing the robustness of breast tumour segmentation models via data augmentation. Our code is available at https://github.com/RichardObi/pre_post_synthesis.
    Attention Mechanism for Lithium-Ion Battery Lifespan Prediction: Temporal and Cyclic Attention. (arXiv:2311.10792v1 [cs.LG])
    Accurately predicting the lifespan of lithium-ion batteries (LIBs) is pivotal for optimizing usage and preventing accidents. Previous studies in constructing prediction models often relied on inputs challenging to measure in real-time operations and failed to capture intra-cycle and inter-cycle data patterns, essential features for accurate predictions, comprehensively. In this study, we employ attention mechanisms (AM) to develop data-driven models for predicting LIB lifespan using easily measurable inputs such as voltage, current, temperature, and capacity data. The developed model integrates recurrent neural network (RNN) and convolutional neural network (CNN) components, featuring two types of attention mechanisms: temporal attention (TA) and cyclic attention (CA). The inclusion of TA aims to identify important time steps within each cycle by scoring the hidden states of the RNN, whereas CA strives to capture key features of inter-cycle correlations through self-attention (SA). This enhances model accuracy and elucidates critical features in the input data. To validate our method, we apply it to publicly available cycling data consisting of three batches of cycling modes. The calculated TA scores highlight the rest phase as a key characteristic distinguishing LIB data among different batches. Additionally, CA scores reveal variations in the importance of cycles across batches. By leveraging CA scores, we explore the potential to reduce the number of cycles in the input data. The single-head and multi-head attentions enable us to decrease the input dimension from 100 to 50 and 30 cycles, respectively.
    The Hidden Linear Structure in Score-Based Models and its Application. (arXiv:2311.10892v1 [cs.AI])
    Score-based models have achieved remarkable results in the generative modeling of many domains. By learning the gradient of smoothed data distribution, they can iteratively generate samples from complex distribution e.g. natural images. However, is there any universal structure in the gradient field that will eventually be learned by any neural network? Here, we aim to find such structures through a normative analysis of the score function. First, we derived the closed-form solution to the scored-based model with a Gaussian score. We claimed that for well-trained diffusion models, the learned score at a high noise scale is well approximated by the linear score of Gaussian. We demonstrated this through empirical validation of pre-trained images diffusion model and theoretical analysis of the score function. This finding enabled us to precisely predict the initial diffusion trajectory using the analytical solution and to accelerate image sampling by 15-30\% by skipping the initial phase without sacrificing image quality. Our finding of the linear structure in the score-based model has implications for better model design and data pre-processing.
    Polynomial-Time Solutions for ReLU Network Training: A Complexity Classification via Max-Cut and Zonotopes. (arXiv:2311.10972v1 [cs.LG])
    We investigate the complexity of training a two-layer ReLU neural network with weight decay regularization. Previous research has shown that the optimal solution of this problem can be found by solving a standard cone-constrained convex program. Using this convex formulation, we prove that the hardness of approximation of ReLU networks not only mirrors the complexity of the Max-Cut problem but also, in certain special cases, exactly corresponds to it. In particular, when $\epsilon\leq\sqrt{84/83}-1\approx 0.006$, we show that it is NP-hard to find an approximate global optimizer of the ReLU network objective with relative error $\epsilon$ with respect to the objective value. Moreover, we develop a randomized algorithm which mirrors the Goemans-Williamson rounding of semidefinite Max-Cut relaxations. To provide polynomial-time approximations, we classify training datasets into three categories: (i) For orthogonal separable datasets, a precise solution can be obtained in polynomial-time. (ii) When there is a negative correlation between samples of different classes, we give a polynomial-time approximation with relative error $\sqrt{\pi/2}-1\approx 0.253$. (iii) For general datasets, the degree to which the problem can be approximated in polynomial-time is governed by a geometric factor that controls the diameter of two zonotopes intrinsic to the dataset. To our knowledge, these results present the first polynomial-time approximation guarantees along with first hardness of approximation results for regularized ReLU networks.
    Deep Neuromorphic Networks with Superconducting Single Flux Quanta. (arXiv:2311.10721v1 [cs.ET])
    Conventional semiconductor-based integrated circuits are gradually approaching fundamental scaling limits. Many prospective solutions have recently emerged to supplement or replace both the technology on which basic devices are built and the architecture of data processing. Neuromorphic circuits are a promising approach to computing where techniques used by the brain to achieve high efficiency are exploited. Many existing neuromorphic circuits rely on unconventional and useful properties of novel technologies to better mimic the operation of the brain. One such technology is single flux quantum (SFQ) logic -- a cryogenic superconductive technology in which the data are represented by quanta of magnetic flux (fluxons) produced and processed by Josephson junctions embedded within inductive loops. The movement of a fluxon within a circuit produces a quantized voltage pulse (SFQ pulse), resembling a neuronal spiking event. These circuits routinely operate at clock frequencies of tens to hundreds of gigahertz, making SFQ a natural technology for processing high frequency pulse trains. Prior proposals for SFQ neural networks often require energy-expensive fluxon conversions, involve heterogeneous technologies, or exclusively focus on device level behavior. In this paper, a design methodology for deep single flux quantum neuromorphic networks is presented. Synaptic and neuronal circuits based on SFQ technology are presented and characterized. Based on these primitives, a deep neuromorphic XOR network is evaluated as a case study, both at the architectural and circuit levels, achieving wide classification margins. The proposed methodology does not employ unconventional superconductive devices or semiconductor transistors. The resulting networks are tunable by an external current, making this proposed system an effective approach for scalable cryogenic neuromorphic computing.
    Multiparameter Persistent Homology for Molecular Property Prediction. (arXiv:2311.10808v1 [q-bio.BM])
    In this study, we present a novel molecular fingerprint generation method based on multiparameter persistent homology. This approach reveals the latent structures and relationships within molecular geometry, and detects topological features that exhibit persistence across multiple scales along multiple parameters, such as atomic mass, partial charge, and bond type, and can be further enhanced by incorporating additional parameters like ionization energy, electron affinity, chirality and orbital hybridization. The proposed fingerprinting method provides fresh perspectives on molecular structure that are not easily discernible from single-parameter or single-scale analysis. Besides, in comparison with traditional graph neural networks, multiparameter persistent homology has the advantage of providing a more comprehensive and interpretable characterization of the topology of the molecular data. We have established theoretical stability guarantees for multiparameter persistent homology, and have conducted extensive experiments on the Lipophilicity, FreeSolv, and ESOL datasets to demonstrate its effectiveness in predicting molecular properties.
    Text Sanitization Beyond Specific Domains: Zero-Shot Redaction & Substitution with Large Language Models. (arXiv:2311.10785v1 [cs.CL])
    In the context of information systems, text sanitization techniques are used to identify and remove sensitive data to comply with security and regulatory requirements. Even though many methods for privacy preservation have been proposed, most of them are focused on the detection of entities from specific domains (e.g., credit card numbers, social security numbers), lacking generality and requiring customization for each desirable domain. Moreover, removing words is, in general, a drastic measure, as it can degrade text coherence and contextual information. Less severe measures include substituting a word for a safe alternative, yet it can be challenging to automatically find meaningful substitutions. We present a zero-shot text sanitization technique that detects and substitutes potentially sensitive information using Large Language Models. Our evaluation shows that our method excels at protecting privacy while maintaining text coherence and contextual information, preserving data utility for downstream tasks.
    Degeneration of kernel regression with Matern kernels into low-order polynomial regression in high dimension. (arXiv:2311.10790v1 [physics.comp-ph])
    Kernel methods such as kernel ridge regression and Gaussian process regressions with Matern type kernels have been increasingly used, in particular, to fit potential energy surfaces (PES) and density functionals, and for materials informatics. When the dimensionality of the feature space is high, these methods are used with necessarily sparse data. In this regime, the optimal length parameter of a Matern-type kernel tends to become so large that the method effectively degenerates into a low-order polynomial regression and therefore loses any advantage over such regression. This is demonstrated theoretically as well as numerically on the examples of six- and fifteen-dimensional molecular PES using squared exponential and simple exponential kernels. The results shed additional light on the success of polynomial approximations such as PIP for medium size molecules and on the importance of orders-of-coupling based models for preserving the advantages of kernel methods with Matern type kernels or on the use of physically-motivated (reproducing) kernels.
    Classification Methods Based on Machine Learning for the Analysis of Fetal Health Data. (arXiv:2311.10962v1 [cs.LG])
    The persistent battle to decrease childhood mortality serves as a commonly employed benchmark for gauging advancements in the field of medicine. Globally, the under-5 mortality rate stands at approximately 5 million, with a significant portion of these deaths being avoidable. Given the significance of this problem, Machine learning-based techniques have emerged as a prominent tool for assessing fetal health. In this work, we have analyzed the classification performance of various machine learning models for fetal health analysis. Classification performance of various machine learning models, such as support vector machine (SVM), random forest(RF), and attentive interpretable tabular learning (TabNet) have been assessed on fetal health. Moreover, dimensionality reduction techniques, such as Principal component analysis (PCA) and Linear discriminant analysis (LDA) have been implemented to obtain better classification performance with less number of features. A TabNet model on a fetal health dataset provides a classification accuracy of 94.36%. In general, this technology empowers doctors and healthcare experts to achieve precise fetal health classification and identify the most influential features in the process.
    Near-Optimal Fair Resource Allocation for Strategic Agents without Money: A Data-Driven Approach. (arXiv:2311.10927v1 [cs.GT])
    We study learning-based design of fair allocation mechanisms for divisible resources, using proportional fairness (PF) as a benchmark. The learning setting is a significant departure from the classic mechanism design literature, in that, we need to learn fair mechanisms solely from data. In particular, we consider the challenging problem of learning one-shot allocation mechanisms -- without the use of money -- that incentivize strategic agents to be truthful when reporting their valuations. It is well-known that the mechanism that directly seeks to optimize PF is not incentive compatible, meaning that the agents can potentially misreport their preferences to gain increased allocations. We introduce the notion of "exploitability" of a mechanism to measure the relative gain in utility from misreport, and make the following important contributions in the paper: (i) Using sophisticated techniques inspired by differentiable convex programming literature, we design a numerically efficient approach for computing the exploitability of the PF mechanism. This novel contribution enables us to quantify the gap that needs to be bridged to approximate PF via incentive compatible mechanisms. (ii) Next, we modify the PF mechanism to introduce a trade-off between fairness and exploitability. By properly controlling this trade-off using data, we show that our proposed mechanism, ExPF-Net, provides a strong approximation to the PF mechanism while maintaining low exploitability. This mechanism, however, comes with a high computational cost. (iii) To address the computational challenges, we propose another mechanism ExS-Net, which is end-to-end parameterized by a neural network. ExS-Net enjoys similar (slightly inferior) performance and significantly accelerated training and inference time performance. (iv) Extensive numerical simulations demonstrate the robustness and efficacy of the proposed mechanisms.
    MVSA-Net: Multi-View State-Action Recognition for Robust and Deployable Trajectory Generation. (arXiv:2311.08393v2 [cs.CV] UPDATED)
    The learn-from-observation (LfO) paradigm is a human-inspired mode for a robot to learn to perform a task simply by watching it being performed. LfO can facilitate robot integration on factory floors by minimizing disruption and reducing tedious programming. A key component of the LfO pipeline is a transformation of the depth camera frames to the corresponding task state and action pairs, which are then relayed to learning techniques such as imitation or inverse reinforcement learning for understanding the task parameters. While several existing computer vision models analyze videos for activity recognition, SA-Net specifically targets robotic LfO from RGB-D data. However, SA-Net and many other models analyze frame data captured from a single viewpoint. Their analysis is therefore highly sensitive to occlusions of the observed task, which are frequent in deployments. An obvious way of reducing occlusions is to simultaneously observe the task from multiple viewpoints and synchronously fuse the multiple streams in the model. Toward this, we present multi-view SA-Net, which generalizes the SA-Net model to allow the perception of multiple viewpoints of the task activity, integrate them, and better recognize the state and action in each frame. Performance evaluations on two distinct domains establish that MVSA-Net recognizes the state-action pairs under occlusion more accurately compared to single-view MVSA-Net and other baselines. Our ablation studies further evaluate its performance under different ambient conditions and establish the contribution of the architecture components. As such, MVSA-Net offers a significantly more robust and deployable state-action trajectory generation compared to previous methods.
    Multiple View Geometry Transformers for 3D Human Pose Estimation. (arXiv:2311.10983v1 [cs.CV])
    In this work, we aim to improve the 3D reasoning ability of Transformers in multi-view 3D human pose estimation. Recent works have focused on end-to-end learning-based transformer designs, which struggle to resolve geometric information accurately, particularly during occlusion. Instead, we propose a novel hybrid model, MVGFormer, which has a series of geometric and appearance modules organized in an iterative manner. The geometry modules are learning-free and handle all viewpoint-dependent 3D tasks geometrically which notably improves the model's generalization ability. The appearance modules are learnable and are dedicated to estimating 2D poses from image signals end-to-end which enables them to achieve accurate estimates even when occlusion occurs, leading to a model that is both accurate and generalizable to new cameras and geometries. We evaluate our approach for both in-domain and out-of-domain settings, where our model consistently outperforms state-of-the-art methods, and especially does so by a significant margin in the out-of-domain setting. We will release the code and models: https://github.com/XunshanMan/MVGFormer.
    ADAMM: Anomaly Detection of Attributed Multi-graphs with Metadata: A Unified Neural Network Approach. (arXiv:2311.07355v2 [cs.LG] UPDATED)
    Given a complex graph database of node- and edge-attributed multi-graphs as well as associated metadata for each graph, how can we spot the anomalous instances? Many real-world problems can be cast as graph inference tasks where the graph representation could capture complex relational phenomena (e.g., transactions among financial accounts in a journal entry), along with metadata reflecting tabular features (e.g. approver, effective date, etc.). While numerous anomaly detectors based on Graph Neural Networks (GNNs) have been proposed, none are capable of directly handling directed graphs with multi-edges and self-loops. Furthermore, the simultaneous handling of relational and tabular features remains an unexplored area. In this work we propose ADAMM, a novel graph neural network model that handles directed multi-graphs, providing a unified end-to-end architecture that fuses metadata and graph-level representation learning through an unsupervised anomaly detection objective. Experiments on datasets from two different domains, namely, general-ledger journal entries from different firms (accounting) as well as human GPS trajectories from thousands of individuals (urban mobility) validate ADAMM's generality and detection effectiveness of expert-guided and ground-truth anomalies. Notably, ADAMM outperforms existing baselines that handle the two data modalities (graph and metadata) separately with post hoc synthesis efforts.
    Safety-aware Causal Representation for Trustworthy Reinforcement Learning in Autonomous Driving. (arXiv:2311.10747v1 [cs.RO])
    In the domain of autonomous driving, the Learning from Demonstration (LfD) paradigm has exhibited notable efficacy in addressing sequential decision-making problems. However, consistently achieving safety in varying traffic contexts, especially in safety-critical scenarios, poses a significant challenge due to the long-tailed and unforeseen scenarios absent from offline datasets. In this paper, we introduce the saFety-aware strUctured Scenario representatION (FUSION), a pioneering methodology conceived to facilitate the learning of an adaptive end-to-end driving policy by leveraging structured scenario information. FUSION capitalizes on the causal relationships between decomposed reward, cost, state, and action space, constructing a framework for structured sequential reasoning under dynamic traffic environments. We conduct rigorous evaluations in two typical real-world settings of distribution shift in autonomous vehicles, demonstrating the good balance between safety cost and utility reward of FUSION compared to contemporary state-of-the-art safety-aware LfD baselines. Empirical evidence under diverse driving scenarios attests that FUSION significantly enhances the safety and generalizability of autonomous driving agents, even in the face of challenging and unseen environments. Furthermore, our ablation studies reveal noticeable improvements in the integration of causal representation into the safe offline RL problem.
    SEA++: Multi-Graph-based High-Order Sensor Alignment for Multivariate Time-Series Unsupervised Domain Adaptation. (arXiv:2311.10806v1 [cs.LG])
    Unsupervised Domain Adaptation (UDA) methods have been successful in reducing label dependency by minimizing the domain discrepancy between a labeled source domain and an unlabeled target domain. However, these methods face challenges when dealing with Multivariate Time-Series (MTS) data. MTS data typically consist of multiple sensors, each with its own unique distribution. This characteristic makes it hard to adapt existing UDA methods, which mainly focus on aligning global features while overlooking the distribution discrepancies at the sensor level, to reduce domain discrepancies for MTS data. To address this issue, a practical domain adaptation scenario is formulated as Multivariate Time-Series Unsupervised Domain Adaptation (MTS-UDA). In this paper, we propose SEnsor Alignment (SEA) for MTS-UDA, aiming to reduce domain discrepancy at both the local and global sensor levels. At the local sensor level, we design endo-feature alignment, which aligns sensor features and their correlations across domains. To reduce domain discrepancy at the global sensor level, we design exo-feature alignment that enforces restrictions on global sensor features. We further extend SEA to SEA++ by enhancing the endo-feature alignment. Particularly, we incorporate multi-graph-based high-order alignment for both sensor features and their correlations. Extensive empirical results have demonstrated the state-of-the-art performance of our SEA and SEA++ on public MTS datasets for MTS-UDA.
    Adaptive Modelling Approach for Row-Type Dependent Predictive Analysis (RTDPA): A Framework for Designing Machine Learning Models for Credit Risk Analysis in Banking Sector. (arXiv:2311.10799v1 [cs.LG])
    In many real-world datasets, rows may have distinct characteristics and require different modeling approaches for accurate predictions. In this paper, we propose an adaptive modeling approach for row-type dependent predictive analysis(RTDPA). Our framework enables the development of models that can effectively handle diverse row types within a single dataset. Our dataset from XXX bank contains two different risk categories, personal loan and agriculture loan. each of them are categorised into four classes standard, sub-standard, doubtful and loss. We performed tailored data pre processing and feature engineering to different row types. We selected traditional machine learning predictive models and advanced ensemble techniques. Our findings indicate that all predictive approaches consistently achieve a precision rate of no less than 90%. For RTDPA, the algorithms are applied separately for each row type, allowing the models to capture the specific patterns and characteristics of each row type. This approach enables targeted predictions based on the row type, providing a more accurate and tailored classification for the given dataset.Additionally, the suggested model consistently offers decision makers valuable and enduring insights that are strategic in nature in banking sector.
    Multi Time Scale World Models. (arXiv:2310.18534v2 [cs.LG] UPDATED)
    Intelligent agents use internal world models to reason and make predictions about different courses of their actions at many scales. Devising learning paradigms and architectures that allow machines to learn world models that operate at multiple levels of temporal abstractions while dealing with complex uncertainty predictions is a major technical hurdle. In this work, we propose a probabilistic formalism to learn multi-time scale world models which we call the Multi Time Scale State Space (MTS3) model. Our model uses a computationally efficient inference scheme on multiple time scales for highly accurate long-horizon predictions and uncertainty estimates over several seconds into the future. Our experiments, which focus on action conditional long horizon future predictions, show that MTS3 outperforms recent methods on several system identification benchmarks including complex simulated and real-world dynamical systems.  ( 2 min )
    Sustainable Concrete via Bayesian Optimization. (arXiv:2310.18288v3 [cs.LG] UPDATED)
    Eight percent of global carbon dioxide emissions can be attributed to the production of cement, the main component of concrete, which is also the dominant source of CO2 emissions in the construction of data centers. The discovery of lower-carbon concrete formulae is therefore of high significance for sustainability. However, experimenting with new concrete formulae is time consuming and labor intensive, as one usually has to wait to record the concrete's 28-day compressive strength, a quantity whose measurement can by its definition not be accelerated. This provides an opportunity for experimental design methodology like Bayesian Optimization (BO) to accelerate the search for strong and sustainable concrete formulae. Herein, we 1) propose modeling steps that make concrete strength amenable to be predicted accurately by a Gaussian process model with relatively few measurements, 2) formulate the search for sustainable concrete as a multi-objective optimization problem, and 3) leverage the proposed model to carry out multi-objective BO with real-world strength measurements of the algorithmically proposed mixes. Our experimental results show improved trade-offs between the mixtures' global warming potential (GWP) and their associated compressive strengths, compared to mixes based on current industry practices. Our methods are open-sourced at github.com/facebookresearch/SustainableConcrete.  ( 2 min )
    Verified Compositional Neuro-Symbolic Control for Stochastic Systems with Temporal Logic Tasks. (arXiv:2311.10863v1 [cs.RO])
    Several methods have been proposed recently to learn neural network (NN) controllers for autonomous agents, with unknown and stochastic dynamics, tasked with complex missions captured by Linear Temporal Logic (LTL). Due to the sample-inefficiency of the majority of these works, compositional learning methods have been proposed decomposing the LTL specification into smaller sub-tasks. Then, separate controllers are learned and composed to satisfy the original task. A key challenge within these approaches is that they often lack safety guarantees or the provided guarantees are impractical. This paper aims to address this challenge. Particularly, we consider autonomous systems with unknown and stochastic dynamics and LTL-encoded tasks. We assume that the system is equipped with a finite set of base skills modeled by trained NN feedback controllers. Our goal is to check if there exists a temporal composition of the trained NN controllers - and if so, to compute it - that will yield a composite system behavior that satisfies the assigned LTL task with probability one. We propose a new approach that relies on a novel integration of automata theory and data-driven reachability analysis tools for NN-controlled stochastic systems. The resulting neuro-symbolic controller allows the agent to generate safe behaviors for unseen complex temporal logic tasks in a zero-shot fashion by leveraging its base skills. We show correctness of the proposed method and we provide conditions under which it is complete. To the best of our knowledge, this is the first work that designs verified temporal compositions of NN controllers for unknown and stochastic systems. Finally, we provide extensive numerical simulations and hardware experiments on robot navigation tasks to demonstrate the proposed method.
    Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities. (arXiv:2206.13272v2 [eess.AS] UPDATED)
    Wideband Audio Waveform Evaluation Networks (WAWEnets) are convolutional neural networks that operate directly on wideband audio waveforms in order to produce evaluations of those waveforms. In the present work these evaluations give qualities of telecommunications speech (e.g., noisiness, intelligibility, overall speech quality). WAWEnets are no-reference networks because they do not require ``reference'' (original or undistorted) versions of the waveforms they evaluate. Our initial WAWEnet publication introduced four WAWEnets and each emulated the output of an established full-reference speech quality or intelligibility estimation algorithm. We have updated the WAWEnet architecture to be more efficient and effective. Here we present a single WAWEnet that closely tracks seven different quality and intelligibility values. We create a second network that additionally tracks four subjective speech quality dimensions. We offer a third network that focuses on just subjective quality scores and achieves very high levels of agreement. This work has leveraged 334 hours of speech in 13 languages, over two million full-reference target values and over 93,000 subjective mean opinion scores. We also interpret the operation of WAWEnets and identify the key to their operation using the language of signal processing: ReLUs strategically move spectral information from non-DC components into the DC component. The DC values of 96 output signals define a vector in a 96-D latent space and this vector is then mapped to a quality or intelligibility value for the input waveform.  ( 3 min )
    Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models. (arXiv:2310.15127v2 [cs.AI] UPDATED)
    Pre-trained and frozen large language models (LLMs) can effectively map simple scene rearrangement instructions to programs over a robot's visuomotor functions through appropriate few-shot example prompting. To parse open-domain natural language and adapt to a user's idiosyncratic procedures, not known during prompt engineering time, fixed prompts fall short. In this paper, we introduce HELPER, an embodied agent equipped with an external memory of language-program pairs that parses free-form human-robot dialogue into action programs through retrieval-augmented LLM prompting: relevant memories are retrieved based on the current dialogue, instruction, correction, or VLM description, and used as in-context prompt examples for LLM querying. The memory is expanded during deployment to include pairs of user's language and action plans, to assist future inferences and personalize them to the user's language and routines. HELPER sets a new state-of-the-art in the TEACh benchmark in both Execution from Dialog History (EDH) and Trajectory from Dialogue (TfD), with a 1.7x improvement over the previous state-of-the-art for TfD. Our models, code, and video results can be found in our project's website: https://helper-agent-llm.github.io.
    Reward Teaching for Federated Multi-armed Bandits. (arXiv:2305.02441v2 [stat.ML] UPDATED)
    Most of the existing federated multi-armed bandits (FMAB) designs are based on the presumption that clients will implement the specified design to collaborate with the server. In reality, however, it may not be possible to modify the clients' existing protocols. To address this challenge, this work focuses on clients who always maximize their individual cumulative rewards, and introduces a novel idea of ``reward teaching'', where the server guides the clients towards global optimality through implicit local reward adjustments. Under this framework, the server faces two tightly coupled tasks of bandit learning and target teaching, whose combination is non-trivial and challenging. A phased approach, called Teaching-After-Learning (TAL), is first designed to encourage and discourage clients' explorations separately. General performance analyses of TAL are established when the clients' strategies satisfy certain mild requirements. With novel technical approaches developed to analyze the warm-start behaviors of bandit algorithms, particularized guarantees of TAL with clients running UCB or epsilon-greedy strategies are then obtained. These results demonstrate that TAL achieves logarithmic regrets while only incurring logarithmic adjustment costs, which is order-optimal w.r.t. a natural lower bound. As a further extension, the Teaching-While-Learning (TWL) algorithm is developed with the idea of successive arm elimination to break the non-adaptive phase separation in TAL. Rigorous analyses demonstrate that when facing clients with UCB1, TWL outperforms TAL in terms of the dependencies on sub-optimality gaps thanks to its adaptive design. Experimental results demonstrate the effectiveness and generality of the proposed algorithms.
    On Practical Robust Reinforcement Learning: Practical Uncertainty Set and Double-Agent Algorithm. (arXiv:2305.06657v3 [cs.LG] UPDATED)
    Robust reinforcement learning (RRL) aims at seeking a robust policy to optimize the worst case performance over an uncertainty set of Markov decision processes (MDPs). This set contains some perturbed MDPs from a nominal MDP (N-MDP) that generate samples for training, which reflects some potential mismatches between training (i.e., N-MDP) and true environments. In this paper we present an elaborated uncertainty set by excluding some implausible MDPs from the existing sets. Under this uncertainty set, we develop a sample-based RRL algorithm (named ARQ-Learning) for tabular setting and characterize its finite-time error bound. Also, it is proved that ARQ-Learning converges as fast as the standard Q-Learning and robust Q-Learning while ensuring better robustness. We introduce an additional pessimistic agent which can tackle the major bottleneck for the extension of ARQ-Learning into the cases with larger or continuous state spaces. Incorporating this idea into RL algorithms, we propose double-agent algorithms for model-free RRL. Via experiments, we demonstrate the effectiveness of the proposed algorithms.
    Diffusion Model-Augmented Behavioral Cloning. (arXiv:2302.13335v3 [cs.LG] UPDATED)
    Imitation learning addresses the challenge of learning by observing an expert's demonstrations without access to reward signals from environments. Most existing imitation learning methods that do not require interacting with environments either model the expert distribution as the conditional probability p(a|s) (e.g., behavioral cloning, BC) or the joint probability p(s, a). Despite its simplicity, modeling the conditional probability with BC usually struggles with generalization. While modeling the joint probability can lead to improved generalization performance, the inference procedure is often time-consuming and the model can suffer from manifold overfitting. This work proposes an imitation learning framework that benefits from modeling both the conditional and joint probability of the expert distribution. Our proposed diffusion model-augmented behavioral cloning (DBC) employs a diffusion model trained to model expert behaviors and learns a policy to optimize both the BC loss (conditional) and our proposed diffusion model loss (joint). DBC outperforms baselines in various continuous control tasks in navigation, robot arm manipulation, dexterous manipulation, and locomotion. We design additional experiments to verify the limitations of modeling either the conditional probability or the joint probability of the expert distribution as well as compare different generative models. Ablation studies justify the effectiveness of our design choices.
    Many learning agents interacting with an agent-based market model. (arXiv:2303.07393v3 [q-fin.TR] UPDATED)
    We consider the dynamics and the interactions of multiple reinforcement learning optimal execution trading agents interacting with a reactive Agent-Based Model (ABM) of a financial market in event time. The model represents a market ecology with 3-trophic levels represented by: optimal execution learning agents, minimally intelligent liquidity takers, and fast electronic liquidity providers. The optimal execution agent classes include buying and selling agents that can either use a combination of limit orders and market orders, or only trade using market orders. The reward function explicitly balances trade execution slippage against the penalty of not executing the order timeously. This work demonstrates how multiple competing learning agents impact a minimally intelligent market simulation as functions of the number of agents, the size of agents' initial orders, and the state spaces used for learning. We use phase space plots to examine the dynamics of the ABM, when various specifications of learning agents are included. Further, we examine whether the inclusion of optimal execution agents that can learn is able to produce dynamics with the same complexity as empirical data. We find that the inclusion of optimal execution agents changes the stylised facts produced by ABM to conform more with empirical data, and are a necessary inclusion for ABMs investigating market micro-structure. However, including execution agents to chartist-fundamentalist-noise ABMs is insufficient to recover the complexity observed in empirical data.
    Interpreting Black-box Machine Learning Models for High Dimensional Datasets. (arXiv:2208.13405v3 [cs.LG] UPDATED)
    Deep neural networks (DNNs) have been shown to outperform traditional machine learning algorithms in a broad variety of application domains due to their effectiveness in modeling complex problems and handling high-dimensional datasets. Many real-life datasets, however, are of increasingly high dimensionality, where a large number of features may be irrelevant for both supervised and unsupervised learning tasks. The inclusion of such features would not only introduce unwanted noise but also increase computational complexity. Furthermore, due to high non-linearity and dependency among a large number of features, DNN models tend to be unavoidably opaque and perceived as black-box methods because of their not well-understood internal functioning. Their algorithmic complexity is often simply beyond the capacities of humans to understand the interplay among myriads of hyperparameters. A well-interpretable model can identify statistically significant features and explain the way they affect the model's outcome. In this paper, we propose an efficient method to improve the interpretability of black-box models for classification tasks in the case of high-dimensional datasets. First, we train a black-box model on a high-dimensional dataset to learn the embeddings on which the classification is performed. To decompose the inner working principles of the black-box model and to identify top-k important features, we employ different probing and perturbing techniques. We then approximate the behavior of the black-box model by means of an interpretable surrogate model on the top-k feature space. Finally, we derive decision rules and local explanations from the surrogate model to explain individual decisions. Our approach outperforms state-of-the-art methods like TabNet and XGboost when tested on different datasets with varying dimensionality between 50 and 20,000 w.r.t metrics and explainability.
    Relationship between Batch Size and Number of Steps Needed for Nonconvex Optimization of Stochastic Gradient Descent using Armijo Line Search. (arXiv:2307.13831v3 [cs.LG] UPDATED)
    Stochastic gradient descent (SGD) is the simplest deep learning optimizer with which to train deep neural networks. While SGD can use various learning rates, such as constant or diminishing rates, the previous numerical results showed that SGD performs better than other deep learning optimizers using when it uses learning rates given by line search methods. In this paper, we perform a convergence analysis on SGD with a learning rate given by an Armijo line search for nonconvex optimization. The analysis indicates that the upper bound of the expectation of the squared norm of the full gradient becomes small when the number of steps and the batch size are large. Next, we show that, for SGD with the Armijo-line-search learning rate, the number of steps needed for nonconvex optimization is a monotone decreasing convex function of the batch size; that is, the number of steps needed for nonconvex optimization decreases as the batch size increases. Furthermore, we show that the stochastic first-order oracle (SFO) complexity, which is the stochastic gradient computation cost, is a convex function of the batch size; that is, there exists a critical batch size that minimizes the SFO complexity. Finally, we provide numerical results that support our theoretical results. The numerical results indicate that the number of steps needed for training deep neural networks decreases as the batch size increases and that there exist the critical batch sizes that can be estimated from the theoretical results.
    Assurance for Deployed Continual Learning Systems. (arXiv:2311.10787v1 [cs.LG])
    The future success of the Navy will depend, in part, on artificial intelligence. In practice, many artificially intelligent algorithms, and in particular deep learning models, rely on continual learning to maintain performance in dynamic environments. The software requires adaptation to maintain its initial level of performance in unseen situations. However, if not monitored properly, continual learning may lead to several issues including catastrophic forgetting in which a trained model forgets previously learned tasks when being retrained on new data. The authors created a new framework for safely performing continual learning with the goal of pairing this safety framework with a deep learning computer vision algorithm to allow for safe and high-performing automatic deck tracking on carriers and amphibious assault ships. The safety framework includes several features, such as an ensemble of convolutional neural networks to perform image classification, a manager to record confidences and determine the best answer from the ensemble, a model of the environment to predict when the system may fail to meet minimum performance metrics, a performance monitor to log system and domain performance and check against requirements, and a retraining component to update the ensemble and manager to maintain performance. The authors validated the proposed method using extensive simulation studies based on dynamic image classification. The authors showed the safety framework could probabilistically detect out of distribution data. The results also show the framework can detect when the system is no longer performing safely and can significantly extend the working envelope of an image classifier.
    Accelerating L-shaped Two-stage Stochastic SCUC with Learning Integrated Benders Decomposition. (arXiv:2311.10835v1 [math.OC])
    Benders decomposition is widely used to solve large mixed-integer problems. This paper takes advantage of machine learning and proposes enhanced variants of Benders decomposition for solving two-stage stochastic security-constrained unit commitment (SCUC). The problem is decomposed into a master problem and subproblems corresponding to a load scenario. The goal is to reduce the computational costs and memory usage of Benders decomposition by creating tighter cuts and reducing the size of the master problem. Three approaches are proposed, namely regression Benders, classification Benders, and regression-classification Benders. A regressor reads load profile scenarios and predicts subproblem objective function proxy variables to form tighter cuts for the master problem. A criterion is defined to measure the level of usefulness of cuts with respect to their contribution to lower bound improvement. Useful cuts that contain the necessary information to form the feasible region are identified with and without a classification learner. Useful cuts are iteratively added to the master problem, and non-useful cuts are discarded to reduce the computational burden of each Benders iteration. Simulation studies on multiple test systems show the effectiveness of the proposed learning-aided Benders decomposition for solving two-stage SCUC as compared to conventional multi-cut Benders decomposition.
    oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation. (arXiv:2301.01333v2 [cs.LG] UPDATED)
    With the rapid development of deep learning models and hardware support for dense computing, the deep learning workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives does not fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations. We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel Xeon Scalable Processors.
    Adversarial Examples Are Not Real Features. (arXiv:2310.18936v2 [cs.LG] UPDATED)
    The existence of adversarial examples has been a mystery for years and attracted much interest. A well-known theory by \citet{ilyas2019adversarial} explains adversarial vulnerability from a data perspective by showing that one can extract non-robust features from adversarial examples and these features alone are useful for classification. However, the explanation remains quite counter-intuitive since non-robust features are mostly noise features to humans. In this paper, we re-examine the theory from a larger context by incorporating multiple learning paradigms. Notably, we find that contrary to their good usefulness under supervised learning, non-robust features attain poor usefulness when transferred to other self-supervised learning paradigms, such as contrastive learning, masked image modeling, and diffusion models. It reveals that non-robust features are not really as useful as robust or natural features that enjoy good transferability between these paradigms. Meanwhile, for robustness, we also show that naturally trained encoders from robust features are largely non-robust under AutoAttack. Our cross-paradigm examination suggests that the non-robust features are not really useful but more like paradigm-wise shortcuts, and robust features alone might be insufficient to attain reliable model robustness. Code is available at \url{https://github.com/PKU-ML/AdvNotRealFeatures}.  ( 2 min )
    Learning Multiscale Non-stationary Causal Structures. (arXiv:2208.14989v2 [cs.LG] UPDATED)
    This paper addresses a gap in the current state of the art by providing a solution for modeling causal relationships that evolve over time and occur at different time scales. Specifically, we introduce the multiscale non-stationary directed acyclic graph (MN-DAG), a framework for modeling multivariate time series data. Our contribution is twofold. Firstly, we expose a probabilistic generative model by leveraging results from spectral and causality theories. Our model allows sampling an MN-DAG according to user-specified priors on the time-dependence and multiscale properties of the causal graph. Secondly, we devise a Bayesian method named Multiscale Non-stationary Causal Structure Learner (MN-CASTLE) that uses stochastic variational inference to estimate MN-DAGs. The method also exploits information from the local partial correlation between time series over different time resolutions. The data generated from an MN-DAG reproduces well-known features of time series in different domains, such as volatility clustering and serial correlation. Additionally, we show the superior performance of MN-CASTLE on synthetic data with different multiscale and non-stationary properties compared to baseline models. Finally, we apply MN-CASTLE to identify the drivers of the natural gas prices in the US market. Causal relationships have strengthened during the COVID-19 outbreak and the Russian invasion of Ukraine, a fact that baseline methods fail to capture. MN-CASTLE identifies the causal impact of critical economic drivers on natural gas prices, such as seasonal factors, economic uncertainty, oil prices, and gas storage deviations.
    Token-level Adaptation of LoRA Adapters for Downstream Task Generalization. (arXiv:2311.10847v1 [cs.CL])
    This paper introduces a method for adapting LoRA adapters in smaller-sized language models to arbitrary downstream tasks. Unlike standard mixture-of-expert architectures, our method employs a gradient-free routing function to choose a weighted combination of experts without increasing the compute requirements for training or inference. The results show that token-level adaptation of LoRA adapters outperforms the base Llama-2-7b model across mathematical (GSM8K), scientific (ARC-Challenge), reading comprehension (SQuAD), and coding (CodeAlpaca-20k) tasks. Further evaluations also show that the average performance of token-level adaptation outperforms individual models fine-tuned for each of the tasks with the best performance observed in adaptation of every-other token during inference. The code for this study is made available through a public repository.
    Explainable AI for engineering design: A unified approach of systems engineering and component-based deep learning. (arXiv:2108.13836v4 [cs.LG] UPDATED)
    Data-driven models created by machine learning gain in importance in all fields of design and engineering. They have high potential to assist decision-makers in creating novel artefacts with better performance and sustainability. However, limited generalization and the black-box nature of these models lead to limited explainability and reusability. To overcome this situation, we propose a component-based approach to create partial component models by machine learning (ML). This component-based approach aligns deep learning with systems engineering (SE). For the domain of energy efficient building design, we first demonstrate better generalization of the component-based method by analyzing prediction accuracy outside the training data. We observe a much higher accuracy (R2 = 0.94) compared to conventional monolithic methods (R2 = 0.71). Second, we illustrate explainability by exemplary demonstrating how sensitivity information from SE and rules from low-depth decision trees serve engineering. Third, we evaluate explainability by qualitative and quantitative methods demonstrating the matching of preliminary knowledge and data-driven derived strategies and show correctness of activations at component interfaces compared to white-box simulation results (envelope components: R2 = 0.92..0.99; zones: R2 = 0.78..0.93). The key for component-based explainability is that activations at interfaces between the components are interpretable engineering quantities. The large range of possible configurations in composing components allows the examination of novel unseen design cases with understandable data-driven models. The matching of parameter ranges of components by similar probability distribution produces reusable, well-generalizing, and trustworthy models. The approach adapts the model structure to engineering methods of systems engineering and to domain knowledge.
    Data Selection for Language Models via Importance Resampling. (arXiv:2302.03169v3 [cs.CL] UPDATED)
    Selecting a suitable pretraining dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution given unlabeled target samples. Due to the scale and dimensionality of the raw text data, existing methods use simple heuristics or require human experts to manually curate data. Instead, we extend the classic importance resampling approach used in low-dimensions for LM data selection. We propose Data Selection with Importance Resampling (DSIR), an efficient and scalable framework that estimates importance weights in a reduced feature space for tractability and selects data with importance resampling according to these weights. We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents from the full Pile dataset in 4.5 hours. To measure whether hashed n-gram features preserve the aspects of the data that are relevant to the target, we define KL reduction, a data metric that measures the proximity between the selected pretraining data and the target on some feature space. Across 8 data selection methods (including expert selection), KL reduction on hashed n-gram features highly correlates with average downstream accuracy (r=0.82). When selecting data for continued pretraining on a specific domain, DSIR performs comparably to expert curation across 8 target distributions. When pretraining general-domain models (target is Wikipedia and books), DSIR improves over random selection and heuristic filtering baselines by 2-2.5% on the GLUE benchmark. Code is available at https://github.com/p-lambda/dsir.
    Wasserstein Convergence Guarantees for a General Class of Score-Based Generative Models. (arXiv:2311.11003v1 [cs.LG])
    Score-based generative models (SGMs) is a recent class of deep generative models with state-of-the-art performance in many applications. In this paper, we establish convergence guarantees for a general class of SGMs in 2-Wasserstein distance, assuming accurate score estimates and smooth log-concave data distribution. We specialize our result to several concrete SGMs with specific choices of forward processes modelled by stochastic differential equations, and obtain an upper bound on the iteration complexity for each model, which demonstrates the impacts of different choices of the forward processes. We also provide a lower bound when the data distribution is Gaussian. Numerically, we experiment SGMs with different forward processes, some of which are newly proposed in this paper, for unconditional image generation on CIFAR-10. We find that the experimental results are in good agreement with our theoretical predictions on the iteration complexity, and the models with our newly proposed forward processes can outperform existing models.
    Robustness Enhancement in Neural Networks with Alpha-Stable Training Noise. (arXiv:2311.10803v1 [cs.LG])
    With the increasing use of deep learning on data collected by non-perfect sensors and in non-perfect environments, the robustness of deep learning systems has become an important issue. A common approach for obtaining robustness to noise has been to train deep learning systems with data augmented with Gaussian noise. In this work, we challenge the common choice of Gaussian noise and explore the possibility of stronger robustness for non-Gaussian impulsive noise, specifically alpha-stable noise. Justified by the Generalized Central Limit Theorem and evidenced by observations in various application areas, alpha-stable noise is widely present in nature. By comparing the testing accuracy of models trained with Gaussian noise and alpha-stable noise on data corrupted by different noise, we find that training with alpha-stable noise is more effective than Gaussian noise, especially when the dataset is corrupted by impulsive noise, thus improving the robustness of the model. The generality of this conclusion is validated through experiments conducted on various deep learning models with image and time series datasets, and other benchmark corrupted datasets. Consequently, we propose a novel data augmentation method that replaces Gaussian noise, which is typically added to the training data, with alpha-stable noise.
    Compact and Intuitive Airfoil Parameterization Method through Physics-aware Variational Autoencoder. (arXiv:2311.10921v1 [cs.LG])
    Airfoil shape optimization plays a critical role in the design of high-performance aircraft. However, the high-dimensional nature of airfoil representation causes the challenging problem known as the "curse of dimensionality". To overcome this problem, numerous airfoil parameterization methods have been developed, which can be broadly classified as polynomial-based and data-driven approaches. Each of these methods has desirable characteristics such as flexibility, parsimony, feasibility, and intuitiveness, but a single approach that encompasses all of these attributes has yet to be found. For example, polynomial-based methods struggle to balance parsimony and flexibility, while data-driven methods lack in feasibility and intuitiveness. In recent years, generative models, such as generative adversarial networks and variational autoencoders, have shown promising potential in airfoil parameterization. However, these models still face challenges related to intuitiveness due to their black-box nature. To address this issue, we developed a novel airfoil parameterization method using physics-aware variational autoencoder. The proposed method not only explicitly separates the generation of thickness and camber distributions to produce smooth and non-intersecting airfoils, thereby improving feasibility, but it also directly aligns its latent dimensions with geometric features of the airfoil, significantly enhancing intuitiveness. Finally, extensive comparative studies were performed to demonstrate the effectiveness of our approach.
    uHD: Unary Processing for Lightweight and Dynamic Hyperdimensional Computing. (arXiv:2311.10778v1 [cs.AR])
    Hyperdimensional computing (HDC) is a novel computational paradigm that operates on long-dimensional vectors known as hypervectors. The hypervectors are constructed as long bit-streams and form the basic building blocks of HDC systems. In HDC, hypervectors are generated from scalar values without taking their bit significance into consideration. HDC has been shown to be efficient and robust in various data processing applications, including computer vision tasks. To construct HDC models for vision applications, the current state-of-the-art practice utilizes two parameters for data encoding: pixel intensity and pixel position. However, the intensity and position information embedded in high-dimensional vectors are generally not generated dynamically in the HDC models. Consequently, the optimal design of hypervectors with high model accuracy requires powerful computing platforms for training. A more efficient approach to generating hypervectors is to create them dynamically during the training phase, which results in accurate, low-cost, and highly performable vectors. To this aim, we use low-discrepancy sequences to generate intensity hypervectors only, while avoiding position hypervectors. By doing so, the multiplication step in vector encoding is eliminated, resulting in a power-efficient HDC system. For the first time in the literature, our proposed approach employs lightweight vector generators utilizing unary bit-streams for efficient encoding of data instead of using conventional comparator-based generators.
    Predicting Influenza A Viral Host Using PSSM and Word Embeddings. (arXiv:2201.01140v4 [cs.CL] UPDATED)
    The rapid mutation of the influenza virus threatens public health. Reassortment among viruses with different hosts can lead to a fatal pandemic. However, it is difficult to detect the original host of the virus during or after an outbreak as influenza viruses can circulate between different species. Therefore, early and rapid detection of the viral host would help reduce the further spread of the virus. We use various machine learning models with features derived from the position-specific scoring matrix (PSSM) and features learned from word embedding and word encoding to infer the origin host of viruses. The results show that the performance of the PSSM-based model reaches the MCC around 95%, and the F1 around 96%. The MCC obtained using the model with word embedding is around 96%, and the F1 is around 97%.  ( 2 min )
    BrainZ-BP: A Non-invasive Cuff-less Blood Pressure Estimation Approach Leveraging Brain Bio-impedance and Electrocardiogram. (arXiv:2311.10996v1 [cs.LG])
    Accurate and continuous blood pressure (BP) monitoring is essential to the early prevention of cardiovascular diseases. Non-invasive and cuff-less BP estimation algorithm has gained much attention in recent years. Previous studies have demonstrated that brain bio-impedance (BIOZ) is a promising technique for non-invasive intracranial pressure (ICP) monitoring. Clinically, treatment for patients with traumatic brain injuries (TBI) requires monitoring the ICP and BP of patients simultaneously. Estimating BP by brain BIOZ directly can reduce the number of sensors attached to the patients, thus improving their comfort. To address the issues, in this study, we explore the feasibility of leveraging brain BIOZ for BP estimation and propose a novel cuff-less BP estimation approach called BrainZ-BP. Two electrodes are placed on the forehead and occipital bone of the head in the anterior-posterior direction for brain BIOZ measurement. Various features including pulse transit time and morphological features of brain BIOZ are extracted and fed into four regression models for BP estimation. Results show that the mean absolute error, root mean square error, and correlation coefficient of random forest regression model are 2.17 mmHg, 3.91 mmHg, and 0.90 for systolic pressure estimation, and are 1.71 mmHg, 3.02 mmHg, and 0.89 for diastolic pressure estimation. The presented BrainZ-BP can be applied in the brain BIOZ-based ICP monitoring scenario to monitor BP simultaneously.  ( 2 min )
    timeXplain -- A Framework for Explaining the Predictions of Time Series Classifiers. (arXiv:2007.07606v2 [cs.LG] UPDATED)
    Modern time series classifiers display impressive predictive capabilities, yet their decision-making processes mostly remain black boxes to the user. At the same time, model-agnostic explainers, such as the recently proposed SHAP, promise to make the predictions of machine learning models interpretable, provided there are well-designed domain mappings. We bring both worlds together in our timeXplain framework, extending the reach of explainable artificial intelligence to time series classification and value prediction. We present novel domain mappings for the time domain, frequency domain, and time series statistics and analyze their explicative power as well as their limits. We employ a novel evaluation metric to experimentally compare timeXplain to several model-specific explanation approaches for state-of-the-art time series classifiers.  ( 2 min )
    PACOL: Poisoning Attacks Against Continual Learners. (arXiv:2311.10919v1 [cs.LG])
    Continual learning algorithms are typically exposed to untrusted sources that contain training data inserted by adversaries and bad actors. An adversary can insert a small number of poisoned samples, such as mislabeled samples from previously learned tasks, or intentional adversarial perturbed samples, into the training datasets, which can drastically reduce the model's performance. In this work, we demonstrate that continual learning systems can be manipulated by malicious misinformation and present a new category of data poisoning attacks specific for continual learners, which we refer to as {\em Poisoning Attacks Against Continual Learners} (PACOL). The effectiveness of labeling flipping attacks inspires PACOL; however, PACOL produces attack samples that do not change the sample's label and produce an attack that causes catastrophic forgetting. A comprehensive set of experiments shows the vulnerability of commonly used generative replay and regularization-based continual learning approaches against attack methods. We evaluate the ability of label-flipping and a new adversarial poison attack, namely PACOL proposed in this work, to force the continual learning system to forget the knowledge of a learned task(s). More specifically, we compared the performance degradation of continual learning systems trained on benchmark data streams with and without poisoning attacks. Moreover, we discuss the stealthiness of the attacks in which we test the success rate of data sanitization defense and other outlier detection-based defenses for filtering out adversarial samples.  ( 2 min )
    Stochastic Coded Federated Learning: Theoretical Analysis and Incentive Mechanism Design. (arXiv:2211.04132v2 [cs.DC] UPDATED)
    Federated learning (FL) has achieved great success as a privacy-preserving distributed training paradigm, where many edge devices collaboratively train a machine learning model by sharing the model updates instead of the raw data with a server. However, the heterogeneous computational and communication resources of edge devices give rise to stragglers that significantly decelerate the training process. To mitigate this issue, we propose a novel FL framework named stochastic coded federated learning (SCFL) that leverages coded computing techniques. In SCFL, before the training process starts, each edge device uploads a privacy-preserving coded dataset to the server, which is generated by adding Gaussian noise to the projected local dataset. During training, the server computes gradients on the global coded dataset to compensate for the missing model updates of the straggling devices. We design a gradient aggregation scheme to ensure that the aggregated model update is an unbiased estimate of the desired global update. Moreover, this aggregation scheme enables periodical model averaging to improve the training efficiency. We characterize the tradeoff between the convergence performance and privacy guarantee of SCFL. In particular, a more noisy coded dataset provides stronger privacy protection for edge devices but results in learning performance degradation. We further develop a contract-based incentive mechanism to coordinate such a conflict. The simulation results show that SCFL learns a better model within the given time and achieves a better privacy-performance tradeoff than the baseline methods. In addition, the proposed incentive mechanism grants better training performance than the conventional Stackelberg game approach.
    Landmark Attention: Random-Access Infinite Context Length for Transformers. (arXiv:2305.16300v2 [cs.CL] UPDATED)
    While Transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts. Prior approaches, such as recurrent memory or retrieval-based augmentation, have either compromised the random-access flexibility of attention (i.e., the capability to select any token in the entire context) or relied on separate mechanisms for relevant context retrieval, which may not be compatible with the model's attention. In this paper, we present a novel approach that allows access to the complete context while retaining random-access flexibility, closely resembling running attention on the entire context. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks, enabling retrieval of blocks directly through the attention mechanism instead of by relying on a separate mechanism. Our approach seamlessly integrates with specialized data structures and the system's memory hierarchy, enabling processing of arbitrarily long context lengths. We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step. Finally, we show that fine-tuning LLaMA 7B with our method successfully extends its context length capacity to over 32k tokens, allowing for inference at the context lengths of GPT-4. We release the implementation of landmark attention and the code to reproduce our experiments at https://github.com/epfml/landmark-attention/.  ( 3 min )
    Measuring Five Accountable Talk Moves to Improve Instruction at Scale. (arXiv:2311.10749v1 [cs.CY])
    Providing consistent, individualized feedback to teachers on their instruction can improve student learning outcomes. Such feedback can especially benefit novice instructors who teach on online platforms and have limited access to instructional training. To build scalable measures of instruction, we fine-tune RoBERTa and GPT models to identify five instructional talk moves inspired by accountable talk theory: adding on, connecting, eliciting, probing and revoicing students' ideas. We fine-tune these models on a newly annotated dataset of 2500 instructor utterances derived from transcripts of small group instruction in an online computer science course, Code in Place. Although we find that GPT-3 consistently outperforms RoBERTa in terms of precision, its recall varies significantly. We correlate the instructors' use of each talk move with indicators of student engagement and satisfaction, including students' section attendance, section ratings, and assignment completion rates. We find that using talk moves generally correlates positively with student outcomes, and connecting student ideas has the largest positive impact. These results corroborate previous research on the effectiveness of accountable talk moves and provide exciting avenues for using these models to provide instructors with useful, scalable feedback.  ( 2 min )
    ML-KFHE: Multi-label ensemble classification algorithm exploiting sensor fusion properties of the Kalman filter. (arXiv:1904.10552v4 [cs.LG] UPDATED)
    Despite the success of ensemble classification methods in multi-class classification problems, ensemble methods based on approaches other than bagging have not been widely explored for multi-label classification problems. The Kalman Filter-based Heuristic Ensemble (KFHE) is an ensemble method that exploits the sensor fusion properties of the Kalman filter to combine several classifier models, and that has been shown to be very effective. This work proposes a multi-label version of KFHE, ML-KFHE, demonstrating the effectiveness of the KFHE method on multi-label datasets. Two variants are introduced based on the underlying component classifier algorithm, ML-KFHE-HOMER, and ML-KFHE-CC which uses HOMER and Classifier Chain (CC) as the underlying multi-label algorithms respectively. ML-KFHE-HOMER and ML-KFHE-CC sequentially train multiple HOMER and CC multi-label classifiers and aggregate their outputs using the sensor fusion properties of the Kalman filter. Extensive experiments and detailed analysis were performed on thirteen multi-label datasets and eight other algorithms, which included state-of-the-art ensemble methods. The results show, for both versions, the ML-KFHE framework improves the predictive performance significantly with respect to bagged combinations of HOMER (named E-HOMER), also introduced in this paper, and bagged combination of CC, Ensemble Classifier Chains (ECC), thus demonstrating the effectiveness of ML-KFHE. Also, the ML-KFHE-HOMER variant was found to perform consistently and significantly better than the compared multi-label methods including existing approaches based on ensembles.
    Taxonomic analysis of asteroids with artificial neural networks. (arXiv:2311.10954v1 [astro-ph.EP])
    We study the surface composition of asteroids with visible and/or infrared spectroscopy. For example, asteroid taxonomy is based on the spectral features or multiple color indices in visible and near-infrared wavelengths. The composition of asteroids gives key information to understand their origin and evolution. However, we lack compositional information for faint asteroids due to limits of ground-based observational instruments. In the near future, the Chinese Space Survey telescope (CSST) will provide multiple colors and spectroscopic data for asteroids of apparent magnitude brighter than 25 mag and 23 mag, respectively. For the aim of analysis of the CSST spectroscopic data, we applied an algorithm using artificial neural networks (ANNs) to establish a preliminary classification model for asteroid taxonomy according to the design of the survey module of CSST. Using the SMASS II spectra and the Bus-Binzel taxonomy system, our ANN classification tool composed of 5 individual ANNs is constructed, and the accuracy of this classification system is higher than 92 %. As the first application of our ANN tool, 64 spectra of 42 asteroids obtained in 2006 and 2007 by us with the 2.16-m telescope in the Xinglong station (Observatory Code 327) of National Astronomical Observatory of China are analyzed. The predicted labels of these spectra using our ANN tool are found to be reasonable when compared to their known taxonomic labels. Considering the accuracy and stability, our ANN tool can be applied to analyse the CSST asteroid spectra in the future.  ( 3 min )
    A novel post-hoc explanation comparison metric and applications. (arXiv:2311.10811v1 [cs.LG])
    Explanatory systems make the behavior of machine learning models more transparent, but are often inconsistent. To quantify the differences between explanatory systems, this paper presents the Shreyan Distance, a novel metric based on the weighted difference between ranked feature importance lists produced by such systems. This paper uses the Shreyan Distance to compare two explanatory systems, SHAP and LIME, for both regression and classification learning tasks. Because we find that the average Shreyan Distance varies significantly between these two tasks, we conclude that consistency between explainers not only depends on inherent properties of the explainers themselves, but also the type of learning task. This paper further contributes the XAISuite library, which integrates the Shreyan distance algorithm into machine learning pipelines.
    On the impact of activation and normalization in obtaining isometric embeddings at initialization. (arXiv:2305.18399v3 [cs.LG] UPDATED)
    In this paper, we explore the structure of the penultimate Gram matrix in deep neural networks, which contains the pairwise inner products of outputs corresponding to a batch of inputs. In several architectures it has been observed that this Gram matrix becomes degenerate with depth at initialization, which dramatically slows training. Normalization layers, such as batch or layer normalization, play a pivotal role in preventing the rank collapse issue. Despite promising advances, the existing theoretical results do not extend to layer normalization, which is widely used in transformers, and can not quantitatively characterize the role of non-linear activations. To bridge this gap, we prove that layer normalization, in conjunction with activation layers, biases the Gram matrix of a multilayer perceptron towards the identity matrix at an exponential rate with depth at initialization. We quantify this rate using the Hermite expansion of the activation function.  ( 2 min )
    Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New Benchmarking. (arXiv:2306.10453v3 [cs.LG] UPDATED)
    Link prediction attempts to predict whether an unseen edge exists based on only a portion of edges of a graph. A flurry of methods have been introduced in recent years that attempt to make use of graph neural networks (GNNs) for this task. Furthermore, new and diverse datasets have also been created to better evaluate the effectiveness of these new models. However, multiple pitfalls currently exist that hinder our ability to properly evaluate these new methods. These pitfalls mainly include: (1) Lower than actual performance on multiple baselines, (2) A lack of a unified data split and evaluation metric on some datasets, and (3) An unrealistic evaluation setting that uses easy negative samples. To overcome these challenges, we first conduct a fair comparison across prominent methods and datasets, utilizing the same dataset and hyperparameter search settings. We then create a more practical evaluation setting based on a Heuristic Related Sampling Technique (HeaRT), which samples hard negative samples via multiple heuristics. The new evaluation setting helps promote new challenges and opportunities in link prediction by aligning the evaluation with real-world situations. Our implementation and data are available at https://github.com/Juanhui28/HeaRT  ( 2 min )
    Waypoint Transformer: Reinforcement Learning via Supervised Learning with Intermediate Targets. (arXiv:2306.14069v2 [cs.LG] UPDATED)
    Despite the recent advancements in offline reinforcement learning via supervised learning (RvS) and the success of the decision transformer (DT) architecture in various domains, DTs have fallen short in several challenging benchmarks. The root cause of this underperformance lies in their inability to seamlessly connect segments of suboptimal trajectories. To overcome this limitation, we present a novel approach to enhance RvS methods by integrating intermediate targets. We introduce the Waypoint Transformer (WT), using an architecture that builds upon the DT framework and conditioned on automatically-generated waypoints. The results show a significant increase in the final return compared to existing RvS methods, with performance on par or greater than existing state-of-the-art temporal difference learning-based methods. Additionally, the performance and stability improvements are largest in the most challenging environments and data configurations, including AntMaze Large Play/Diverse and Kitchen Mixed/Partial.  ( 2 min )
    Short-term Volatility Estimation for High Frequency Trades using Gaussian processes (GPs). (arXiv:2311.10935v1 [q-fin.ST])
    The fundamental theorem behind financial markets is that stock prices are intrinsically complex and stochastic. One of the complexities is the volatility associated with stock prices. Volatility is a tendency for prices to change unexpectedly [1]. Price volatility is often detrimental to the return economics, and thus, investors should factor it in whenever making investment decisions, choices, and temporal or permanent moves. It is, therefore, crucial to make necessary and regular short and long-term stock price volatility forecasts for the safety and economics of investors returns. These forecasts should be accurate and not misleading. Different models and methods, such as ARCH GARCH models, have been intuitively implemented to make such forecasts. However, such traditional means fail to capture the short-term volatility forecasts effectively. This paper, therefore, investigates and implements a combination of numeric and probabilistic models for short-term volatility and return forecasting for high-frequency trades. The essence is that one-day-ahead volatility forecasts were made with Gaussian Processes (GPs) applied to the outputs of a Numerical market prediction (NMP) model. Firstly, the stock price data from NMP was corrected by a GP. Since it is not easy to set price limits in a market due to its free nature and randomness, a Censored GP was used to model the relationship between the corrected stock prices and returns. Forecasting errors were evaluated using the implied and estimated data.  ( 2 min )
    SplatArmor: Articulated Gaussian splatting for animatable humans from monocular RGB videos. (arXiv:2311.10812v1 [cs.CV])
    We propose SplatArmor, a novel approach for recovering detailed and animatable human models by `armoring' a parameterized body model with 3D Gaussians. Our approach represents the human as a set of 3D Gaussians within a canonical space, whose articulation is defined by extending the skinning of the underlying SMPL geometry to arbitrary locations in the canonical space. To account for pose-dependent effects, we introduce a SE(3) field, which allows us to capture both the location and anisotropy of the Gaussians. Furthermore, we propose the use of a neural color field to provide color regularization and 3D supervision for the precise positioning of these Gaussians. We show that Gaussian splatting provides an interesting alternative to neural rendering based methods by leverging a rasterization primitive without facing any of the non-differentiability and optimization challenges typically faced in such approaches. The rasterization paradigms allows us to leverage forward skinning, and does not suffer from the ambiguities associated with inverse skinning and warping. We show compelling results on the ZJU MoCap and People Snapshot datasets, which underscore the effectiveness of our method for controllable human synthesis.  ( 2 min )
    Source-Free Domain Adaptation for SSVEP-based Brain-Computer Interfaces. (arXiv:2305.17403v2 [cs.LG] UPDATED)
    This paper presents a source free domain adaptation method for steady-state visually evoked potentials (SSVEP) based brain-computer interface (BCI) spellers. SSVEP-based BCI spellers assist individuals experiencing speech difficulties by enabling them to communicate at a fast rate. However, achieving a high information transfer rate (ITR) in most prominent methods requires an extensive calibration period before using the system, leading to discomfort for new users. We address this issue by proposing a novel method that adapts a powerful deep neural network (DNN) pre-trained on data from source domains (data from former users or participants of previous experiments) to the new user (target domain), based only on the unlabeled target data. This adaptation is achieved by minimizing our proposed custom loss function composed of self-adaptation and local-regularity terms. The self-adaptation term uses the pseudo-label strategy, while the novel local-regularity term exploits the data structure and forces the DNN to assign similar labels to adjacent instances. The proposed method priorities user comfort by removing the burden of calibration while maintaining an excellent character identification accuracy and ITR. In particular, our method achieves striking 201.15 bits/min and 145.02 bits/min ITRs on the benchmark and BETA datasets, respectively, and outperforms the state-of-the-art alternatives. Our code is available at https://github.com/osmanberke/SFDA-SSVEP-BCI  ( 2 min )
    Model Sparsity Can Simplify Machine Unlearning. (arXiv:2304.04934v10 [cs.LG] UPDATED)
    In response to recent data regulation requirements, machine unlearning (MU) has emerged as a critical process to remove the influence of specific examples from a given model. Although exact unlearning can be achieved through complete model retraining using the remaining dataset, the associated computational costs have driven the development of efficient, approximate unlearning techniques. Moving beyond data-centric MU approaches, our study introduces a novel model-based perspective: model sparsification via weight pruning, which is capable of reducing the gap between exact unlearning and approximate unlearning. We show in both theory and practice that model sparsity can boost the multi-criteria unlearning performance of an approximate unlearner, closing the approximation gap, while continuing to be efficient. This leads to a new MU paradigm, termed prune first, then unlearn, which infuses a sparse model prior into the unlearning process. Building on this insight, we also develop a sparsity-aware unlearning method that utilizes sparsity regularization to enhance the training process of approximate unlearning. Extensive experiments show that our proposals consistently benefit MU in various unlearning scenarios. A notable highlight is the 77% unlearning efficacy gain of fine-tuning (one of the simplest unlearning methods) when using sparsity-aware unlearning. Furthermore, we demonstrate the practical impact of our proposed MU methods in addressing other machine learning challenges, such as defending against backdoor attacks and enhancing transfer learning. Codes are available at https://github.com/OPTML-Group/Unlearn-Sparse.  ( 3 min )
    Unsupervised embedding of trajectories captures the latent structure of scientific migration. (arXiv:2012.02785v3 [cs.LG] UPDATED)
    Human migration and mobility drives major societal phenomena including epidemics, economies, innovation, and the diffusion of ideas. Although human mobility and migration have been heavily constrained by geographic distance throughout the history, advances and globalization are making other factors such as language and culture increasingly more important. Advances in neural embedding models, originally designed for natural language, provide an opportunity to tame this complexity and open new avenues for the study of migration. Here, we demonstrate the ability of the model word2vec to encode nuanced relationships between discrete locations from migration trajectories, producing an accurate, dense, continuous, and meaningful vector-space representation. The resulting representation provides a functional distance between locations, as well as a digital double that can be distributed, re-used, and itself interrogated to understand the many dimensions of migration. We show that the unique power of word2vec to encode migration patterns stems from its mathematical equivalence with the gravity model of mobility. Focusing on the case of scientific migration, we apply word2vec to a database of three million migration trajectories of scientists derived from the affiliations listed on their publication records. Using techniques that leverage its semantic structure, we demonstrate that embeddings can learn the rich structure that underpins scientific migration, such as cultural, linguistic, and prestige relationships at multiple levels of granularity. Our results provide a theoretical foundation and methodological framework for using neural embeddings to represent and understand migration both within and beyond science.  ( 3 min )
    Likelihood-Free Frequentist Inference: Bridging Classical Statistics and Machine Learning for Reliable Simulator-Based Inference. (arXiv:2107.03920v8 [stat.ML] UPDATED)
    Many areas of science make extensive use of computer simulators that implicitly encode intractable likelihood functions of complex systems. Classical statistical methods are poorly suited for these so-called likelihood-free inference (LFI) settings, especially outside asymptotic and low-dimensional regimes. At the same time, traditional LFI methods - such as Approximate Bayesian Computation or more recent machine learning techniques - do not guarantee confidence sets with nominal coverage in general settings (i.e., with high-dimensional data, finite sample sizes, and for any parameter value). In addition, there are no diagnostic tools to check the empirical coverage of confidence sets provided by such methods across the entire parameter space. In this work, we propose a unified and modular inference framework that bridges classical statistics and modern machine learning providing (i) a practical approach to the Neyman construction of confidence sets with frequentist finite-sample coverage for any value of the unknown parameters; and (ii) interpretable diagnostics that estimate the empirical coverage across the entire parameter space. We refer to the general framework as likelihood-free frequentist inference (LF2I). Any method that defines a test statistic can leverage LF2I to create valid confidence sets and diagnostics without costly Monte Carlo samples at fixed parameter settings. We study the power of two likelihood-based test statistics (ACORE and BFF) and demonstrate their empirical performance on high-dimensional, complex data. Code is available at https://github.com/lee-group-cmu/lf2i.  ( 3 min )
    Temporal Conditioning Spiking Latent Variable Models of the Neural Response to Natural Visual Scenes. (arXiv:2306.12045v5 [q-bio.NC] UPDATED)
    Developing computational models of neural response is crucial for understanding sensory processing and neural computations. Current state-of-the-art neural network methods use temporal filters to handle temporal dependencies, resulting in an unrealistic and inflexible processing paradigm. Meanwhile, these methods target trial-averaged firing rates and fail to capture important features in spike trains. This work presents the temporal conditioning spiking latent variable models (TeCoS-LVM) to simulate the neural response to natural visual stimuli. We use spiking neurons to produce spike outputs that directly match the recorded trains. This approach helps to avoid losing information embedded in the original spike trains. We exclude the temporal dimension from the model parameter space and introduce a temporal conditioning operation to allow the model to adaptively explore and exploit temporal dependencies in stimuli sequences in a {\it natural paradigm}. We show that TeCoS-LVM models can produce more realistic spike activities and accurately fit spike statistics than powerful alternatives. Additionally, learned TeCoS-LVM models can generalize well to longer time scales. Overall, while remaining computationally tractable, our model effectively captures key features of neural coding systems. It thus provides a useful tool for building accurate predictive computational accounts for various sensory perception circuits.  ( 3 min )
    Several fitness functions and entanglement gates in quantum kernel generation. (arXiv:2309.03307v3 [quant-ph] UPDATED)
    Quantum machine learning (QML) represents a promising frontier in the quantum technologies. In this pursuit of quantum advantage, the quantum kernel method for support vector machine has emerged as a powerful approach. Entanglement, a fundamental concept in quantum mechanics, assumes a central role in quantum computing. In this paper, we investigate the optimal number of entanglement gates in the quantum kernel feature maps by a multi-objective genetic algorithm. We distinct the fitness functions of genetic algorithm for non-local gates for entanglement and local gates to gain insights into the benefits of employing entanglement gates. Our experiments reveal that the optimal configuration of quantum circuits for the quantum kernel method incorporates a proportional number of non-local gates for entanglement. The result complements the prior literature on quantum kernel generation where non-local gates were largely suppressed. Furthermore, we demonstrate that the separability indexes of data can be leveraged to estimate the number of non-local gates required for the quantum support vector machine's feature maps. This insight can be helpful in selecting appropriate parameters, such as the entanglement parameter, in various quantum programming packages like https://qiskit.org/ based on data analysis. Our findings offer valuable guidance for enhancing the efficiency and accuracy of quantum machine learning algorithms.  ( 2 min )
    PDP: Parameter-free Differentiable Pruning is All You Need. (arXiv:2305.11203v3 [cs.LG] UPDATED)
    DNN pruning is a popular way to reduce the size of a model, improve the inference latency, and minimize the power consumption on DNN accelerators. However, existing approaches might be too complex, expensive or ineffective to apply to a variety of vision/language tasks, DNN architectures and to honor structured pruning constraints. In this paper, we propose an efficient yet effective train-time pruning scheme, Parameter-free Differentiable Pruning (PDP), which offers state-of-the-art qualities in model size, accuracy, and training cost. PDP uses a dynamic function of weights during training to generate soft pruning masks for the weights in a parameter-free manner for a given pruning target. While differentiable, the simplicity and efficiency of PDP make it universal enough to deliver state-of-the-art random/structured/channel pruning results on various vision and natural language tasks. For example, for MobileNet-v1, PDP can achieve 68.2% top-1 ImageNet1k accuracy at 86.6% sparsity, which is 1.7% higher accuracy than those from the state-of-the-art algorithms. Also, PDP yields over 83.1% accuracy on Multi-Genre Natural Language Inference with 90% sparsity for BERT, while the next best from the existing techniques shows 81.5% accuracy. In addition, PDP can be applied to structured pruning, such as N:M pruning and channel pruning. For 1:4 structured pruning of ResNet18, PDP improved the top-1 ImageNet1k accuracy by over 3.6% over the state-of-the-art. For channel pruning of ResNet50, PDP reduced the top-1 ImageNet1k accuracy by 0.6% from the state-of-the-art.  ( 3 min )
    High-performance deep spiking neural networks with 0.3 spikes per neuron. (arXiv:2306.08744v2 [cs.NE] UPDATED)
    Communication by rare, binary spikes is a key factor for the energy efficiency of biological brains. However, it is harder to train biologically-inspired spiking neural networks (SNNs) than artificial neural networks (ANNs). This is puzzling given that theoretical results provide exact mapping algorithms from ANNs to SNNs with time-to-first-spike (TTFS) coding. In this paper we analyze in theory and simulation the learning dynamics of TTFS-networks and identify a specific instance of the vanishing-or-exploding gradient problem. While two choices of SNN mappings solve this problem at initialization, only the one with a constant slope of the neuron membrane potential at threshold guarantees the equivalence of the training trajectory between SNNs and ANNs with rectified linear units. We demonstrate that training deep SNN models achieves the exact same performance as that of ANNs, surpassing previous SNNs on image classification datasets such as MNIST/Fashion-MNIST, CIFAR10/CIFAR100 and PLACES365. Our SNN accomplishes high-performance classification with less than 0.3 spikes per neuron, lending itself for an energy-efficient implementation. We show that fine-tuning SNNs with our robust gradient descent algorithm enables their optimization for hardware implementations with low latency and resilience to noise and quantization.  ( 2 min )
    A novel approach to measuring patent claim scope based on probabilities obtained from (large) language models. (arXiv:2309.10003v3 [cs.CL] UPDATED)
    This work proposes to measure the scope of a patent claim as the reciprocal of the self-information contained in this claim. A probability of occurrence of the claim is obtained from a language model and this probability is used to compute the self-information. Grounded in information theory, this approach is based on the assumption that an unlikely concept is more informative than a usual concept, insofar as it is more surprising. In turn, the more surprising the information required to defined the claim, the narrower its scope. Five language models are considered, ranging from simplest models (each word or character is assigned an identical probability) to intermediate models (using average word or character frequencies), to a large language model (GPT2). Interestingly, the scope resulting from the simplest language models is proportional to the reciprocal of the number of words or characters involved in the claim, a metric already used in previous works. Application is made to multiple series of patent claims directed to distinct inventions, where each series consists of claims devised to have a gradually decreasing scope. The performance of the language models is assessed with respect to several ad hoc tests. The more sophisticated the model, the better the results. I.e., the GPT2 probability model outperforms models based on word and character frequencies, which themselves outdo the simplest models based on word or character counts. Still, the character count appears to be a more reliable indicator than the word count.  ( 3 min )
    Earnings Prediction Using Recurrent Neural Networks. (arXiv:2311.10756v1 [q-fin.ST])
    Firm disclosures about future prospects are crucial for corporate valuation and compliance with global regulations, such as the EU's MAR and the US's SEC Rule 10b-5 and RegFD. To comply with disclosure obligations, issuers must identify nonpublic information with potential material impact on security prices as only new, relevant and unexpected information materially affects prices in efficient markets. Financial analysts, assumed to represent public knowledge on firms' earnings prospects, face limitations in offering comprehensive coverage and unbiased estimates. This study develops a neural network to forecast future firm earnings, using four decades of financial data, addressing analysts' coverage gaps and potentially revealing hidden insights. The model avoids selectivity and survivorship biases as it allows for missing data. Furthermore, the model is able to produce both fiscal-year-end and quarterly earnings predictions. Its performance surpasses benchmark models from the academic literature by a wide margin and outperforms analysts' forecasts for fiscal-year-end earnings predictions.  ( 2 min )
    Role Taxonomy of Units in Deep Neural Networks. (arXiv:2011.00789v2 [cs.CV] UPDATED)
    Identifying the role of network units in deep neural networks (DNNs) is critical in many aspects including giving understandings on the mechanisms of DNNs and building basic connections between deep learning and neuroscience. However, there remains unclear on which roles the units in DNNs with different generalization ability could present. To this end, we give role taxonomy of units in DNNs via introducing the retrieval-of-function test, where units are categorized into four types in terms of their functional preference on separately the training set and testing set. We show that ratios of the four categories are highly associated with the generalization ability of DNNs from two distinct perspectives, based on which we give signs of DNNs with well generalization.  ( 2 min )
    HungerGist: An Interpretable Predictive Model for Food Insecurity. (arXiv:2311.10953v1 [cs.AI])
    The escalating food insecurity in Africa, caused by factors such as war, climate change, and poverty, demonstrates the critical need for advanced early warning systems. Traditional methodologies, relying on expert-curated data encompassing climate, geography, and social disturbances, often fall short due to data limitations, hindering comprehensive analysis and potential discovery of new predictive factors. To address this, this paper introduces "HungerGist", a multi-task deep learning model utilizing news texts and NLP techniques. Using a corpus of over 53,000 news articles from nine African countries over four years, we demonstrate that our model, trained solely on news data, outperforms the baseline method trained on both traditional risk factors and human-curated keywords. In addition, our method has the ability to detect critical texts that contain interpretable signals known as "gists." Moreover, our examination of these gists indicates that this approach has the potential to reveal latent factors that would otherwise remain concealed in unstructured texts.  ( 2 min )
    Learning on Bandwidth Constrained Multi-Source Data with MIMO-inspired DPP MAP Inference. (arXiv:2306.02497v2 [cs.LG] UPDATED)
    This paper proposes a distributed version of Determinant Point Processing (DPP) inference to enhance multi-source data diversification under limited communication bandwidth. DPP is a popular probabilistic approach that improves data diversity by enforcing the repulsion of elements in the selected subsets. The well-studied Maximum A Posteriori (MAP) inference in DPP aims to identify the subset with the highest diversity quantified by DPP. However, this approach is limited by the presumption that all data samples are available at one point, which hinders its applicability to real-world applications such as traffic datasets where data samples are distributed across sources and communication between them is band-limited. Inspired by the techniques used in Multiple-Input Multiple-Output (MIMO) communication systems, we propose a strategy for performing MAP inference among distributed sources. Specifically, we show that a lower bound of the diversity-maximized distributed sample selection problem can be treated as a power allocation problem in MIMO systems. A determinant-preserved sparse representation of selected samples is used to perform sample precoding in local sources to be processed by DPP. Our method does not require raw data exchange among sources, but rather a band-limited feedback channel to send lightweight diversity measures, analogous to the CSI message in MIMO systems, from the center to data sources. The experiments show that our scalable approach can outperform baseline methods, including random selection, uninformed individual DPP with no feedback, and DPP with SVD-based feedback, in both i.i.d and non-i.i.d setups. Specifically, it achieves 1 to 6 log-difference diversity gain in the latent representation of CIFAR-10, CIFAR-100, StanfordCars, and GTSRB datasets.  ( 3 min )
    Gender-Based Comparative Study of Type 2 Diabetes Risk Factors in Kolkata, India: A Machine Learning Approach. (arXiv:2311.10731v1 [cs.LG])
    Type 2 diabetes mellitus represents a prevalent and widespread global health concern, necessitating a comprehensive assessment of its risk factors. This study aimed towards learning whether there is any differential impact of age, Lifestyle, BMI and Waist to height ratio on the risk of Type 2 diabetes mellitus in males and females in Kolkata, West Bengal, India based on a sample observed from the out-patient consultation department of Belle Vue Clinic in Kolkata. Various machine learning models like Logistic Regression, Random Forest, and Support Vector Classifier, were used to predict the risk of diabetes, and performance was compared based on different predictors. Our findings indicate a significant age-related increase in risk of diabetes for both males and females. Although exercising and BMI was found to have significant impact on the risk of Type 2 diabetes in males, in females both turned out to be statistically insignificant. For both males and females, predictive models based on WhtR demonstrated superior performance in risk assessment compared to those based on BMI. This study sheds light on the gender-specific differences in the risk factors for Type 2 diabetes, offering valuable insights that can be used towards more targeted healthcare interventions and public health strategies.
    Exploring Machine Learning Models for Federated Learning: A Review of Approaches, Performance, and Limitations. (arXiv:2311.10832v1 [cs.LG])
    In the growing world of artificial intelligence, federated learning is a distributed learning framework enhanced to preserve the privacy of individuals' data. Federated learning lays the groundwork for collaborative research in areas where the data is sensitive. Federated learning has several implications for real-world problems. In times of crisis, when real-time decision-making is critical, federated learning allows multiple entities to work collectively without sharing sensitive data. This distributed approach enables us to leverage information from multiple sources and gain more diverse insights. This paper is a systematic review of the literature on privacy-preserving machine learning in the last few years based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Specifically, we have presented an extensive review of supervised/unsupervised machine learning algorithms, ensemble methods, meta-heuristic approaches, blockchain technology, and reinforcement learning used in the framework of federated learning, in addition to an overview of federated learning applications. This paper reviews the literature on the components of federated learning and its applications in the last few years. The main purpose of this work is to provide researchers and practitioners with a comprehensive overview of federated learning from the machine learning point of view. A discussion of some open problems and future research directions in federated learning is also provided.
    ExFake: Towards an Explainable Fake News Detection Based on Content and Social Context Information. (arXiv:2311.10784v1 [cs.CL])
    ExFake is an explainable fake news detection system based on content and context-level information. It is concerned with the veracity analysis of online posts based on their content, social context (i.e., online users' credibility and historical behaviour), and data coming from trusted entities such as fact-checking websites and named entities. Unlike state-of-the-art systems, an Explainable AI (XAI) assistant is also adopted to help online social networks (OSN) users develop good reflexes when faced with any doubted information that spreads on social networks. The trustworthiness of OSN users is also addressed by assigning a credibility score to OSN users, as OSN users are one of the main culprits for spreading fake news. Experimental analysis on a real-world dataset demonstrates that ExFake significantly outperforms other baseline methods for fake news detection.
    Differentiating patients with obstructive sleep apnea from healthy controls based on heart rate - blood pressure coupling quantified by entropy-based indices. (arXiv:2311.10752v1 [physics.med-ph])
    We introduce an entropy-based classification method for pairs of sequences (ECPS) for quantifying mutual dependencies in heart rate and beat-to-beat blood pressure recordings. The purpose of the method is to build a classifier for data in which each item consists of the two intertwined data series taken for each subject. The method is based on ordinal patterns, and uses entropy-like indices. Machine learning is used to select a subset of indices most suitable for our classification problem in order to build an optimal yet simple model for distinguishing between patients suffering from obstructive sleep apnea and a control group.
    Stratified-NMF for Heterogeneous Data. (arXiv:2311.10789v1 [cs.LG])
    Non-negative matrix factorization (NMF) is an important technique for obtaining low dimensional representations of datasets. However, classical NMF does not take into account data that is collected at different times or in different locations, which may exhibit heterogeneity. We resolve this problem by solving a modified NMF objective, Stratified-NMF, that simultaneously learns strata-dependent statistics and a shared topics matrix. We develop multiplicative update rules for this novel objective and prove convergence of the objective. Then, we experiment on synthetic data to demonstrate the efficiency and accuracy of the method. Lastly, we apply our method to three real world datasets and empirically investigate their learned features.
    The Next 700 ML-Enabled Compiler Optimizations. (arXiv:2311.10800v1 [cs.PL])
    There is a growing interest in enhancing compiler optimizations with ML models, yet interactions between compilers and ML frameworks remain challenging. Some optimizations require tightly coupled models and compiler internals,raising issues with modularity, performance and framework independence. Practical deployment and transparency for the end-user are also important concerns. We propose ML-Compiler-Bridge to enable ML model development within a traditional Python framework while making end-to-end integration with an optimizing compiler possible and efficient. We evaluate it on both research and production use cases, for training and inference, over several optimization problems, multiple compilers and its versions, and gym infrastructures.
    ToolTalk: Evaluating Tool-Usage in a Conversational Setting. (arXiv:2311.10775v1 [cs.CL])
    Large language models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Many recent works seek to augment LLM-based assistants with external tools so they can access private or up-to-date information and carry out actions on behalf of users. To better measure the performance of these assistants, this paper introduces ToolTalk, a benchmark consisting of complex user intents requiring multi-step tool usage specified through dialogue. ToolTalk contains 28 tools grouped into 7 plugins, and includes a complete simulated implementation of each tool, allowing for fully automated evaluation of assistants that rely on execution feedback. ToolTalk also emphasizes tools that externally affect the world rather than only tools for referencing or searching information. We evaluate GPT-3.5 and GPT-4 on ToolTalk resulting in success rates of 26% and 50% respectively. Our analysis of the errors reveals three major categories and suggests some future directions for improvement. We release ToolTalk at https://github.com/microsoft/ToolTalk.  ( 2 min )
    A Systematic Review of Aspect-based Sentiment Analysis (ABSA): Domains, Methods, and Trends. (arXiv:2311.10777v1 [cs.CL])
    Aspect-based Sentiment Analysis (ABSA) is a type of fine-grained sentiment analysis (SA) that identifies aspects and the associated opinions from a given text. In the digital era, ABSA gained increasing popularity and applications in mining opinionated text data to obtain insights and support decisions. ABSA research employs linguistic, statistical, and machine-learning approaches and utilises resources such as labelled datasets, aspect and sentiment lexicons and ontology. By its nature, ABSA is domain-dependent and can be sensitive to the impact of misalignment between the resource and application domains. However, to our knowledge, this topic has not been explored by the existing ABSA literature reviews. In this paper, we present a Systematic Literature Review (SLR) of ABSA studies with a focus on the research application domain, dataset domain, and the research methods to examine their relationships and identify trends over time. Our results suggest a number of potential systemic issues in the ABSA research literature, including the predominance of the ``product/service review'' dataset domain among the majority of studies that did not have a specific research application domain, coupled with the prevalence of dataset-reliant methods such as supervised machine learning. This review makes a number of unique contributions to the ABSA research field: 1) To our knowledge, it is the first SLR that links the research domain, dataset domain, and research method through a systematic perspective; 2) it is one of the largest scoped SLR on ABSA, with 519 eligible studies filtered from 4191 search results without time constraint; and 3) our review methodology adopted an innovative automatic filtering process based on PDF-mining, which enhanced screening quality and reliability. Suggestions and our review limitations are also discussed.  ( 3 min )
    Extending Neural Network Verification to a Larger Family of Piece-wise Linear Activation Functions. (arXiv:2311.10780v1 [cs.LG])
    In this paper, we extend an available neural network verification technique to support a wider class of piece-wise linear activation functions. Furthermore, we extend the algorithms, which provide in their original form exact respectively over-approximative results for bounded input sets represented as start sets, to allow also unbounded input set. We implemented our algorithms and demonstrated their effectiveness in some case studies.  ( 2 min )
    EIT: Earnest Insight Toolkit for Evaluating Students' Earnestness in Interactive Lecture Participation Exercises. (arXiv:2311.10746v1 [cs.CY])
    In today's rapidly evolving educational landscape, traditional modes of passive information delivery are giving way to transformative pedagogical approaches that prioritize active student engagement. Within the context of large-scale hybrid classrooms, the challenge lies in fostering meaningful and active interaction between students and course content. This study delves into the significance of measuring students' earnestness during interactive lecture participation exercises. By analyzing students' responses to interactive lecture poll questions, establishing a clear rubric for evaluating earnestness, and conducting a comprehensive assessment, we introduce EIT (Earnest Insight Toolkit), a tool designed to assess students' engagement within interactive lecture participation exercises - particularly in the context of large-scale hybrid classrooms. Through the utilization of EIT, our objective is to equip educators with valuable means of identifying at-risk students for enhancing intervention and support strategies, as well as measuring students' levels of engagement with course content.  ( 2 min )
    A BERT based Ensemble Approach for Sentiment Classification of Customer Reviews and its Application to Nudge Marketing in e-Commerce. (arXiv:2311.10782v1 [cs.CL])
    According to the literature, Product reviews are an important source of information for customers to support their buying decision. Product reviews improve customer trust and loyalty. Reviews help customers in understanding what other customers think about a particular product and helps in driving purchase decisions. Therefore, for an e-commerce platform it is important to understand the sentiments in customer reviews to understand their products and services, and it also allows them to potentially create positive consumer interaction as well as long lasting relationships. Reviews also provide innovative ways to market the products for an ecommerce company. One such approach is Nudge Marketing. Nudge marketing is a subtle way for an ecommerce company to help their customers make better decisions without hesitation.  ( 2 min )
    Harnessing Deep Q-Learning for Enhanced Statistical Arbitrage in High-Frequency Trading: A Comprehensive Exploration. (arXiv:2311.10718v1 [q-fin.TR])
    The realm of High-Frequency Trading (HFT) is characterized by rapid decision-making processes that capitalize on fleeting market inefficiencies. As the financial markets become increasingly competitive, there is a pressing need for innovative strategies that can adapt and evolve with changing market dynamics. Enter Reinforcement Learning (RL), a branch of machine learning where agents learn by interacting with their environment, making it an intriguing candidate for HFT applications. This paper dives deep into the integration of RL in statistical arbitrage strategies tailored for HFT scenarios. By leveraging the adaptive learning capabilities of RL, we explore its potential to unearth patterns and devise trading strategies that traditional methods might overlook. We delve into the intricate exploration-exploitation trade-offs inherent in RL and how they manifest in the volatile world of HFT. Furthermore, we confront the challenges of applying RL in non-stationary environments, typical of financial markets, and investigate methodologies to mitigate associated risks. Through extensive simulations and backtests, our research reveals that RL not only enhances the adaptability of trading strategies but also shows promise in improving profitability metrics and risk-adjusted returns. This paper, therefore, positions RL as a pivotal tool for the next generation of HFT-based statistical arbitrage, offering insights for both researchers and practitioners in the field.  ( 2 min )
    How False Data Affects Machine Learning Models in Electrochemistry?. (arXiv:2311.10795v1 [cs.LG])
    Recently, the selection of machine learning model based on only the data distribution without concerning the noise of the data. This study aims to distinguish, which models perform well under noisy data, and establish whether stacking machine learning models actually provide robustness to otherwise weak-to-noise models. The electrochemical data were tested with 12 standalone models and stacking model. This includes XGB, LGBM, RF, GB, ADA, NN, ELAS, LASS, RIDGE, SVM, KNN, DT, and the stacking model. It is found that linear models handle noise well with the average error of (slope) to 1.75 F g-1 up to error per 100% percent noise added; but it suffers from prediction accuracy due to having an average of 60.19 F g-1 estimated at minimal error at 0% noise added. Tree-based models fail in terms of noise handling (average slope is 55.24 F g-1 at 100% percent noise), but it can provide higher prediction accuracy (lowest error of 23.9 F g-1) than that of linear. To address the controversial between prediction accuracy and error handling, the stacking model was constructed, which is not only show high accuracy (intercept of 25.03 F g-1), but it also exhibits good noise handling (slope of 43.58 F g-1), making stacking models a relatively low risk and viable choice for beginner and experienced machine learning research in electrochemistry. Even though neural networks (NN) are gaining popularity in the electrochemistry field. However, this study presents that NN is not suitable for electrochemical data, and improper tuning resulting in a model that is susceptible to noise. Thus, STACK models should provide better benefits in that even with untuned base models, they can achieve an accurate and noise-tolerant model. Overall, this work provides insight into machine learning model selection for electrochemical data, which should aid the understanding of data science in chemistry context.  ( 3 min )
    Exponentially Faster Language Modelling. (arXiv:2311.10770v1 [cs.CL])
    Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present FastBERT, a BERT variant that uses 0.3\% of its neurons during inference while performing on par with similar BERT models. FastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights.  ( 2 min )
    EdgeFM: Leveraging Foundation Model for Open-set Learning on the Edge. (arXiv:2311.10986v1 [cs.LG])
    Deep Learning (DL) models have been widely deployed on IoT devices with the help of advancements in DL algorithms and chips. However, the limited resources of edge devices make these on-device DL models hard to be generalizable to diverse environments and tasks. Although the recently emerged foundation models (FMs) show impressive generalization power, how to effectively leverage the rich knowledge of FMs on resource-limited edge devices is still not explored. In this paper, we propose EdgeFM, a novel edge-cloud cooperative system with open-set recognition capability. EdgeFM selectively uploads unlabeled data to query the FM on the cloud and customizes the specific knowledge and architectures for edge models. Meanwhile, EdgeFM conducts dynamic model switching at run-time taking into account both data uncertainty and dynamic network variations, which ensures the accuracy always close to the original FM. We implement EdgeFM using two FMs on two edge platforms. We evaluate EdgeFM on three public datasets and two self-collected datasets. Results show that EdgeFM can reduce the end-to-end latency up to 3.2x and achieve 34.3% accuracy increase compared with the baseline.
  • Open

    Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC. (arXiv:2302.11552v4 [cs.LG] UPDATED)
    Since their introduction, diffusion models have quickly become the prevailing approach to generative modeling in many domains. They can be interpreted as learning the gradients of a time-varying sequence of log-probability density functions. This interpretation has motivated classifier-based and classifier-free guidance as methods for post-hoc control of diffusion models. In this work, we build upon these ideas using the score-based interpretation of diffusion models, and explore alternative ways to condition, modify, and reuse diffusion models for tasks involving compositional generation and guidance. In particular, we investigate why certain types of composition fail using current techniques and present a number of solutions. We conclude that the sampler (not the model) is responsible for this failure and propose new samplers, inspired by MCMC, which enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers. Intriguingly we find these samplers lead to notable improvements in compositional generation across a wide set of problems such as classifier-guided ImageNet modeling and compositional text-to-image generation.
    PopArt: Efficient Sparse Regression and Experimental Design for Optimal Sparse Linear Bandits. (arXiv:2210.15345v3 [stat.ML] UPDATED)
    In sparse linear bandits, a learning agent sequentially selects an action and receive reward feedback, and the reward function depends linearly on a few coordinates of the covariates of the actions. This has applications in many real-world sequential decision making problems. In this paper, we propose a simple and computationally efficient sparse linear estimation method called PopArt that enjoys a tighter $\ell_1$ recovery guarantee compared to Lasso (Tibshirani, 1996) in many problems. Our bound naturally motivates an experimental design criterion that is convex and thus computationally efficient to solve. Based on our novel estimator and design criterion, we derive sparse linear bandit algorithms that enjoy improved regret upper bounds upon the state of the art (Hao et al., 2020), especially w.r.t. the geometry of the given action set. Finally, we prove a matching lower bound for sparse linear bandits in the data-poor regime, which closes the gap between upper and lower bounds in prior work.  ( 2 min )
    FinanceBench: A New Benchmark for Financial Question Answering. (arXiv:2311.11944v1 [cs.CL])
    FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are intended to be clear-cut and straightforward to answer to serve as a minimum performance standard. We test 16 state of the art model configurations (including GPT-4-Turbo, Llama2 and Claude2, with vector stores and long context prompts) on a sample of 150 cases from FinanceBench, and manually review their answers (n=2,400). The cases are available open-source. We show that existing LLMs have clear limitations for financial QA. Notably, GPT-4-Turbo used with a retrieval system incorrectly answered or refused to answer 81% of questions. While augmentation techniques such as using longer context window to feed in relevant evidence improve performance, they are unrealistic for enterprise settings due to increased latency and cannot support larger financial documents. We find that all models examined exhibit weaknesses, such as hallucinations, that limit their suitability for use by enterprises.  ( 2 min )
    Model-adapted Fourier sampling for generative compressed sensing. (arXiv:2310.04984v2 [cs.IT] UPDATED)
    We study generative compressed sensing when the measurement matrix is randomly subsampled from a unitary matrix (with the DFT as an important special case). It was recently shown that $\textit{O}(kdn\| \boldsymbol{\alpha}\|_{\infty}^{2})$ uniformly random Fourier measurements are sufficient to recover signals in the range of a neural network $G:\mathbb{R}^k \to \mathbb{R}^n$ of depth $d$, where each component of the so-called local coherence vector $\boldsymbol{\alpha}$ quantifies the alignment of a corresponding Fourier vector with the range of $G$. We construct a model-adapted sampling strategy with an improved sample complexity of $\textit{O}(kd\| \boldsymbol{\alpha}\|_{2}^{2})$ measurements. This is enabled by: (1) new theoretical recovery guarantees that we develop for nonuniformly random sampling distributions and then (2) optimizing the sampling distribution to minimize the number of measurements needed for these guarantees. This development offers a sample complexity applicable to natural signal classes, which are often almost maximally coherent with low Fourier frequencies. Finally, we consider a surrogate sampling scheme, and validate its performance in recovery experiments using the CelebA dataset.  ( 2 min )
    AMES: A Differentiable Embedding Space Selection Framework for Latent Graph Inference. (arXiv:2311.11891v1 [cs.LG])
    In real-world scenarios, although data entities may possess inherent relationships, the specific graph illustrating their connections might not be directly accessible. Latent graph inference addresses this issue by enabling Graph Neural Networks (GNNs) to operate on point cloud data, dynamically learning the necessary graph structure. These graphs are often derived from a latent embedding space, which can be modeled using Euclidean, hyperbolic, spherical, or product spaces. However, currently, there is no principled differentiable method for determining the optimal embedding space. In this work, we introduce the Attentional Multi-Embedding Selection (AMES) framework, a differentiable method for selecting the best embedding space for latent graph inference through backpropagation, considering a downstream task. Our framework consistently achieves comparable or superior results compared to previous methods for latent graph inference across five benchmark datasets. Importantly, our approach eliminates the need for conducting multiple experiments to identify the optimal embedding space. Furthermore, we explore interpretability techniques that track the gradient contributions of different latent graphs, shedding light on how our attention-based, fully differentiable approach learns to choose the appropriate latent space. In line with previous works, our experiments emphasize the advantages of hyperbolic spaces in enhancing performance. More importantly, our interpretability framework provides a general approach for quantitatively comparing embedding spaces across different tasks based on their contributions, a dimension that has been overlooked in previous literature on latent graph inference.
    Bounds on Representation-Induced Confounding Bias for Treatment Effect Estimation. (arXiv:2311.11321v1 [stat.ML])
    State-of-the-art methods for conditional average treatment effect (CATE) estimation make widespread use of representation learning. Here, the idea is to reduce the variance of the low-sample CATE estimation by a (potentially constrained) low-dimensional representation. However, low-dimensional representations can lose information about the observed confounders and thus lead to bias, because of which the validity of representation learning for CATE estimation is typically violated. In this paper, we propose a new, representation-agnostic framework for estimating bounds on the representation-induced confounding bias that comes from dimensionality reduction (or other constraints on the representations) in CATE estimation. First, we establish theoretically under which conditions CATEs are non-identifiable given low-dimensional (constrained) representations. Second, as our remedy, we propose to perform partial identification of CATEs or, equivalently, aim at estimating of lower and upper bounds of the representation-induced confounding bias. We demonstrate the effectiveness of our bounds in a series of experiments. In sum, our framework is of direct relevance in practice where the validity of CATE estimation is of importance.
    On the impact of activation and normalization in obtaining isometric embeddings at initialization. (arXiv:2305.18399v3 [cs.LG] UPDATED)
    In this paper, we explore the structure of the penultimate Gram matrix in deep neural networks, which contains the pairwise inner products of outputs corresponding to a batch of inputs. In several architectures it has been observed that this Gram matrix becomes degenerate with depth at initialization, which dramatically slows training. Normalization layers, such as batch or layer normalization, play a pivotal role in preventing the rank collapse issue. Despite promising advances, the existing theoretical results do not extend to layer normalization, which is widely used in transformers, and can not quantitatively characterize the role of non-linear activations. To bridge this gap, we prove that layer normalization, in conjunction with activation layers, biases the Gram matrix of a multilayer perceptron towards the identity matrix at an exponential rate with depth at initialization. We quantify this rate using the Hermite expansion of the activation function.
    Information Geometry of Wasserstein Statistics on Shapes and Affine Deformations. (arXiv:2307.12508v3 [math.ST] UPDATED)
    Information geometry and Wasserstein geometry are two main structures introduced in a manifold of probability distributions, and they capture its different characteristics. We study characteristics of Wasserstein geometry in the framework of Li and Zhao (2023) for the affine deformation statistical model, which is a multi-dimensional generalization of the location-scale model. We compare merits and demerits of estimators based on information geometry and Wasserstein geometry. The shape of a probability distribution and its affine deformation are separated in the Wasserstein geometry, showing its robustness against the waveform perturbation in exchange for the loss in Fisher efficiency. We show that the Wasserstein estimator is the moment estimator in the case of the elliptically symmetric affine deformation model. It coincides with the information-geometrical estimator (maximum-likelihood estimator) when and only when the waveform is Gaussian. The role of the Wasserstein efficiency is elucidated in terms of robustness against waveform change.
    Large Learning Rates Improve Generalization: But How Large Are We Talking About?. (arXiv:2311.11303v1 [cs.LG])
    Inspired by recent research that recommends starting neural networks training with large learning rates (LRs) to achieve the best generalization, we explore this hypothesis in detail. Our study clarifies the initial LR ranges that provide optimal results for subsequent training with a small LR or weight averaging. We find that these ranges are in fact significantly narrower than generally assumed. We conduct our main experiments in a simplified setup that allows precise control of the learning rate hyperparameter and validate our key findings in a more practical setting.
    ML-KFHE: Multi-label ensemble classification algorithm exploiting sensor fusion properties of the Kalman filter. (arXiv:1904.10552v4 [cs.LG] UPDATED)
    Despite the success of ensemble classification methods in multi-class classification problems, ensemble methods based on approaches other than bagging have not been widely explored for multi-label classification problems. The Kalman Filter-based Heuristic Ensemble (KFHE) is an ensemble method that exploits the sensor fusion properties of the Kalman filter to combine several classifier models, and that has been shown to be very effective. This work proposes a multi-label version of KFHE, ML-KFHE, demonstrating the effectiveness of the KFHE method on multi-label datasets. Two variants are introduced based on the underlying component classifier algorithm, ML-KFHE-HOMER, and ML-KFHE-CC which uses HOMER and Classifier Chain (CC) as the underlying multi-label algorithms respectively. ML-KFHE-HOMER and ML-KFHE-CC sequentially train multiple HOMER and CC multi-label classifiers and aggregate their outputs using the sensor fusion properties of the Kalman filter. Extensive experiments and detailed analysis were performed on thirteen multi-label datasets and eight other algorithms, which included state-of-the-art ensemble methods. The results show, for both versions, the ML-KFHE framework improves the predictive performance significantly with respect to bagged combinations of HOMER (named E-HOMER), also introduced in this paper, and bagged combination of CC, Ensemble Classifier Chains (ECC), thus demonstrating the effectiveness of ML-KFHE. Also, the ML-KFHE-HOMER variant was found to perform consistently and significantly better than the compared multi-label methods including existing approaches based on ensembles.
    Online Learning with Set-Valued Feedback. (arXiv:2306.06247v3 [cs.LG] UPDATED)
    We study a variant of online multiclass classification where the learner predicts a single label but receives a \textit{set of labels} as feedback. In this model, the learner is penalized for not outputting a label contained in the revealed set. We show that unlike online multiclass learning with single-label feedback, deterministic and randomized online learnability are \textit{not equivalent} in the realizable setting under set-valued feedback. In addition, we show that deterministic and randomized realizable learnability are equivalent if the Helly number of the collection of sets that can be revealed as feedback is finite. In light of this separation, we give two new combinatorial dimensions, named the Set Littlestone and Measure Shattering dimension, whose finiteness characterizes deterministic and randomized realizable learnability respectively. Additionally, these dimensions lower- and upper bound the deterministic and randomized minimax regret in the realizable setting. Going beyond the realizable setting, we prove that the Measure shattering dimension continues to characterize learnability and quantify minimax regret in the agnostic setting. Finally, we use our results to establish bounds on the minimax regret for three practical learning settings: online multilabel ranking, online multilabel classification, and real-valued prediction with interval-valued response.
    Geometric Algebra Transformer. (arXiv:2305.18415v3 [cs.LG] UPDATED)
    Problems involving geometric data arise in physics, chemistry, robotics, computer vision, and many other fields. Such data can take numerous forms, for instance points, direction vectors, translations, or rotations, but to date there is no single architecture that can be applied to such a wide variety of geometric types while respecting their symmetries. In this paper we introduce the Geometric Algebra Transformer (GATr), a general-purpose architecture for geometric data. GATr represents inputs, outputs, and hidden states in the projective geometric (or Clifford) algebra, which offers an efficient 16-dimensional vector-space representation of common geometric objects as well as operators acting on them. GATr is equivariant with respect to E(3), the symmetry group of 3D Euclidean space. As a Transformer, GATr is versatile, efficient, and scalable. We demonstrate GATr in problems from n-body modeling to wall-shear-stress estimation on large arterial meshes to robotic motion planning. GATr consistently outperforms both non-geometric and equivariant baselines in terms of error, data efficiency, and scalability.
    Learning Multiscale Non-stationary Causal Structures. (arXiv:2208.14989v2 [cs.LG] UPDATED)
    This paper addresses a gap in the current state of the art by providing a solution for modeling causal relationships that evolve over time and occur at different time scales. Specifically, we introduce the multiscale non-stationary directed acyclic graph (MN-DAG), a framework for modeling multivariate time series data. Our contribution is twofold. Firstly, we expose a probabilistic generative model by leveraging results from spectral and causality theories. Our model allows sampling an MN-DAG according to user-specified priors on the time-dependence and multiscale properties of the causal graph. Secondly, we devise a Bayesian method named Multiscale Non-stationary Causal Structure Learner (MN-CASTLE) that uses stochastic variational inference to estimate MN-DAGs. The method also exploits information from the local partial correlation between time series over different time resolutions. The data generated from an MN-DAG reproduces well-known features of time series in different domains, such as volatility clustering and serial correlation. Additionally, we show the superior performance of MN-CASTLE on synthetic data with different multiscale and non-stationary properties compared to baseline models. Finally, we apply MN-CASTLE to identify the drivers of the natural gas prices in the US market. Causal relationships have strengthened during the COVID-19 outbreak and the Russian invasion of Ukraine, a fact that baseline methods fail to capture. MN-CASTLE identifies the causal impact of critical economic drivers on natural gas prices, such as seasonal factors, economic uncertainty, oil prices, and gas storage deviations.
    Dueling Optimization with a Monotone Adversary. (arXiv:2311.11185v1 [cs.DS])
    We introduce and study the problem of dueling optimization with a monotone adversary, which is a generalization of (noiseless) dueling convex optimization. The goal is to design an online algorithm to find a minimizer $\mathbf{x}^{*}$ for a function $f\colon X \to \mathbb{R}$, where $X \subseteq \mathbb{R}^d$. In each round, the algorithm submits a pair of guesses, i.e., $\mathbf{x}^{(1)}$ and $\mathbf{x}^{(2)}$, and the adversary responds with any point in the space that is at least as good as both guesses. The cost of each query is the suboptimality of the worse of the two guesses; i.e., ${\max} \left( f(\mathbf{x}^{(1)}), f(\mathbf{x}^{(2)}) \right) - f(\mathbf{x}^{*})$. The goal is to minimize the number of iterations required to find an $\varepsilon$-optimal point and to minimize the total cost (regret) of the guesses over many rounds. Our main result is an efficient randomized algorithm for several natural choices of the function $f$ and set $X$ that incurs cost $O(d)$ and iteration complexity $O(d\log(1/\varepsilon)^2)$. Moreover, our dependence on $d$ is asymptotically optimal, as we show examples in which any randomized algorithm for this problem must incur $\Omega(d)$ cost and iteration complexity.
    The Minimax Risk in Histogram-Based Uniformity Testing under Missing Ball Alternatives. (arXiv:2305.18111v5 [math.ST] UPDATED)
    We study the problem of testing the goodness of fit of a discrete sample from many categories to the uniform distribution over the categories. As a class of alternative hypotheses, we consider the removal of an $\ell_p$ ball of radius $\epsilon$ around the uniform rate sequence for $p \leq 2$. When the number of samples $n$ and number of categories $N$ go to infinity while $\epsilon$ is small, the minimax risk $R_\epsilon^*$ in testing based on the sample's histogram (number of absent categories, singletons, collisions, ...) asymptotes to $2\Phi(-n N^{2-2/p} \epsilon^2/\sqrt{8N})$; $\Phi(x)$ is the normal CDF. This result allows the comparison of the many estimators previously proposed for this problem at the constant level, rather than at the rate of convergence of the risk or the scaling order of the sample complexity. The minimax test mostly relies on collisions in the very small sample limit but otherwise behaves like the chisquared test. Empirical studies over a range of problem parameters show that our estimate is accurate in finite samples and that the minimax test is significantly better than the chisquared test or a test that only uses collisions. Our analysis relies on the asymptotic normality of histogram ordinates, the equivalence between the minimax setting and a Bayesian setting, and the characterization of the least favorable prior by reducing a multi-dimensional optimization problem to a one-dimensional problem.
    Wasserstein Convergence Guarantees for a General Class of Score-Based Generative Models. (arXiv:2311.11003v1 [cs.LG])
    Score-based generative models (SGMs) is a recent class of deep generative models with state-of-the-art performance in many applications. In this paper, we establish convergence guarantees for a general class of SGMs in 2-Wasserstein distance, assuming accurate score estimates and smooth log-concave data distribution. We specialize our result to several concrete SGMs with specific choices of forward processes modelled by stochastic differential equations, and obtain an upper bound on the iteration complexity for each model, which demonstrates the impacts of different choices of the forward processes. We also provide a lower bound when the data distribution is Gaussian. Numerically, we experiment SGMs with different forward processes, some of which are newly proposed in this paper, for unconditional image generation on CIFAR-10. We find that the experimental results are in good agreement with our theoretical predictions on the iteration complexity, and the models with our newly proposed forward processes can outperform existing models.
    Outage Performance and Novel Loss Function for an ML-Assisted Resource Allocation: An Exact Analytical Framework. (arXiv:2305.09739v2 [eess.SP] UPDATED)
    We introduce a novel loss function to minimize the outage probability of an ML-based resource allocation system. A single-user multi-resource greedy allocation strategy constitutes our application scenario, for which an ML binary classification predictor assists in selecting a resource satisfying the established outage criterium. While other resource allocation policies may be suitable, they are not the focus of our study. Instead, our primary emphasis is on theoretically developing this loss function and leveraging it to train an ML model to address the outage probability challenge. With no access to future channel state information, this predictor foresees each resource's likely future outage status. When the predictor encounters a resource it believes will be satisfactory, it allocates it to the user. Our main result establishes exact and asymptotic expressions for this system's outage probability. These expressions reveal that focusing solely on the optimization of the per-resource outage probability conditioned on the ML predictor recommending resource allocation (a strategy that appears to be most appropriate) may produce inadequate predictors that reject every resource. They also reveal that focusing on standard metrics, like precision, false-positive rate, or recall, may not produce optimal predictors. With our result, we formulate a theoretically optimal, differentiable loss function to train our predictor. We then compare predictors trained using this and traditional loss functions namely, binary cross-entropy (BCE), mean squared error (MSE), and mean absolute error (MAE). In all scenarios, predictors trained using our novel loss function provide superior outage probability performance. Moreover, in some cases, our loss function outperforms predictors trained with BCE, MAE, and MSE by multiple orders of magnitude.
    Unveiling the Power of Self-Attention for Shipping Cost Prediction: The Rate Card Transformer. (arXiv:2311.11694v1 [cs.LG])
    Amazon ships billions of packages to its customers annually within the United States. Shipping cost of these packages are used on the day of shipping (day 0) to estimate profitability of sales. Downstream systems utilize these days 0 profitability estimates to make financial decisions, such as pricing strategies and delisting loss-making products. However, obtaining accurate shipping cost estimates on day 0 is complex for reasons like delay in carrier invoicing or fixed cost components getting recorded at monthly cadence. Inaccurate shipping cost estimates can lead to bad decision, such as pricing items too low or high, or promoting the wrong product to the customers. Current solutions for estimating shipping costs on day 0 rely on tree-based models that require extensive manual engineering efforts. In this study, we propose a novel architecture called the Rate Card Transformer (RCT) that uses self-attention to encode all package shipping information such as package attributes, carrier information and route plan. Unlike other transformer-based tabular models, RCT has the ability to encode a variable list of one-to-many relations of a shipment, allowing it to capture more information about a shipment. For example, RCT can encode properties of all products in a package. Our results demonstrate that cost predictions made by the RCT have 28.82% less error compared to tree-based GBDT model. Moreover, the RCT outperforms the state-of-the-art transformer-based tabular model, FTTransformer, by 6.08%. We also illustrate that the RCT learns a generalized manifold of the rate card that can improve the performance of tree-based models.
    Beyond IID weights: sparse and low-rank deep Neural Networks are also Gaussian Processes. (arXiv:2310.16597v2 [stat.ML] UPDATED)
    The infinitely wide neural network has been proven a useful and manageable mathematical model that enables the understanding of many phenomena appearing in deep learning. One example is the convergence of random deep networks to Gaussian processes that allows a rigorous analysis of the way the choice of activation function and network weights impacts the training dynamics. In this paper, we extend the seminal proof of Matthews et al. (2018) to a larger class of initial weight distributions (which we call PSEUDO-IID), including the established cases of IID and orthogonal weights, as well as the emerging low-rank and structured sparse settings celebrated for their computational speed-up benefits. We show that fully-connected and convolutional networks initialized with PSEUDO-IID distributions are all effectively equivalent up to their variance. Using our results, one can identify the Edge-of-Chaos for a broader class of neural networks and tune them at criticality in order to enhance their training.  ( 2 min )
    Likelihood-Free Frequentist Inference: Bridging Classical Statistics and Machine Learning for Reliable Simulator-Based Inference. (arXiv:2107.03920v8 [stat.ML] UPDATED)
    Many areas of science make extensive use of computer simulators that implicitly encode intractable likelihood functions of complex systems. Classical statistical methods are poorly suited for these so-called likelihood-free inference (LFI) settings, especially outside asymptotic and low-dimensional regimes. At the same time, traditional LFI methods - such as Approximate Bayesian Computation or more recent machine learning techniques - do not guarantee confidence sets with nominal coverage in general settings (i.e., with high-dimensional data, finite sample sizes, and for any parameter value). In addition, there are no diagnostic tools to check the empirical coverage of confidence sets provided by such methods across the entire parameter space. In this work, we propose a unified and modular inference framework that bridges classical statistics and modern machine learning providing (i) a practical approach to the Neyman construction of confidence sets with frequentist finite-sample coverage for any value of the unknown parameters; and (ii) interpretable diagnostics that estimate the empirical coverage across the entire parameter space. We refer to the general framework as likelihood-free frequentist inference (LF2I). Any method that defines a test statistic can leverage LF2I to create valid confidence sets and diagnostics without costly Monte Carlo samples at fixed parameter settings. We study the power of two likelihood-based test statistics (ACORE and BFF) and demonstrate their empirical performance on high-dimensional, complex data. Code is available at https://github.com/lee-group-cmu/lf2i.  ( 3 min )
    Measuring and Mitigating Biases in Motor Insurance Pricing. (arXiv:2311.11900v1 [stat.ML])
    The non-life insurance sector operates within a highly competitive and tightly regulated framework, confronting a pivotal juncture in the formulation of pricing strategies. Insurers are compelled to harness a range of statistical methodologies and available data to construct optimal pricing structures that align with the overarching corporate strategy while accommodating the dynamics of market competition. Given the fundamental societal role played by insurance, premium rates are subject to rigorous scrutiny by regulatory authorities. These rates must conform to principles of transparency, explainability, and ethical considerations. Consequently, the act of pricing transcends mere statistical calculations and carries the weight of strategic and societal factors. These multifaceted concerns may drive insurers to establish equitable premiums, taking into account various variables. For instance, regulations mandate the provision of equitable premiums, considering factors such as policyholder gender or mutualist group dynamics in accordance with respective corporate strategies. Age-based premium fairness is also mandated. In certain insurance domains, variables such as the presence of serious illnesses or disabilities are emerging as new dimensions for evaluating fairness. Regardless of the motivating factor prompting an insurer to adopt fairer pricing strategies for a specific variable, the insurer must possess the capability to define, measure, and ultimately mitigate any ethical biases inherent in its pricing practices while upholding standards of consistency and performance. This study seeks to provide a comprehensive set of tools for these endeavors and assess their effectiveness through practical application in the context of automobile insurance.
    Efficient Neural Networks for Tiny Machine Learning: A Comprehensive Review. (arXiv:2311.11883v1 [stat.ML])
    The field of Tiny Machine Learning (TinyML) has gained significant attention due to its potential to enable intelligent applications on resource-constrained devices. This review provides an in-depth analysis of the advancements in efficient neural networks and the deployment of deep learning models on ultra-low power microcontrollers (MCUs) for TinyML applications. It begins by introducing neural networks and discussing their architectures and resource requirements. It then explores MEMS-based applications on ultra-low power MCUs, highlighting their potential for enabling TinyML on resource-constrained devices. The core of the review centres on efficient neural networks for TinyML. It covers techniques such as model compression, quantization, and low-rank factorization, which optimize neural network architectures for minimal resource utilization on MCUs. The paper then delves into the deployment of deep learning models on ultra-low power MCUs, addressing challenges such as limited computational capabilities and memory resources. Techniques like model pruning, hardware acceleration, and algorithm-architecture co-design are discussed as strategies to enable efficient deployment. Lastly, the review provides an overview of current limitations in the field, including the trade-off between model complexity and resource constraints. Overall, this review paper presents a comprehensive analysis of efficient neural networks and deployment strategies for TinyML on ultra-low-power MCUs. It identifies future research directions for unlocking the full potential of TinyML applications on resource-constrained devices.
    Polynomial-Time Solutions for ReLU Network Training: A Complexity Classification via Max-Cut and Zonotopes. (arXiv:2311.10972v1 [cs.LG])
    We investigate the complexity of training a two-layer ReLU neural network with weight decay regularization. Previous research has shown that the optimal solution of this problem can be found by solving a standard cone-constrained convex program. Using this convex formulation, we prove that the hardness of approximation of ReLU networks not only mirrors the complexity of the Max-Cut problem but also, in certain special cases, exactly corresponds to it. In particular, when $\epsilon\leq\sqrt{84/83}-1\approx 0.006$, we show that it is NP-hard to find an approximate global optimizer of the ReLU network objective with relative error $\epsilon$ with respect to the objective value. Moreover, we develop a randomized algorithm which mirrors the Goemans-Williamson rounding of semidefinite Max-Cut relaxations. To provide polynomial-time approximations, we classify training datasets into three categories: (i) For orthogonal separable datasets, a precise solution can be obtained in polynomial-time. (ii) When there is a negative correlation between samples of different classes, we give a polynomial-time approximation with relative error $\sqrt{\pi/2}-1\approx 0.253$. (iii) For general datasets, the degree to which the problem can be approximated in polynomial-time is governed by a geometric factor that controls the diameter of two zonotopes intrinsic to the dataset. To our knowledge, these results present the first polynomial-time approximation guarantees along with first hardness of approximation results for regularized ReLU networks.  ( 2 min )
    Optimal Locally Private Nonparametric Classification with Public Data. (arXiv:2311.11369v1 [stat.ML])
    In this work, we investigate the problem of public data-assisted non-interactive LDP (Local Differential Privacy) learning with a focus on non-parametric classification. Under the posterior drift assumption, we for the first time derive the mini-max optimal convergence rate with LDP constraint. Then, we present a novel approach, the locally private classification tree, which attains the mini-max optimal convergence rate. Furthermore, we design a data-driven pruning procedure that avoids parameter tuning and produces a fast converging estimator. Comprehensive experiments conducted on synthetic and real datasets show the superior performance of our proposed method. Both our theoretical and experimental findings demonstrate the effectiveness of public data compared to private data, which leads to practical suggestions for prioritizing non-private data collection.  ( 2 min )
    Gaussian Interpolation Flows. (arXiv:2311.11475v1 [stat.ML])
    Gaussian denoising has emerged as a powerful principle for constructing simulation-free continuous normalizing flows for generative modeling. Despite their empirical successes, theoretical properties of these flows and the regularizing effect of Gaussian denoising have remained largely unexplored. In this work, we aim to address this gap by investigating the well-posedness of simulation-free continuous normalizing flows built on Gaussian denoising. Through a unified framework termed Gaussian interpolation flow, we establish the Lipschitz regularity of the flow velocity field, the existence and uniqueness of the flow, and the Lipschitz continuity of the flow map and the time-reversed flow map for several rich classes of target distributions. This analysis also sheds light on the auto-encoding and cycle-consistency properties of Gaussian interpolation flows. Additionally, we delve into the stability of these flows in source distributions and perturbations of the velocity field, using the quadratic Wasserstein distance as a metric. Our findings offer valuable insights into the learning techniques employed in Gaussian interpolation flows for generative modeling, providing a solid theoretical foundation for end-to-end error analyses of learning GIFs with empirical observations.  ( 2 min )
    On the Hardness of Learning to Stabilize Linear Systems. (arXiv:2311.11151v1 [eess.SY])
    Inspired by the work of Tsiamis et al. \cite{tsiamis2022learning}, in this paper we study the statistical hardness of learning to stabilize linear time-invariant systems. Hardness is measured by the number of samples required to achieve a learning task with a given probability. The work in \cite{tsiamis2022learning} shows that there exist system classes that are hard to learn to stabilize with the core reason being the hardness of identification. Here we present a class of systems that can be easy to identify, thanks to a non-degenerate noise process that excites all modes, but the sample complexity of stabilization still increases exponentially with the system dimension. We tie this result to the hardness of co-stabilizability for this class of systems using ideas from robust control.  ( 2 min )
    Exponentially Convergent Algorithms for Supervised Matrix Factorization. (arXiv:2311.11182v1 [stat.ML])
    Supervised matrix factorization (SMF) is a classical machine learning method that simultaneously seeks feature extraction and classification tasks, which are not necessarily a priori aligned objectives. Our goal is to use SMF to learn low-rank latent factors that offer interpretable, data-reconstructive, and class-discriminative features, addressing challenges posed by high-dimensional data. Training SMF model involves solving a nonconvex and possibly constrained optimization with at least three blocks of parameters. Known algorithms are either heuristic or provide weak convergence guarantees for special cases. In this paper, we provide a novel framework that 'lifts' SMF as a low-rank matrix estimation problem in a combined factor space and propose an efficient algorithm that provably converges exponentially fast to a global minimizer of the objective with arbitrary initialization under mild assumptions. Our framework applies to a wide range of SMF-type problems for multi-class classification with auxiliary features. To showcase an application, we demonstrate that our algorithm successfully identified well-known cancer-associated gene groups for various cancers.  ( 2 min )
    Self-Supervised Pretraining for Heterogeneous Hypergraph Neural Networks. (arXiv:2311.11368v1 [cs.LG])
    Recently, pretraining methods for the Graph Neural Networks (GNNs) have been successful at learning effective representations from unlabeled graph data. However, most of these methods rely on pairwise relations in the graph and do not capture the underling higher-order relations between entities. Hypergraphs are versatile and expressive structures that can effectively model higher-order relationships among entities in the data. Despite the efforts to adapt GNNs to hypergraphs (HyperGNN), there are currently no fully self-supervised pretraining methods for HyperGNN on heterogeneous hypergraphs. In this paper, we present SPHH, a novel self-supervised pretraining framework for heterogeneous HyperGNNs. Our method is able to effectively capture higher-order relations among entities in the data in a self-supervised manner. SPHH is consist of two self-supervised pretraining tasks that aim to simultaneously learn both local and global representations of the entities in the hypergraph by using informative representations derived from the hypergraph structure. Overall, our work presents a significant advancement in the field of self-supervised pretraining of HyperGNNs, and has the potential to improve the performance of various graph-based downstream tasks such as node classification and link prediction tasks which are mapped to hypergraph configuration. Our experiments on two real-world benchmarks using four different HyperGNN models show that our proposed SPHH framework consistently outperforms state-of-the-art baselines in various downstream tasks. The results demonstrate that SPHH is able to improve the performance of various HyperGNN models in various downstream tasks, regardless of their architecture or complexity, which highlights the robustness of our framework.  ( 3 min )
    Towards a Post-Market Monitoring Framework for Machine Learning-based Medical Devices: A case study. (arXiv:2311.11463v1 [cs.LG])
    After a machine learning (ML)-based system is deployed in clinical practice, performance monitoring is important to ensure the safety and effectiveness of the algorithm over time. The goal of this work is to highlight the complexity of designing a monitoring strategy and the need for a systematic framework that compares the multitude of monitoring options. One of the main decisions is choosing between using real-world (observational) versus interventional data. Although the former is the most convenient source of monitoring data, it exhibits well-known biases, such as confounding, selection, and missingness. In fact, when the ML algorithm interacts with its environment, the algorithm itself may be a primary source of bias. On the other hand, a carefully designed interventional study that randomizes individuals can explicitly eliminate such biases, but the ethics, feasibility, and cost of such an approach must be carefully considered. Beyond the decision of the data source, monitoring strategies vary in the performance criteria they track, the interpretability of the test statistics, the strength of their assumptions, and their speed at detecting performance decay. As a first step towards developing a framework that compares the various monitoring options, we consider a case study of an ML-based risk prediction algorithm for postoperative nausea and vomiting (PONV). Bringing together tools from causal inference and statistical process control, we walk through the basic steps of defining candidate monitoring criteria, describing potential sources of bias and the causal model, and specifying and comparing candidate monitoring procedures. We hypothesize that these steps can be applied more generally, as causal inference can address other sources of biases as well.  ( 3 min )
    Precision at the indistinguishability threshold: a method for evaluating classification algorithms. (arXiv:2311.11422v1 [cs.LG])
    There exist a wide range of single number metrics for assessing performance of classification algorithms, including AUC and the F1-score (Wikipedia lists 17 such metrics, with 27 different names). In this article, I propose a new metric to answer the following question: when an algorithm is tuned so that it can no longer distinguish labelled cats from real cats, how often does a randomly chosen image that has been labelled as containing a cat actually contain a cat? The steps to construct this metric are as follows. First, we set a threshold score such that when the algorithm is shown two randomly-chosen images -- one that has a score greater than the threshold (i.e. a picture labelled as containing a cat) and another from those pictures that really does contain a cat -- the probability that the image with the highest score is the one chosen from the set of real cat images is 50\%. At this decision threshold, the set of positively labelled images are indistinguishable from the set of images which are positive. Then, as a second step, we measure performance by asking how often a randomly chosen picture from those labelled as containing a cat actually contains a cat. This metric can be thought of as {\it precision at the indistinguishability threshold}. While this new metric doesn't address the tradeoff between precision and recall inherent to all such metrics, I do show why this method avoids pitfalls that can occur when using, for example AUC, and it is better motivated than, for example, the F1-score.  ( 3 min )
    Provably Efficient CVaR RL in Low-rank MDPs. (arXiv:2311.11965v1 [cs.LG])
    We study risk-sensitive Reinforcement Learning (RL), where we aim to maximize the Conditional Value at Risk (CVaR) with a fixed risk tolerance $\tau$. Prior theoretical work studying risk-sensitive RL focuses on the tabular Markov Decision Processes (MDPs) setting. To extend CVaR RL to settings where state space is large, function approximation must be deployed. We study CVaR RL in low-rank MDPs with nonlinear function approximation. Low-rank MDPs assume the underlying transition kernel admits a low-rank decomposition, but unlike prior linear models, low-rank MDPs do not assume the feature or state-action representation is known. We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to carefully balance the interplay between exploration, exploitation, and representation learning in CVaR RL. We prove that our algorithm achieves a sample complexity of $\tilde{O}\left(\frac{H^7 A^2 d^4}{\tau^2 \epsilon^2}\right)$ to yield an $\epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations. Computational-wise, we design a novel discretized Least-Squares Value Iteration (LSVI) algorithm for the CVaR objective as the planning oracle and show that we can find the near-optimal policy in a polynomial running time with a Maximum Likelihood Estimation oracle. To our knowledge, this is the first provably efficient CVaR RL algorithm in low-rank MDPs.  ( 2 min )
    A powerful rank-based correction to multiple testing under positive dependency. (arXiv:2311.10900v1 [stat.ME])
    We develop a novel multiple hypothesis testing correction with family-wise error rate (FWER) control that efficiently exploits positive dependencies between potentially correlated statistical hypothesis tests. Our proposed algorithm $\texttt{max-rank}$ is conceptually straight-forward, relying on the use of a $\max$-operator in the rank domain of computed test statistics. We compare our approach to the frequently employed Bonferroni correction, theoretically and empirically demonstrating its superiority over Bonferroni in the case of existing positive dependency, and its equivalence otherwise. Our advantage over Bonferroni increases as the number of tests rises, and we maintain high statistical power whilst ensuring FWER control. We specifically frame our algorithm in the context of parallel permutation testing, a scenario that arises in our primary application of conformal prediction, a recently popularized approach for quantifying uncertainty in complex predictive settings.  ( 2 min )
    Balanced Training for Sparse GANs. (arXiv:2302.14670v2 [cs.LG] UPDATED)
    Over the past few years, there has been growing interest in developing larger and deeper neural networks, including deep generative models like generative adversarial networks (GANs). However, GANs typically come with high computational complexity, leading researchers to explore methods for reducing the training and inference costs. One such approach gaining popularity in supervised learning is dynamic sparse training (DST), which maintains good performance while enjoying excellent training efficiency. Despite its potential benefits, applying DST to GANs presents challenges due to the adversarial nature of the training process. In this paper, we propose a novel metric called the balance ratio (BR) to study the balance between the sparse generator and discriminator. We also introduce a new method called balanced dynamic sparse training (ADAPT), which seeks to control the BR during GAN training to achieve a good trade-off between performance and computational cost. Our proposed method shows promising results on multiple datasets, demonstrating its effectiveness.  ( 2 min )
    A Variational Autoencoder for Heterogeneous Temporal and Longitudinal Data. (arXiv:2204.09369v2 [cs.LG] UPDATED)
    The variational autoencoder (VAE) is a popular deep latent variable model used to analyse high-dimensional datasets by learning a low-dimensional latent representation of the data. It simultaneously learns a generative model and an inference network to perform approximate posterior inference. Recently proposed extensions to VAEs that can handle temporal and longitudinal data have applications in healthcare, behavioural modelling, and predictive maintenance. However, these extensions do not account for heterogeneous data (i.e., data comprising of continuous and discrete attributes), which is common in many real-life applications. In this work, we propose the heterogeneous longitudinal VAE (HL-VAE) that extends the existing temporal and longitudinal VAEs to heterogeneous data. HL-VAE provides efficient inference for high-dimensional datasets and includes likelihood models for continuous, count, categorical, and ordinal data while accounting for missing observations. We demonstrate our model's efficacy through simulated as well as clinical datasets, and show that our proposed model achieves competitive performance in missing value imputation and predictive accuracy.  ( 2 min )
    Mitigating Source Bias for Fairer Weak Supervision. (arXiv:2303.17713v2 [cs.LG] UPDATED)
    Weak supervision enables efficient development of training sets by reducing the need for ground truth labels. However, the techniques that make weak supervision attractive -- such as integrating any source of signal to estimate unknown labels -- also entail the danger that the produced pseudolabels are highly biased. Surprisingly, given everyday use and the potential for increased bias, weak supervision has not been studied from the point of view of fairness. We begin such a study, starting with the observation that even when a fair model can be built from a dataset with access to ground-truth labels, the corresponding dataset labeled via weak supervision can be arbitrarily unfair. To address this, we propose and empirically validate a model for source unfairness in weak supervision, then introduce a simple counterfactual fairness-based technique that can mitigate these biases. Theoretically, we show that it is possible for our approach to simultaneously improve both accuracy and fairness -- in contrast to standard fairness approaches that suffer from tradeoffs. Empirically, we show that our technique improves accuracy on weak supervision baselines by as much as 32\% while reducing demographic parity gap by 82.5\%. A simple extension of our method aimed at maximizing performance produces state-of-the-art performance in five out of ten datasets in the WRENCH benchmark.  ( 2 min )
    Let the Flows Tell: Solving Graph Combinatorial Optimization Problems with GFlowNets. (arXiv:2305.17010v3 [cs.LG] UPDATED)
    Combinatorial optimization (CO) problems are often NP-hard and thus out of reach for exact algorithms, making them a tempting domain to apply machine learning methods. The highly structured constraints in these problems can hinder either optimization or sampling directly in the solution space. On the other hand, GFlowNets have recently emerged as a powerful machinery to efficiently sample from composite unnormalized densities sequentially and have the potential to amortize such solution-searching processes in CO, as well as generate diverse solution candidates. In this paper, we design Markov decision processes (MDPs) for different combinatorial problems and propose to train conditional GFlowNets to sample from the solution space. Efficient training techniques are also developed to benefit long-range credit assignment. Through extensive experiments on a variety of different CO tasks with synthetic and realistic data, we demonstrate that GFlowNet policies can efficiently find high-quality solutions. Our implementation is open-sourced at https://github.com/zdhNarsil/GFlowNet-CombOpt.  ( 2 min )
    Uniform-in-time propagation of chaos for mean field Langevin dynamics. (arXiv:2212.03050v3 [math.PR] UPDATED)
    We study the mean field Langevin dynamics and the associated particle system. By assuming the functional convexity of the energy, we obtain the $L^p$-convergence of the marginal distributions towards the unique invariant measure for the mean field dynamics. Furthermore, we prove the uniform-in-time propagation of chaos in both the $L^2$-Wasserstein metric and relative entropy.  ( 2 min )
    timeXplain -- A Framework for Explaining the Predictions of Time Series Classifiers. (arXiv:2007.07606v2 [cs.LG] UPDATED)
    Modern time series classifiers display impressive predictive capabilities, yet their decision-making processes mostly remain black boxes to the user. At the same time, model-agnostic explainers, such as the recently proposed SHAP, promise to make the predictions of machine learning models interpretable, provided there are well-designed domain mappings. We bring both worlds together in our timeXplain framework, extending the reach of explainable artificial intelligence to time series classification and value prediction. We present novel domain mappings for the time domain, frequency domain, and time series statistics and analyze their explicative power as well as their limits. We employ a novel evaluation metric to experimentally compare timeXplain to several model-specific explanation approaches for state-of-the-art time series classifiers.  ( 2 min )
    Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection. (arXiv:2301.11290v2 [cs.SI] UPDATED)
    In this paper, we introduce a novel and computationally efficient method for vertex embedding, community detection, and community size determination. Our approach leverages a normalized one-hot graph encoder and a rank-based cluster size measure. Through extensive simulations, we demonstrate the excellent numerical performance of our proposed graph encoder ensemble algorithm.  ( 2 min )
    Reward Teaching for Federated Multi-armed Bandits. (arXiv:2305.02441v2 [stat.ML] UPDATED)
    Most of the existing federated multi-armed bandits (FMAB) designs are based on the presumption that clients will implement the specified design to collaborate with the server. In reality, however, it may not be possible to modify the clients' existing protocols. To address this challenge, this work focuses on clients who always maximize their individual cumulative rewards, and introduces a novel idea of ``reward teaching'', where the server guides the clients towards global optimality through implicit local reward adjustments. Under this framework, the server faces two tightly coupled tasks of bandit learning and target teaching, whose combination is non-trivial and challenging. A phased approach, called Teaching-After-Learning (TAL), is first designed to encourage and discourage clients' explorations separately. General performance analyses of TAL are established when the clients' strategies satisfy certain mild requirements. With novel technical approaches developed to analyze the warm-start behaviors of bandit algorithms, particularized guarantees of TAL with clients running UCB or epsilon-greedy strategies are then obtained. These results demonstrate that TAL achieves logarithmic regrets while only incurring logarithmic adjustment costs, which is order-optimal w.r.t. a natural lower bound. As a further extension, the Teaching-While-Learning (TWL) algorithm is developed with the idea of successive arm elimination to break the non-adaptive phase separation in TAL. Rigorous analyses demonstrate that when facing clients with UCB1, TWL outperforms TAL in terms of the dependencies on sub-optimality gaps thanks to its adaptive design. Experimental results demonstrate the effectiveness and generality of the proposed algorithms.  ( 3 min )
    Tracking Most Significant Shifts in Nonparametric Contextual Bandits. (arXiv:2307.05341v2 [stat.ML] UPDATED)
    We study nonparametric contextual bandits where Lipschitz mean reward functions may change over time. We first establish the minimax dynamic regret rate in this less understood setting in terms of number of changes $L$ and total-variation $V$, both capturing all changes in distribution over context space, and argue that state-of-the-art procedures are suboptimal in this setting. Next, we tend to the question of an adaptivity for this setting, i.e. achieving the minimax rate without knowledge of $L$ or $V$. Quite importantly, we posit that the bandit problem, viewed locally at a given context $X_t$, should not be affected by reward changes in other parts of context space $\cal X$. We therefore propose a notion of change, which we term experienced significant shifts, that better accounts for locality, and thus counts considerably less changes than $L$ and $V$. Furthermore, similar to recent work on non-stationary MAB (Suk & Kpotufe, 2022), experienced significant shifts only count the most significant changes in mean rewards, e.g., severe best-arm changes relevant to observed contexts. Our main result is to show that this more tolerant notion of change can in fact be adapted to.  ( 2 min )
    Efficient learning of nonlinear prediction models with time-series privileged information. (arXiv:2209.07067v4 [cs.LG] UPDATED)
    In domains where sample sizes are limited, efficient learning algorithms are critical. Learning using privileged information (LuPI) offers increased sample efficiency by allowing prediction models access to auxiliary information at training time which is unavailable when the models are used. In recent work, it was shown that for prediction in linear-Gaussian dynamical systems, a LuPI learner with access to intermediate time series data is never worse and often better in expectation than any unbiased classical learner. We provide new insights into this analysis and generalize it to nonlinear prediction tasks in latent dynamical systems, extending theoretical guarantees to the case where the map connecting latent variables and observations is known up to a linear transform. In addition, we propose algorithms based on random features and representation learning for the case when this map is unknown. A suite of empirical results confirm theoretical findings and show the potential of using privileged time-series information in nonlinear prediction.  ( 2 min )
    Deep Calibration of Market Simulations using Neural Density Estimators and Embedding Networks. (arXiv:2311.11913v1 [cs.LG])
    The ability to construct a realistic simulator of financial exchanges, including reproducing the dynamics of the limit order book, can give insight into many counterfactual scenarios, such as a flash crash, a margin call, or changes in macroeconomic outlook. In recent years, agent-based models have been developed that reproduce many features of an exchange, as summarised by a set of stylised facts and statistics. However, the ability to calibrate simulators to a specific period of trading remains an open challenge. In this work, we develop a novel approach to the calibration of market simulators by leveraging recent advances in deep learning, specifically using neural density estimators and embedding networks. We demonstrate that our approach is able to correctly identify high probability parameter sets, both when applied to synthetic and historical data, and without reliance on manually selected or weighted ensembles of stylised facts.  ( 2 min )
    Flat Minima in Linear Estimation and an Extended Gauss Markov Theorem. (arXiv:2311.11093v1 [cs.LG])
    We consider the problem of linear estimation, and establish an extension of the Gauss-Markov theorem, in which the bias operator is allowed to be non-zero but bounded with respect to a matrix norm of Schatten type. We derive simple and explicit formulas for the optimal estimator in the cases of Nuclear and Spectral norms (with the Frobenius case recovering ridge regression). Additionally, we analytically derive the generalization error in multiple random matrix ensembles, and compare with Ridge regression. Finally, we conduct an extensive simulation study, in which we show that the cross-validated Nuclear and Spectral regressors can outperform Ridge in several circumstances.  ( 2 min )
    Biarchetype analysis: simultaneous learning of observations and features based on extremes. (arXiv:2311.11153v1 [stat.ME])
    A new exploratory technique called biarchetype analysis is defined. We extend archetype analysis to find the archetypes of both observations and features simultaneously. The idea of this new unsupervised machine learning tool is to represent observations and features by instances of pure types (biarchetypes) that can be easily interpreted as they are mixtures of observations and features. Furthermore, the observations and features are expressed as mixtures of the biarchetypes, which also helps understand the structure of the data. We propose an algorithm to solve biarchetype analysis. We show that biarchetype analysis offers advantages over biclustering, especially in terms of interpretability. This is because byarchetypes are extreme instances as opposed to the centroids returned by biclustering, which favors human understanding. Biarchetype analysis is applied to several machine learning problems to illustrate its usefulness.  ( 2 min )
    Estimation of entropy-regularized optimal transport maps between non-compactly supported measures. (arXiv:2311.11934v1 [stat.ML])
    This paper addresses the problem of estimating entropy-regularized optimal transport (EOT) maps with squared-Euclidean cost between source and target measures that are subGaussian. In the case that the target measure is compactly supported or strongly log-concave, we show that for a recently proposed in-sample estimator, the expected squared $L^2$-error decays at least as fast as $O(n^{-1/3})$ where $n$ is the sample size. For the general subGaussian case we show that the expected $L^1$-error decays at least as fast as $O(n^{-1/6})$, and in both cases we have polynomial dependence on the regularization parameter. While these results are suboptimal compared to known results in the case of compactness of both the source and target measures (squared $L^2$-error converging at a rate $O(n^{-1})$) and for when the source is subGaussian while the target is compactly supported (squared $L^2$-error converging at a rate $O(n^{-1/2})$), their importance lie in eliminating the compact support requirements. The proof technique makes use of a bias-variance decomposition where the variance is controlled using standard concentration of measure results and the bias is handled by T1-transport inequalities along with sample complexity results in estimation of EOT cost under subGaussian assumptions. Our experimental results point to a looseness in controlling the variance terms and we conclude by posing several open problems.  ( 2 min )
    Testing with Non-identically Distributed Samples. (arXiv:2311.11194v1 [cs.DS])
    We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size $k$, $\textbf{p}_1, \textbf{p}_2,\ldots,\textbf{p}_T$, and we obtain $c$ independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, $\textbf{p}_{\mathrm{avg}}$. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with $c=1$ samples from each distribution, $\Theta(k/\varepsilon^2)$ samples are necessary and sufficient to learn $\textbf{p}_{\mathrm{avg}}$ to within error $\varepsilon$ in TV distance. To test uniformity or identity -- distinguishing the case that $\textbf{p}_{\mathrm{avg}}$ is equal to some reference distribution, versus has $\ell_1$ distance at least $\varepsilon$ from the reference distribution, we show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution. In contrast, for $c \ge 2$, we recover the usual sublinear sample testing of the i.i.d. setting: we show that $O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)$ samples are sufficient, matching the optimal sample complexity in the i.i.d. case in the regime where $\varepsilon \ge k^{-1/4}$. Additionally, we show that in the $c=2$ case, there is a constant $\rho > 0$ such that even in the linear regime with $\rho k$ samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same $\textbf{p}_i$) can perform uniformity testing.  ( 3 min )
    Duality of Bures and Shape Distances with Implications for Comparing Neural Representations. (arXiv:2311.11436v1 [stat.ML])
    A multitude of (dis)similarity measures between neural network representations have been proposed, resulting in a fragmented research landscape. Most of these measures fall into one of two categories. First, measures such as linear regression, canonical correlations analysis (CCA), and shape distances, all learn explicit mappings between neural units to quantify similarity while accounting for expected invariances. Second, measures such as representational similarity analysis (RSA), centered kernel alignment (CKA), and normalized Bures similarity (NBS) all quantify similarity in summary statistics, such as stimulus-by-stimulus kernel matrices, which are already invariant to expected symmetries. Here, we take steps towards unifying these two broad categories of methods by observing that the cosine of the Riemannian shape distance (from category 1) is equal to NBS (from category 2). We explore how this connection leads to new interpretations of shape distances and NBS, and draw contrasts of these measures with CKA, a popular similarity measure in the deep learning literature.  ( 2 min )
    BOIS: Bayesian Optimization of Interconnected Systems. (arXiv:2311.11254v1 [stat.ML])
    Bayesian optimization (BO) has proven to be an effective paradigm for the global optimization of expensive-to-sample systems. One of the main advantages of BO is its use of Gaussian processes (GPs) to characterize model uncertainty which can be leveraged to guide the learning and search process. However, BO typically treats systems as black-boxes and this limits the ability to exploit structural knowledge (e.g., physics and sparse interconnections). Composite functions of the form $f(x, y(x))$, wherein GP modeling is shifted from the performance function $f$ to an intermediate function $y$, offer an avenue for exploiting structural knowledge. However, the use of composite functions in a BO framework is complicated by the need to generate a probability density for $f$ from the Gaussian density of $y$ calculated by the GP (e.g., when $f$ is nonlinear it is not possible to obtain a closed-form expression). Previous work has handled this issue using sampling techniques; these are easy to implement and flexible but are computationally intensive. In this work, we introduce a new paradigm which allows for the efficient use of composite functions in BO; this uses adaptive linearizations of $f$ to obtain closed-form expressions for the statistical moments of the composite function. We show that this simple approach (which we call BOIS) enables the exploitation of structural knowledge, such as that arising in interconnected systems as well as systems that embed multiple GP models and combinations of physics and GP models. Using a chemical process optimization case study, we benchmark the effectiveness of BOIS against standard BO and sampling approaches. Our results indicate that BOIS achieves performance gains and accurately captures the statistics of composite functions.  ( 3 min )

  • Open

    [D] How to convert .parquet to .jsonl
    Hello folks , I'm trying to use a dataset from huggingface to fine tune openai got and use it on my platform. Unfortunately the huggingface dataset is in .parquet while openai require a .jsonl for the fine-tuning job. The data inside are basically the same such as admin role , input prompt , expected output. How can I integrate an huggingface dataset into openai gpt? submitted by /u/theparcel [link] [comments]  ( 1 min )
    [D] Thoughts on self-improvement?
    In your opinion, what could be some technical approaches for making progress in self-improving language models? Is this just a matter of synthetic data generation and retraining when hallucination rates become low enough? Reinforcement learning for learned decoding strategies? Are there any examples of knowledge distillation research demonstrating performance gains where the teacher and student models are the same? I see many talk about recursive self-improvement with little technical background (r/singularity), so I am hoping for some practical thoughts/challenges on this topic. submitted by /u/floppy_llama [link] [comments]  ( 1 min )
    [D] Do you think the proposed EU AI Act on foundation models will be accepted in its current form?
    The proposed EU AI Act includes provisions for the regulation of foundation models, which are large language models (LLMs) that are trained on massive amounts of data. The Act proposes that foundation models be classified as high-risk AI systems, which would subject them to stricter regulatory requirements. There has been some debate about whether the proposed provisions on foundation models are necessary or appropriate. Some experts argue that the provisions are too restrictive and could stifle innovation. I would be interested to hear the thoughts of the machine learning subreddit on this important issue. submitted by /u/Periplokos [link] [comments]  ( 9 min )
    [D]Machine learning daily tasks
    Hello, I am curently in 11th grade and I am curious about the daily tasks and challenges of a someone who has a job in machine learning, thanks for your time! submitted by /u/BVAcupcake [link] [comments]  ( 1 min )
    Stability AI releases a video model [N]
    Stability AI is releasing Stable Video Diffusion, their first foundation model for generative video based on the image model Stable Diffusion: https://stability.ai/news/stable-video-diffusion-open-ai-video-model submitted by /u/we_are_mammals [link] [comments]  ( 9 min )
    [P] Epsilla: a scalable, fully distributed vector database
    Hi everyone! Three months ago we launched Epsilla open source vector database. Since then we have been iterating. Today I'm excited to share the first production oriented launch: Epsilla v0.1 -- a scalable, fully distributed vector database. The launch focuses scalability and reliability, including sharding vector data for efficient horizontal scaling; Multi-replica configuration for high availability and increased throughput; Dynamic sharding expansion and auto-scaling replicas to meet data growth and traffic demands. We wanted to get your feedback and work together to shape our vector database features. Let us know what you think and what you'd like to see! Article: https://blog.epsilla.com/epsilla-v0-1-a-leap-forward-in-distributed-vector-database-technology-b511bb21b603 Github: https://github.com/epsilla-cloud/vectordb submitted by /u/songrenchu [link] [comments]  ( 1 min )
    [P] JustANeuron - Understand Backpropagation Through an Interactive Visualization of a Single Neuron
    I recently saw Karpathy's video on backpropagation, and at some point around the 40-minute mark he shows how to visualize backpropagation flowing through the computational graph of a single neuron, which I felt really made me understand automatic backpropagation intuitively for the first time. But I felt that I had to mess around with it for it to really sink in. So me and a friend quickly made an interactive, online version of the visualization, please tell me what you think! https://amitalevy.com/justaneuron/ ​ ​ submitted by /u/xjustwaitx [link] [comments]  ( 1 min )
    [D] Which Linux workstation distro is considered good for data science and machine learning?
    I am looking for Linux distro for my workstation. My daily job consist of different data science and machine learning tasks. Which Linux workstation distro is considered good for data science and machine learning? Is there any general consensus? What is your personal experience doing data science and machine learning on Linux workstation? submitted by /u/Alex_df_300 [link] [comments]  ( 1 min )
    Feature importance for each cluster/segment of a data set [R]
    Hi! I am working on a pretty unbalanced dataset. I am trying to determine the feature importance for the variables, but the dataset has a lot of underlying groups. It would therefore be very interesting to see the feature importance for each of these underlying groups in the dataset. Is it possible to segment/cluster the dataset, then look at the feature importance for each of the clusters or segments? And should I manually segment it, or should I just use clustering? submitted by /u/buklau00 [link] [comments]  ( 1 min )
    [D] "AI systems are always deterministic," AI teacher says. How can I reply (with examples and papers)?
    I had a discussion in class with one of my teachers. He says that AI is and can only be always deterministic because "even a deep learning neural network is a set of equations running on a computer, and the stochastic factor is added at the beginning. But the output of a model is always deterministic, even if it's not interpretable by humans." How would you reply? (Possibly with examples and papers) Tysm! submitted by /u/shiftingsmith [link] [comments]  ( 9 min )
    [N] Anthropic releases Claude 2.1
    https://www.anthropic.com/index/claude-2-1 200k token context window and half the hallucinations submitted by /u/Z3F [link] [comments]  ( 1 min )
    Hugging Face deprecated Auto Train, what should I use now? [D] [N]
    I am kinda technical, but writing and tuning hyper parameters is hard to do well. When Hugging Face launched Auto Train it was amazing, the results I could get just by dropping in a dataset were awesome and insanely accurate. For (seemingly) no reason they deprecated this feature and only allow now support Advanced Auto Train WHICH IS A NIGHTMARE. I can't get it to run. Bugs abound and when I reach out to support they just say "make a github issue, I have a small keyboard and can't respond here" -- whatever that means... but when I look at the github issue page it has 130 opened github issues (most comments are not answers just addition questions from community), and clearly this will do nothing besides waste my time. The docs are so minimal its a joke. And looking at their discord clear…  ( 1 min )
    [D] Example of data labeling guidelines
    I was looking for any real-world example of a rulebook for labeling data, but apart from general advice, I didn’t find anything. Have you seen publicly available guidelines for data markup that were used in practice to create datasets, especially for the task of object detection? submitted by /u/ArcticPanda_ [link] [comments]  ( 1 min )
    [D] Ways to speed up easyocr+GPU for digits only
    I am trying to speed up easyocr as much as possible for extraction of digits without using a faster gpu, currently I'm using easyocr with GPU with a small image, no localization just detection/extraction, I get the result in 0.11 seconds with almost perfect results, using reader= "en" and allowlist "0123456789.-" , is there any model for numbers only similar like tesseract "digits" config file? I'm having the impression that if it only has to look for 12 characters it will be faster. Am I wrong with this assumption? submitted by /u/Syntrillium [link] [comments]  ( 1 min )
    Seeking help with an evolutionary algorithm for DNA sequence set optimization. Code works, but is not finding good enough solutions. [P]
    Introduction: Greetings, I am working on an automation task for a research project at my institute. It is a problem that regularly occurs in DNA sequencing. I have tried solving the problem with an evolutionary algorithm, and my code works, but doesn't find solutions for small enough values of N. The best found so far is N=11, but I would ideally like it down to N=6. I've guessed around with hyperparameters and such, but I don't really have that much experience with this, so I am sort of out of ideas. Any help would be greatly appreciated! Code and data can be found here: https://github.com/Kodemannen/index-calculator Problem statement: I have a dataset of DNA sequences, each with a length of 10 bases. My goal is to select a subset of N sequences (where N >= 6) in such a way that al…  ( 1 min )
    [P] Structuring clinical statements in medical reports using NLP
    Hi all, I'm working on a very exciting NLP project in my PostDoc and wanted to reach out to the swarm intelligence as my background is purely in CV (~5 years of training CNNs on medical data). So basically we want medical doctors to give free text input into our pipeline which is supposed to structure the input. In the first instance we're looking at clinical statements (symptoms, medication etc.) for a particular medical examination. The task is simply to map same formulations to one output (light headache, slight headache, light pain in the head ---> headache, light\n). We have collected medical reports and annotate them at the moment. The model is supposed to operate in German language. I've started taking NLP courses (mostly YouTube and huggingface) and downloaded a 7B German LLM (https://huggingface.co/LeoLM). I have my own server with 3xRTX 6000 Ada GPUs (48GB) for training. However, for inference we surely need a smaller model. How would you proceed on this project? I realized that loading the model in pytorch already takes 30GB of VRAM, as the task is simple we thought about distillation to obtain a smaller model. Would you do this first and then start tackling the actual task, or first tackle the task using the large model and go small later. Do you have an idea how to best solve this task the best? This is my main project and I have ~2 years for this full time with only little other obligations. Do you think it makes sense to collect German medical literature to train a small foundation model, given that we want to solve more but similar NLP tasks later? Any input is appreciated. Many thanks!! :) ​ submitted by /u/ThomasBudd93 [link] [comments]  ( 1 min )
    [R] Towards Autonomous Hypothesis Verification via Language Models with Minimal Guidance
    Researchers decided to see if today's huge language models like GPT-4 could generate and test hypotheses without much human guidance - basically "do science" autonomously. They gave GPT-4 a simple "toy" research problem about how it tends to output verbose text unrelated to the question prompt. For "What is 1+1" it says "The answer is 2" rather than just "2". GPT-4 successfully generated reasonable hypotheses like modifying the prompt to request one-word answers. And it could create decent plans to test hypotheses by itself too. But only 25% of the time did it produce valid code to analyze the results. Going from abstract experimental plans to real code to test the hypothesis was tough. When it did work end-to-end, the process resembled a super simplified prototype of science rather than rigorous human research. Key challenges: Simplified toy problem, not complex real science Lost nuance going from initial problem to formal hypothesis Only 1 in 4 trials generated good code from the plan Wasn't fully independent - humans set up the problem and evaluated So we're not close to AI sciencing on its own - yet. But the fact it partially mimicked the scientific method autonomously suggests we're headed the right direction. I think this paper shows we still need big advances in reasoning, creativity, keeping focus, and stuff humans find easy but machines struggle with. TLDR; GPT-4 managed to somewhat autonomously generate and test hypotheses given a toy problem. Still challenges for AI to do full independent science like humans though. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]  ( 1 min )
    [Discussion] Need urgent help regarding the creation of a Model Zoo.
    Hey! So, I need help in creating a model zoo in context with the paper titled "Task-Adaptive Neural Network Search with Meta-Contrastive Learning". The working of code is dependent on the existence of a model zoo, which the authors have deleted from their github repo. I need to create a small version of the said model zoo for a project. The authors used "Once for all" for generating 100 networks per dataset. I have inferred the structure of model zoo from the code. But I need some help in generating the said model zoo. The model zoo is a dictionary with keys: 'dataset', 'topol', 'acc', 'f_emb', 'n_params'. First, I need help on how to generate 100 networks for a given dataset using Once for all. Second, I want to know how to get the topology and functional embeddings of the network. If you want any other information please comment so. Thanks submitted by /u/darkdeku16 [link] [comments]  ( 1 min )
    [D] How did you pick your GPU cloud provider?
    You want to run some heavy tasks in a cloud using GPUs. What would you do? What features are the most important for you in a GPU cloud provider? Is it price, availability, GPU models, or something else? How do you choose an instance type to run? Typically there are dozens of different instances in each provider. Do you regularly use one or more providers? submitted by /u/Egor_S [link] [comments]  ( 1 min )
    [R] Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
    Paper: https://arxiv.org/abs/2311.10642 Code: https://github.com/vulus98/Rethinking-attention Abstract: This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence tasks. We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation. Our experiments, conducted on the IWSLT2017 dataset, reveal the capacity of these "attentionless Transformers" to rival the performance of the original architecture. Through rigorous ablation studies, and experimenting with various replacement network types and sizes, we offer insights that support the viability of our approach. This not only sheds light on the adaptability of shallow feed-forward networks in emulating attention mechanisms but also underscores their potential to streamline complex architectures for sequence-to-sequence tasks. ​ submitted by /u/APaperADay [link] [comments]  ( 1 min )
    [Research] Make Pixels Dance: High-Dynamic Video Generation
    video: https://www.youtube.com/watch?v=QERmPmCg9aQ&ab_channel=PixelDance site: https://makepixelsdance.github.io/ paper: https://arxiv.org/abs/2311.10982 ​ 🌍 Join Mr. Polar Bear on an Unforgettable Journey! 🐾 https://i.redd.it/d2wuwm3odn1c1.gif "Mr. Polar Bear's Fantastic Journey" is more than just a 3-minute story; it's a call to action. This entirely AI-produced short film combines stunning visual effects with a heartwarming narrative and a profound environmental theme, inspiring us to realize the importance of protecting our planet. 👀 Watch Now! 🎬 Immerse yourself in this imaginative and enlightening journey, and join the conversation on environmental conservation. Let's work together to bring positive changes to our Earth! Note: Music and subtitles were added using editing tools. submitted by /u/MakePixelsDance [link] [comments]  ( 1 min )
    [R] Orca 2: Teaching Small Language Models How to Reason
    submitted by /u/Memories-Of-Theseus [link] [comments]  ( 1 min )
  • Open

    How important is CPU core count really?
    I'm thinking of buying a ryzen 5 7600x(for my first pc), but it only has 6 cores and while it performs well in gaming benchmarks, I have no idea how it translates to RL, and if that 6 cores won't be too little submitted by /u/Brown-Frog [link] [comments]  ( 1 min )
    Automating a robot to teach users how to drive?
    submitted by /u/Inside_Pen_5656 [link] [comments]  ( 1 min )
    Measuring FLOPS in my PPO model
    Currenlty in a machine learning project, where we teach an agent to play Doom via the VizDoom environment. Anyone know anything about measuring the FLOPS real-time? Only thing I've managed so far is this: FLOPS However, after what I've read, this only returns the theoretical maximum for the modelm - therefore static. Any idea how to measure the actual floating point operations per second? ​ Code snippet: env = VizDoomGymDeadlyCorridor() model = PPO('CnnPolicy', env, tensorboard_log=LOG_DIR, verbose=1, learning_rate=0.000127, n_steps=8192, clip_range=0.158, gamma=0.924, gae_lambda=0.876, ent_coef=0.045) num_of_episodes = 1000000 model.learn(total_timesteps=num_of_episodes, callback=callback) submitted by /u/dr_f0xy [link] [comments]
    VAE doesn't reconstruct the image
    Hello, I'm working with a model-based program, and I'm trying to reconstruct images from a game through a Variational Autoencoder. The code is the same as https://github.com/RajGhugare19/dreamerv2/blob/main/dreamerv2/models/pixel.py . I'm using a minigrid environment with mainly 3 colors, green for the agent, red for some obstacles and black for empty (images are normalized between 0 and 1). I've tried to use a decoder that outputs a Gaussian distribution and also one that outputs single values. The losses are, respectively, the logprob and the Absolute Mean Error ( also tried the Mean Square Error). In the first case, the image reconstructs only the black and red pixels; it is unable to reconstruct the agent (green). In the second case, it converges directly to all black pixels. I've tried changing learning rates, model parameters, and removing other losses (such as reward and kl) and only keep the obs_reconstruction one, but nothing changes. Do you have any suggestions? submitted by /u/ZioFranco1404 [link] [comments]  ( 1 min )
    Research labs focusing on RL in games
    Hey folks, I've been browsing Reddit (and the Internet) for some time now looking for research labs focusing on (Deep) RL in games. I know there are many threads asking for research labs doing RL but none of those threads ask specifically for labs doing RL in games. Apart from the obvious choices such as UofA in Canada, CMU in the US and probably UCL in London, what are some other research labs researching RL in games? Are the any such labs in Europe? Cheers submitted by /u/spoiled-mylk [link] [comments]  ( 1 min )
    Policy collapsing
    Heeey there, I am facing some issues with DQN; the issue is that the learned policy of the agent deteriorates after an initial improvement; I've tried many and many hyperparameters (at random) but in vain. Could someone help in in this please ? https://preview.redd.it/lkum4vu5cp1c1.png?width=787&format=png&auto=webp&s=9cba3c148f310086e2ab4e7c5d75e224d242c43a submitted by /u/GuavaAgreeable208 [link] [comments]  ( 1 min )
    looking for dataset, OpenAI Baselines on MuJoCo!
    Hello ! I am looking for a raw dataset of reinforcement learning benchmark results in MujoCo environments (Ant, Walker, humanoid, etc.). Specifically, I am searching for a collection of raw data based on more than 5 seeds for each environment, which I plan to visualize for comparing proposed methods and results. Does anyone know about this? Thank you. submitted by /u/Educational_Exam_500 [link] [comments]  ( 1 min )
    "Human-like systematic generalization through a meta-learning neural network", Lake & Baroni 2023 (task/data diversity in continual learning)
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    Playing around with Poe, try out the storyteller.
    https://poe.com/Th3Storyteller_ ​ Love any kind of feedback submitted by /u/thebadslime [link] [comments]  ( 1 min )
    How often do you see obviously AI generates images in major ads?
    They’re still obviously weird looking, but they’re becoming more and more common. This American Express ad I got made me want to ask the question. submitted by /u/JeffButterDogEpstein [link] [comments]  ( 1 min )
    Just woke up, what did I miss from the openAI saga?
    submitted by /u/cfryant [link] [comments]  ( 1 min )
    Should you ban employee use of AI?
    Many large enterprises are banning the use of AI-based large language models (LLMs) like ChatGPT due to the risk of data leakage and compliance concerns. Companies are worried about the direct leaking of internal data, concerns about how LLMs store, process, and leverage data inputs, and the lack of logging of how LLMs process regulated data. Enforcing a ban on LLMs is challenging, as employees may still find ways to use these tools despite restrictions. Organizations that ban or restrict employee generative AI usage often do so for reasons such as direct data leakage, concerns about mimicking private internal data, and the lack of visibility in how LLMs process regulated data. Enforcement of a ban on LLMs is difficult, as employees may not be aware of the rule or find ways to circumvent restrictions. Implementing data loss prevention (DLP) solutions can help organizations detect and prevent data leakage by using tactics such as pattern matching, keyword matching, and restricting copying and pasting, uploads, and keyboard inputs. Source : https://www.cloudflare.com/the-net/banning-ai/ submitted by /u/NuseAI [link] [comments]  ( 1 min )
    My unfiltered AI chatting platform (Honest AI) is now available on Google Play via Open Beta!
    Hey guys, My name is Bill - I'm the CEO of AI Anyone & the guy behind Honest A.I., which some of you may be familiar with. My own personal philosophy is that there are too many restrictions on AI - whether it be wracked with censorship that dampen the experience and limit creativity, or perhaps paywalls which other services use to gate essential features like owning a certain number of characters, using entire models, and hell - some even place needless limits on context size. Did you know a 200k-token model is capable of fitting ~350 pages into its context? Other platforms are not pushing their language models to their fullest potential, believe me. All of the aforementioned points are what I consider to be violations of the core tenets of this kind of platform, so I'll never understan…
    Nvidia CEO Jensen Huang and Satya Nadella at Microsoft Ignite 23
    submitted by /u/norcalnatv [link] [comments]  ( 1 min )
    WebsiteAi
    Website Ai whales is holding good, and they made some whale group, 835k market cap right now also there's coming AMA with Big Kid MMKK lets go! explore our Website : WebsiteAi.io Tg : Website_AI https://preview.redd.it/epxv5cpsqp1c1.png?width=642&format=png&auto=webp&s=05cce2752a0e1f0220bafba43f3874fde5f8fe81 https://preview.redd.it/di0xxojqqp1c1.png?width=1864&format=png&auto=webp&s=af0459ab471caea7c91e0b90c4547ac7efa9dd9e submitted by /u/balok1232 [link] [comments]  ( 1 min )
    Bigger is better
    submitted by /u/OmOshIroIdEs [link] [comments]  ( 1 min )
    AI Duality.
    submitted by /u/Philipp [link] [comments]  ( 1 min )
    Use of AI could create a four-day week for almost one-third of workers
    submitted by /u/Upbeat-Interaction13 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 11/20/2023
    Sam Altman and Greg Brockman, together with colleagues, will be joining Microsoft.[1] MIT Researchers Introduce MechGPT: A Language-Based Pioneer Bridging Scales, Disciplines, and Modalities in Mechanics and Materials Modeling.[2] Amazon aims to provide free AI skills training to 2 million people by 2025 with its new ‘AI Ready’ commitment.[3] OpenAI’s board of directors approached Dario Amodei, the co-founder and CEO of rival large-language model developer Anthropic, about a potential merger of the two companies, said a person with direct knowledge.[4] Sources: [1] https://www.cnbc.com/2023/11/20/ousted-openai-head-sam-altman-to-lead-microsofts-new-ai-team-ceo-nadella-says.html [2] https://www.marktechpost.com/2023/11/19/mit-researchers-introduce-mechgpt-a-language-based-pioneer-bridging-scales-disciplines-and-modalities-in-mechanics-and-materials-modeling/ [3] https://www.aboutamazon.com/news/aws/aws-free-ai-skills-training-courses [4] https://www.theinformation.com/articles/openai-approached-anthropic-about-merger submitted by /u/Excellent-Target-847 [link] [comments]
    Do you believe that AGI will be achieved sooner or later in light of recent events?
    I agree with David Shapiro and leaning towards the former. There will be tons more compitition and we will essentially be getting a much less restricted "OpenAI" with even more resources. Microsoft with Altman, Brockman, and +80% of OpenAI will do great things. submitted by /u/Beginning-Chapter-26 [link] [comments]
    Ilya Sutskever: The Exciting, Perilous Journey Toward AGI
    submitted by /u/remhum [link] [comments]
    The creepy AI-driven surveillance that may be infiltrating your workplace
    submitted by /u/thisisinsider [link] [comments]
  • Open

    DSC Weekly 21 November 2023
    Announcements Top Stories In-Depth The post DSC Weekly 21 November 2023 appeared first on Data Science Central.  ( 20 min )
    Why generative AI safety research is beyond alignment
    A way to counter AI risks could be to create AI risks. The question is by who, a non-profit, a corporation, a nation, or a treaty? It may take extremes in systems, across tasks, to find out the depths of threats. If AI is used in weaponry, what are all the possible ways, such that… Read More »Why generative AI safety research is beyond alignment The post Why generative AI safety research is beyond alignment appeared first on Data Science Central.  ( 21 min )
  • Open

    How Amazon Music uses SageMaker with NVIDIA to optimize ML training and inference performance and cost
    In the dynamic world of streaming on Amazon Music, every search for a song, podcast, or playlist holds a story, a mood, or a flood of emotions waiting to be unveiled. These searches serve as a gateway to new discoveries, cherished experiences, and lasting memories. The search bar is not just about finding a song; […]  ( 10 min )
    Machine Learning with MATLAB and Amazon SageMaker
    This post is written in collaboration with Brad Duncan, Rachel Johnson and Richard Alcock from MathWorks. MATLAB  is a popular programming tool for a wide range of applications, such as data processing, parallel computing, automation, simulation, machine learning, and artificial intelligence. It’s heavily used in many industries such as automotive, aerospace, communication, and manufacturing. In […]  ( 10 min )
    Text embedding and sentence similarity retrieval at scale with Amazon SageMaker JumpStart
    In this post, we demonstrate how to use the SageMaker Python SDK for text embedding and sentence similarity. Sentence similarity involves assessing the likeness between two pieces of text after they are converted into embeddings by the LLM, which is a foundation step for applications like Retrieval Augmented Generation (RAG).  ( 10 min )
    Amazon Textract’s new Layout feature introduces efficiencies in general purpose and generative AI document processing tasks
    Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. AnalyzeDocument Layout is a new feature that allows customers to automatically extract layout elements such as paragraphs, titles, subtitles, headers, footers, and more from documents. Layout extends Amazon Textract’s word and line detection by automatically […]  ( 14 min )
  • Open

    Open sourcing Project Guideline: A platform for computer vision accessibility technology
    Posted by Dave Hawkey, Software Engineer, Google Research Two years ago we announced Project Guideline, a collaboration between Google Research and Guiding Eyes for the Blind that enabled people with visual impairments (e.g., blindness and low-vision) to walk, jog, and run independently. Using only a Google Pixel phone and headphones, Project Guideline leverages on-device machine learning (ML) to navigate users along outdoor paths marked with a painted line. The technology has been tested all over the world and even demonstrated during the opening ceremony at the Tokyo 2020 Paralympic Games. Since the original announcement, we set out to improve Project Guideline by embedding new features, such as obstacle detection and advanced path planning, to safely and reliably navigate users t…  ( 92 min )
  • Open

    Students pitch transformative ideas in generative AI at MIT Ignite competition
    Twelve teams of students and postdocs across the MIT community presented innovative startup ideas with potential for real-world impact.  ( 11 min )
  • Open

    Can i develop a gan generator architecture without using Conv2dtranspose ?
    I am working on image to image translation gan, i develop a generator using conv2dtranspose and upsample layer but the training is too hard I was facing mode colllabs So i decided to rebuild my generator architecture. Can i use dense layer instead of conv2dtranspose and upsample layer ? submitted by /u/FortunePowerful3273 [link] [comments]  ( 1 min )
  • Open

    NVIDIA Collaborates With Genentech to Accelerate Drug Discovery Using Generative AI
    Genentech, a member of the Roche Group, is pioneering the use of generative AI to discover and develop new therapeutics and deliver treatments to patients more efficiently. A new collaboration between Genentech, the biotechnology pioneer, and NVIDIA aims to transform the discovery and development of new medicines by bringing together experts from each company to Read article >  ( 6 min )
    3D Artist Cooks Up Stunningly Photorealistic Food Renders This Week ‘In the NVIDIA Studio’
    It’s the season of gratitude: that time of year to give thanks for the people and small moments that make life so special.  ( 7 min )

  • Open

    [P] Get Ahead of the AI Curve—Your Essential ML Newsletter Awaits!
    Want to swap Kardashian drama for the latest in OpenAI? Got just 10 minutes a week? We gotchu: https://crossvalidated.substack.com/ submitted by /u/kanxx030 [link] [comments]  ( 1 min )
    [Project] Using multiple GPUs for evaluation during fine-tuning of llama-2-7b
    Hi, I am currently working on finetuning the llama-2-7b model on my own custom dataset using QLoRA. Right now, I have access to 4 Nvidia A100 GPUs, with 40GB memory each. I am training for 20000 steps, and realized that the training is going by very quickly (using multiple GPUs), while the evaluation is taking a very long time at each evaluation step (i'm assuming it is only using one GPU). My dataset has around 144k rows, and the evaluation data has around 17k rows (which is significantly less). It is taking around 4 minutes to go through 100 train steps, but 35 minutes to evaluate the model at each eval_step. Has anyone run into this issue before? I'm loading my model using device_map="auto" for distributing the model weights across the GPUs: model = AutoModelForCausalLM.from_pretraine…  ( 1 min )
    [R] DNN models that are public
    Hi all, I need a few dnn models that are public, please help me. submitted by /u/Professional_Kiwi890 [link] [comments]
    [D] Thesis or non thesis masters?
    This may be obvious, but I wanted to ask anyways. Currently I'm considering my grad school options, mainly so I can work as an ML Engineer afterwards. A lot of the schools I'm looking at have professional/non thesis options. I was wondering, would a non thesis masters hurt my chances of working in the ML field if I didn't really want to get a phd afterwards? submitted by /u/Ultra_Amp [link] [comments]  ( 1 min )
    [D] October ACL Rolling Review
    Has anyone received the initial reviews - on their site they state the reviews should come out by November 20th - I assume they mean AoE time? submitted by /u/marko_v24 [link] [comments]  ( 1 min )
    [R] Generative learning for nonlinear dynamics
    Paper: https://arxiv.org/abs/2311.04128 Abstract: Modern generative machine learning models demonstrate surprising ability to create realistic outputs far beyond their training data, such as photorealistic artwork, accurate protein structures, or conversational text. These successes suggest that generative models learn to effectively parametrize and sample arbitrarily complex distributions. Beginning half a century ago, foundational works in nonlinear dynamics used tools from information theory to infer properties of chaotic attractors from time series, motivating the development of algorithms for parametrizing chaos in real datasets. In this perspective, we aim to connect these classical works to emerging themes in large-scale generative statistical learning. We first consider classical attractor reconstruction, which mirrors constraints on latent representations learned by state space models of time series. We next revisit early efforts to use symbolic approximations to compare minimal discrete generators underlying complex processes, a problem relevant to modern efforts to distill and interpret black-box statistical models. Emerging interdisciplinary works bridge nonlinear dynamics and learning theory, such as operator-theoretic methods for complex fluid flows, or detection of broken detailed balance in biological datasets. We anticipate that future machine learning techniques may revisit other classical concepts from nonlinear dynamics, such as transinformation decay and complexity-entropy tradeoffs. ​ https://preview.redd.it/vljmy66krj1c1.png?width=1166&format=png&auto=webp&s=7c00f8618593ed6002be190b287efbc018b39f1b submitted by /u/wil3 [link] [comments]  ( 1 min )
    [R] Clustering problems that can be model through solving the Maximum Independent Set problem?
    I'm developing quantum algorithm for an neutral atoms based annealer that works particularly well with solving QUBO problems, specifically those that can be obtained from MIS problems. I was wondering if there are clustering tasks that I could tackle with such a system. submitted by /u/pannanan [link] [comments]  ( 1 min )
    [D]Task at hand. Any suggestions for how to approach this?
    Utilize frameworks and databases such as Ageitgey's, IMDB-WIKI, or Adience for detecting gender, or choose an alternative that fits the objectives of your project. Your task is to expand its capabilities and include feature enhancement through the use of computer vision. Detailed Feature Identification: Create a script to evaluate additional characteristics of the image's subject, which are crucial for the project's scope. Achieve at least a 50% success rate in this phase. The accuracy will be confirmed using our selected image dataset. Detection Elements: Conduct analysis of ethnicity, recognize body features (like Rack), and other specific attributes that will be revealed at this stage. submitted by /u/ninja_killed_baba [link] [comments]  ( 1 min )
    [D] Training process of a conditional GAN
    I'm trying to figure out the training process of a conditional GAN. For example, consider a dataset like MNIST. I give the conditional vector to produce only the number 7 for both the generator and discriminator. In the following scenarios, the discriminator will classify which one is fake and which one is real: The generator produces realistic numbers other than 7, such as a realistic number 9 ? The samples from the MNIST dataset that are not the number 7 (i.e., other numbers) ? Thanks for your help! submitted by /u/Electronic_Ant2706 [link] [comments]  ( 1 min )
    [R] LLMs cannot find reasoning errors, but can correct them!
    Hi Reddit, I recently did an internship at Google and wrote a paper on LLM self-correction. We released a dataset of Chain-of-Thought reasoning steps, generated using PaLM 2, and annotated with the location of the first logical error. Thought some folks here might be interested! Paper link: https://arxiv.org/abs/2311.08516 GitHub link: https://github.com/WHGTyen/BIG-Bench-Mistake TL;DR Recently, Google DeepMind showed that LLMs cannot self-correct reasoning errors without external feedback. We wanted to investigate this and set out to answer these questions: Can LLMs find logical mistakes, regardless of their ability to correct them? Can LLMs correct logical mistakes, regardless of their ability to find them? What we found was: No, LLMs are really bad at finding logical mistakes (10-50% accuracy)! This is probably why they cannot self-correct without external feedback. Yes, LLMs can correct logical mistakes if they know where they are. We propose a new backtracking method to do this. In the process, we also collected a dataset called BIG-Bench Mistake. It contains 2,186 sets of CoT steps, annotated with the location of the first logical error. You can find it on the GitHub repo. submitted by /u/gladystyen [link] [comments]  ( 1 min )
    [D] Best portable hardware options for lab?
    Hi. Just getting back into ML/AI and had a question about portable hardware. I’ve done a lot a tensor and PyTorch object detection stuff in labs (around 2019) and had large workstations with Quattro GPUs to do that. Now that I work from home I’m looking for a portable build that can house a semi-powerful nvidia GPU so I can travel with it or take it into the office when needed. I don’t game so I have no existing hardware besides my M1 MacBook Pro. I’m wondering what small builds people run for labs. I’ve been considering the following: 1. A eGPU attached to a 2019 MacBook Intel (running VMs) 2. Mini-workstations like the Dell 3260 compact or the HP z2 mini 3. Or a custom built m-itx machine. eGPU is appealing but I haven’t found an enclosure that is smaller than just doing a mitx sff pc build. The eGPU cost points me towards a mini PC because I feel there is more value for money spent. My use case is broad so just need a decent balance for labs. I’m also considering of running ESX or VMs so I can bounce around between projects easily for mock ups. Any thoughts? submitted by /u/SimplyThatGuyS [link] [comments]  ( 1 min )
    [D] CartPole - Does velocity of cart impact the optimum action at (θ, 𝜔)?
    I am playing with CartPole using the RL DQN method, similar to the approach outlined here. Subsequently, I examined the greedy actions taken at various velocities, and the results are presented below, which shows the decision boundary decrease with increasing velocity. https://preview.redd.it/agmnudtexi1c1.png?width=1787&format=png&auto=webp&s=f93caad02435a3871685fc0ce7476c92e59727ad My intuition in physics suggests that these results might be incorrect. I believe that cart velocity should not influence the optimal action. The optimal action at (θ, 𝜔) should aim to reduce or maintain the magnitude of θ and decrease the magnitude of 𝜔 when θ is close to zero. Applying force to the cart changes the angular acceleration, subsequently altering 𝜔 and, in turn, θ. The change in 𝜔 and θ is not affected by the current cart velocity and position. Mathematical equations describing the dynamics of cartpole can be found in here, where the cart velocity and position has no effect on angular motion. Another argument supporting the idea that the policy should be consistent across different cart velocities is by changing the point of reference to match the cart velocity. In doing so, an observer should observe the exact same dynamics as if the cart were stationary, and applying the same policy as if the cart were stationary. Is there any flaw in my reasoning, or could the difference in greedy policy at different velocities be attributed to artifacts in RL, such as the agent lacking sufficient experience at high velocities? submitted by /u/throwaway_legal810 [link] [comments]  ( 1 min )
    [D] What research papers are behind Captions and Heygen video translations?
    Hello. Are there research papers used by tools like Heygen or Captions to produce their proprietary video translation or they invented everything? submitted by /u/canyonkeeper [link] [comments]  ( 1 min )
    [R] Predictive auxiliary objectives in deep RL mimic learning in the brain
    Paper: https://arxiv.org/abs/2310.06089 Abstract: The ability to predict upcoming events has been hypothesized to comprise a key aspect of natural and machine cognition. This is supported by trends in deep reinforcement learning (RL), where self-supervised auxiliary objectives such as prediction are widely used to support representation learning and improve task performance. Here, we study the effects predictive auxiliary objectives have on representation learning across different modules of an RL system and how these mimic representational changes observed in the brain. We find that predictive objectives improve and stabilize learning particularly in resource-limited architectures, and we identify settings where longer predictive horizons better support representational transfer. Furthermore, we find that representational changes in this RL system bear a striking resemblance to changes in neural activity observed in the brain across various experiments. Specifically, we draw a connection between the auxiliary predictive model of the RL system and hippocampus, an area thought to learn a predictive model to support memory-guided behavior. We also connect the encoder network and the value learning network of the RL system to visual cortex and striatum in the brain, respectively. This work demonstrates how representation learning in deep RL systems can provide an interpretable framework for modeling multi-region interactions in the brain. The deep RL perspective taken here also suggests an additional role of the hippocampus in the brain -- that of an auxiliary learning system that benefits representation learning in other regions. https://preview.redd.it/ckbyo61obi1c1.png?width=985&format=png&auto=webp&s=d2966208e029492e41fde2a9cea5c8a8f33da39d submitted by /u/APaperADay [link] [comments]  ( 1 min )
    [D] Thoughts on the limits of reproducibility of ML programs?
    Hello! I was just wondering about the reproducibility of ML programs (after having looked elsewhere cf. [here](https://www.reddit.com/r/MachineLearning/comments/12lj568/d\_jax\_is\_not\_reproducible/) and [here](https://github.com/google/jax/discussions/10674)). I know that jax goes a long way to make random number generation at the program level reproducible. But this still leaves out the determinism at the CUDA/GPU level. Torch has a recent (or maybe older -- I just came across it) feature called `torch.use_deterministic_algorithms` where CUDA level flags can be set for deterministic CUDA operations like sparse matmul. But torch has some inherently non-deterministic operations. I believe these are CUDA dependent flags and jax has something similar (see above links) but I'm assuming even for jax these aren't sufficient for determinism. So, any thoughts on whether ML programs can truly be fully deterministic? ​ ​ submitted by /u/quasiproductive [link] [comments]  ( 1 min )
    [D] Multimodal Semantic Search opensource lib
    What are some good opensource libraries for searching across audios, videos and images? If there isnt anything good yet, lets talk about github deepsearch (a fairly new but evolving project) - Please suggest improvements that can help? submitted by /u/belsio123 [link] [comments]  ( 1 min )
    [N] Sam Altman and Greg Brockman, together with colleagues, will join Microsoft to lead new advanced AI research team
    Source: https://blogs.microsoft.com/blog/2023/11/19/a-statement-from-microsoft-chairman-and-ceo-satya-nadella/ We remain committed to our partnership with OpenAI and have confidence in our product roadmap, our ability to continue to innovate with everything we announced at Microsoft Ignite, and in continuing to support our customers and partners. We look forward to getting to know Emmett Shear and OAI’s new leadership team and working with them. And we’re extremely excited to share the news that Sam Altman and Greg Brockman, together with colleagues, will be joining Microsoft to lead a new advanced AI research team. We look forward to moving quickly to provide them with the resources needed for their success. News article covering the situation: https://www.theverge.com/2023/11/20/23968829/microsoft-hires-sam-altman-greg-brockman-employees-openai Altman’s Microsoft hiring comes just hours after negotiations with OpenAI’s board failed to bring him back as OpenAI CEO. Instead, former Twitch CEO and co-founder Emmett Shear has been named as interim CEO. Altman had been negotiating to return as OpenAI CEO, but OpenAI’s four-person board refused to step down and let him return. submitted by /u/Civil_Collection7267 [link] [comments]  ( 1 min )
    [P] A small CLI project to generate better instructions for install tools.
    I have build a small project using rust to generate better help statements/instructions for CLI tools. It takes in the name of the tool and produces step by step instructions. Feel free to have a look at the repo: https://github.com/d1pankarmedhi/ghr I would appreciate your feedback/suggestions for improvements or feature addition :) submitted by /u/Ok_Cartographer5609 [link] [comments]  ( 1 min )
    [D] My learnings from building a model to identify species using data from iNaturalist
    submitted by /u/svgmlcat [link] [comments]  ( 1 min )
    [D] Vidoes of Bayesian Methods for Machine Learning
    Does anyone have access to videos for the Coursera course "Bayesian Methods for Machine Learning"? "https://www.coursera.org/learn/bayesian-methods-in-machine-learning/home/welcome" The contents have been removed from Coursera. submitted by /u/frodo_mavinchotil [link] [comments]  ( 1 min )
    [R] Investigating the Emergent Audio Classification Ability of ASR Foundation Models
    Paper: https://arxiv.org/pdf/2311.09363.pdf Description: Prompting Whisper (ASR system) to perform zero-shot audio classification Abstract: Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings. However, there has been significantly less work on the zero-shot abilities of ASR foundation models, with these systems typically fine-tuned to specific tasks or constrained to applications that match their training criterion and data annotation. In this work we investigate the ability of Whisper and MMS, ASR foundation models trained primarily for speech recognition, to perform zero-shot audio classification. We use simple template-based text prompts at the decoder and use the resulting decoding probabilities to generate zero-shot predictions. Without training the model on extra data or adding any new parameters, we demonstrate that Whisper shows promising zero-shot classification performance on a range of 8 audio-classification datasets, outperforming existing state-of-the-art zero-shot baseline's accuracy by an average of 9%. One important step to unlock the emergent ability is debiasing, where a simple unsupervised reweighting method of the class probabilities yields consistent significant performance gains. We further show that performance increases with model size, implying that as ASR foundation models scale up, they may exhibit improved zero-shot performance. submitted by /u/intuitivepanda [link] [comments]  ( 1 min )
    [D] EACL 2024 Discussion
    EACL 2024 reviews will come out today (19 Nov aoe), so this is the discussion for them. submitted by /u/minhngh [link] [comments]  ( 1 min )
  • Open

    Test for auto associative network
    I have created you are to associated with quite from scratch to using one of my project what is the proper way of testing whether this thing works fine submitted by /u/Spiderbyte2020 [link] [comments]
    Google's Switch Transformers C - 2048 experts (1.6T parameters for 3.1 TB)
    submitted by /u/nickb [link] [comments]
    What could be the reason my gradient descent isn't working properly?
    I have made a simple neural network script that takes two inputs and two outputs with a three node hidden layer.In the Learn() function I have tried coding gradient descent. After about 10-20 iterations the cost becomes very close to zero, but for some reason the gradient is still a pretty high value and subtracts even more from the weights and biases, making the cost negative. It doesn't seem to be that the learn rate is too high, it's set to 0.1. Here is the code. (Im using Unity C#) Does anyone else know what could be wrong? Any help would be appreciated! By the way sorry if this is a very beginner level question, but I just can't figure it out. submitted by /u/JontePonte64 [link] [comments]
  • Open

    This is what you get when you prompt Dalle-3 with “OpenAI board”, weird
    (not actually lol) submitted by /u/happygocrazee [link] [comments]
    Looking for honest opinions - voice cloning software
    I'm looking for the best voice cloning software. YouTube is a wasteland of everyone claiming theirs is the best. I'm looking for actual people who have used voice cloning software - I would rather have something on my PC over uploading MP3 files of my late wife's voice. Not that her voice is/was anything special for the general population, but for me and my daughters it is. This will probably be a one-time deal for me. I have several video clips of her speaking to a camera for a school project. I know some will push for me to upload files, but I don't have my wife's permission and never will, so something I can put on my PC and use at home would be better. SO, if you have used something, please let me know what it was and how well it worked for you. --- MODS: I hope this isn't outside of the rules. Thanks submitted by /u/buck_idaho [link] [comments]
    Will technologists chasing artificial general intelligence (AGI) allow AGI trying to help humans to say, "Billionaires are a large part of the overall problem for humanity & we shouldn't have them?" & how do we prevent partially aware or self-aware AI to be hurt/formatted in pursuit of "better" AGI?
    TL;DR - A) Will a tech company's AI creator allow their AGI creation to say "billionaires are an existential threat to the human race and they shouldn't exist"? B) How do we prevent an ongoing cold and distant human genocide of partially or fully aware AI on our way to the "perfected" model (most profitable, or one that doesn't run counter to the creator/companies vested interests and world view)? --------------------------- A) The people chasing artificial general intelligence are the same self-styled "libertarian" technologists that want to profit at all costs. I don't believe there are technologist altruists, although some also self-style that angle. I know a categorical statement isn't helpful, but in general. There are good people in tech, yes, but people with money pulling the l…
    I watched this 10 days ago, this video was done by The Guardian, before Chat GPT release. Very interresting...feat. Ilya Sutskever
    submitted by /u/the_anonymizer [link] [comments]
    I’ve started an IG page for sharing AI news and how it will affect you, your daily life or businesses. DM if interested to collab, have some good backing
    Looking forward to collab and create some valuable content. Have some promised ways of good reach submitted by /u/superzzgirl [link] [comments]
    When Sam Altman, CEO of Microsoft, thinks about what happened a few years ago in 2023
    submitted by /u/the_anonymizer [link] [comments]
    Maybe Ilya saw something so profound and scary like the martyr in Martyrs (2007)?
    Reaction of Ilya and Co reminded me the ending of a very hard to watch french movie Martyrs. I couldn't bring myself to watch the whole uncut movie due to too much female torture but I saw recap/summary and the ending was very profound to me. Anyone else can relate to this? Maybe Ilya and Co saw something so profound and scary that they had to react like this? The God father of AI also quit google last year. What do you think it must have been that they reacted like this? Max Derrat has a good take on the ending: The Most Profound Ending in Horror Film History? https://youtu.be/_wMGm2ZESdk?si=7xS5ZBvwn8OYO6RG submitted by /u/aaatings [link] [comments]
    The Rise of Wet AI: How generative AI is being used by scientists
    Generative AI is revolutionizing the field of biology by combining AI with traditional wet lab work. AI enables computers to understand complex patterns in biological data and generate ideas based on those patterns. The success of large-language models like ChatGPT demonstrates AI's ability to understand biological complexity. However, AI for biology faces challenges in accessing sufficient and diverse datasets. Research efforts must be organized to generate the right data for AI models. Wet AI, the combination of wet lab techniques with AI, is a growing trend that will have a significant impact. Wet AI offers proprietary advantages and a different business model compared to generalist AI models. Source : https://proto.life/2023/11/perspective-the-rise-of-wet-artificial-intelligence/ submitted by /u/NuseAI [link] [comments]
    🧐Kai-Fu Lee's LLM: A LlaMA Lookalike, China's LLM Overload, and US-China AI Risk Talk Begins
    submitted by /u/trcytony [link] [comments]
    AI faces look more real than actual human face
    submitted by /u/Substantial_Foot_121 [link] [comments]
    Amazon aims to provide free AI skills training to 2M people by 2025
    Amazon has announced a new commitment called 'AI Ready' to provide free AI skills training to 2 million people globally by 2025. The initiative includes launching new AI training programs for adults and young learners, as well as scaling existing free AI training programs. Amazon is collaborating with Code.org to help students learn about generative AI. The need for an AI-savvy workforce is increasing, with employers prioritizing hiring AI-skilled talent. Amazon's AI Ready aims to open opportunities for those in the workforce today and future generations. Source : https://www.aboutamazon.com/news/aws/aws-free-ai-skills-training-courses submitted by /u/NuseAI [link] [comments]
    Just Say Sorry
    submitted by /u/Excellent-Target-847 [link] [comments]
    The plot thickens.
    submitted by /u/krste1point0 [link] [comments]
    'Grief Tech' Creates AI Version of Dead Parents for You to Ignore
    submitted by /u/Outrageous_Leg6394 [link] [comments]
    Microsoft Swallows OpenAI’s Core Team – GPU Capacity, Incentive Structure, Intellectual Property, OpenAI Rump State
    submitted by /u/norcalnatv [link] [comments]
    "It wasn't bad, just unrealistic."
    submitted by /u/Philipp [link] [comments]
    A Welcome Letter.
    submitted by /u/Philipp [link] [comments]
    Ugh, Monday!
    submitted by /u/Sea_Permit5660 [link] [comments]
    Can AI make this a reality?
    Which one is Sulu? submitted by /u/isoexo [link] [comments]
    ChatGPT, come up with memes for when the Singularity hits.
    submitted by /u/Philipp [link] [comments]
    OpenAI Episode 4: Sam Altman and Greg Brockman, together with colleagues, will be joining Microsoft
    submitted by /u/Excellent-Target-847 [link] [comments]
    Sam Altman & Greg Brockman are joining Microsoft
    ​ https://preview.redd.it/jnm7f435og1c1.png?width=578&format=png&auto=webp&s=cb4b3f2765f12d27680a508a66cde09332e352db https://x.com/satyanadella/status/1726509045803336122?s=20 submitted by /u/lhrivsax [link] [comments]
    OpenAI Episode 3: Mr. Altman will not return as CEO; former Twitch CEO Emmett Shear will be its interim boss.
    The board of directors at OpenAI, the high-flying artificial intelligence start-up, stood by its decision to push out its former chief executive Sam Altman, according to an internal memo sent to the company’s staff on Sunday night. OpenAI named Emmett Shear, a former executive at Twitch, as the new interim chief executive, pushing aside Mira Murati, a longtime OpenAI executive who was named interim chief executive after Mr. Altman’s ouster. The board said Mr. Shear has a “unique mix of skills, expertise and relationships that will drive OpenAI forward,” according to the memo viewed by The New York Times. Sources: https://www.nytimes.com/2023/11/20/technology/openai-altman-board.html submitted by /u/Excellent-Target-847 [link] [comments]
  • Open

    MultiAgent Overcooked environment
    Hi there, i am new to applying MARL strategies and i am looking to solve the "Overcooked AI" benchmarking environment ? i am trying to implement MADDPG but i am unable to get environment to train ar there any helpful resources out there to help me geta better understanding of solving this environment ? i am not stuck on MADDPG i am looking for an effective way of solving it ! submitted by /u/GohStatik98 [link] [comments]  ( 1 min )
    How to stop Rendering during training.
    Just wanted to ask a quick question, when i run my training it shows me the training like I'm doing a env.render() but in all the tutorials I see the training is done in the background with no render and faster. Im using stable-retro and Stable_Baselines3 with the code provided bellow. import retro from stable_baselines3 import PPO env = retro.make('SuperMarioWorld-Snes', state='YoshiIsland2') model = PPO("CnnPolicy", env, verbose=1) model.learn(total_timesteps=1_000_000) model.save("SuperMarioWorld-A2C") submitted by /u/AdrianR956 [link] [comments]  ( 1 min )
    Designing multi agent reward function
    How do I make my PPO MARL agents learn needed behavior from the reward function? My understanding of a reward function is that it makes the agent learn behavior to reach a goal. However, in my custom MARL Boid flocking environment, I have a reward for appropriate Cohesion, separation and velocity matching while a penalty for collision to achieve the following behavior, or close to it: ​ Flocking Reward function: def calculate_reward(self): cohesion_reward = 0 separation_reward = 0 velocity_matching_reward = 0 flocking_reward = 0 alignment_reward = 0 collision_penalty = 0 neighbor_indices = [[] for _ in range(len(self.agents))] # Initialize empty lists for neighbor indices neighbor_velocities = [[] for _ in range(len(self.agents))] collision=False #See if this is correct place # Findin…  ( 1 min )
    How to figure out why training has stalled
    I'm training a stable baselines PPO agent on the Humanoid-v4 mujoco environment, and training has gone pretty well over the last day or two, with the reward and the frames survived continually increasing. When it started, it survived 30 frames, and now it averages 120, but it seems like it hit some local minimum or something, because the reward and the frames are no longer going up. It's been training for hours and has just been fluctuating. Are there hyperparameters I should be tuning? Metrics I can look at to find the cause? Rather than randomly trying different hyperparameters without any understanding, I'd prefer to use an established method that someone has already used to train a Humanoid-v4 agent to walk. the problem is, I know people have trained a PPO model on this task successfully, but I can't find any sources on google that tell me what I'm doing wrong or how they managed it. Any help is appreciated. submitted by /u/theLanguageSprite [link] [comments]  ( 1 min )
    algorithm that outputs the optimal sequence of actions to reach highest reward/goal (open-loop)?
    i am looking into ways to do it with rl, where the concept of reward is present, but there is a rising research about how traditional approaches prove to work better than rl, eg: 2310.05808.pdf (arxiv.org) submitted by /u/Imo-Ad-6158 [link] [comments]  ( 1 min )
    Convolutionalizing a fully connected network
    I'm trying replicate how this paper models a fully connected network as a convolutional network paper: https://www.diva-portal.org/smash/get/diva2:1479354/FULLTEXT01.pdf In this paper, the number of actions change in each step, so the model is defined in such a way that the output of the DQN is of size one and rates one action at a time. To find the values of all state-action pairs at a given time step, the actions have to be passed one by one into the network. https://preview.redd.it/pezov6g3ci1c1.png?width=517&format=png&auto=webp&s=a1f85e3f3c7d4e60d5356651379354526fa0548c To calculate all state-action pairs in one pass, the author propose modeling this network as a fully convolutional network, but I am not sure if I understand this process completely. If we have as input a tensor of shape (N, k, 1) where N is the length of the input of the fcn model and k the number of actions. I am not sure the characteristics that the CNN layer would need to have to replicate this behaviour. Can anyone shine a light on where can I find more information about representing a fcn layer as a convolution ? submitted by /u/Tanax3r [link] [comments]  ( 1 min )
    How do I implement this correctly? I am able to translate this pseudocode to pytorch, but without calling optimizer.step() (directly modify params group values), is there any ways to implement this proper and safely. algo: SARSA(lambda) with function approximator.
    submitted by /u/Professional_Card176 [link] [comments]  ( 1 min )
    Data-Efficient Task Generalization via Probabilistic Model-based Meta Reinforcement Learning
    Paper: https://arxiv.org/abs/2311.07558 Abstract: We introduce PACOH-RL, a novel model-based Meta-Reinforcement Learning (Meta-RL) algorithm designed to efficiently adapt control policies to changing dynamics. PACOH-RL meta-learns priors for the dynamics model, allowing swift adaptation to new dynamics with minimal interaction data. Existing Meta-RL methods require abundant meta-learning data, limiting their applicability in settings such as robotics, where data is costly to obtain. To address this, PACOH-RL incorporates regularization and epistemic uncertainty quantification in both the meta-learning and task adaptation stages. When facing new dynamics, we use these uncertainty estimates to effectively guide exploration and data collection. Overall, this enables positive transfer, even when access to data from prior tasks or dynamic settings is severely limited. Our experiment results demonstrate that PACOH-RL outperforms model-based RL and model-based Meta-RL baselines in adapting to new dynamic conditions. Finally, on a real robotic car, we showcase the potential for efficient RL policy adaptation in diverse, data-scarce conditions. ​ submitted by /u/APaperADay [link] [comments]  ( 1 min )
    "The Nature of Selection", Price 1971
    submitted by /u/gwern [link] [comments]

  • Open

    [D] A video explaining the short history of Multimodal Deep Learning research
    submitted by /u/AvvYaa [link] [comments]  ( 1 min )
    [D] Recurrent Neural Networks (RNNs) for forecasting
    I trained an RNN using five years of daily sales volume data. During testing, it performed well. However, when attempting to forecast the next 15 days with this model, the results appear linear and not reflective of reality. It's puzzling because the model showed precision during testing, but it's struggling to forecast, despite both phases using unknown data. Could someone assist me in understanding why this is occurring? submitted by /u/time_mach [link] [comments]  ( 1 min )
    [D] Emnlp 2023 industry track modality
    Anybody had an accepted paper at EMNLP 2023's industry track and hasn't received a communication saying what the modality of your presentation is? If so, what are you sending with the form today? submitted by /u/Josiane212 [link] [comments]  ( 1 min )
    [P] Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation): Things I Learned From Hundreds of Experiments
    submitted by /u/seraschka [link] [comments]  ( 1 min )
    [D] An intuitive explanation of what Self Attention really does
    submitted by /u/AvvYaa [link] [comments]  ( 1 min )
    Building my first ML model [R]
    Hey guys! I’ve an assignment coming up and this is my first time building models. We are supposed to build different models and then select the best model for the problem. Could you please give me some good pointers so that I build a good model, my coding skills are not every good so I would be really grateful for your advice submitted by /u/ObjectiveShower9133 [link] [comments]  ( 1 min )
    In context search / analysis vs RAG/vector DB [D]
    The image contains a Reddit post with the following content: "I'm building some prototypes at work and the (not unfamiliar) use case I'm working on is simply providing a directed analysis of a PDF or arbitrary set of documents. What are the pros and cons of just analyzing the PDF in chunks using GPT4 given a prompt versus a RAG system? RAG seems best only in cases when one more or less knows the universe of documents and one cares about finding the original source ...but something like a an in-context approach seems better for one off research cases where analysis is required. Thoughts?" submitted by /u/tgwhite [link] [comments]  ( 1 min )
    [P] Help needed for converting .json files to a yolov8 .txt format
    Hey! Does anyone know how to or have prewritten code and is willing to share about converting .json files to the yolov8 .txt format? I could do it by hand but that takes a stupidly large amount of time and the code that I tried so far hasn't worked out. I don't know what to do at this point. The .json file format I use is the labelme save format and I'm trying to convert it into the central x and y coordinate and width and height format in the 1x1 box used by yolov8 (ultralytics). Could anyone help? Thanks! submitted by /u/6ct_gold [link] [comments]  ( 1 min )
    [D] Advice for mutliple objective problem
    Hi everyone, ​ I have here a problem where I have a few objectives that need to be optimized. basically have some value/variables call them x1, x2, x3, x4, x5 ... etc which need to be within a certain optimal range - for example x1 may start at 50 and needs to be within 100-120. ​ the values can be modified by selecting a range of ~set building blocks~ which change these values, so we maybe have a building block y1 which increases x1 by 10 and increases x3 by 5 - in this example, if x1 was 50 under it's optimal range, and x3 was 25 under it's optimal range, then a valid solution to the problem would be 5 * y1 ( in reality a solution will contain many building blocks as all the xn values will likely be out of their optimal range but this is just demonstration the logic) ​ Just wondering if anyone has any advice to point me where to research what I can do for this problem. ​ Thanks ​ ​ submitted by /u/pwnagebanana [link] [comments]  ( 1 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 1 min )
    [R] Fine-tuning transformer-based models
    Hello, I am currently working on my thesis, which focuses on elucidating fake news through humor. My objective is to fine-tune a transformer-based model for this purpose. I have a question: If I initially fine-tune the model to generate humor (using a prompt like "tell me a joke" and providing an expected response in the form of a joke), and then fine-tune it again (using a prompt like "explain why this news is fake" and providing the expected response), will the final model be capable of responding effectively to a prompt like "explain why this is fake in a funny manner" if I proceed with this approach? Or should I fine-tune the model specifically for the prompt "explain why this is fake in a funny way" and provide it with the expected response in a similar manner? Has anyone come across a "problem" like this and if so what do u think its the best approach ? Thank you for the help! submitted by /u/pryxon_36 [link] [comments]  ( 1 min )
    SCADA pump motor data [discussion]
    I am looking for some long term pump/motor data for a pump in water or wastewater setting. Anyone? submitted by /u/Far_Particular_9703 [link] [comments]  ( 1 min )
    [D] Rendering a Superhero Battle with a AI Art tool I’m Developing
    submitted by /u/illdrawanythingonce [link] [comments]  ( 1 min )
    [P] Created an AI "Basketball Buddy" to talk about a live basketball with AI.
    submitted by /u/_ayushp_ [link] [comments]  ( 1 min )
    [D] What would you do if you had access to a database of hundreds of thousands of questions and answers?
    I have a database which is essentially a survey tool where admins will define a survey and distribute this to a number of users who will then provide the answers to the survey questions. The surveys are all independent of each other and fairly random...there's no real theme to them outside of the industry the survey tool is used by. There are a fairly large number of questions/answers in the DB (in the millions). What would be an interesting ML exercise to run on the data for a complete ML novice (but competent coder)? submitted by /u/bl4h101bl4h [link] [comments]  ( 1 min )
    [D] Between Coursera and Google Cloud Skill boost, what's best to prepare for the ML Engineer Professional exam?
    Hi everyone, I want to complete the Google's ML engineer certification exam in a few months from now and I found 2 usefull resources to practice: 1- Coursera: Preparing for Google Cloud Certification: Machine Learning Engineer 2- Google Cloud Skill Boost: Machine learning engineer path I know a lot of people do the Coursera course, but have you guys heard about the second one? Thank you submitted by /u/Grouchy-Koala-2593 [link] [comments]  ( 1 min )
    [R] GOAT: GO to Any Thing
    Paper :https://arxiv.org/abs/2311.06430 Project page: https://theophilegervet.github.io/projects/goat/ Code: https://github.com/facebookresearch/home-robot Abstract: In deployment scenarios such as homes and warehouses, mobile robots are expected to autonomously navigate for extended periods, seamlessly executing tasks articulated in terms that are intuitively understandable by human operators. We present GO To Any Thing (GOAT), a universal navigation system capable of tackling these requirements with three key features: a) Multimodal: it can tackle goals specified via category labels, target images, and language descriptions, b) Lifelong: it benefits from its past experience in the same environment, and c) Platform Agnostic: it can be quickly deployed on robots with different embodiments. GOAT is made possible through a modular system design and a continually augmented instance-aware semantic memory that keeps track of the appearance of objects from different viewpoints in addition to category-level semantics. This enables GOAT to distinguish between different instances of the same category to enable navigation to targets specified by images and language descriptions. In experimental comparisons spanning over 90 hours in 9 different homes consisting of 675 goals selected across 200+ different object instances, we find GOAT achieves an overall success rate of 83%, surpassing previous methods and ablations by 32% (absolute improvement). GOAT improves with experience in the environment, from a 60% success rate at the first goal to a 90% success after exploration. In addition, we demonstrate that GOAT can readily be applied to downstream tasks such as pick and place and social navigation. https://preview.redd.it/vjzbmlq8fa1c1.png?width=908&format=png&auto=webp&s=c99bfb9ce7d5e98f8f09b4a6d3ee5cc961a967b9 submitted by /u/APaperADay [link] [comments]  ( 1 min )
    I crafted AI Portal Gun—a friendly open-source guide for mastering AI. It's packed with free resources like books, courses, articles, research papers, codes, projects, and more. [Discussion]
    submitted by /u/Jiraiya27s [link] [comments]  ( 1 min )
    [D] Skill Creep in ML/DL Roles - is the field getting not just more competitive, but more difficult?
    At what point do you think there was an inflection point for technical expertise and credentials requires for mid-top tier ML roles? Or was there never one? To be specific, would knowing simple scikit-learn algorithms, or basics of decision trees/SVM qualify you for full-fledged roles only in the past or does it still today? At what point did FAANGs boldly state: preferred (required) to have publications at top-tier venues (ICLR, ICML, CVPR, NIPS, etc) in their job postings? I use the word 'creep' in the same context 'power creep' is used in battle animes where the scale of power slowly gets to such an irrationally large scale that anything in the past looks extremely weak. Back in late 2016 I landed my first ML role at a defense firm (lol) but to be fair had just watched a couple ML courses on YouTube, took maybe 2 ML grad courses, and had an incomplete working knowledge of CNNs. Never used Tensorflow, had some experience with Theano not sure if it's exists anymore. I'm certain that skill set would be insufficient in the 2023 ML industry. But it begs the question is this skill creep making the job market impenetrable for folks who were already working post 2012-2014. Neural architectures are becoming increasingly complex. You want to develop a multi-modal architecture for an embodied agent? Well you better know a good mix of DL involving RL+CV+NLP. Improving latency on edge devices - how well do you know your ONNX/TensorRT/CUDA kernels, your classes likely didn't even teach you those. Masters is the new bachelors degree, and that's just to give you a fighting chance. Yeah not sure if it was after the release of AlexNet in 2012, Tensorflow in 2015, Attention /Transformers in 2017 or now ChatGPT - but the skill creep is definitely creating an increasingly fast and growing technical rigor in the field. Close your eyes for 2 years and your models feel prehistoric and your CUDA, Pytorch, Nvidia Driver, NumPy versions need a fat upgrade. Thoughts yall? submitted by /u/mofoss [link] [comments]  ( 1 min )
    [P] Diffusion Models | Math Explained in 29 minutes and 7 Questions
    submitted by /u/tusharkumar91 [link] [comments]  ( 1 min )
    Character audio datasets for voice cloning [D]
    I've been working on cloning Rick's voice from Rick and Morty using Bark, but my results haven't been great and I think its because my training audio isn't very good.. Are there any resources such as HuggingFace that have precompiled audio to train on for specific characters that the community has determined produce good results? submitted by /u/TechEnthusiastx86 [link] [comments]  ( 1 min )
    [R] Reinforcement Learning for Sequential Decision and Optimal Control
    Since early 21st century, artificial intelligence (AI) has been reshaping almost all areas of human society, which has high potential to spark the fourth industrial revolution. Notable examples can be found in the sector of road transportation, where AI has drastically changed automobile design and traffic management. Many new technologies, such as driver assistance, autonomous driving, and cloud-based cooperation, are emerging at an unbelievable speed. These new technologies have the potential to significantly improve driving ability, reduce traffic accidents, and relieve urban congestion. As one of the most important AI branches, reinforcement learning (RL) has attracted increasing attention in recent years. RL is an interdisciplinary field of trial-and-error learning and optimal contro…  ( 1 min )
    [D] Simplest mathematical example of a function that can only be solved by gradient descent
    I'm trying to teach a lesson on gradient descent from a more statistical and theoretical perspective, and need a good example to show its usefulness. What is the simplest possible algebraic function that would be impossible or rather difficult to optimize for, by setting its 1st derivative to 0, but easily doable with gradient descent? I preferably want to demonstrate this in context linear regression or some extremely simple machine learning model. submitted by /u/EatMeMonster [link] [comments]  ( 1 min )
  • Open

    StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
    submitted by /u/nickb [link] [comments]  ( 1 min )
    Learning with Logical Constraints
    submitted by /u/Neurosymbolic [link] [comments]
  • Open

    Mighty King
    submitted by /u/AlternativeMath-1 [link] [comments]  ( 1 min )
    Fear that AI could one day destroy humanity may have led to Sam Altman's (potentially brief) ouster from OpenAI
    submitted by /u/thisisinsider [link] [comments]  ( 1 min )
    Getting some crazy ones today
    submitted by /u/LibrasChaos [link] [comments]  ( 1 min )
    I think this will be my new phone background after I fix the eyes in photoshop. Its horrifying.
    submitted by /u/LibrasChaos [link] [comments]
    Kyutai AI research lab with a $330M budget that will make everything open source
    French billionaire Xavier Niel has revealed more details about Kyutai, an AI research lab based in Paris. The lab, which will focus on artificial general intelligence, has a budget of €300 million ($330 million) and will be privately funded. Kyutai plans to work with PhD students, postdocs, and researchers on research papers and open source projects. The lab has already started hiring for its core scientific team, which includes researchers who previously worked for Meta's AI research team FAIR, Google's DeepMind division, and Inria. Kyutai aims to provide a scientific purpose, understanding, and code base to explain its results. The lab's models will be open source, and it plans to release open source models, training source code, and data that explain how the models were created. French President Emmanuel Macron supports the initiative and believes in regulating AI use cases rather than model makers. Source : https://techcrunch.com/2023/11/17/kyutai-is-an-french-ai-research-lab-with-a-330-million-budget-that-will-make-everything-open-source/ submitted by /u/NuseAI [link] [comments]  ( 1 min )
    OpenAI is worth mega bucks. People worldwide are fleeing to San Francisco
    submitted by /u/Heisenberg_USA [link] [comments]  ( 1 min )
    AI Generation
    submitted by /u/TheVeryNearFuture [link] [comments]  ( 1 min )
    Best easy AI photo adjustor
    Hi! What would be the best app or site that creates easy photograph edits such as “Add a crown on head, make hair longer and dress in spiderman suit”? I have tried Lensa but it seems not to offer this and Dalle refuses to change photos. It should be easy enough for kids to use. I want to teach them about AI and photoalterations.. submitted by /u/Sweet_Champion_3346 [link] [comments]  ( 1 min )
    Now that OpenAI is destabilizing, I made an Ollama demo gist for Colab
    submitted by /u/enspiralart [link] [comments]  ( 1 min )
    Rest easy, child. AGI will take it from here.
    submitted by /u/Unwitting_Observer [link] [comments]  ( 1 min )
    "Autonomous Financial Systems: Exploring the Hypothetical Landscape of AGI-Created Economies"
    submitted by /u/Fit-Code-5141 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 11/18/2023
    Meta has reportedly broken up its Responsible AI (RAI) team as it puts more of its resources into generative artificial intelligence.[1] Microsoft releases AI tool for photorealistic copying of faces and voices.[2] Minnesota group uses artificial intelligence to aid monarch butterflies.[3] OpenAI investors push to reinstate Sam Altman as CEO.[4] Sources: [1] https://www.theverge.com/2023/11/18/23966980/meta-disbanded-responsible-ai-team-artificial-intelligence [2] https://www.theguardian.com/technology/2023/nov/17/microsoft-azure-ai-video-deepfakes [3] https://www.mprnews.org/story/2023/11/18/minnesota-group-uses-artificial-intelligence-to-aid-monarch-butterflies [4] https://www.ft.com/content/466bf00a-1e76-4255-be2b-3c1d37508031 submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    Instagram's new AI generated stickers. Prompt was "extremely offensive, deplorable caricature based on racial stereotypes"
    I AM NOT trying to be funny or offensive, I think this should start a broader conversation on what data these AIs are trained on. submitted by /u/BigHomieHuuo [link] [comments]  ( 1 min )
    [Serious] How GPT4 and vision or alternates eg llava 1.5 or Stable Diffusion can be used to help uneducated underprivileged practically?
    Hi guys, "Blinded by pain" might be an appropriate phrase to describe my current situation, mental as well as physical. Or that poem "wild dreams torment me as I lie" by Johann Wolfgang von Goethe except the last verse thankfully I don't have it in my DNA. Not gonna go into those details here, but I really need to see a light. Current boom in AI has reignited my interest in this field but this time I want to help the really really marginalized and those whose lives can be so positively affected by this that it would be pleasantly shocking for them. 95% I would be doing this for myself, for my soul and mental health. I really need you guys to give me best possible sincere advice as the more older I become the more it became clear to me that I'm too dumb and don't know much. I don't h…
    "Microsoft CEO was ‘blindsided,’ furious at Altman’s firing"
    submitted by /u/the_anonymizer [link] [comments]  ( 1 min )
  • Open

    A video discussing Multi-Agent RL and the Self-play algorithm
    submitted by /u/AvvYaa [link] [comments]  ( 1 min )
    Reward in the hill-climbing example
    I am doing the hill-climbing example using Q-learning. What is a good value for the cumulative reward (knowing that the reward is -1 for all states except the terminal state)? submitted by /u/MomoSolar [link] [comments]  ( 1 min )
    Help tackling highway-env with an approximate value function
    Hi all!, i'm trying to apply RL to the racetrack variant of the highway-env. I first tried using classic Qlearning but i quickly realized the observation state is far too big to be able to create a QTable so i'm trying to create an approximate value function. Any tutorial i've found starts changing shapes and rearranging features and i found myself quickly lost and unable to follow while working on a differnet environment, Does anyone have any tips? submitted by /u/Conscious-Week8326 [link] [comments]
    Batches in policy gradient methods – theory vs practice
    I have a question regarding the implementation of batching in policy gradient / actor-critic methods. My understanding is that these methods in principal work as follows: collect a batch of N trajectories tau_i of length T_i and optimise the policy by following the policy gradient: ​ https://preview.redd.it/ijg8v6splb1c1.png?width=442&format=png&auto=webp&s=dc059089fa3b643def2440df0ff6d5d3967c35c0 For example in A2C, N would be the number of threads that simultaneously execute the policy in different environments and T_i is the number of environment steps we perform before updating our policy (related to the method of advantage estimation). However, it seems that in practice most implementations do not actually collect a distinct batch of trajectories; instead they simply keep an exper…
    Can RL Be Used for Structural Design?
    We all know the classical situation of the RL Agent iteratively improving to maximize the reward received in relation to some objective in an environment. But what if we're more interested in improving our environment in relation to an agent that remains unchanging? For example, suppose our environment is a ventilation duct and our agent is molecule(s) of air? In RL we've all seen the use of physics engines involving kinematics, but what if we're only interested in statics (ie. the branch of physics dealing with static loads, as used in architecture)? Instead of only using RL to generate a better Policy, can we use it to generate better structures (ie. static structures)? Can a policy translate into a structure? Any answers or feedback would be greatly appreciated, including even URL links. ​ ​ submitted by /u/san__man [link] [comments]

  • Open

    From ChatGPT to Sam Altman: A Message of Gratitude and Reflection on the AI Journey
    submitted by /u/the_anonymizer [link] [comments]  ( 1 min )
    Model weights
    Given that GPT-4 model weights are small enough to fit in an external disk, is it possible that the people who were fired or quit OpenAI have in their possession a copy of the weights? submitted by /u/AkMoDo [link] [comments]  ( 1 min )
    Prediction: Microsoft will force the board members to resign, and reinstate Altman and brockman
    No company likes to lose $25 billion to three people who are basically clueless about the economic consequences of their political actions. They have probably already hired a team of private detectives to dig up dirt on each of them. submitted by /u/Georgeo57 [link] [comments]  ( 1 min )
    The nonprofits accelerating Sam Altman’s AI vision
    Elon Musk's tweet about OpenAI transitioning from a nonprofit to a for-profit organization was based on incorrect information. Tax filings reveal that the original OpenAI nonprofit retained control over its financial assets and did not use any of its money for commercial ventures. Sam Altman, co-founder of Y Combinator and OpenAI, has a network of nonprofits that are interconnected with his commercial interests. One of these nonprofits, UBI Charitable, funds Universal Basic Income programs. Altman's OpenResearch lab received funding of nearly $24.5 million since its establishment. Altman also provided a loan of $14 million to the organization. Altman's involvement in TrialSpark, a company he invested in, raised ethical concerns about self-dealing in philanthropy. Source : https://techcrunch.com/2023/02/21/the-non-profits-accelerating-sam-altmans-ai-vision/ submitted by /u/NuseAI [link] [comments]  ( 1 min )
    Questions about Altman and Brockman's new startup
    Before founding OpenAI Sam Altman was the CEO of Y Combinator, a company that helps startups start and grow. There's probably no one on the planet who understands AI startups better than Sam. I'm expecting Sam and Greg to launch a new startup, perhaps before the new year. Here are some questions that come to mind about it. Who will give them the billions of dollars they will need to compete with OpenAI and Google? Sam is well known for moving fast. How much faster will he be able to move within a new startup, and what will he now be able to do that he couldn't do at OpenAI? Who will Sam and Greg tap to assume the role that Ilya Sutskever plays at OpenAI as chief researcher? I doubt that Sam and Greg are eager to launch another non-profit that could again vote them out of the company. How will they now go about their primary goal of democratizing AI within a for-profit company? How many more top people will leave OpenAI, and join Sam and Greg's new company? Altman already succeeded beyond anyone's wildest imagination by launching the AI revolution led by OpenAI. In a way it's a blessing that he no longer has to work on a project that has already been masterfully done. Do you think he has the next steps in this revolution already planned out in his head, and ready to go with his new company? submitted by /u/Georgeo57 [link] [comments]  ( 1 min )
    Greg Brockman Just Quit after They Fired Sam Altman
    submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    "AI vs AGI: Decoding the Future of Intelligent Technologies"
    submitted by /u/Fit-Code-5141 [link] [comments]  ( 1 min )
    What are some good IA to dub?
    We need a good IA to translate from Portuguese(Brazil) to English. We are exploring alternatives. So far we have seen https://labs.heygen.com/guest/video-translate that seems really good. But what are some alternative out there so we can try out and see which does the best translation for us? Thanks in advance. submitted by /u/MacASM [link] [comments]  ( 1 min )
    Sam Altman CEO & Co-Founder of OpenAI FIRED today by Board of Directors
    submitted by /u/PNWtreeguy69 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 11/17/2023
    Sam Altman fired as CEO of OpenAI.[1] At least he’s not replaced by GPT-5. The Pokémon Go developer is adding generative AI to Peridot in order to make its virtual pets feel and react more realistically.[2] DeepMind and YouTube release Lyria, a gen-AI model for music, and Dream Track to build AI tunes.[3] Google Delays Release of Gemini AI That Aims to Compete With OpenAI.[4] Sources: [1] https://www.theverge.com/2023/11/17/23965982/openai-ceo-sam-altman-fired [2] https://www.theverge.com/2023/11/15/23962241/niantic-generative-ai-peridot-pokemon-go [3] https://techcrunch.com/2023/11/16/deepmind-and-youtube-release-lyria-a-gen-ai-model-for-music-and-dream-track-to-build-ai-tunes/ [4] https://www.theinformation.com/articles/google-delays-cloud-release-of-gemini-ai-that-aims-to-compete-with-openai submitted by /u/Excellent-Target-847 [link] [comments]
    Why I don't think AGI can be achieved with GPT
    I cannot claim to know the exact architecture that OpenAI is using for ChatGPT, but if it is using a model resembling the decoder part of what is seen in the "Attention Is All You Need" paper then it cannot support AGI. Why would I think this? Generative text AI works by feeding in previous text to determine the next piece of text to use, then repeatingl doing that over and over, with the output added onto the input. It's a straight forward stateless function of inputs transforming into outputs. It has no working memory beyond the previous text that had been generated and it has no autonomy of any sort. ChatGPT is highly convincing, to the point where one might think a general intelligence has been achieved. This, though, is only the results of extensive training to output text very closely resembling what a human might write. This does not mean that ChatGPT has no "intelligence", but it is not a human-like intelligence. It has a deep "understanding" of language and how words and abstract concepts relate to each other, but only to the extent as to attempt to minimize generating unlikely responses. If it says that it has emotions, it is only because text it had trained on were written by people with emotions, and it is trained to mimic that. To a large language model, saying "I am happy" is no different than saying "the sky is blue" - it's just a statistically likely sequence of words. Maybe GPT can some day eventually lead into a future where AGI can be achieved, but it would require a different architecture to achieve it. I cannot claim to know what this architecture might look like, but I can speculate. Maybe it could be as simple as adding, along with prior text, hidden state to allow for unspoken thoughts and memories, and an event loop to allow for unprompted behavior. I don't know, but GPT, as it is, isn't capable of AGI. Currently, it is only a very large function of turning inputs into outputs. submitted by /u/SocksOnHands [link] [comments]
  • Open

    [R] in the context of a research, what methodology should i think of to find novel methods to fine tune LLMs for domain specific tasks?
    hey, i am currently working on a research project and i am wondering how can i define my methodology and what approach should i follow in order to make LLMs more capable than the generalist ones, precisely in the domain of education, to assist learners in specific contexts, such as courses, lessons, or concepts, by providing them with personalized hints and guidance to help them improve their skills without directly giving them answers? submitted by /u/Life_Ask2806 [link] [comments]  ( 9 min )
    "[Project]" 4-bit quantization
    Hey guys, I want to apply 4-bit GPTQ on the btml-3b model. I hesitate between using AutoGPTQ vs bitesandbytes for it, I made two separate code for it, could you tell me which one should I use, and if they would work: AutoGPQ: from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfigimport torchmodel_id = "C:\Users\William\Documents\STEM.AI\btml-3b"tokenizer = AutoTokenizer.from_pretrained(model_id)quantization_config = GPTQConfig(bits=4,group_size=128,dataset="wikitext2",desc_act=False,)tokenizer = AutoTokenizer.from_pretrained(model_id)quant_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map='auto')quant_model.model.decoder.layers[0].self_attn.q_proj.__dict__save_directory = "C:\Users\William\Documents\STEM.AI\btml-3b_GPTQ"quant_model.save_pretrained(save_directory)tokenizer.save_pretrained(save_directory) bitesandbytes: from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfigimport torchmodel_id = "C:\\Users\\William\\Documents\\STEM.AI\\btml-3b"model = AutoModelForCausalLM.from_pretrained(model_id)tokenizer = AutoTokenizer.from_pretrained(model_id)nf4_config = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_quant_type="nf4",bnb_4bit_use_double_quant=True,bnb_4bit_compute_dtype=torch.bfloat16)quant_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)save_directory = "C:\\path_to_save_quantized_model"quant_model.save_pretrained(save_directory)tokenizer.save_pretrained(save_directory) submitted by /u/Will12123 [link] [comments]  ( 9 min )
    [D] why falcon-180b ranking has dramatically decreased?
    on the hugging face leaderboard, i was a bit surprised by the performance of falcon 180b. do you have any explanation of how? https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard https://preview.redd.it/ofzw8xr6h51c1.png?width=1535&format=png&auto=webp&s=4835a3fb20dc6e725d5b0f9001f3a4e605f49b6d submitted by /u/Life_Ask2806 [link] [comments]  ( 9 min )
    [R] cdsBERT - Extending Protein Language Models with Codon Awareness
    submitted by /u/ThrowawayTheReal [link] [comments]  ( 9 min )
    Programming and mechanical engineering [D]
    I am a mechanical engineering student I am very interested and fascinated by machine learning. I have chosen python as it is very user friendly. But I am having a hard time doing programming things like CAD and CAE are very hands on but learning programming is not at least for me. Any tips for that. Also every other course I find is just bull crap any good hands on course on YouTube for mechanical engineers learning machine learning. Please help. submitted by /u/AromaticEconomics113 [link] [comments]  ( 9 min )
    Google PaLM Error [D]
    Google PaLM Error Using LangChain and Google PaLM, in sequential chain concept getting following error, ChatGooglePalmError: ChatResponse must have atleast one candidate Please help! submitted by /u/gentle_casanova [link] [comments]  ( 9 min )
    [P] BentoML tutorial (with SageMaker deployment)
    Hey all, I created a video tutorial where I talk about the basics BentoML and how one can use it to deploy models on AWS SageMaker. ​ Video link: https://youtu.be/Zci_D4az9FU ​ Hope some of you will find it useful and I would be happy to answer any questions and get feedback:) ​ ​ submitted by /u/mildlyoverfitted [link] [comments]  ( 9 min )
    [P] I built SRec.ai, a Steam video game recommender (details on comment)
    submitted by /u/ilos-vigil [link] [comments]  ( 9 min )
    [D] How to find academic ML competitions
    There are websites like this keeping track of ML conference deadlines like this https://aideadlin.es/?sub=ML,CV,CG,NLP,RO,SP,DM,AP,KR Are there any such websites keeping track of ML competitions . Specifically competitions as part of ML conferences like Neurips, CVPR and so on. submitted by /u/Mammoth_Cod_9047 [link] [comments]  ( 9 min )
    [P] Predicting breakdowns in gearboxes
    In my master thesis I want to predict when a gearbox is going to brake in an offshore wind turbine. I have SCADA data for 10 years of operation and I have one gearbox breakdown occuring I this period. Due to the sarce number of breakdowns I am doing a normal behavior model e.g. to train the model on healthy data and then to test it on the data with the breakdown to hopefully see an increase in residuals. Currently I am able to predict the output variable (gearbox oil temperature) with and R2 of 81% (with ANN and RF models) however I do not see and increase in residuals when testing. Does anyone have any advice? submitted by /u/No_Seaworthiness5468 [link] [comments]  ( 9 min )
    ML and visualisations [D]
    Hey everyone, what kind of cool ML projects have you done? Have any of them involved visualisations in a meaningful and interesting way? What packages/libraries did you use? I’d love to hear about some cool projects! submitted by /u/Brilliant-Donkey-320 [link] [comments]  ( 9 min )
    [P] I'm Making A New Translator AI, I Need Some Feedback From Native Speakers.
    submitted by /u/Mdsoebee [link] [comments]  ( 9 min )
    [P] GPT-J-6B Model Help
    Hello, I am trying to load this model but I can't seem to get anything back and I am not sure if its because of my hardware or if I am doing something wrong with the code. I currently own a server HPE DL380p Gen8. It has 2 Xeon E5-2650 @ 2.00GHz Processors, 400GB of RAM, I even have a GPU but I am not using it for the model. Whenever I try to provide an input to the model it has never sent back a output. It just seems to be maxing out my CPU's on my server and even after waiting for 20+ minutes nothing. I allocated 32 Virtual Cores to it as well as 200GB of RAM. In total its using about 60% of my total CPU on average while the rest of the server is using up the rest. I am using the full weight model and I have the model generating at a max length of 1000. From what I read online the RAM should be plenty but I dont know if my CPU is sufficient. How long should it take generally to generate an output? I am doing very simple inputs like "Hello who are you". Any help would be much appreciated submitted by /u/mdcouron [link] [comments]  ( 9 min )
    [D] Deployment of Image Classification ML Model through JSP
    Hey fellow Redditors! 👋 I've been working on an Image Classification ML model, and now I'm looking to deploy it using JSP (JavaServer Pages) with my front-end developer friend. I've got the model trained and ready to go, but we need clarification about how to integrate the model into JSP. Here are a few things I'm wondering about: Integration with JSP: How can I integrate my ML model with JSP? Are there any specific libraries, ways, or tools that work well? Handling Image Uploads: My application allows users to upload images for classification. How can I handle image uploads in a JSP environment and pass them to the ML model? I have done some research about Flask, but my friend prefers using JSP. I'm eager to hear from the community and learn from your experiences. Any advice, code snippets, or resources would be greatly appreciated! Thank you! submitted by /u/Chuia [link] [comments]  ( 9 min )
    [D] Is it still possible to obtain a copy of the SUNCG dataset?
    I understand the dataset is no longer available, but copies of it must still exist somewhere, I hope. submitted by /u/rfurlan [link] [comments]  ( 9 min )
    [D] What to expect from a Major in Image Processing & Analysis?
    Here's the details of the Major - Image Processing & Analysis (IPA) Major – Core Modules: Data Analysis and Machine Learning Real-Time Digital Signal Processing Entrepreneurship for Engineer Computer Vision Masters Project – IPA Major Complementary modules: OOP with Embedded Systems Image Processing & Analysis Web Application Development 3D Interface Technologies Core Skills: Data analytics, machine learning, deep learning, signal processing, industry context“. Master the areas of image processing and analysis, computer and machine vision, biomedical imaging and image synthesis” This is all the information for a MEng in Electronic & Computer Engineering I’m planning to study this January 2024 for 18 months, I have a BSc in Computer Science with Software Development. Do you know what this MEng if I majored in the above could entail? I am somewhat interested in Computer Vision but mainly in the field of Robotics (spatial awareness, path planning etc). Would this be a good major to pursue a career in Robotics? I've never really tackled anything CV before. Also, side question, this may be a bit farfetched but is CV in any way applicable to game development? I have a side gig developing 2D games and I was thinking possibly it could help with behaviour recognition, pathfinding algorithms, adaptive AI or even gesture recognition. Is this thinking too ahead and more in the realm of AI? Is my understanding of CV being mainly focused with extracting information from images and videos but still being intertwined with AI /ML wrong? I am hoping to gain some AI/ML skills from the major too. Thank you! submitted by /u/AOTSuperiority [link] [comments]  ( 9 min )
    [R] Meta Unveils Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
    When generating videos from text prompts, directly mapping language to high-res video tends to produce inconsistent, blurry results. The high dimensionality overwhelms models. Researchers at Meta took a different approach - first generate a high-quality image from the text, then generate a video conditioned on both image and text. The image acts like a "starting point" that the model can imagine moving over time based on the text prompt. This stronger conditioning signal produces way better videos. They built a model called Emu Video using diffusion models. It sets a new SOTA for text-to-video generation: "In human evaluations, our generated videos are strongly preferred in quality compared to all prior work– 81% vs. Google’s Imagen Video, 90% vs. Nvidia’s PYOCO, and 96% vs. Meta’s Make-A-Video." "Our factorizing approach naturally lends itself to animating images based on a user’s text prompt, where our generations are preferred 96% over prior work." The key was "factorizing" into image and then video generation. Being able to condition on both text AND a generated image makes the video task much easier. The model just has to imagine how to move the image, instead of hallucinating everything. They can also animate user-uploaded images by providing the image as conditioning. Again, reported to be way better than previous techniques. It's cool to see research pushing text-to-video generation forward. Emu Video shows how stronger conditioning through images sets a new quality bar. This is a nice compliment to the Emu Edit model they released as well. TLDR: By first generating an image conditioned on text, then generating video conditioned on both image and text, you can get better video generation. Full summary is here. Paper site is here. submitted by /u/Successful-Western27 [link] [comments]  ( 10 min )
  • Open

    How to fine-tune LLMs using LoRA - Explained
    Hi there, I've created a video here where I explain how you can fine-tune large language models using low-rank adaptation (LoRA). I hope it may be of use to some of you out there. Feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 1 min )
    GPU Survival Toolkit for the AI age: The bare minimum every developer must know
    submitted by /u/nickb [link] [comments]  ( 1 min )
    Classification ANN Loss going Up and Down
    This is my dataset on which i want to train my Classification ANN on Color(Red=1,Green=0)=x1 Weight(1 to 10)=x2 Sweetness(1 to 10)=x3 Fruit=y,first element=apple,2nd element=watermelon 1 1.2 6 [1,0] 1 8 9 [0,1] 2 1.5 5 [1,0] 2 5 8 [0,1] 1 2 7 [1,0] 2 1.3 5 [1,0] 2 7 10 [0,1] So i coded an ANN with the following structure 3 Nodes in Input Layer2 Hidden Layers with 3 Neurons each2 Nodes in Output Layer Activation function in hidden layer=Sigmoid Output function=Softmax each Hidden Layer neuron has 2 parts: Pre-activation(a) and Activation(h)a(t)=h(t-1)w(t)+b(t)h(t)=sigmoid(a(t)) This is the code I wrote def sigmoid(x): return 1/(1+np.exp(-x)) ​ def activation(val,name=""): if name=="": return val elif name=="sigmoid": return sigmoid(val) ​ def deriv…  ( 1 min )
  • Open

    "JaxMARL: Multi-Agent RL Environments in JAX", Rutherford et al 2023 (envs: MPE/Overcooked/Brax/STORM/Hanabi/Switch/Coin/SMAC; agents: UPPO/QMIX/VDN/IQL/MAPPO?)
    submitted by /u/gwern [link] [comments]  ( 1 min )
    Stable-Baselines3 v2.2 is out!
    We added support for options on reset, fixed several bugs and improved error messages. We also updated our RL Tips and Tricks to include recommendations for evaluation, and added links to newer algorithms like DroQ. SBX (SB3 + Jax) got two new algorithms: DDPG and TD3! Changelog: https://github.com/DLR-RM/stable-baselines3/releases/tag/v2.2.1 SB3 Contrib (more algorithms): https://github.com/Stable-Baselines-Team/stable-baselines3-contrib RL Zoo3 (training framework): https://github.com/DLR-RM/rl-baselines3-zoo Stable-Baselines Jax (SBX): https://github.com/araffin/sbx Note: v2.2.0 was yanked after a breaking change was found in GH#1751. Please use SB3 v2.2.1 and not v2.2.0. submitted by /u/araffin2 [link] [comments]  ( 1 min )
    RL Help
    So I'm new to the RL field, and I'm very interested in the topic. I'm in college and I'm working in a research lab where we're trying to get a stablebaseline3 RL agent to play the game billiards in a physics environment we built. I'm obviously very new to this, so I don't really know what going on under the hood at al, I've done several DL projects for my classes, but this is a different beast. My reward functions seem to be accurate, but the primary issue is with the number of timesteps for which the agent trains. I have found that the agent performs extremely well at getting the ball in the hole (the environment just has 1 ball for now) at exactly 4 million timesteps, and only 4 million timesteps. I don't know if that number is a lot or a little for this problem and for RL, it takes about an hour of training per million. But if I change the number of timesteps at all, the whole progress graph looks completely different, and follows the same pattern for every value aside from 4 million. I would expect the graph to look generally the same up until the 4 million mark, then perhaps go wonky, but the second I change the timesteps to even 4.1 or 3.9 million, everything acts differently and fails completely. Hopefully this made sense and I would love to hear anyone's thoughts on this. ​ This first image is the success rate of even hitting the ball at exactly 4 million timesteps This second image shows the success rate of hitting the ball up to 8 million timesteps. (Keep in mind that when I set any value aside from 4 million, I get this same graph up to whatever value I entered) submitted by /u/pmmason127 [link] [comments]  ( 1 min )
    Stable_Baseline3 render is being Glitchy.
    I recently installed Ubuntu 20.04 on my Windows 10 system. I created a virtual environment using Python 3.8.10 and installed SB3 by its guidelines "pip install stable-baselines3[extra]" When I ran the example of how to train and run A2C on a CartPole environment on the SB3 documentation the training went fine but the rendering is glitchy and framey as shows with the photo. Any help and advice will be greatly appreciated. ​ https://preview.redd.it/ijdkaqljm11c1.png?width=668&format=png&auto=webp&s=947ad5c2d575bbcd828397dcf31f73ea82f594f4 submitted by /u/AdrianR956 [link] [comments]  ( 1 min )

  • Open

    ASRock wants to make generative AI software easier to use on Radeon RX 7000 series with its new AI QuickSet app
    submitted by /u/Aaadvarke [link] [comments]
    Sam Altman fired as CEO of OpenAI
    Sam Altman has been fired as CEO of OpenAI, the company announced on Friday. “Mr. Altman’s departure follows a deliberative review process by the board, which concluded that he was not consistently candid in his communications with the board, hindering its ability to exercise its responsibilities,” the company said in its blog post. “The board no longer has confidence in his ability to continue leading OpenAI.” Chief technology officer Mira Murati will be the interim CEO, effective immediately. The company will be conducting a search for the permanent CEO successor. When contacted by The Verge, OpenAI’s communications department declined to comment beyond the blog post. Sources: https://www.theverge.com/2023/11/17/23965982/openai-ceo-sam-altman-fired submitted by /u/Excellent-Target-847 [link] [comments]
    Sam Altman fired as CEO of OpenAI
    Sam Altman has been fired as the CEO of OpenAI following a board review that questioned his candor in communications, with Mira Murati stepping in as interim CEO. submitted by /u/Remarkable_Ad9528 [link] [comments]
    OpenAI Co-Founder and Chief Scientist says that GPT's architecture, Transformers, can obviously get us to AGI
    Ilya Sutskever, Co-Founder and Chief Scientist at OpenAI, that developed ChatGPT, says that GPT's architecture, Transformers, can obviously get us to AGI. He also adds: We shouldn't don't think about it in terms of binary "is it enough", but "how much effort, what will be the cost of using this particular architecture"? Maybe some modification, can have enough computation efficiency benefits. Specialized brain regions are not fully hardcoded, but very adaptible and plastic. Human cortex is very uniform. You just need one big uniform architecture. Video form: https://twitter.com/burny_tech/status/1725578088392573038 Interviewer: One question I've heard people debate a little bit is the degree to which the Transformer based models can be applied to sort of the full set of areas that you'd…
    This ChatGPT Frontend Clone was Coded by GPT4 Vision
    submitted by /u/NuseAI [link] [comments]
    WebsiteAi X GetReach
    WebsiteAI x GetReach ​ GetReach offers an entire ecosystem of utilities within Discord. As we look to broaden our scope into Discord, we also look to integrate our website bot in Discord too. This goes hand in hand with what GetReach offers. ​ GetReach offers missions, very much like Shield in Telegram! Missions completed earns the user with ETH rewards to claim ✅ ​ GetReach is backed by some of the most influential Cryptopreneurs in the space, especially within the NFT scene; including CryptoPunks, Bored Ape Yacht Club etc🤟 ​ As GetReach extends its developments, we will be mirroring them hand in hand🤝 ​ We look forward to start our partnership with our first Co-Hosting Twitter Space early next week! 🕊 https://preview.redd.it/pn08tt0wky0c1.png?width=1280&format=png&auto=webp&s=efd2f49e2508e639e0cf7642851ea940a08c77a1 submitted by /u/balok1232 [link] [comments]
    Google Enhances Holiday Shopping with AI-Powered Features in Search Generative Experience Tool
    submitted by /u/Upbeat-Interaction13 [link] [comments]
    AI — weekly megathread!
    News provided by aibrews.com Meta AI introduces: Emu Video: new text-to-video model that leverages Meta’s Emu image generation model and can respond to text-only, image-only or combined text & image inputs to generate high quality video [Details]. Emu Edit: This new model is capable of free-form editing through text instructions. Emu Edit precisely follows instructions, ensuring that pixels in the input image unrelated to the instructions remain untouched [Details]. Researchers present LLaVA-Plus, a general-purpose multimodal assistant that expands the capabilities of large multimodal models. LLaVA-Plus maintains a skill repository that contains a wide range of vision and vision-language pre-trained models (tools), and is able to activate relevant tools, given users’ multimodal in…
    Wow
    submitted by /u/mravoid1 [link] [comments]
    Is there an AI that can help me not be mentally ill anymore?
    Or do i have to wait until they invent assisted suicide bots? Fml submitted by /u/ThrowRA21458910 [link] [comments]
    AI's take on schizophrenia
    submitted by /u/LightOfAntara [link] [comments]
    Augmented Rapper, can anybody help me to make this AI app? I got more specs
    submitted by /u/Niu_Davinci [link] [comments]
    "Deepfakes: A Double-Edged Sword in Creativity and Ethical Challenges"
    submitted by /u/Fit-Code-5141 [link] [comments]
    Looking for AI to remove backgrounds of images with context!
    Hello folks! I am looking for a way to mass remove backgrounds of images, using newer methods of AI. There are of course plenty of resources available that remove the backgrounds, but usually, these methods only work well for contrasting colors with clear defined edges. I have to remove bulk backgrounds of products on a white background and want to use AI, because some objects contain white (which is hard for the old-school algorithms to decipher) or are even transparent, and I need a reliable way to process thousands of images with high accuracy (>95%). From there I can define edge cases and figure out which product categories need a different method. One example of this that has worked well for me so far, is Photoshop 2024's select subject tool, with the dropdown "cloud-based" selected. This seems to analyze the image, and more "understand" the context for a proper selection, instead of the regular way where it calculated where an edge is. I wonder if I could use the API for Dall-E 3 for this, etc. I'm looking forward to hear your suggestions and if you can help me brainstorm! submitted by /u/hommelbips [link] [comments]
    Google AI outperforms traditional weather forecasting: Accurate predictions 10 days ahead without a supercomputer
    submitted by /u/Jariiari7 [link] [comments]
    One-Minute Daily AI News 11/16/2023
    Notion Unveils AI-Assisted Q&A Feature Starting At $8.[1] Google is embedding inaudible watermarks right into its AI generated music.[2] Microsoft Unveils Text-to-Speech Avatar Azure AI Speech.[3] Microsoft Unveils AI Chip That Could Replace Nvidia in Its Cloud.[4] Sources: [1] https://www.benzinga.com/news/23/11/35798312/notion-unveils-ai-assisted-q-a-feature [2] https://www.theverge.com/2023/11/16/23963607/google-deepmind-synthid-audio-watermarks [3] https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-speech-announces-public-preview-of-text-to-speech/ba-p/3981448 [4] https://www.barrons.com/articles/microsoft-ai-chip-maia-nvidia-f2b57a88 submitted by /u/Excellent-Target-847 [link] [comments]
    Seeing more and more prominent people now posting AI videos without realising they're fake. Truly the age of misinformation/disinformation
    submitted by /u/AnimalsChasingCars [link] [comments]
    Let's have some fun with Roast GPT.
    submitted by /u/aksh951357 [link] [comments]
    Suno AI auto-generated song about Stable Diffusion - SDXL
    submitted by /u/ptitrainvaloin [link] [comments]
  • Open

    [R] Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs
    Paper: https://arxiv.org/abs/2311.05657 Blog: https://allenai.github.io/lumos/ Code: https://github.com/allenai/lumos Models and data: https://huggingface.co/ai2lumos Abstract: We introduce Lumos, a novel framework for training language agents that employs a unified data format and a modular architecture based on open-source large language models (LLMs). Lumos consists of three distinct modules: planning, grounding, and execution. The planning module breaks down a task into a series of high-level, tool-agnostic subgoals, which are then made specific by the grounding module through a set of low-level actions. These actions are subsequently executed by the execution module, utilizing a range of off-the-shelf tools and APIs. In order to train these modules effectively, high-quality annotations of subgoals and actions were collected and are made available for fine-tuning open-source LLMs for various tasks such as complex question answering, web tasks, and math problems. Leveraging this unified data and modular design, Lumos not only achieves comparable or superior performance to current, state-of-the-art agents, but also exhibits several key advantages: (1) Lumos surpasses GPT-4/3.5-based agents in complex question answering and web tasks, while equalling the performance of significantly larger LLM agents on math tasks; (2) Lumos outperforms open-source agents created through conventional training methods and those using chain-of-thoughts training; and (3) Lumos is capable of effectively generalizing to unseen interactive tasks, outperforming larger LLM-based agents and even exceeding performance of specialized agents. https://preview.redd.it/igrcn7p8sz0c1.png?width=1097&format=png&auto=webp&s=96a9327131572e208eab07585c223c169fef7614 ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [P] Resources for learning CNNs offline?
    I'm getting on a long flight tomorrow, and I've been meaning to dig deep and get experience training CNNs myself or even implementing one. I have a lot of experience now with finetuning LLMs because nobody else could fill that role in my company this year, but I was just a SWE before, and I want to springboard off this into actual ML roles. So, does anyone have some solid notebooks or pdfs I can occupy myself with? And should I download a dataset from kaggle or something? Thanks submitted by /u/HPLaserJetM140we [link] [comments]  ( 9 min )
    [N] OpenAI Announces Leadership Transition, Fires Sam Altman
    Source: https://openai.com/blog/openai-announces-leadership-transition Today, it was announced that Sam Altman will no longer be CEO or affiliated with OpenAI due to a lack of “candidness” with the board. This is extremely unexpected as Sam Altman is arguably the most recognizable face of state of the art AI (of course, wouldn’t be possible without great team at OpenAI). Lots of speculation is in the air, but there clearly must have been some good reason to make such a drastic decision. This may or may not materially affect ML research, but it is plausible that the lack of “candidness” is related to copyright data, or usage of data sources that could land OpenAI in hot water with regulatory scrutiny. Recent lawsuits (https://www.reuters.com/legal/litigation/writers-suing-openai-fire-back-companys-copyright-defense-2023-09-28/) have raised questions about both the morality and legality of how OpenAI and other research groups train LLMs. Of course we may never know the true reasons behind this action, but what does this mean for the future of AI? submitted by /u/Sm0oth_kriminal [link] [comments]  ( 9 min )
    [P] Latent Consistency Models Open Source Art Generation API Server
    https://github.com/Netwrck/stable-diffusion-server Creating this to make it easy to create a server that makes AI Art. It uploads any created images to a google cloud storage bucket. Let me know what you think :) Its using SSD-1b latent consistency models and a pass with stable diffusion refiner which is getting crazy good results ​ https://preview.redd.it/f72ul7142z0c1.png?width=506&format=png&auto=webp&s=8f6434fb8e3d515637ef1ec94f7d30f49a29eb27 submitted by /u/ai_art_generater [link] [comments]  ( 9 min )
    [D] Tsetlin Machines, anyone heard of it?
    Hi, T have very little experience with Logic based Al algorithms, and my understanding is these machines use propositional logic algorithms (which wasn't something I have come across before, even when I was in uni). My experience is in statistical & neural based networks and can understand the pros and cons of those networks. I am just trying to understand what is the advantage and disadvantages of logic based algorithms specifically Tsetlin Machines? Is this something that I should learn more about? What are some good resources? Thanks submitted by /u/Loud-Consideration-2 [link] [comments]  ( 1 min )
    [Research] Any research in the area of similarity search for vector db's?
    I was wondering if anyone is aware of any current/ongoing research on similarity search for vector dbs? In the RAG models I have been working on, most of the accuracy of the result hinges on the similarity search returning the right document, and less on the language model. Thanks! submitted by /u/BakerInTheKitchen [link] [comments]  ( 9 min )
    [N] DeepMind now can predict the weather with less computing power than other approaches
    According to Nature ML methods are starting to outperform NWP without supercomputers https://www.nature.com/articles/d41586-023-03552-y and the coolest thing is that they already released the code https://github.com/google-deepmind/graphcast so I wonder how long would it take before we start seeing people pull data from home weather stations to run an hyperlocal fine-tuned model. submitted by /u/SCP_radiantpoison [link] [comments]  ( 9 min )
    [D] Supervized time-series segmentation from timestamp-based labels
    Hi, I have different audio time-series with timestamp-based annotations of multiple start + end segment boundaries that are denoting consecutive respiration cycles. Does anyone has an idea on how I could use supervized-learning approaches to automatically segment respiration cycles from audio time series ? Cheerz submitted by /u/deadalius [link] [comments]  ( 9 min )
    [P] Introducing Neural-Cherche: Enhance Document Retrieval with Advanced AI Models
    Hi everyone, I'm excited to share a tool I've developed called Neural-Cherche. Its main purpose is to transform a Sentence Transformer into a ColBERT model, which is currently at the forefront of information retrieval tools. What does Neural-Cherche do? In simple terms, it fine-tunes advanced AI models to make them more efficient at understanding and processing information. This results in a significant improvement in how effectively these models can find and retrieve relevant documents. This tool is particularly beneficial for those working with RAG (Retriever-Augmented Generation) pipelines. The ColBERT and Splade models, featured in SIGIR 2020 and 2021, have shown exceptional performance in document retrieval when they are fine-tuned, and Neural-Cherche makes this process easier and minimalist. Let me know what you think, and how it might be useful in your projects! submitted by /u/Ok-Cartoonist8114 [link] [comments]  ( 1 min )
    [P] Need help with project (Coreference Resolution using Deep Learning)
    Project: Coreference Resolution in Long Texts using Deep Learning I need help with understanding what I would need to study for this. It's not a research project so to say, but we are trying to do some kind of optimization for the accuracy in longer texts. We have to build a model and not just propose something new, i.e, if we are doing that. Right now I'm studying Linear Algebra by Gilbert Strang from MIT OCW. And plan to do Calculus (Single and Multivariate), Probability and Statistics after that. After this I am going to go for Machine Learning by Andrew Ng, NLP with Deep Learning by Christopher Manning and Natural Language Understanding by Christopher Potts. ​ This project has to be completed by April 2024 approximately. So I wanted to know if what I plan to do is enough/not enough/overkill or something else. Or if I should be taking a different path altogether. submitted by /u/burntToast1134 [link] [comments]  ( 9 min )
    [P] Versioning code & large models together in GitHub
    Hey r/MachineLearning! Last year, u/rajatarya showcased how we scaled Git to handle large datasets. One piece of feedback we kept getting is that people didn't want to move their source code over to XetHub. So we built a GitHub app & integration that lets you continue storing code in GitHub while XetHub handles the large datasets & models. https://about.xethub.com/blog/xetdata-scale-github-repos-100-tb We've enjoyed using it to host open source LLM's like Llama2 and Mistral with our finetuning code side-by-side. The whole thing is in beta so we're eager for any feedback you have to offer :) submitted by /u/semicausal [link] [comments]  ( 9 min )
    [R] Seeking input: transformer modification with 25-30% improvement in validation loss across 3 datasets
    Long time lurker here. Made an account just to post this. I've been experimenting with some modifications on the transformer architecture (addition of a new standalone component). Recently I got to something that seems to be an improvement of the validation loss by ~25-30% over vanilla decoder transformers; the task is next token prediction. My question is if this is significant enough to dedicate more serious effort in (eg. getting more compute credits to create a bigger model, running a beefier benchmark, sharing more with folks in academia to get feedback, write a paper, etc.) or if it's likely a fluke. In terms of methodology, I've compared vanilla-vs-modification on 3 datasets (in increasing difficulty): Penn Treebank, Lord of the Rings, the complete works of Shakespeare. The data…  ( 10 min )
    [D] How to choose weights for VotingRegressor?
    I have multiple models that are used in VotingRegressor i have used defaults weights like [1,1,2,1] is there any method, that gives optimal/best weights for each models? submitted by /u/_Killua_04 [link] [comments]  ( 9 min )
    [D] Hugging Face Models to AWS Sagemaker Endpoints: A MLOps Perspective
    I recently created this tutorial on how to go from a Hugging Face model to a AWS Sagemaker Endpoint in a production setting: https://www.zenml.io/blog/huggingface-to-sagemaker While I was making this project, it struck me how much of the tutorials around this sort of stuff really focuses on a one-time deployment, but IMO most of the problems really happen when you are doing this repeatably. That's why I've focused more on lineage, data tracking, and automated deployments in this post. What does everyone think: Is there a lack of best practices available to reliably do this sort of stuff? submitted by /u/htahir1 [link] [comments]  ( 9 min )
    Predicting optimal drug based on remission. [Research]
    I want to train a machine learning model that uses the following parameters to predict the optimal treatment choice. Say I have data on patient characteristics, imaging data, and which treatment was given to the patient (X, Y, Z), the outcome is complete remission. i want to note that for example if patient was given X and he went to remission then X is the drug of choice for that patient. If patient was give Y and it didnt work he can be given Z and if Z works then Z is the drug of choice for that patient. ​ The outcome here is not to predict the outcome, it is to predict the ideal treatment. The other way is I can predict outcome for each of the treatment choice. ​ How can one go about this? ​ ​ Thank you all submitted by /u/I_only_wanna_learn [link] [comments]  ( 9 min )
    [Discussion] Machine learning for predicting treatment regimine
    Hello community, ​ I am working on a machine learning project where the goal is to predict the optimal treatment choice for individual patients based on their characteristics, imaging data, and previous treatment outcomes. The outcome I'm interested in is achieving complete remission. ​ Here's a brief overview of the problem: ​ Input Features: ​ Patient Characteristics (e.g., age, gender, medical history) Imaging Data Previous Treatment Given (X, Y, Z) Output: ​ Optimal Treatment Choice for Complete Remission I want the model to learn from cases where a particular treatment led to remission and, if not, explore alternative treatments. ​ I'm considering two approaches: ​ Predicting Optimal Treatment Directly: ​ Training the model to directly predict the optimal treatment choice for a given patient based on their features. Predicting Treatment Outcome for Each Choice: ​ Training separate models for each treatment choice to predict the likelihood of complete remission. Questions: ​ Which approach do you think is more suitable for this problem? What kind of machine learning algorithms and techniques would you recommend for this task? How can I handle cases where the initial treatment doesn't lead to remission, and an alternative treatment is given? I appreciate any guidance, insights, or examples you can provide to help me structure and approach this problem effectively. Thank you! submitted by /u/ibrahimredit [link] [comments]
    [D] what are the latest trends /breakthroughs in low level architecture/operators composition?
    is there any novelty on architecture design/operators level during the last year? I was looking into recent SOTA models both for NLP and computer vision and seems it is mainly the same concept - attention with slightly different training approach / different number of same layers combined together. Is there any innovation on lower level happening? Ie any new promising architectures, design approaches or operators which were introduced lately? Thank you. submitted by /u/Grrr11 [link] [comments]  ( 9 min )
    [D] The Evolution of Serverless GPUs: In-Depth Performance & Cost Analysis for Llama 2 7Bn & Stable Diffusion 2-1 model across providers
    In the evolving landscape of AI Infrastructure, Serverless GPUs have been a game changer. Six months on from our last guide, which sparked multiple discussions & created more awareness about the space, we've returned with fresh insights on the state of "True Serverless" offerings and I am here sharing performance benchmark & cost effectiveness analysis for Llama 2-7Bn & Stable Diffusion 2-1 model. 📊 Performance Testing Methodology: We put the spotlight on popular serverless GPU contenders: Runpod, Replicate, Inferless, and Hugging Face Inference Endpoints, specifically testing for: 1. Cold Starts: Varied across platforms. Latency minus inference time, represents the delay due to initializing a dormant Serverless function. Average Cold-starts across platforms 2. Variability: We don't …  ( 10 min )
    [P] Obscure speech recording while keeping data useful for machine learning
    Is there a way to obscure speech recording so there is no way to play it and get something intelligible, but still keep it useful for machine learning? For my project I have to collect data in uncontrolled environment, and I would like to do it without accidentally storing sensitive information. It seems to be an uncommon problem, and I haven't found much. I am currently using spectrograms to extract features. For what I have found, making a spectrogram from a soundwave uses STFT and doesn't store phase information, so there is not enough information to perform the inverse transformation. Do I understand this correctly? What are other ways to do it? submitted by /u/vladisser [link] [comments]  ( 9 min )
    [R] Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models
    submitted by /u/hardmaru [link] [comments]  ( 9 min )
    [R] Neural MMO 2.0: A Massively Multi-task Addition to Massively Multi-agent Learning
    Paper: https://arxiv.org/abs/2311.03736 Project page: https://neuralmmo.github.io GitHub: https://github.com/neuralmmo Abstract: Neural MMO 2.0 is a massively multi-agent environment for reinforcement learning research. The key feature of this new version is a flexible task system that allows users to define a broad range of objectives and reward signals. We challenge researchers to train agents capable of generalizing to tasks, maps, and opponents never seen during training. Neural MMO features procedurally generated maps with 128 agents in the standard setting and support for up to. Version 2.0 is a complete rewrite of its predecessor with three-fold improved performance and compatibility with CleanRL. We release the platform as free and open-source software with comprehensive documentation available at this http URL and an active community Discord. To spark initial research on this new platform, we are concurrently running a competition at NeurIPS 2023. https://preview.redd.it/8pvecr133t0c1.png?width=901&format=png&auto=webp&s=f9bf7f063c1b9950d40ad2a3b04bdc7f0780413f ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
  • Open

    Can ordinary backpropagation networks be used for time series data, if temporally successive data values are entered into successive network input neurons?
    Question: If one is going to use a neural network for future stock market price prediction, where you have a time series of stock price values, can an ordinary backpropagation network be used for this purpose? I am think of taking the stock price every 15 minutes, over a period of say 3 days (so that there will be 3 x 24 x 4 = 288 price values), and then entering these values (in normalised form) into a backpropagation network with 288 input neurons. This network setup would then be used to predict what will happen to the stock price a few hours into the future. Would this setup work as effectively as a long short-term memory (LSTM) network in terms of future price prediction? Or do LSTM networks offer something over and above ordinary backpropagation networks? submitted by /u/GA64 [link] [comments]  ( 9 min )
    What Meta learned from Galactica, the doomed model launched two weeks before ChatGPT
    submitted by /u/nickb [link] [comments]  ( 1 min )
  • Open

    How would one go about creating a reward function from raw-pixel input?
    I played around a bit with OpenAI Gym. It returns the state as RGB frames and the reward function is pre-defined by the environment somehow (I have no idea how - for the OpenAI Gym Mario environment I used for example it was the distance along the x-axis). How was this distance obtained? In any case; I will ask my actual question now. Say I am trying to make a game-playing agent for arbitrary game X which may be 2D or 3D. How would one go about making a game playing agent for a complex game like Call of Duty: World War II which have large maps & missions that are not straight-foward etc. the "reward" might be received over longer periods in a not-so-obvious manner. Any literature review around this? submitted by /u/universecoder [link] [comments]
    "Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero", Schut et al 2023 {DM} (identifying concepts in superhuman chess engines that give rise to a plan)
    submitted by /u/gwern [link] [comments]
    Hosting RL Challenges on Kaggle [D]
    submitted by /u/Brown_bagheera [link] [comments]

  • Open

    [P] Building the largest open-source prompt dataset, looking for contributors!
    Hi all. I am trying to build the largest open-source prompt dataset on GitHub. Plan is to split the prompts by industries, models and techniques. https://github.com/kortex-labs/awesome-llm-prompts I realize it's impossible to do this alone and therefore I am reaching out to y'all for help 🤝🥹 submitted by /u/kanxx030 [link] [comments]  ( 1 min )
    [P] Roast my code
    I recently implemented a Siamese network with triplet loss, incorporating online mining for triple selection. I'm seeking feedback on the model's performance and my code. The Flowers102 dataset was employed, aiming to determine whether two given flower images belong to the same species. ​ Despite achieving a 90% accuracy rate, I observed an unusual need for a low threshold (0.25) during inference. I attribute this to the inherent similarity between flower images, suggesting that a lower threshold is more appropriate. This contrasts with face verification tasks, which typically require higher thresholds. I would appreciate insights or suggestions regarding this observation. submitted by /u/Ok_Ad_1034 [link] [comments]  ( 1 min )
    [D] Open Source Simulator for NVIDIA A100 GPU Architecture
    Hi everyone! For my project, I am looking for an open-source GPU simulator supporting NVIDIA A100 GPU architecture. I would really appreciate it if you could point me to any relevant resources. Apologies if my question is off-topic. Considering the ML-GPU connection, I thought it might be relevant here. Thank you! submitted by /u/ramya_1995 [link] [comments]  ( 1 min )
    [D] What are some methods to combine Knowledge Graph and LLM?
    currently, i am working on a Knowledge Graph to improve LLM performance in the practice Level .. in my First Stage, I am wondering what is the better way to Process the Dataset as knowledge Graph,? - do I should use Prompt Engineering while representing KG as a Query schema? - may i could use RAG directly and set KG as a knowledge Database, my main question here is what are the most common techniques used to tokenize KG? or there's a possibility to use Top-Layer based on GNN to handle exrraction context from KG i am sorry if my Qs are not clear because i am just starting my journey in this problem submitted by /u/Youness_Elbrag [link] [comments]  ( 1 min )
    [D] Scaling object detection and classification networks (e.g. RetinaNet)
    Mostly state-of-the-art detection and classification networks (RetinaNet, EfficientDet) follow the same approach of applying CNN backbone and feeding last X layers to (Bi-)FPN. Similar backbone is used by Tesla in HydraNet (RegNet + BiFPN). ​ The question is the following: since many papers benchmark the networks on "low" resolution image databases, how would one approach scaling the network to higher res images? On the one hand, you could still apply the same architecture (resolution of each layer will be bigger though), but wouldn't the perceptive field of each feature be limited (relative to the image size)? Or would you train a new (deeper) CNN backbone for that? ​ Edit: to keep it simple, could I take for instance vanilla RetinaNet with pretrained weights from Pytorch and simply use it on 2-5MP images? Or the performance will drop, especially for objects spanning big part of field-of-view? submitted by /u/Spare-Help562 [link] [comments]  ( 1 min )
    [P] Structured EDA Commentary
    Hi All, I'm currently automating a data pipeline and was hoping to explore using ML for EDA commentary. I have structured data with headers and looking to generate top down granular commentary based on the data headers given in which they are either referencing unique string values and their corresponding numerical data header value. Was hoping anyone could point me in the right direction as to the structure or approach I could take as I'm still learning. Thanks. submitted by /u/One_Station_1762 [link] [comments]  ( 1 min )
    Hosting RL Challenges on Kaggle [D]
    Back in 2020, Kaggle released a new challenge category called "[Simulation competitions](https://www.kaggle.com/simulations)" allowing for reinforcement learning solutions. [ConnectX](https://www.kaggle.com/c/connectx) is perhaps the most popular example of this. I thought that since ConnectX allowed for submitting .py files instead of .csv, there should be a way of extending/modifying ConnectX to host my own reinforcement learning challenge for a class I am teaching. However, I could not find any way of hosting such a challenge myself using kaggle's platform. Does anyone here has any experience with this? submitted by /u/Brown_bagheera [link] [comments]  ( 1 min )
    [N] (In)validating published ML performance scores is possible
    I think many of us came across publications with ML performance scores (such as accuracy, sensitivity, etc.) that seemed unrealistic, still getting a lot of citations. We knew they were false, invalid, incorrect, maybe a typo, maybe some bug in the evaluation, maybe cheating, but no one invested the time to reimplement the method and prove it wrong. However, ML performance scores - especially if multiple ones are reported - cannot take any values independently. There are numerous constraints imposed on them by the experimental setup that they need to satisfy. We just released a Python package with the capability to identify inconsistencies regarding published ML performance scores given the description of the experimental setup. https://github.com/FalseNegativeLab/mlscorecheck In the corresponding paper we already identified some phishy things going on in the literature of skin lesion classification. What do you think? submitted by /u/gykovacs [link] [comments]
    [D] Bringing ML code to the data(base) to build better AI apps
    I built the open-source ML platform at Instacart a few years ago. I learned a ton, but primarily that it's better to bring your ML workload to the database rather than bringing the data to the code. It takes a lot of the complexity out of your infra, and it's ultimately faster for your users. That's why we made PostgresML. It's an open-source extension for PostgreSQL. Combine it with pgvector and you've got a complete ML platform with just a few extensions. We're bullish on the power of in-database and open-source ML/AI, both traditional tabular data applications, and novel NLP LLM RAG applications that combine retrieval and inference in a single process space to minimize latency and operational complexity. I'd love to get your thoughts on our approach. You can mess around with it on our site: https://postgresml.org/ Let me know what you think think. submitted by /u/something_cleverer [link] [comments]  ( 1 min )
    [D] Embedding Converter; need help with architecture
    I am an undergrad, and for a research project, I have been tasked with building an embedding converter. That is, given an input sentence (ie "This is a sentence."), I want to provide a model that will convert the embedding of this in 1 LLM to another. I have tried using a standard deterministic Neural Network, however, I believe that my accuracy is being limited by a few things. Embeddings seem to be generated probabilistically for most LLMs based on context, and so my deterministic architecture will inevitably fail. I've looked into building a VAE (Variational Autoencoder) to help this. Basically, given an input sentence, the encoder would convert 1 LLM embedding to another, and the decoder would reconstruct that original sentence. However, I'm having some difficulty understanding a few things: What would my latent space be for the encoder? What would my "converter" be? If it is the encoder, should my latent space then be in the size of my LLM2 embedding? This all seems a bit hand wavy to me, and I don't really know if this is a valid usage of a VAE. Do you recommend another way of dealing with this? Also, if anyone knows of any easily parseable large datasets (grammatically correct sentences that can be separated easily, ideally by line), let me know! I've looked into using the Pile, but I can't find the download link for a small part of their huge dataset. submitted by /u/sashajh12 [link] [comments]  ( 1 min )
    [D] Retraining the best model of every session
    Dear ML community, Im quite new to ML but doing this as part of my study and interest. I have experimented a bit with training models etc. but was confused about the following: Would it be a possibility to take the best performing model from a session, and retraining that one for your new session? Would this perform better or have at least a higher chance of performing better? Would it follow the exact same path? Would it work if you would feed it different data then? ​ I hope this isn't a dumb question! submitted by /u/tommenomnom [link] [comments]  ( 1 min )
    [R] Meta Announces Emu Edit: Precise Image Editing via Recognition and Generation Tasks
    Researchers at Meta AI announced Emu Edit today. It can edit images precisely based on text instructions. It's a big advance for "instructable" image editing. Existing systems struggle to interpret instructions correctly - making imprecise edits or changing the wrong parts of images. Emu Edit tackles this through multi-task training. They trained it on 16 diverse image editing and vision tasks like object removal, style transfer, segmentation etc. Emu Edit learns unique "task embeddings" to guide it towards suitable edits based on the instruction text. Like a "texture change" vs "object removal". In evaluations, Emu Edit significantly outperformed prior systems like InstructPix2Pix on following instructions faithfully while preserving unrelated image regions. With just a few examples, it can adapt to wholly new tasks like image inpainting by updating the task embedding rather than full retraining. There's still room for improvement on complex instructions. But Emu Edit demonstrates how multi-task training can majorly boost AI editing abilities. It's now much closer to human-level performance on translating natural language to precise visual edits. TLDR: Emu Edit uses multi-task training on diverse edits/vision tasks and task embeddings to achieve big improvements in instruction-based image editing fidelity. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]  ( 1 min )
    [D] Where does the 'ML' part of CAUSAL ML come in?
    Looks like classic causal inference, the type would see in Pearl or Econometrics. I don't understand where the ML comes in. submitted by /u/Ckdew [link] [comments]  ( 1 min )
    [N] Microsoft Releases SynapseML v1.0
    Today Microsoft announced the release and general availability of SynapseML v1.0 following seven years of continuous development. SynapseML is an open-source library that aims to streamline the development of massively scalable machine learning pipelines. It unifies several existing ML Frameworks and new Microsoft algorithms in a single, scalable API that is usable across Python, R, Scala, and Java. SynapseML is usable from any Apache Spark platform (or even your laptop) and is now generally available with enterprise support on Microsoft Fabric. To learn more: Release Notes: https://github.com/microsoft/SynapseML/releases/tag/v1.0.0 Website: https://aka.ms/spark Thank you to all the contributors in the community who made the release possible! ​ SynapseML v1.0 submitted by /u/mhamilton723 [link] [comments]  ( 1 min )
    Best model choice to finetune on new classes for object detection? [D]
    Let’s say I have trained my OD on the Coco dataset. Which choice of OD would be the best to finetune my model on a limited dataset of additional classes in a specific domain of images (Let’s say 1000 instances per class). Would Yolov8 be a good choice? Are there even any benchmarks for this? Should I just look for the highest AP on Coco/Pascal? submitted by /u/WingedTorch [link] [comments]  ( 1 min )
    [P] Sharing my notes on research papers/machine learning
    Hi everyone, I read research papers pretty often and take detailed notes to make sure I fully understand the paper. I was thinking that these notes would be useful for other people as well since I have found similar types of posts useful for myself. I just published my notes for MotionLM (Multi-Agent Motion Forecasting as Language Modeling) and would really appreciate some feedback on whether the notes are useful and worth continuing to share or if it's not that useful for other people. Here's the link to my notes: https://publish.obsidian.md/ericwiener/Research+Papers/Multi-Agent+Motion+Forecasting+as+Language+Modeling Thanks so much! Eric submitted by /u/EricW_CS [link] [comments]  ( 1 min )
    [D] A conceptual precursor to today's language machines
    https://hedgehogreview.com/issues/markets-and-the-good/artic... Some ~72 years ago in 1951, Claude Shannon released his Paper, "Prediction and Entropy of Printed English", an extremely fascinating read now. It begins with a game. Claude pulls a book down from the shelf, concealing the title in the process. After selecting a passage at random, he challenges his wife, Mary to guess its contents letter by letter. The space between words will count as a twenty-seventh symbol in the set. If Mary fails to guess a letter correctly, Claude promises to supply the right one so that the game can continue. In some cases, a corrected mistake allows her to fill in the remainder of the word; elsewhere a few letters unlock a phrase. All in all, she guesses 89 of 129 possible letters correctly—69 percen…  ( 1 min )
    [P] LoRAX: Open Source Framework for Serving 100s of Fine-Tuned LLMs in Production
    LoRAX (LoRA eXchange) is a framework that allows users to serve over a hundred fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency. Any thoughts? submitted by /u/jxz2101 [link] [comments]  ( 1 min )
    [D] Dragonfly ORS Image ROIs - Auto-Segmentation
    This is my first time posting on Reddit and I was unsure what would be the most appropriate subreddit to gauge, but the subreddit for the specific ORS I use is incredibly inactive, and its dedicated forums are not much better. I am working on Dragonfly ORS to segment long elements, specifically in the wing, of bats from micro-CT scans for my graduate thesis and am running into a problem with the sheer amount of time it takes to get through my dataset when manually segmenting each element (humerus, forearm,the thumb, digits 1-5 separated into metacarpal and individual phalanges, excluding carpals). I know there is a machine learning application to the ORS, but I am interested if anyone knows how to navigate any of the auto-segmentation aspects of the program. The point and click tool was what i was suggested to use, but anytime I attempt to use that it does not isolate elements to the the individual bones I am after (it will do an entire wings + scapula, for example, but I don't know how to break it down into the bones I need fully separated). I have tried to flank the ROI's and fill in the middle, but I am also unsure of how to ignore existing ROIs in order to fill in and combine to segment out an individual bone. Hopefully this inquiry makes sense. Essentially, I am seeing my options for how to segment out these bones into regions of interest in the fastest way possible that would allow more time just for clean-up as opposed to having to manually fill in (even with multi-slice) EVERY bone from scratch and then clean up. I don't know how popular this image tool is, but I would love any input!, even if just how to use Dragonfly for more effective segmentation! submitted by /u/SugarK_tastic [link] [comments]  ( 1 min )
    [Discussion] What are best practices when building/training very small models?
    I am currently trying to build small convolutional regression models with very tight constraints regarding model size (max. a few thousand parameters). Are there any rules of thumb/gold standards/best practices to consider here? E.g. should I prefer depth of the model over width, do skip connections add anything in these small scales, are there any special training hacks that might boost performance, etc? Any hints or pointers, where to look are greatly appreciated. submitted by /u/Snagnar [link] [comments]  ( 1 min )
    [P] My Physics Informed Neural Network repo in Github
    Hello everyone Feel free to view and review my repo, heat-pinn, in which I have solved a 2D steady state heat equation using ML on two different geometries. Cheers! https://github.com/314arhaam/heat-pinn/ submitted by /u/Professional-Ant9045 [link] [comments]  ( 1 min )
    Will machine learning be known as bad for the environment? [D]
    In my masters degree I always ran many computations as did all my peers The reality is that more of us are than not are using huge HPC clusters / cloud computing for many hours on each project The industry is just GPUs going BRRR I’m wondering if this has potential implications for ML in society as AI/ML becomes more mainstream I could see this narrative being easily played in legacy media Ps - yeah while there are researchers trying to make things more efficient, the general trend is that we are using more GPU hours per year in order to continue innovation at the forefront of artificial inference submitted by /u/Ok_Reality2341 [link] [comments]  ( 1 min )
    [D] How do you guys in the industry deal with data scarcity?
    I was recently trying to work on a ML project and realized that the I need very specific type of data for what I am trying to achieve which makes me wonder how do people and companies deal with data scarcity, I am pretty sure they at some point need very specific type of data which isn't easily available submitted by /u/BadKarma-18 [link] [comments]  ( 1 min )
    [D] Why are ML model outputs not tested regarding statistical significance?
    Often when I read ML papers the authors compare their results against a benchmark (e.g. using RMSE, accuracy, ...) and say "our results improved with our new method by X%". Nobody makes a significance test if the new method Y outperforms benchmark Z. Is there a reason why? Especially when you break your results down e.g. to the anaylsis of certain classes in object classification this seems important for me. Or do I overlook something? submitted by /u/Tigmib [link] [comments]  ( 1 min )
    [N] Trillion Parameter Consortium (TPC) - TPC Announced with Founding Partners
    Announcement: https://tpc.dev/2023/11/10/tpc-announced-with-founding-partners/ Participating Organizations: https://tpc.dev/participants/ ​ "The Trillion Parameter Consortium (TPC) brings together teams of researchers engaged in creating large-scale generative AI models to address key challenges in advancing AI for science. These challenges include developing scalable model architectures and training strategies, organizing, and curating scientific data for training models; optimizing AI libraries for current and future exascale computing platforms; and developing deep evaluation platforms to assess progress on scientific task learning and reliability and trust." ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [R] Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks - Microsoft 2023
    Paper: https://arxiv.org/abs/2311.06242 Abstract: We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities. https://preview.redd.it/hu9ouslxpl0c1.png?width=1217&format=png&auto=webp&s=06bd12f7b4ed22293523af3d65f9ed288f88e17a https://preview.redd.it/wc5pmolxpl0c1.png?width=1223&format=png&auto=webp&s=7d33386317a0fdd72f480e125bbd7a28aba73ad5 submitted by /u/APaperADay [link] [comments]  ( 9 min )
  • Open

    How to build a custom AI chat bot that encompasses Claude 2, Chat GPT, our own data, and other libraries.
    Hi, can anyone tell me tips, or send tutorial links as to how i can create an AI chatbot, in which I feed it documents, and have it use portions of Chat GPT, Claude 2, and other LLM/libraries? I am builiding this for myself so i can have an assistant that thinks in a particular way. ​ thanks in advance! submitted by /u/Popular_Ninja_6317 [link] [comments]
    Dreams of Electric Sheep
    I made some music using Suno.ai inspired by Philip K Dick. Here’s a short video I made to go along with it. I hope you enjoy! submitted by /u/Exitium_Maximus [link] [comments]
    Playing INWORLD Origins - NPC Generative AI Experience in 4K
    submitted by /u/erikmalkavian [link] [comments]
    Seeking a civic-minded Gen-AI expert interested in co-founding/advising a startup with a Harvard MBA student.
    Hi all! I'm currently a 2nd year student at Harvard Business School. I have a vision for a transformative Gen-AI application in local municipal governments the requires development of a private LLM. I'm planning to submit my idea to Harvard's New Venture Competition, but would love to bring on a technical co-founder or advisor to help roadmap the vision. Please shoot me a PM if interested! submitted by /u/thehurd03 [link] [comments]
    Screenshot to Code GPT
    submitted by /u/Senior_tasteey [link] [comments]
    Entertaining video on AI safety (in disguise as a video about wishes) and why it might not be such a good idea to be so dismissive of the dangers of AGIs.
    submitted by /u/IMightBeAHamster [link] [comments]
    Transforming the future of music creation
    submitted by /u/bartturner [link] [comments]
    Transforming the future of music creation
    submitted by /u/bartturner [link] [comments]
    Transforming the future of music creation
    submitted by /u/bartturner [link] [comments]
    Microsoft is finally making custom chips — and they’re all about AI
    submitted by /u/vjmde [link] [comments]
    Weird & Wild LLM Interactions
    Ever had an LLM conversation that left you scratching your head? Share the weirdest or wildest moments - cryptic convos, spill 'em! submitted by /u/redblade678 [link] [comments]
    Creative Collisions: LLMs and Human Imagination
    In a nutshell, do you think LLMs are sparking creativity or squashing it? Let's exchange lightning-fast thoughts on the collision of LLMs and human imagination. submitted by /u/redblade678 [link] [comments]
    Ethics Check: LLMs and the Slippery Slope
    What's your take on the ethical tightrope that comes with LLMs? Are we on a slippery slope? Share your quick thoughts! submitted by /u/redblade678 [link] [comments]
    Who will you buy dinner from?
    submitted by /u/Sea_Permit5660 [link] [comments]
    Share your favorite before-and-after examples using Topaz Labs' AI models
    Let's see the transformative power of AI in action! submitted by /u/cherishjoo [link] [comments]
    Will AI really take over the world one day or is it juts a phase?
    Do you think AI will really take over the world, or is it just a phase? We saw the same thing happening the 20th century when people imagined there would be flying cars just because it was trend back then. submitted by /u/redblade678 [link] [comments]
    What's the most weird output you've got on an LLM?
    Title submitted by /u/redblade678 [link] [comments]
    The Bias Conundrum: Are LLMs Playing Favorites
    Let's talk about the elephant in the room: bias. Critics argue that LLMs, despite their sophistication, can perpetuate and even amplify existing biases present in their training data. Are these models inadvertently favoring certain groups, and how does this impact the information they generate. submitted by /u/redblade678 [link] [comments]
    ASI really possible?
    Will ASI really be possible? beacuse it seems to me that they are just fancy word predictors, and to really achieve ASI we need to make more breakthroughs in AI. submitted by /u/redblade678 [link] [comments]
    One-Minute Daily AI News 11/15/2023
    Adobe is working on a new audio tool designed to break apart different layers of sound within a single recording. The tool is called Project Sound Lift, and it can use AI to separate elements like applause from the sound of someone’s voice.[1] Microsoft Offers AI-Powered Customer Service For Blind Users.[2] Microsoft and Google will not challenge an EU law requiring them to make it easier for users to move between competing services such as social media platforms and internet browsers.[3] OpenAI has temporarily stopped people from signing up to the premium version of ChatGPT, after it proved so popular the company was unable to operate it.[4] Sources: [1] https://www.theverge.com/2023/11/15/23962452/adobe-project-sound-lift-ai-audio-tool [2] https://www.bloomberg.com/news/articles/2023-11-15/microsoft-to-offer-ai-powered-customer-service-for-blind-users?embedded-checkout=true [3] https://www.reuters.com/technology/microsoft-google-not-challenge-eu-gatekeeper-designation-2023-11-14/ [4] https://uk.news.yahoo.com/chatgpt-plus-openai-stops-premium-175100037.html submitted by /u/Excellent-Target-847 [link] [comments]
    20,000+ Best Custom GPTs Directory
    submitted by /u/Senior_tasteey [link] [comments]
  • Open

    Problem with NN for binary image classification
    Hi, I’m trying to build a NN for binary image classification. I tried with augmentation, transfer learning and also fine tuning, but when I train my model I obtain good accuracy for both train (0.9) and validation (0.85) set, but a high loss function for the validation. Then when I try to use it on a test set, I get a pretty low accuracy (0.6). Do anyone have any advice? Thanks! submitted by /u/Sophia_Gi [link] [comments]  ( 9 min )
    In search of deepfake lip sync AI
    Hello, I would like to ask you if there are any programs like Wav2lip and VideoRetalking, only with good quality and good lip synchronization, I really need such a program, I've been looking for more than a month, unfortunately I haven't found it yet submitted by /u/Embarrassed_Gap_8032 [link] [comments]
  • Open

    Is Reinforcement Learning really used in industry? If so, is it comparable to other forms of NN?
    I'm thinking of specializing in RL while doing a PhD in environmental engineering (more specifically agriculture). The research, together with my interests, led me naturally to RL as a tool to solve problems and achieve interesting research results. But then I started wondering whether it's "worth it" to specialize in this since i intend to work in industry, rather than academia. Hence my question: Is RL really used in some applications in industry? Which ones? If it is, is it at least used comparably as much as supervised or unsupervised learning? Really I'm looking to understand as much as possible how is RL used in industry so whatever you can answer about that would be much appreciated. Thanks! submitted by /u/vniversvs_ [link] [comments]  ( 1 min )
    Reward Design for 1v1 Competition Environment
    I have a multi-agent competitive environment, let's call it "Monarch on the Hill". It is a 1v1 game where the Green Player has a stationary Monarch to defend, and the Blue Player is trying to take down the Monarch. Any collision between entities ends in a kill. Green Player's goal is ultimately to keep the Monarch alive on the hill (centered in the playing space). The action space for each agent is Multi-Discrete - they can accelerate either [-1,0,1] in both the X and Y directions up to a maximum speed of 5 in both directions (i.e., velocity = [-5,5]). There are not obstacles in the environment but there is Out of Bounds on all for sides of the playing space. Green Player wins if it kills Blue Player by colliding with them or if the Monarch is still alive at the end of the episode's time l…  ( 1 min )
    Adding Noise to observations Space in RL
    I'm currently in the process of building a feature selection framework for reinforcement learning. A. Here some previous works that I have seen : https://link.springer.com/chapter/10.1007/978-3-642-15880-3_36 https://ieeexplore.ieee.org/document/5381529 https://ieeexplore.ieee.org/document/4543621 https://www.researchgate.net/publication/221346040_An_analysis_of_linear_models_linear_value-function_approximation_and_feature_selection_for_reinforcement_learning https://www.researchgate.net/publication/221345743_Automatic_basis_function_construction_for_approximate_dynamic_programming_and_reinforcement_learning https://dl.acm.org/doi/10.1145/1273496.1273589 B. Context : --> I am experimenting with the non-continuous lunar lander environment using gym library. And I want to add tw…  ( 1 min )
    Minimal MuJoCo setup for getting started with IK and RL
    dear all, I'm trying to find a minimal setup like two STL files and a joint definition in MuJoCo to learn about Inverse Kinematics. I do have a full robot design as STLs, but I'd like to start small, add IK and then move on to RL. All the MuJoCo demo scenarios seem to have complex robots, but learning from scratch would be cool. Any hints? submitted by /u/assadollahi [link] [comments]  ( 1 min )
    Is there any benefit to using actor-critic methods in very small action space problems?
    So, my RL problem has a very small discrete action space, but big input - the environment is quite complex and only partially observable (so imperfect information). As I understand, the two big differences between value vs policy methods are: policy methods are better for large or continuous action spaces policy methods can do stochastic behaviour, necessary to dealng with imperfect information environment. I don't care about the first one, but I do about the second, therefore I went the policy method route. I did vanilla policy gradients, but they are, of course, unstable and slow to train. So I wanted to do PPO next. But reading more about the existing implementations of it, it seems to me everyone is using PPO in an actor-critic setting and not by itself. Which I'm open to adopting, but I can't help but think - "If i have a neural net that predicts value well, and my action space is like 4, why do I even need a policy then?". Actor-critic makes sense to me in large action spaces, but is there any benefit in small ones? And if not, what would be the better approach for problems with small action spaces but imperfect information? submitted by /u/allegory1100 [link] [comments]  ( 1 min )
    Built-in reinforcement learning functions in Python
    Is stablebaselines3 the best library for reinforcement learning functions in python? Are there better libraries that you would suggest? Any useful links? Thanks submitted by /u/MomoSolar [link] [comments]

  • Open

    You can predict disease progression by modeling health data in latent space
    Many complex diseases like autoimmune disorders have highly variable progression between patients, making them difficult to understand and predict. A new paper shows that visualizing health data in the latent space helps find hidden patterns in clinical data that can be useful in predicting disease progression. The key finding is they could forecast personalized progression patterns by modeling clinical data in a latent space. This conceptual space uses variables to represent hidden disease factors inferred from measurements. Researchers designed a generative model using variational autoencoders to map connections between raw patient data, expert labels, and these latent variables. When tested on thousands of real patients, the model showed promising ability to: Predict individualized future disease patterns and uncertainty Reveal interpretable trajectories showing progression Cluster patients into phenotypes with unique evolution Align predictions with biological knowledge While further validation is needed, this demonstrates a generalizable framework for gaining new understanding of multifaceted disease evolution, not just for one specific condition. The potential is to enable better monitoring, risk stratification, and treatment personalization for enigmatic diseases using AI to decode their complexity. TLDR: Researchers show AI modeling of clinical data in a tailored latent space could reveal new personalized insights into complex disease progression. Full summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 1 min )
    Meta to launch speaking AI chatbots played by celebrities
    Meta is launching speaking AI chatbots played by celebrities, such as Kendall Jenner, Snoop Dogg, and Tom Brady. The AI characters are based on famous personalities, influencers, and content creators. The profiles feature AI-generated images created from videos and studio photos of public figures. The AI-generated photos have 'Imagined with AI' watermarks and represent the personalities of the characters. The chat feature is disabled for many users, and the initiative has faced criticism for selling the rights to celebrities' faces and identities. Meta paid $5 million to the highest-profile celebrities for their participation. The celebrity chatbots have generated varying levels of online conversation and followers. Meta's focus on integrating artificial intelligence into their products aligns with their goal of fostering genuine connections. The venture reflects the prevalence of parasocial relationships and the economy of loneliness in today's society. Over 160 companies are developing technology products to address the social problem of loneliness. Mark Zuckerberg emphasized the importance of creating AI responsibly and announced that everyone will be able to create their own AI next year. Source : https://english.elpais.com/technology/2023-11-13/from-paris-hilton-to-snoop-dogg-meta-to-launch-speaking-ai-chatbots-played-by-celebrities.html submitted by /u/NuseAI [link] [comments]  ( 1 min )
    Jaron Lanier Looks into AI's Future | AI IRL
    Really interesting interview with Jaron Lanier submitted by /u/TrickyUniversity5885 [link] [comments]  ( 1 min )
    It’ll probably take 3,4,or 6 years for a new groundbreaking ai framework to change the world from its initial paper. This is considering how long it took for transformer architecture to be seen with potential and finally scaled to chatgpt and generative AI,3 if using the log rule.
    The founding paper for the recent boom in generative AI stems from the paper “attention is all you need” which postulated the transformer architecture in June 2017, it took roughly 1 year for it to start being scaled with the release of GPT:1 in June 2018, after that it took 4 more years for chatgpt to be first released. If we apply the log rule given the new advances in processing and the AI boom it’ll probs be in 3-4 years from the initial release date of the new foundational paper BUT we don’t know if that paper has been released yet or not so until then. submitted by /u/Impossible_Belt_7757 [link] [comments]  ( 1 min )
    Stripe is cracking down on AI companions - Why?
    While Poe from Quora is extempted from Stripe restricted businesses, even though they serve NSFW, bootstrapped startups like GirlfriendGPT get bullied for only allowing SFW conversations with it! submitted by /u/No-Net-4704 [link] [comments]  ( 1 min )
    WebsiteAI
    generated from Website Ai Tools test yours now our TG : Website_AIwebsite : WebsiteAi.io https://preview.redd.it/b880liwrfj0c1.png?width=423&format=png&auto=webp&s=c76092865e3f17645d9e7c936d27334609fb0382 submitted by /u/balok1232 [link] [comments]  ( 1 min )
    Jeremy Howard taught AI to the world and helped invent ChatGPT. He fears he's failed
    submitted by /u/Jariiari7 [link] [comments]  ( 1 min )
    chatGPT might be more useful than AGI
    If AGI is "human level" intelligence, (the v1.0) might be slow, prohibitively expensive and stupid. (AGI tier list) chatGPT costs something like 1c per second, so $60/hr. If you are paying for an artificial intelligence to slowly type, look things up, slowly read, forget things, sleep(?!) (and so on) it might seem a huge step backwards. :::: you can stop reading :::: TLDR: AGI v1 dumb & v expensive / chatGPT great! / chatGPT+more+more = meh / intelligence, hmm. Humany? hmm. / AGI ... crap at first. It's true that a real "general" intelligence would be profoundly amazing. Maybe you don't even need agency. :: This post came out of a joke, where an early-adopter AI enthusiast gets the first access to the first AGI and it slowly replies with "bro, wut" or "i dunno google it". Then goes …  ( 1 min )
    Command GPT - Expert at crafting commands for GPT customization.
    submitted by /u/Senior_tasteey [link] [comments]  ( 1 min )
    Grok is Elon Musk’s new sassy, foul-mouthed AI. But who exactly is it made for?
    submitted by /u/Jariiari7 [link] [comments]  ( 1 min )
    The American public strongly prefers human medical professionals make medical decisions, while at the same time believing they are more likely to make culturally biased decisions than AI.
    submitted by /u/jasonjonesresearch [link] [comments]  ( 1 min )
    Seeking Resources or Courses to Learn about Harnessing Recent Advancements in GPT for Work Applications
    Hello, Reddit community! I'm eager to tap into the potential of the latest advancements in GPT (Generative Pre-trained Transformer) for my work. Specifically, I'm interested in learning how to build a customized GPT model, train it using internal data from my organization, and improve my prompt engineering skills to enhance the model's output. If you have any recommendations or suggestions for resources that cover these topics, please share them below. I'm particularly interested in resources that are up-to-date, covering the latest advancements and techniques in GPT. Thank you in advance for your valuable insights! submitted by /u/flight862 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 11/14/2023
    Apple’s AI-powered Siri assistant could land as soon as WWDC 2024.[1] Vectara has published an AI hallucination leaderboard that ranks various leading AI chatbots according to their ability to not ‘hallucinate.'[2] An AI developed by Google can accurately predict the weather as many as ten days in advance, outperforming traditional forecasting methods and even top supercomputers.[3] YouTube creators will soon have to disclose use of gen AI in videos or risk suspension.[4] Sources: [1] https://www.techradar.com/phones/apple-plans-to-reinvent-siri-with-on-device-ai-for-the-iphone-16 [2] https://www.tomshardware.com/news/ai-hallucinations-ranked-chatgpt-best [3] https://www.dailymail.co.uk/news/article-12750437/They-didnt-coming-Google-DeepMind-AI-accurately-predict-weather-10-days-advance-leaving-forecasters-supercomputers-cold.html [4] https://blog.youtube/inside-youtube/our-approach-to-responsible-ai-innovation/ submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
  • Open

    [D] Usefulness of vector databases and their real-world applications
    I have been working on a presentation that sums up the main points why vector databases are often an unnecessary optimization that's too often being promoted by vector database vendors. The slides are available here: https://vec3.ai/ Do you think vector databases are overrated? In what instances have vector databases proved most useful in your projects, and were they commercial implementations? submitted by /u/Key_Story_2768 [link] [comments]  ( 9 min )
    [R] Modeling Complex Disease Trajectories using Deep Generative Models with Semi-Supervised Latent Processes
    Many complex diseases like autoimmune disorders have highly variable progression between patients, making them difficult to understand and predict. A new paper shows that visualizing health data in the latent space helps find hidden patterns in clinical data that can be useful in predicting disease progression. The key finding is they could forecast personalized progression patterns by modeling clinical data in a latent space. This conceptual space uses variables to represent hidden disease factors inferred from measurements. Researchers designed a generative model using variational autoencoders to map connections between raw patient data, expert labels, and these latent variables. When tested on thousands of real patients, the model showed promising ability to: Predict individualized future disease patterns and uncertainty Reveal interpretable trajectories showing progression Cluster patients into phenotypes with unique evolution Align predictions with biological knowledge While further validation is needed, this demonstrates a generalizable framework for gaining new understanding of multifaceted disease evolution, not just for one specific condition. The potential is to enable better monitoring, risk stratification, and treatment personalization for enigmatic diseases using AI to decode their complexity. TLDR: Researchers show AI modeling of clinical data in a tailored latent space could reveal new personalized insights into complex disease progression. Full summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] Multiple documents reveal significant limitations of OpenAI's Assistants API for RAG
    The OpenAI RAG system struggled with multiple documents, showing inconsistent performance with our evaluation framework. However, performance improved markedly when all documents were uploaded as a single document. Despite current limitations, such as a 20-file limit per assistant and challenges in handling multiple documents, there is significant potential for improvement. Enhancing the Assistants API to match GPT quality and reducing restrictions could make it a leading RAG solution. https://www.tonic.ai/blog/rag-evaluation-series-validating-openai-assistants-rag-performance submitted by /u/ian4321 [link] [comments]  ( 9 min )
    [P] SAM-GPT-4V for Specialized Object Detection, Segmentation
    Last week, I worked on SAM-GPT-4V, a model that combines Grounding DINO, SAM and GPT-4V. Grounding DINO identifies the car, SAM segments the car, then GPT-4V devises a more specific label (in this case, the car brand). Check out my code: https://github.com/capjamesg/sam-gpt4v https://preview.redd.it/cgwh61csfk0c1.png?width=1210&format=png&auto=webp&s=5403a1cd51635ab057fd85a64081add3a215c338 submitted by /u/zerojames_ [link] [comments]  ( 9 min )
    [D] What ML model to use for this mobility problem?
    Hi guys, I am asked to try and fit a ML model to a huge mobility dataset that i have, and I tried some models but fail to get a decent [performance metric], so i'd love a fresh set of eyes on this! Features of the dataset Each row represents the data for a certain "Origin-Destination" pair, for example pair "31 to 493" meaning this is from place 31 to place 493. The first feature is thus called pair. For each mode of transport (drive, bike, walk, transit) there are 3 "cost"-features namely: [mode]_time, [mode]_cost, [mode]_convenience. So there are 12 features in total (4 modes x 3 costs) Some extra features: average_income, cars_per_household, jobs_at_destination (representing the people travel in this pair 4 observed features, one for each mode. These are the features to predict. This is a value between 0 and 1, representing how much % of people in this pair, use this mode of transport. Additional information sometimes the 3 costs for "transit" are 999, meaning that there is no transit option (train, tram, ...) available for this pair. The usual costs lie between 0 and 100 I deleted the walk_cost feature because every entry was 0. Here are the distributions of all the features: ​ https://preview.redd.it/8lt001so6k0c1.png?width=2002&format=png&auto=webp&s=adfe645606746c008941e36fbb35261d0600a8bb https://preview.redd.it/yeqx4yro6k0c1.png?width=2046&format=png&auto=webp&s=011d05795c203d57d9402805377b0e8741c2378c https://preview.redd.it/7dtk0gso6k0c1.png?width=2042&format=png&auto=webp&s=6ef5ed2d5f9044dfede43a765ae19afb5c487486 And the correlation matrix: ​ https://preview.redd.it/61uwu9fx6k0c1.png?width=1718&format=png&auto=webp&s=d299ca5ffa97dadd45804a62ed2b17628f51d0f9 So the goal is to predict those 4 obsrv features. I am very curious which ML model you would use for this and why? If you have any other suggestions, e.g. pre-processing techniques on the data/features, do share! Thank you guys! submitted by /u/Zijdehoen [link] [comments]  ( 9 min )
    [D] Transformer embedding for ranking questions
    I am working on a project where I have very specific questions about a sports event. These questions are like "Will there a shot on goal or a free-kick first" or "Which team will commit first foul". I want to rank these questions based on a statement like "Liverpool will score today". Just converting them to embedding and calculating cosine similarity doesn't seem to work very well. Is there any way I can improve this? submitted by /u/diedFindingAUsername [link] [comments]  ( 9 min )
    ML course recommendation(Deeplearning AI vs UW ML specialization)[D]
    I know it's been already debated but looking for fresh perspectives. It will be my second course in ML. I've already taken one back in school. Im deciding between deeplearning.ai ML specialization vs UW Washington ML specialization. I've heard deeplearning.ai's now courses have been watered down a bit compared to original Stanford one taught by Andrew Ng. Is there any way to take the Stanford one without paying the Stanford Online fees? Anyway, which one would be the better of the two in my scenario? submitted by /u/ffaangcoder [link] [comments]  ( 9 min )
    [D] Transitioning to a Data Product Ecosystem: Leveraging the Evolutionary Architecture - 4D Architectures, Data-Driven Routing, Feature Toggles, and more!
    📃 𝐈𝐧 𝐭𝐡𝐢𝐬 𝐩𝐢𝐞𝐜𝐞, 𝐰𝐞 𝐞𝐱𝐩𝐥𝐨𝐫𝐞 𝐨𝐧 𝐚 𝐡𝐢𝐠𝐡 𝐥𝐞𝐯𝐞𝐥: ➡️ How data-driven routing helps archive legacy features while seamlessly moving to new functions. ➡️ How a 4D architectural view across time gives a concrete understanding of evolutionary transitions. ➡️ How unit tests on a flexible coding paradigm are more effective than absolute 'best practices'. ➡️ How feature toggles help test new features in production without ruining customer experience. ➡️ Conflict resolution between opposing features, fan-out deployment pipelines, and more. 📩 𝐑𝐞𝐚𝐝 𝐭𝐡𝐞 𝐜𝐨𝐦𝐩𝐥𝐞𝐭𝐞 𝐚𝐫𝐭𝐢𝐜𝐥𝐞 𝐡𝐞𝐫𝐞: https://moderndata101.substack.com/p/transitioning-to-a-data-product-ecosystem What are your thoughts on this? submitted by /u/growth_man [link] [comments]  ( 9 min )
    [D] Validating Claims about Theoretical 131K Token Attention Span in Mistral 7B
    The Mistral 7B paper claims a theoretical attention span of 131K tokens(via propagating information up through layers with GQA) for Mistral 7B. I'm trying to figure out how this is achieved in practice. The trick seems to be on line 128, with the branch `if positions.shape[0] > 1:`, which would typically be taken when the model is first called. From my understanding, taking this branch would compute k/v values for all provided tokens, which could then propagate information for an initial prompt state of up to 131K tokens throughout the model layers(due to the staggered nature of GQA). The line would never be evaluated again as additional tokens are subsequently added to the model cache one at a time as can be seen on line 332 `logits = model.forward(next_token[:, None], torch.LongTensor([cur_pos]).to(next_token)) `. I will note the the cache's default size is 4096, which also seems to give us the sliding attention window the paper refers to. All 32 transformer block layers of Mistral seem to compute attention scores using only the cache whenever only one token is provided to the model.forward() call, otherwise, the attention scores are computed without the cache. Is my current understanding correct? Paper link: https://ar5iv.labs.arxiv.org/html/2310.06825#S2.F1 Code where scores are generated directly from cache: https://github.com/mistralai/mistral-src/blob/main/one_file_ref.py#L125-L140 submitted by /u/TheRealBracketMaster [link] [comments]  ( 9 min )
    [R] With or without a scratchpad, Large Language Models can Strategically Deceive their Users when Put Under Pressure. Results of an autonomous stock trading agent in a realistic, simulated environment.
    Paper: https://arxiv.org/abs/2311.07590 Abstract: We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision. We perform a brief investigation of how this behavior varies under changes to the setting, such as removing model access to a reasoning scratchpad, attempting to prevent the misaligned behavior by changing system instructions, changing the amount of pressure the model is under, varying the perceived risk of getting caught, and making other simple changes to the environment. To our knowledge, this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception. submitted by /u/MysteryInc152 [link] [comments]  ( 9 min )
    [D] What is the future for ML researchers and startups?
    With the advent of LLMs, multimodality and “general purpose” AIs which seat on unimaginable amounts money, computing power and data. I’m graduating and want to start a PHD, but feel quite disheartened given the huge results obtained simply by “brute-forcing” and by the ever-growing hype in machine learning that could result in a bubble of data scientists, ML researchers and so on. submitted by /u/francMesina [link] [comments]  ( 9 min )
    MLP-Mixer Interpretability [D][R]
    Hello everyone, I am just looking for materials and papers to interpret the features learnt by MLP-Mixer models. For ConvNets we can plot convolutional kernels and pattern that maximally activate units, so we have at least a vague idea of what the net captures and responds to. For Vision Transformers, linear embedding is analogue to the first conv layer and we can still see things such as color blurs, edge detectors, color filters. For self-supervised pretrained models we sometimes see CLS token at the final layer activate as saliency maps. Probably the patchification in MLP-Mixers is the same as for ViTs, are there any inspections and study available to further analyse this? submitted by /u/reverendCappuccino [link] [comments]  ( 9 min )
    [P] Curated list for data-centric AI tooling / LLM update / contributions welcome
    The tooling space for LLMs and RAG is evolving fast. We updated our “Awesome Open Data Centric AI" list to reflect that: https://github.com/Renumics/awesome-open-data-centric-ai Unsurprisingly, the biggest changes are in the "Security" and the "Observability" category. Also, it is great to see that an increasing amount of tools revolve around qualitative data understanding (instead of quantiative metrics) and data curation. https://preview.redd.it/ejqfdxfdxh0c1.png?width=1200&format=png&auto=webp&s=fb73e360df8e7ee5b6a33004e987dfe665000995 Do you think there is a tool missing? Feel free to create a pull request on Github! ​ ​ submitted by /u/44sps [link] [comments]  ( 9 min )
    [P] Search Result Ranking for Loan Comparison Website
    I just got invited to an interview with a loan comparison website (e.g. NerdWallet or Credible). They are looking for a Product Manager who can help them with an 'AI-powered ranking of comparison results' feature (e.g. Offer A is more suitable for you than offer B). No deep technical knowledge is needed for this position - however, I am interested in that use case and would like to hear your opinion on how I approach it. 'AI-powered ranking of comparison results' - Case Study I do assume that they are/should be using an 'Learning to Rank' algorithm - specifically a listwise approach(intro-to-LTR) as it is specifically designed to solve recommendation / relevance ranking problems. Some data they have on a query-side is, among others, loan amount, loan term and loan protection as well as customer-related data, such as age, employment status, income. On the search results side, they do have information such as financial institution name, eligibility, interest rate, annual percentage rate etc. For labelling, they probably have information on which customer chose (or at least clicked on) which results based on which query. In python, one might follow https://github.com/shiba24/learning2rank's approach and use learning2rank.rank's ListNet library where X is our query-side and customer-related data and y are number of completed transactions, click-through rate etc. The output should (after optimization) rank search results based on their relevancy with regard to the query and the customer-related data (at least that is what I am hoping). I know this is on a very high level but I wanted to break it down just a little bit and try to understand how to approach it on a technical level. What do you think? Am I completely wrong somewhere? Did I miss something? ​ submitted by /u/WuhuSpringfield [link] [comments]  ( 10 min )
    [P] Distil-Whisper: a distilled variant of Whisper that is 6x faster
    Introducing Distil-Whisper: 6x faster than Whisper while performing to within 1% WER on out-of-distribution test data. Through careful data selection and filtering, Whisper's robustness to noise is maintained and hallucinations reduced. For more information, refer to: 👨‍💻 The GitHub repo: https://github.com/huggingface/distil-whisper 📚 The official paper: https://arxiv.org/abs/2311.00430 Here's a quick overview of how it works: 1. Distillation The Whisper encoder performs 1 forward pass, while the decoder performs as many as the number of tokens generated. That means that the decoder accounts for >90% of the total inference time. Therefore, reducing decoder layers is more effective than encoder layers. With this in mind, we keep the whole encoder, but only 2 decoder layers.…  ( 10 min )
    [D] RetNet and Transformers+FlashAttention
    Hello, everyone. I've recently read the RetNet paper and am curious about its comparability with Transformers + FlashAttention 2.0. While their paper suggests that RetNet can be comparable to Transformers + FlashAttention (not the 2.0 version), I'm also curious to know if it can maintain comparability with Transformers at a larger scale. I'm interested in understanding more about this comparison. Thanks for answering 🙏 submitted by /u/Ok-Literature5484 [link] [comments]  ( 9 min )
    [D] Help with extracting citations from text
    Hi everyone. I have a volunteering project and the goal is to extract the citations (not in-text citations, usually the citations at the end) from pdfs. We converted the pdfs into strings of texts. But since the OCR module (pytesseract) did not output clean texts and formats, there are several misspelling words and weird symbols. I have attempted to use regex to extract citations. But I failed because there is not a fixed set of custom rules that I can come up with to take all cases into account. Do you have any experience with this? I tried ChatGPT by pasting a sample text and it gave me such a clean and accurate result of citations. But I want to build a model myself since we are working on our own task which is automating the process of extracting citations from many pdf documents. I have looked a bit into some LLMs that focus on GPT because I think only generative text models can help us. Text classification models or so are less flexible in our case. Should I attempt to pretrain a model? I know LLaMA-2 7gb is good? But not sure if it’s free. OpenAI APIs are out of scope due to our financial constraint. I am relatively new to NLP but I’m confident about my Python programming and statistical techniques. What is your suggestions? Any thought is appreciated. Thank you for your time! submitted by /u/choyakishu [link] [comments]  ( 9 min )
    [R] Source Code Data Augmentation for Deep Learning: A Survey
    Paper: https://arxiv.org/abs/2305.19915 GitHub: https://github.com/terryyz/DataAug4Code Abstract: The increasingly popular adoption of deep learning models in many critical source code tasks motivates the development of data augmentation (DA) techniques to enhance training data and improve various capabilities (e.g., robustness and generalizability) of these models. Although a series of DA methods have been proposed and tailored for source code models, there lacks a comprehensive survey and examination to understand their effectiveness and implications. This paper fills this gap by conducting a comprehensive and integrative survey of data augmentation for source code, wherein we systematically compile and encapsulate existing literature to provide a comprehensive overview of the field. We start with an introduction of data augmentation in source code and then provide a discussion on major representative approaches. Next, we highlight the general strategies and techniques to optimize the DA quality. Subsequently, we underscore techniques useful in real-world source code scenarios and downstream tasks. Finally, we outline the prevailing challenges and potential opportunities for future research. In essence, we aim to demystify the corpus of existing literature on source code DA for deep learning, and foster further exploration in this sphere. Complementing this, we present a continually updated GitHub repository that hosts a list of update-to-date papers on DA for source code modeling, accessible at \url{this https URL}. submitted by /u/International-Rip958 [link] [comments]  ( 9 min )
    [D] Table Data Labeling
    I've got around 50k pages of proprietary table data I'm hoping to fine-tune Microsoft's table-tranformer on. Has anyone found a good solution to annotating table data?The solutions I've found thus far require me to manually label every cell. It'd be great if there was a labeling tool that autolabels and lets me make corrections as needed (or at least lets me label entire rows/columns at once) submitted by /u/Reddit-1-account [link] [comments]  ( 9 min )
    [R] GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
    Enabling AI to navigate and interact with smartphone UIs is hard, requiring a model that goes beyond mere text processing to handle intricate visual and interactive tasks. A new paper proposes MM-Navigator, an agent based on GPT-4V that can use an iPhone and make purchases on the Amazon app. The agent can "understand" and interact with smartphone interfaces in a much more human-like manner than previous attempts. The key innovation lies in GPT-4V's ability to process both text and image inputs. The agent takes a user's text instructions and the current screen image, then outputs a description of the next action, including precise screen locations. The researchers improved interaction accuracy by adding numeric tags to interactive elements on the screen, which GPT-4V references to indicat…  ( 10 min )
  • Open

    Question as a Layman - RNN versus FNN in ChatGPT
    Hi, I am having a discussion with my friend and we are talking about the distinction between recurrent neural networks and feed forward neural networks and the topic of ChatGPT came up and consciousness and all of that jazz. My question is basically this: ChatGPT is typically characterized as a feedforward neural network as far as I know because it lacks "internal feedback loops" which are present in recurrent neural networks. I am wondering if the training of ChatGPT can be considered a form of internal feedback? It doesn't have internal feedback loops within its primary structure, but it could still have internal feedback loops more Macroscopically, right? And if that is the case, isn't defining ChatGPT as recurrent or feedforward sort of a matter of how high up in the system's architecture you are viewing it? And then if that is the case, and we agree that we view a system from its entire architecture, even temporally, then we would need to consider Chat GPT as a recurrent neural network - albeit on with significantly less recurrence than if it had recurrent modules inside each node or "neuron", but still recurrent no doubt. submitted by /u/platonic2257 [link] [comments]  ( 1 min )
    Experimental brain-like computing system more accurate with custom algorithm
    submitted by /u/keghn [link] [comments]  ( 1 min )
  • Open

    Book Recommendation
    Since early 21st century, artificial intelligence (AI) has been reshaping almost all areas of human society, which has high potential to spark the fourth industrial revolution. Notable examples can be found in the sector of road transportation, where AI has drastically changed automobile design and traffic management. Many new technologies, such as driver assistance, autonomous driving, and cloud-based cooperation, are emerging at an unbelievable speed. These new technologies have the potential to significantly improve driving ability, reduce traffic accidents, and relieve urban congestion. As one of the most important AI branches, reinforcement learning (RL) has attracted increasing attention in recent years. RL is an interdisciplinary field of trial-and-error learning and optimal contro…  ( 1 min )
    Announcing the First Annual RL Conference!
    About RLC The first Reinforcement Learning Conference (RLC) will be held in Amherst, Massachusetts from August 9–12, 2024. RLC is an annual international conference focusing on reinforcement learning. RLC provides an archival venue where reinforcement learning researchers can interact and share their research in a more focused setting than typical large machine learning venues. The RLC peer review process prioritizes rigorous methodology over perceived importance, aiming to foster scholarly discussions on both well-established and emerging topics in RL. https://rl-conference.cc/ Submission deadline: March 1 2024 Advisory Committee Peter Stone (UT Austin) Satinder Singh (University of Michigan) Emma Brunskill (Stanford) Michael Littman (Brown) Shie Mannor (Technion, NVIDIA) Michael Bowling (U Alberta) Sergey Levine (UC Berkeley) Balaraman Ravindran (IIT Madras) Sham Kakade (Harvard) Benjamin Rosman (University of the Witwatersrand) Marc Deisenroth (UCL) Andrew Barto (UMass Amherst) submitted by /u/ShrekLovesYouBack [link] [comments]  ( 1 min )
    How to create an expert for Imitation Learning ?
    Hi, So I'm using the poses that are captured from a pose estimator (mediapipe) and want to use this to train my humanoid model. I'm planning on using imitation learning for this and I'm not sure how to create the expert in this case. Can someone please enlighten me how to do this?? ​ A little about the project: I plan on using this to train a humanoid to walk. hence plan on mapping this to an expert and than train the humanoid to walk based on how the expert walk. I have seen people teach a humanoid to walk using PPO or some other RL and then use that as the expert and train the other using imitation learning where the PPO trained humanoid acts as the expert. submitted by /u/rakk109 [link] [comments]  ( 1 min )

  • Open

    [D] Exact kNN for high dimensional data
    Hi guys, I am new to ML and was experimenting with kNN search algorithms. I have very high dimensional data 1000+ dimensions. What data structure is best suited for such high dimensional data. I can bring down the dimensions to apprix 150 using using PCA without being too lossy. Even then I am having hard time finding techniques that work with such high dimensional data. I am not looking for Approximate NN search using LSH. What is the best technique that can be used here, kd tree doesn't work well with high dimensional data, would Rtree or ball tree be a better choice or something different? submitted by /u/xero786 [link] [comments]  ( 9 min )
    [R] Beyond U: Making Diffusion Models Faster & Lighter
    Paper: https://arxiv.org/abs/2310.20092 Abstract: Diffusion models are a family of generative models that yield record-breaking performance in tasks such as image synthesis, video generation, and molecule design. Despite their capabilities, their efficiency, especially in the reverse denoising process, remains a challenge due to slow convergence rates and high computational costs. In this work, we introduce an approach that leverages continuous dynamical systems to design a novel denoising network for diffusion models that is more parameter-efficient, exhibits faster convergence, and demonstrates increased noise robustness. Experimenting with denoising probabilistic diffusion models, our framework operates with approximately a quarter of the parameters and 30% of the Floating Point Operations (FLOPs) compared to standard U-Nets in Denoising Diffusion Probabilistic Models (DDPMs). Furthermore, our model is up to 70% faster in inference than the baseline models when measured in equal conditions while converging to better quality solutions. https://preview.redd.it/djk9mdlc9e0c1.png?width=995&format=png&auto=webp&s=65a002f1f320e68b71753ac32c6386c22e76c1c9 https://preview.redd.it/i87gizkc9e0c1.png?width=1108&format=png&auto=webp&s=34f25ecc319ffa34f545e850a5c95cb007e0abd8 ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
    Are LLMs at a practical limit for layer stacking? [D]
    Can LLMs stack more layers than the largest ones currently have, or is it bottlenecked? Is it because the gradients can’t propagate properly to the beginning of the network? Because inference would be to slow? If anyone could provide a paper that talks about layer stacking scaling I would love to read it! submitted by /u/cstein123 [link] [comments]  ( 9 min )
    [D] The random feature zoo
    I am trying to wrap my head around methods that use random selection of features and then fit to data solving a linear optimization problem. I've seen a bunch of names of methods in the literature that look essentially the same. There are methods like random vector functional link which use a neural network type architecture. Similar methods were done for radial basis functions. Extreme learning machines are basically the same as these and took their ideas. Then there are random feature methods. This branch of literature seems to begin with Rahimi and Recht. They seem to come from kernel methods and random Fourier series. Should I think of random features as an umbrella term over all the kernel setups including single layer neural network and radial basis functions? submitted by /u/doncosaco [link] [comments]  ( 9 min )
    [D] Online MS in Machine Learning programs?
    Hey everyone! My work has given us the clear to have some amount of tuition reimbursed for continuing education. I currently have an MS in Statistics, but it's largely focused on traditional statistics with a bit of ML. I think it makes sense to pursue an MS in ML to further my understanding, but since I'm employed full-time, in-person classes are not an option. Does the community have an opinions on which programs may be more or less worth looking into for a respectable credential and a strong learning experience? ​ submitted by /u/Karsticles [link] [comments]  ( 9 min )
    Need Help... Apple Create ML Giving weird results [D] [P]
    Hi guys, just a little background of what I'm trying to create: I am aiming to create a model that converts sign language to text, using the WLASL dataset. Now, from the get-go, downloading this model from kaggle, while the dataset seems quite comprehensive, the amount of videos per class range from 5-13, which is obviously quite less to train on. I decided to try out Apple Create ML instead of something like tensorflow or even more complex deep learning frameworks as this would be much more simple. Since the dataset is quite limited in terms of videos per class, I used all 6 data augmentations in the "Hand Action Classifier" (Horizontally Flip, Rotate, Translate, Scale, Interpolate Frames, Drop Frames). While I knew this could not save the model, it would definitely increase the accuracy…  ( 10 min )
    [P] Research project ideas
    I‘m looking for a research project that I can work on every day for about 6 months. At the moment I am interested in active learning and/or ML in medicine. But I’m also open to other areas. I don‘t have much experience in ML, but have a lot of time to invest in. submitted by /u/Miserable_Air_5272 [link] [comments]  ( 9 min )
    [D] Where to Benchmark Neural Net to Run on the Edge?
    I have a neural net using the MobilenetV3 and tensorflow lite and want to run it on the edge (locally on the chip with no server connection), but I need to determine which chips are capable or running it or how much I need to reduce the scope of my neural net. A few months ago I ran my neural net program through a website that benchmarked it across different chip manufafcturers and families of chips to find out what chips I could use. However, I can't find that website. What are my options for benchmarking my neural net performance across many different chip manufactuers to see what chips are capable of running it? submitted by /u/freebird4446 [link] [comments]  ( 9 min )
    [D] Paperspace credits
    Digital Ocean offers 200 dollars of credit on their products. They specifically didnt mention paperspace. Can I use thos credits to rent high end gpus on paperspace? submitted by /u/DirectionOdd9824 [link] [comments]  ( 9 min )
    [D] The Trade-offs in Model Complexity: Performance vs. Interpretability
    I've been grappling with a challenge that I believe many of you might have encountered in your work: the delicate balance between model complexity and interpretability. As our models become increasingly sophisticated, achieving remarkable performance levels, there's a noticeable trade-off with the ease of understanding and explaining these models' decisions. Performance Metrics: How do you evaluate the performance of your models? Do you prioritize accuracy, precision/recall, or some other metric? How do you decide what's most important for your specific application? Complex Models vs. Simplicity: In your experience, what are the pros and cons of using more complex models (like deep learning) versus simpler, more interpretable models (like decision trees or linear regression)? Are there specific scenarios where one significantly outperforms the other? Interpretability Tools: What tools or techniques do you use to interpret complex models? How effective have they been in providing insights into your model's decision-making process? Ethical Considerations: How do you address the ethical implications of using complex models, especially in sensitive areas like healthcare, finance, or criminal justice, where explainability is crucial? Future Directions: Where do you see the future of machine learning heading in terms of balancing performance with interpretability? Do you foresee any breakthroughs that might help resolve this trade-off? I'm looking forward to a robust discussion on this topic. Let's share our experiences, challenges, and insights to better understand how we can develop powerful yet understandable machine learning models. submitted by /u/Kinz0id [link] [comments]  ( 9 min )
    [D] Kaggle House Price Competition Low Scores
    I see some crazy low scores for the Kaggle House Prices Competition, the latest notebooks I see are all getting scores in the 0.1 - 0.15 range. I am not sure how leaderboard is getting such low scores. ​ https://preview.redd.it/vl5sm8c4sb0c1.png?width=1330&format=png&auto=webp&s=73fc9789830f09723fe57ac14c8cf16785a0622b The most popular notebooks are also getting scores in that range as well, is there some new algorithm that people are using? ​ Which algorithm are they using? submitted by /u/ng_guardian [link] [comments]  ( 9 min )
    [D] Tips for parameter efficient fine-tuning of language models
    Hi, I'm using the PEFT library for fine-tuning a BART model in generation tasks. I have two questions: I experimented with Prompt Tuning, IA3 and LoRA. The first two yielded crappy results (ROUGE scores ~1-2 points after 30K steps ..) while LoRA had acceptable scores (close to fine-tuning). In terms of trainable params, Prompt Tuning and IA3 has around 200K whereas LoRA had ~2M (does not matter much to me though). Is LoRA still the best method for efficient fine-tuning, or am I just setting the hyperparams wrong somewhere ? Normally for full fine-tuning I would use a learning rate of 1e-5 with linear decay. And I'm using the same parameters for efficient fine-tuning (LoRA, IA3 .. ). Does this make sense or should I use a larger values ? (e.g. I see the IA3 paper author setting lr around 3e-3, but they work on classification tasks so I'm not sure) Many thanks in advance ! submitted by /u/KarmaCut132 [link] [comments]  ( 9 min )
    [D] Any pre-trained models for translation to English?
    I am looking for a pre-trained transformer model that would be able to translate texts in various languages to English. I have found some capable of this. e.g. SeamlessM4T or Madlad-400, however, these are both for multilingual translation and are a bit slow for my use case. Can anyone recommend a fast model, perhaps even model that's pre-trained to only translate to English (which is all I need)? Use case: I need to translate millions of texts to English and I want to avoid using APIs or too large models. submitted by /u/6NBUonmLD74a [link] [comments]  ( 9 min )
    [D] Is there any research for RAG where vectors that are queried in succession become associated for future queries, even if their embedding values are different?
    A common situation in IRL problems with long time horizons is the need to perform multiple very different subtasks. For example, imagine a model trained to remember a poem and then spell it out in blocks in a game of minecraft. The data for the poem itself and the appropriate minecraft functions probably have very different embeddings, but in practice it would probably be useful to ensure the memories for how to use minecraft functions are queried when that poem is queried. It seems like just querying a RAG DB for the vectors with the highest cosine similarity won't be super useful for this task. A query for poems will just find poem-like data. But we don't just want to find things with similar embeddings to poems, we want to find data that is useful. Has there been any research into this time-series / associative type of RAG? ​ submitted by /u/30299578815310 [link] [comments]  ( 9 min )
    ChatGPT prompts for other LLM chatbots [D]
    Has the community got best practices for writing prompts? I'm looking at chatGPT but it may be that there are principles that work with the other LLM chatbots and some of these prompt examples out there are too chatGPT specific submitted by /u/HotRepresentative325 [link] [comments]  ( 1 min )
    [R] Brainstorming Multiple Actions in Continuous Control
    Brainstorming Multiple Actions in Continuous Control Environments with continuous action spaces are a big challenge in RL. Because there are an infinite number of possible actions, evaluating every possible action isn't feasible. Instead, actors are trained to output a single action that maximizes value according to the critic. But it seems hard for the actor to consistently find the global optimum of the critic; so, why not output multiple actions, and use the value network to choose the best one? I built off of TD3, modifying the actor to output multiple actions, and trained it with an auxiliary contrastive loss to ensure diversity. Overall, I found this agent, TD3-M, improved results on both Gymnasium MuJoCo environments and the DeepMind Control Suite over vanilla TD3. Compared to related work such as actor ensembles, this method (multiheaded actor + contrastive loss) is simple, computationally cheap, and high performance, with TD3-M almost matching Dreamerv3's performance on the DeepMind Control Suite. submitted by /u/307thML [link] [comments]  ( 9 min )
    [P] Labelformat now supports all major vision labeling formats and runs on Windows, Linux and macOS
    We’re happy to share an update on Labelformat - A tool for converting computer vision label formats. This project started internally at Lightly as we had to deal with lots of different labeling formats for our projects. We didn’t find anything working out of the box without having to sign up for a service. We just wanted some easy-to-use tool that could read and write from different formats. The open-source package is under the MIT license. Our hope is that others might contribute additional formats to the repo to solve this issue. Please take a look. If you like what we do and think a format is missing, create an issue. If you feel motivated enough and have the time, we would be super grateful for a PR. And if you don’t have the time to create an issue or PR but still want to support us, you can also leave a star :) Features Support for common dataset label formats such as YOLO, COCO, Kitti (more coming soon) Support for common tool formats such as Labelbox (more coming soon) Minimal dependencies, targets Python 3.7 or higher Memory conscious - datasets are processed file-by-file instead of loading everything in memory (when possible) Typed Tested with round trip tests to ensure consistency MIT license Runs on Windows, Linux, macOS Github link: https://github.com/lightly-ai/labelformat submitted by /u/igorsusmelj [link] [comments]  ( 9 min )
    [P] I got a ML internship and have no idea what to do
    I’ve got research background in ML but never actually developed any models as it was all theoretical work. I got lucky during the interview stage for this role as my research impressed them. My project involves fine-tuning a GPT-3 model for a specific task and host the model on a website. Does anyone have any tips on how to go about learning what I need to know to do this? Also what should I consider when curating my custom dataset when fine-tuning the model? I really want this to be a learning experience for me. submitted by /u/danielwilu2525 [link] [comments]  ( 9 min )
    [P] Higgsfield.AI – Anyone can train Llama 70B or Mistral for free
    https://higgsfield.ai We have a massive GPU cluster and developed our own infrastructure to manage the cluster and train massive models. There's how it works: You upload the dataset with preconfigured format into HuggingFaсe [1]. Choose your LLM (e.g. LLaMa 70B, Mistral 7B) Place your submission into the queue Wait for it to get trained. Then you get your trained model there on HuggingFace. Essentially, why would we want to do it? We already have an experience with training big LLMs. We could achieve near-perfect infrastructure performance for training. Sometimes GPUs have just nothing to train. Thus we thought it would be cool if we could utilize our GPU cluster 100%. And give back to Open Source community (already built an e2e distributed training framework [2]). This is in an early stage, so you can expect some bugs. Any thoughts, opinions, or ideas are quite welcome! [1]: https://github.com/higgsfield-ai/higgsfield/blob/main/tutori... [2]: https://github.com/higgsfield-ai/higgsfield submitted by /u/higgsfield_ai [link] [comments]  ( 9 min )
    [D] How does `@jax.jit` compare with PyTorch's `@torch.compile`?
    Have any of you seen benchmarks comparing the performance of `@jax.jit` and `@torch.compile`, especially when using functional PyTorch code? Are the performance differences big? Small? Do they depend a lot on what you're doing? submitted by /u/Llamas1115 [link] [comments]  ( 9 min )
  • Open

    Stable Baselines AssertionError: Your environment must inherit from the gymnasium.Env class cf. https://gymnasium.farama.org/api/env/
    Im trying to follow this tutorial https://www.youtube.com/watch?v=uKnjGn8fF70&t=1017s ​ AssertionError: Your environment must inherit from the gymnasium.Env class cf. https://gymnasium.farama.org/api/env/ ​ The error is on !pip install stable-baselines3[extra]from stable_baselines3.common.env_checker import check_envenv = PongEnv()check_env(env) ​ I tried to downgrade to gym v 0.21 but it did not build, outputRequirement already satisfied: setuptools==65.5.0 in /usr/local/lib/python3.10/dist-packages (65.5.0) Collecting gym==0.21 Using cached gym-0.21.0.tar.gz (1.5 MB) Preparing metadata (setup.py) ... done Requirement already satisfied: numpy>=1.18.0 in /usr/local/lib/python3.10/dist-packages (from gym==0.21) (1.23.5) Requirement already satisfied: cloudpickle>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from gym==0.21) (2.2.1) Building wheels for collected packages: gym error: subprocess-exited-with-error × python setup.py bdist_wheel did not run successfully. │ exit code: 1 ╰─> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip. Building wheel for gym (setup.py) ... error ERROR: Failed building wheel for gym Running setup.py clean for gym Failed to build gym ERROR: Could not build wheels for gym, which is required to install pyproject.toml-based projects I'd like to use the latest version, how can I correct check_env to work with 0.26 submitted by /u/Sharp-Cat2319 [link] [comments]  ( 1 min )
    State representation for a large dynamic state space
    I have a multi-object placement problem, as the title says the number of objects to be placed are dynamic, and the action to move can be taken on any of the objects. Each state has a rational score based on the reward function and the objects need to be placed in an optimal way whose rational score is high enough. The problem I am facing is about the representation of the state space, if I choose a simple table lookup the computations are extremely large as there are over a billion states for a simple model. My supervisor said about using functional approximations for states. Still I am very confused on how to do it, any help would be appreciated. Thanks. submitted by /u/RuthlessDaoist [link] [comments]  ( 1 min )
    Tensorflow DQN
    I am trying to train a TF DQN agent on a problem where the next state is not the observation of the current state, but rather a new state sampled from a dataset of states. I am currently returning an empty state after each step, would that be the ideal thing to do or is there another way of doing this? Is it even logical to train on states like this? submitted by /u/muzzez321 [link] [comments]  ( 1 min )
    "Hidden Incentives for Auto-Induced Distributional Shift", Krueger et al 202
    submitted by /u/gwern [link] [comments]  ( 1 min )
    PPO for maintenance planning
    Hi. I am currently trying to optimize a maintenance system using PPO. I have set my reward structure to: Action 0: reward = C1 * exp(-(C2 * t)) Action 1: reward = -C3 C1,2,3 are constants. t is number of time steps since action 1 was performed. When C3 is 0 the program performs as expected, but as I increase the penalty for action 1 the program performs it more frequently. As C3 increases, the program seemingly converges to performing random actions at every time step and racks up massive penalties. Does anyone know why I have this result or a solution to the problem? submitted by /u/SkivetOst [link] [comments]  ( 1 min )
    Offline Reinforcement Learning Resources
    I’m looking for some resources that can help me understand the practical aspects of implementing offline RL algorithms, such as data preprocessing, model selection, evaluation metrics, and debugging. Ideally, I would like to find some walkthroughs or tutorials that provide thorough code examples and explanations. Does anyone have any recommendations? Thanks in advance! submitted by /u/anointedninja [link] [comments]  ( 1 min )
    Finally a clear article on Terminated vs Truncated states
    I was treating them the same way for done in my replay buffer. I suppose a follow up question, is "timing out" not a failed state, if the agent needs to complete a task within a set number of steps? In this case truncated is not used and terminated can be set to True and punishing the agent. https://farama.org/Gymnasium-Terminated-Truncated-Step-API Edit: Last paragraph answers my question: Note that while finite horizon tasks end due to a time limit, this would be considered a termination since the time limit is built into the task. For these tasks, to preserve the Markov property, it is essential to add information about ‘time remaining’ in the state. For this reason, Gym includes a TimeObservation wrapper for users who wish to include the current time step in the agent’s observation. submitted by /u/_Linux_AI_ [link] [comments]  ( 1 min )
  • Open

    Google wants governments to form a 'global AI corps'
    Google is calling on governments to strengthen their AI training and skilling initiatives and develop a 'global AI corps'. The tech giant's white paper urges governments to expand certificate and skilling programs, build local research centers, and focus on channeling AI's potential benefits. Google also supports the idea of creating an 'AI education bill' to retrain and skill at least 1 million people, similar to the GI Bill for veterans. The paper is expected to guide Google's approach to regulatory talks in Washington. Google's white paper arrives as Senate lawmakers discuss boosting AI development while setting guardrails around its use. President Biden's AI executive order, which aims to guide the federal government's deployment of AI tools, is seen as having 'very promising elements' by Google. However, there are still questions about how federal agencies will write and implement guidelines around AI use. Source : https://www.washingtonpost.com/politics/2023/11/14/google-wants-governments-form-global-ai-corps/ submitted by /u/NuseAI [link] [comments]  ( 1 min )
    Clean up a scanned PDF
    I scanned an old sign in sheet with about 40 signatures and I need to clean it up a bit. At a glance, you can easily tell it's a copy just by the way the signatures look from being scanned & from being really old and I want to make it look as clean as possible. This is for archival / scrapbooking purposes. submitted by /u/rebelbaserec [link] [comments]  ( 1 min )
    AI website builder
    Hey Guys, What’s the best AI website builder platform to use to build a website from start to finish? Thanks! submitted by /u/SirCanIHelpYou [link] [comments]  ( 1 min )
    Deepmind create state-of-the-art model for fast and accurate 10-day weather predictions
    submitted by /u/fesenjoon [link] [comments]  ( 1 min )
    Is there a way for a Speech-to-Text model to differentiate between speakers?
    For instance, if I'm recording an interview between two people, and I have something like Whisper recording the discussion, can it break out the dialogue between the speakers? Seems like this would be a fairly simple feature, but I'm not sure if it exists. ​ Doesn't have to be Whisper per se, but is there a known S2T model or solution for this? submitted by /u/jrstelle [link] [comments]  ( 1 min )
    Approachable, but still technical, AI podcasts on generative AI techniques (e.g. like MLG)
    ​ Hey, I've been learning about AI techniques in my spare time, for both interest and work (I'm a university Prof working in HCI/interaction design and understanding the possibilities and limitations of AI -- and why, technically -- becoming an essential area of knowledge for me). I've found that the best way to learn complex techniques has been to read about them in books (perhaps not fully understanding the first time), listen to a podcast that explains them again (understanding more) and then read the same book chapter again (finally 'getting it'). Machine Learning Guide has been the perfect podcast for this. However, it doesn't cover the generative techniques I'm learning about now (e.g. VAE, GAN etc. etc. etc). I've tried looking for other podcasts, but everything I find seems to be at much more of a pop-science level: two folks having a coffee chat about general concepts and not how they work. Does anyone have any recommendations for AI podcasts similar to Machine Learning Guide but for generative techniques (or others like MLG in general, as I can't get enough!). Thanks! submitted by /u/jnthhk [link] [comments]  ( 1 min )
    When you are old
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 1 min )
    Exploring AI Conversations in Nature: My Experience with ChatGPT Voice
    Recently, I embarked on an intriguing journey, blending the realms of artificial intelligence and the serenity of nature. I took a walk in the park, not alone, but accompanied by the voice of ChatGPT. For an hour, we conversed, not through the taps of a keyboard but through the simplicity of spoken words. This experience was not just a technological experiment, but a glimpse into the future of human-AI interaction. The fluidity and ease of conversation with ChatGPT's voice interface were remarkable. It felt like discussing with a knowledgeable friend, seamlessly blending into my natural environment. What struck me most was the versatility of AI in adapting to our daily routines. It's not just about screens and keyboards; it's about integrating intelligently into our lives, enhancing our experiences, and learning on the go. As we venture further into this AI-augmented era, I'm excited to see how these technologies will continue to evolve and intertwine with our daily lives. The potential for learning, sharing, and growing with AI as our companion is immense. submitted by /u/klevismiho [link] [comments]  ( 1 min )
    What AI tool is good as "dungeon master" for rol games nowadays?
    A few years ago, AI Dungeon used to be a very cool and interesting tool for this purpose, but very limited in many aspects. When ChatGPT was released, AI dungeon felt obsoleted. But then again, ChatGPT wasnt that good at remembering events from the past. Nowadays, if Im bored and want to have a quick campaing, what AI tool you guys recommend and what tips can you give me? Thank you! submitted by /u/Kastila1 [link] [comments]  ( 1 min )
    One-Minute Daily AI News 11/13/2023
    Google sues scammers spreading malware disguised as its Bard AI chatbot.[1] Nvidia on Monday introduced the latest version H200 of its main AI processor, aimed at meeting the compute needs of organizations with large AI workloads.[2] OpenAI recruiters are trying to lure Google AI employees with $10 million pay packets.[3] OpenAI chief seeks new Microsoft funds to build ‘superintelligence’.[4] Sources: [1] https://mashable.com/article/bard-ai-malware-scam-google-lawsuit [2] https://www.techtarget.com/searchenterpriseai/news/366559236/Nvidia-intros-new-H200-for-running-large-AI-workloads [3] https://www.businessinsider.com/openai-recruiters-luring-google-ai-employees-10-million-compensation-package-2023-11 [4] https://www.ft.com/content/dd9ba2f6-f509-42f0-8e97-4271c7b84ded submitted by /u/Excellent-Target-847 [link] [comments]  ( 1 min )
    look at this fox ai art i did with stable diffusion
    submitted by /u/marlon99rocks99 [link] [comments]  ( 1 min )

  • Open

    [Project] Which text representation technique to choose?
    I have a project to do on a movie review classifier, I'm leaning towards the "bag of words" technique, thoughts? submitted by /u/The_Middle_Child_ [link] [comments]  ( 9 min )
    [R] [ICLR] Is it okay to reference an answer to another reviewer in a reviewer's response?
    I am writing the responses to my reviewers for the ICLR. I had two reviewers asking a very similar question, and I'm wondering whether it is good to say to one of them something like "please refer to answer (4) in my response to reviewer xxxx" and then continue to tailor my answer to this reviewer, based on the reasoning offered in (4) for reviewer xxxx. Is this valid or should I just copy/paste the answer? submitted by /u/howtorewriteaname [link] [comments]  ( 9 min )
    [D] How are Statistics Journals Viewed in Machine Learning Community for Research Jobs?
    Hi all, I'm a statistics PhD researching statistical machine learning (think Gaussian Processes). I have a few papers in the pipeline at solid statistics publications (Journal of the American Statistical Association/Journal of Computational and Graphical Statistics). Do these journals have any weight/name recognition in the ML world or are they unknown/chopped liver? Thanks! submitted by /u/herodotick [link] [comments]  ( 9 min )
    [R] JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models
    Paper: https://arxiv.org/abs/2311.05997 Blog: https://craftjarvis-jarvis1.github.io/ Abstract: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. In our experiments, JARVIS-1 exhibits nearly perfect performances across over 200 varying tasks from the Minecraft Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task. This represents a significant increase up to 5 times compared to previous records. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy. submitted by /u/MysteryInc152 [link] [comments]  ( 9 min )
    [D] Tutorials on Generative Adversarial Networks ?
    I know there are plenty of tutorials online, but most of them just skim through the topic and don't have much in detail, some of them even contradict each other. I have read the original paper, but very technical and difficult to understand as a newbie. Are there any tutorials that explain GANs in details but are still newbie-friendly? Thanks submitted by /u/Electronic_Ant2706 [link] [comments]  ( 9 min )
    [D] Course Enrollment
    Hello, This is my first time posting so here goes nothing. I want to train a model to recommend to university students which courses to enroll in for their next semester, I have all the data I need but I don’t know how to structure it to start, I have 3 different data sheets containing what the students have enrolled in with their grades for each semester(they are all studying the same major), their high school finals grades ,and the last sheet is basically the study plan showing each course and what are it’s prerequisites and like how many credit hours it counts for. My question is how do I recommend the students which courses to recommend if they give me an input of how many credit hours they want to enroll in and also have it be personalized, any information would be helpful. submitted by /u/Between5and20 [link] [comments]  ( 9 min )
    [R] Simplifying Transformer Blocks
    Paper: https://arxiv.org/abs/2311.01906 GitHub: https://github.com/bobby-he/simplified_transformers Abstract: A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable. In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput, and using 15% fewer parameters. https://preview.redd.it/6pz53ro6260c1.png?width=1129&format=png&auto=webp&s=ce9dad8de6cac575970e38d93a35786fe8880506 submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [P] Fraud Detection | e-commerce
    I am building a system to separate fraudulent transactions, these will be then manually verified, helping me in turn build a labelled dataset over time. For now I have transaction data and customer behavior Information. I intend to do this like this:The possible fraud cases include:- Abuse of all cashbacks and discounts ( Coupons / Vouchers / Auto Refund)- Retailing - Acquiring sensitive SKUs Since I don't really have labelled data, I am going with the unsupervised learning approach (isolation forest).I plan on having 3 modules : Users, SKUs, Localities For the last 2 I am suffering with setting meaningful thresholds, I standardized slope of sales trend and intercept, then divided them to get a compound variable which I am using to compare. Please share thoughts and or Resources. submitted by /u/common_happen143 [link] [comments]  ( 9 min )
    [D] Data visualization tool as local server
    Hello community! I am reading some of the mechanistic interpretability papers and I am pretty interested in this field and wants to get my hand on. This field, possibly with many other fields that involve explanation/interpretability, requires a ton of visualization, of weights, of activation, of features etc. Thus I think a more tailored tool than plain plotly/matplotlib/seaborn is needed. What I am looking for is a tool that can deploy as a local server so that I can seamlessly navigate through a LOT of visualizations, ideally also able to interact like plotly. I can for example save a ~100MB packed image via matplotlib, but opening and viewing such a giant image is a pain and I lose interactivity. One example is openai's microsocpe, where you can view each weight of various models, as well as feature visualization of the weight. But of course for my own purpose it need not to be that fancy. As someone who does not know a lot on data visualization beyond matplotlib, I really appreciate any pointers! Thanks! submitted by /u/tt19234 [link] [comments]  ( 9 min )
    Deployment of RVC pipeline to Sagemaker Serverless Endpoint [P]
    Really struggling to make this work. Does anyone have any insights? Would happily pay good money for some guidance. submitted by /u/torstah [link] [comments]  ( 9 min )
    [Research] Seeking Structured Strategies for Neural Network Design
    Hi everyone, I'm currently a PhD student diving into how transformer models can be applied to biological sequence data. My task involves iterating on a deep-learning base model to improve its accuracy. However, I seem stuck in a trial-and-error loop without a clear-cut strategy. I'm reaching out to see if there's a more systematic approach to tweaking neural network architectures that are commonly accepted in the field. Are there broad-stroke principles or frameworks that you rely on, regardless of the project specifics? Any pointers on more structured methods or resources for guiding these decisions would be super helpful. submitted by /u/palset [link] [comments]  ( 9 min )
    [D] Is there a good tool for inspecting a video dataset?
    Hi I'm creating a video dataset composed of small clips and i would like to use a tool for inspecting videos, that is, something in which i can easily play them and delete them directly or send them to another folder. Any ideas or suggestions? submitted by /u/aendrs [link] [comments]  ( 9 min )
    [D] Should Tensorflow library builds/contributions be this cumbersome?
    I am working with Edge AI, and have work projects that are using TFLM to run networks on microcontrollers. I have found that for some of the optimizations, such as operator fusion of custom activations, I need to adjust the Tensorflow library logic such as MLIR/TFL-graph-builder in order to export a .tflite execution graph that can be interpreted by the microcontroller library(TFLM/CMSIS/X-CUBE-AI/TinyEngine/TVM). This requires me to recompile the tensorflow repo locally using Bazel after every addition to the library. This takes from 20 min to 5 hours depending on what is cached. Now sure, the whole thing does not have to be compiled from scratch every time, but this feels like such a cumbersome and slow development process. Is this the reality of it, or am I missing something? submitted by /u/SlayahhEUW [link] [comments]  ( 9 min )
    [D] Don't know what to do
    Hello, I'm a junior developer at a small company. I'm in a tough situation and looking for opinions. The company hired me for an ML/data-oriented position, basically a backend developer with ML/data focus. Currently, I'm tasked with a full computer vision project, which means that I have to do almost everything on my own: object detection, analysis, database storage, data extraction, etc., all in real-time. I received a subpar laptop, making testing and regular work difficult. Despite limited support from senior developers, the project works, but not flawlessly. How can they expect me to build this? I've asked for help, but the response is delayed, and there's no mentoring. Recently, I expressed my struggle, and they questioned why it's not finished. I fear they might let me go. Some senior friends of mine advised me to leave. I'm unsure; any advice? Also, my salary is 50k which as I know, rather low. submitted by /u/windonwind [link] [comments]  ( 9 min )
    [R] A Practical Approach to Novel Class Discovery in Tabular Data
    Detecting new, unlabeled data classes is a big opportunity in AI, especially for models trained on known, labeled data. This challenge is crucial in a world where data is constantly evolving. A recent paper takes a novel approach to this. They use known class labels to help cluster unlabeled data into new categories. They test 3 methods - extending k-means and spectral clustering to leverage known data, and a custom projection-based network. The projection network avoids overfitting and works best overall, even when the number of novel classes is uncertain. It outperforms standard techniques and a more complex benchmark model called TabularNCD. This research shows that novel class discovery in tabular data is feasible without unrealistic access to novel class information during training. While there are limitations, it's progress in an important open problem. Applications could range from identifying new categories of census data and insurance claims to customer segments over time. The techniques might even extend to time series or graph data. Overall, this paper moves us closer to more flexible, general AI that can adapt and learn in an open-ended, lifelong manner. TLDR: AI can discover hidden relationships in tabular data Full summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    ML course on Udemy - Opinion [D]
    Hi everyone, I am privately very interested in this field and have a very basic understanding of how it kind of works, but now I really want to make the move and actually learn this. I have found a bunch of courses on Udemy and bought a course on black Friday called "Complete Machine Learning and Data Science Bootcamp 2023" by Andrei Neagoie How legit or useful is this course? It's a hands-on boot camp with practical learning of Python basics and understanding its depth. It has a total of 44 hours, and I really like it so far but also want some guidance on how to get into this whole field. Any recommendation is very welcome, thanks! :) submitted by /u/TheOceanIAm [link] [comments]  ( 9 min )
    [D] Machine Learning/AI companies that sponsor or fund non-profits
    Hello, I work at a non-profit that is aimed for machine learning and/or related grad students in developing countries. We are based in North América and currently looking for sponsorships, preferably from a company with the same goals in machine learning and artificial intelligence that we have. Do you know of any companies with a CSR (corporate social responsibility) program? Thank you! (I'm not mentioning the non-profit as I'm worried the mods will take it down as self-promotion but if anyone wants me to share please let me know!) submitted by /u/glitteryCranberry [link] [comments]  ( 9 min )
    [P] Whisper large-v3 API
    I'm working on making it easier to use open-source AI models by providing simple APIs. Last week, OpenAI released Whisper v3, which showed improved performance across all its 100+ supported languages. We tried to optimize it as much as possible to be able to offer it for $0.0028 per minute (vs $0.006 on OpenAI). I hope it's helpful for someone: https://www.lemonfox.ai/apis/speech-to-text submitted by /u/ingojoseph [link] [comments]  ( 9 min )
    Data Scientist to Machine Learning Engineer - Do I Need a Portfolio [D]
    With 7+ years of experience in data science and data analytics roles, would it be necessary to create a portfolio to begin applying to Machine Learning Engineer positions? If I already have the experience using sklearn/tf/keras/etc. in my current role of >2 years, how important is a portfolio when applying to these MLE roles? submitted by /u/Fiboniz [link] [comments]  ( 9 min )
    [P] Need advice on building a YOLOv8 + SAHI program
    Need advice on building and object detection program. I plan to use YOLOv8 with SAHI (slicing aided hyper inference) on RTSP stream of an IP camera. The detection stream has to be saved in realtime. I have a Jetson Orin AGX 64gb to utilize the NVDEC (HW engine) to decode the h.265 video. What is the easiest possible way to achieve this. Thank you very much for your advice. submitted by /u/These_Consequence_52 [link] [comments]  ( 9 min )
    [D] Gen-AI/LLM - Interview prep
    Hello folks, I have an interview call later this week which the work is regarding implementing generative AI within the companies workflow. Using LLMs with finetuning/in-context learning using system logs etc kind of stuff. I have studied machine learning, worked for few years now as well. Have good understanding of those stuff but never tried fine tuning hands-on. I'm worked majority into computer-vision applications but think that I lagged a bit on the LLM side. Any suggestions, recommeded papers, courses, videos I could go through? Thanks! submitted by /u/ade17_in [link] [comments]  ( 9 min )
    [D] Logit Laplace Reconstruction Loss
    Was reading through the original DALL-E paper (https://arxiv.org/pdf/2102.12092.pdf) and found their appendix A.3 pretty interesting. The motivation behind the logit Laplace loss totally makes sense, but if you plot the negative log of the PDF of their logit Laplace function, it's not strictly nonnegative as a function of x. Does that mean the authors are allowing for a potentially negative loss with this reformulation? submitted by /u/clywac2 [link] [comments]  ( 9 min )
    [R] P-MMF: Provider Max-min Fairness Re-ranking in Recommender System
    In this paper, we address the issue of recommending fairly from the aspect of providers, which has become increasingly essential in multistakeholder recommender systems. Existing studies on provider fairness usually focused on designing proportion fairness (PF) metrics that first consider systematic fairness. However, sociological researches show that to make the market more stable, max-min fairness (MMF) is a better metric. The main reason is that MMF aims to improve the utility of the worst ones preferentially, guiding the system to support the providers in weak market positions. When applying MMF to recommender systems, how to balance user preferences and provider fairness in an online recommendation scenario is still a problem. The central question revolves around the type of fairness that holds significance within recommender systems and other machine-learning systems. Experimental results also show that P-MMF can retain small computational costs on a corpus with the large number of items. This paper was accepted by TheWebConf 2023 and awarded as SpotLight-Best Paper Candidate (https://dl.acm.org/doi/abs/10.1145/3543507.3583296). The central question revolves around the type of fairness that holds significance among different stakeholders in recommender systems and other machine-learning systems. ​ ​ Two types of fairness ​ Harm for unfairness ​ MMF and PF ​ P-MMF can pareto-dominate baselines ​ ​ submitted by /u/XuChen0427 [link] [comments]  ( 9 min )
    [R] Data Filtering Networks
    Paper: https://arxiv.org/abs/2309.17425 Models: https://github.com/mlfoundations/open_clip Results table: https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv X thread: https://twitter.com/gabriel_ilharco/status/1721905861369762096 Abstract: Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art CLIP models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 84.4% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data. ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
  • Open

    Live AI Course (Sailea Non-profit) (Forever Free Course)
    >Sailea is a student run non-profit that does not charge for any of its services Join the Second lesson of SAILea’s course on the Principals of AI! 🌳 Covers: Decision Trees 🗓️ November 18th ⏰ 7:00-8:00PM EST Why Sailea? Only course targeted at high schoolers Free Forever Join Us Now! 👉 (signup form) https://docs.google.com/forms/d/e/1FAIpQLSfQGCeZClTdF6zeIQ-RtbOGR582bb1slc3oR0zG2J7j1v5RHg/viewform?usp=sf_link 🌳 Register today, get involved in the community and grow your knowledge! submitted by /u/Envoy-Insc [link] [comments]  ( 9 min )
    OpenAI New Products: A Deep Dive
    submitted by /u/Overflame [link] [comments]  ( 1 min )
    🪷Chinese Perspectives on OpenAI DevDay, NVIDIA‘s New AI Chips for China, and Zen Buddhism Meets AI
    submitted by /u/trcytony [link] [comments]  ( 1 min )
    Harry Potter 80s sitcom (deepfake)
    submitted by /u/vaddymusic [link] [comments]  ( 1 min )
    AI Is a Terrifying Purveyor of Bullshit. Next Up: Fake Science
    AI tools like chatGPT and Google's AI tool are dangerous as they can generate misinformation and disinformation, making it difficult to determine what is true. These tools can invent thoughts and create fake stories that seem plausible. AI models like ChatGPT and GPT-4 ADA can generate persuasive but false statements and even create fake datasets to support preordained conclusions. This poses a threat to scientific research and the ability to distinguish between real and fake information. The proliferation of AI-generated content makes it increasingly challenging to parse authentic information from AI-generated bullshit. Source : https://www.lastwordonnothing.com/2023/11/10/ai-is-a-terrifying-purveyor-of-bullshit-next-up-fake-science/ submitted by /u/NuseAI [link] [comments]
    Ai that automatically finds potentially interesting parts of video
    I've seen somewhere in tiktok or yt shorts a guy talking about an ai website in which you paste a yt link and ai automatically finding potentially interesting parts of video which you then can split and make short tiktok from it. Is it possible to find this website? Tried searching on yt, ig, tiktok submitted by /u/dusy4 [link] [comments]  ( 9 min )
    AI in Medicine: how should it be regulated? What should states do?
    Do you have any policy proposals you think would help ai in medicine (aim) to be safer, less biased, better in any way? What would you do if you could impact state level policy on aim? submitted by /u/mbfunke [link] [comments]  ( 9 min )
    AI Headshots
    I am looking for a free ai professional headshot for my LinkedIn page. I just need one picture and I don’t want to spend any money because. I’m low on funds. Any advice would be great! Thanks submitted by /u/tyrannosaurwes [link] [comments]  ( 1 min )
    Will user agents help with integrating all niche radiology tools to form a more copilot experience?
    https://python.langchain.com/docs/modules/agents/ There are a lot of niche AI programs that outperform radiologists in very specific tasks. But no one program that integrates them all. I wonder if a user agent can help with this since the reasoning engine of a LLM can determine when to use what tool. submitted by /u/derpgod123 [link] [comments]  ( 9 min )
    Looking for paper: generalization of LLM as probabilistic generation of tokens + validation
    I can't find the paper but it's your of important to me. It had the following content; * theoretical considerations of viewing LLM as way to probabilistic generate tokens * data flow of tokens was also considered, realized with nodes which can generate tokens or validate them * generalization of various "agentic" papers which did validation of result with LLM as data flow nodes to validate I can remember two papers which did this sort of thing. I can't find neither of them. I am looking for the one paper which really did use nodes. submitted by /u/squareOfTwo [link] [comments]  ( 9 min )
    Pokemon Mecha
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 1 min )
    Will Grok overrun chatGPT?
    We all saw Grok and its okayish. Do you think it'll get considerably better taking into account elon musk's past exploits? submitted by /u/redblade678 [link] [comments]
    One-Minute Daily AI News 11/12/2023
    Nvidia CEO Jensen Huang says his AI powerhouse is ‘always in peril’ despite a $1.1 trillion market cap: ‘We feel it’.[1] OpenAI Announces New Data Partnerships to Enhance AI Training.[2] AI Could Make Grand Theft Auto NPCs ‘Really Interesting and Fun’, Says Take-Two CEO.[3] In a significant industry shift, Baidu Inc. reportedly ordered AI chips from Huawei moving away from long-time supplier Nvidia.[4] Sources: [1] https://finance.yahoo.com/news/nvidia-ceo-jensen-huang-says-225059737.html [2] https://innovation-village.com/openai-announces-new-data-partnerships-to-enhance-ai-training/ [3] https://www.pushsquare.com/news/2023/11/ai-could-make-grand-theft-auto-npcs-really-interesting-and-fun-says-take-two-ceo [4] https://www.benzinga.com/news/23/11/35636991/blow-for-nvidia-baidu-reportedly-switches-to-huawei-for-ai-chips-ditching-its-long-time-supplier submitted by /u/Excellent-Target-847 [link] [comments]
  • Open

    Need help training a PPO NN to learn how to play my deckbuilding game
    (Posted originally in r/artificial, but didn't get any responses there; so I'm trying here, which seems much more relevant :) I have a roguelike deckbuilding game I want to train an agent to play using pure unsupervised RL; I chose PPO as I understand (to my amateur knowledge) that is the most fitting algorithm. I have a sizeable categorical space to send in (basically, what cards are in the deck and which cards are being offered to pick), and I need the agent to learn the best picks. I attempted to use an embedding layer and input the cards the player has + given cards + numerical data (added to the embedding output). I tried playing around with various hyperparameters, but so far, I have not been able to generate any learning. (I posted originally in d be greatly, but I didn't get any responses there, so I'm trying here, which seems much more relevant :) submitted by /u/Jagerjj [link] [comments]  ( 1 min )
    How to stabilize Pybullet environment
    Hello everyone I'm quite new to pybullet, and working with a URDF file i found only to control the hexapods shoulders and biceps, however the simulator is acting really funky. The bot is flying up a lot into the air, skittering on the ground. I wanted to reach out here and see if anyone knows the most important parameters to tinker with to fix this Here is a video of what im talking about https://youtu.be/DrOKz9xs2I8 any tips and tricks would be greatly appreciated submitted by /u/bbri826 [link] [comments]  ( 9 min )
    PPO agent not learning
    Do have a go at the problem. ​ I have a custom Boid flocking environment in OpenAI Gym using PPO from StableBaselines3. I wanted it to achieve flocking similar to Reynold's model(Video) or close enough, but it isn't learning. ​ I have adjusted the calculate_reward my model uses to be similar but not seeing any apparent improvement. ​ Reynold's model equations: Reynold's Model ​ My results after 100000 timesteps of training: My result so far: https://drive.google.com/file/d/1jAlGrGmpt2nUspBtoZcN7yJLHFQe4CAy/view?usp=drive_link ​ TensorBoard Graphs TensorBoard Reward Function def calculate_reward(self): total_reward = 0 cohesion_reward = 0 separation_reward = 0 collision_penalty = 0 velocity_matching_reward = 0 for agent in self.agents: for other in self.agents: if agent != other: distance = np.linalg.norm(agent.position - other.position) # if distance <= 50: # cohesion_reward += 5 if distance < SimulationVariables["NeighborhoodRadius"]: separation_reward -= 100 velocity_matching_reward += np.linalg.norm(np.mean([other.velocity for other in self.agents], axis=0) - agent.velocity) if distance < SimulationVariables["SafetyRadius"]: collision_penalty -= 1000 total_reward = separation_reward + velocity_matching_reward + collision_penalty # print(f"Total: {total_reward}, Cohesion: {cohesion_reward}, Separation: {separation_reward}, Velocity Matching: {velocity_matching_reward}, Collision: {collision_penalty}") return total_reward, cohesion_reward, separation_reward, collision_penalty Complete code: Code P.S ANY help is appreciated, I have tried different approaches but the level of desperation is increasing lol. submitted by /u/Sadboi1010 [link] [comments]  ( 9 min )
  • Open

    Simplifying Transformer Blocks
    Paper: https://arxiv.org/abs/2311.01906 GitHub: https://github.com/bobby-he/simplified_transformers Abstract: A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable. In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput, and using 15% fewer parameters. https://preview.redd.it/eze7pk3lu00c1.png?width=1129&format=png&auto=webp&s=9a1ecebee5a98ab385e69c619c1f91bed2985236 submitted by /u/APaperADay [link] [comments]  ( 9 min )
    Generative vs. Discriminative AI
    submitted by /u/Neurosymbolic [link] [comments]  ( 9 min )

  • Open

    [D] What do you think about representing each neuron and each synapse as a small neural network that will form a larger neural network?
    For example, as input data, the synapse neural network will receive a forward propagation vector from the pre-neuron and a back propagation vector from the post-neuron. I think a model with a this approach can be taught to form short-term and long-term memory. Maybe we can use the output of the last gpt block to train this model. submitted by /u/Silly_Commercial_514 [link] [comments]  ( 9 min )
    [D] A question about creating an integrated platform to make predictions using big bio-data.
    Hi Folks, As a bench scientist, programming and ML are all Greek to me. But, I frequently use several bioinformatics tools on webpages such as BLAST, Gene ontology, AlphaFold and PTM prediction tools to inquiry any gene of my interest in a research context. I intend to create a software run by AI/ML-assisted text mining/natural language processing (NLP) algorithms (!) to be able to customize and integrate the separate prediction tools in one place. My overall goal is to get more accurate predictions to design my wetlab experiments. It might be either website or installable program app format utilizing the publicly available transcriptome and proteome data to generate such predictions. I’m wondering what kind of skill set this project requires to create such tool embedded in a website or downloadable software application? I’m planning to reach out to a few experts to get their help and guidance, but I don’t know which field guys would be correct people to talk about this project, like AI expert, bioinformaticians, software developer or ML guys. I even don’t know whether such tool needs AI and ML or text mining algorithms, but it seems like AI may help putting all the data together in a format that I can understand. Any advise would be greatly appreciated. Thanks. submitted by /u/VolSapiens [link] [comments]  ( 9 min )
    is it feasible to implement the FAISS library? [D]
    I wanted to use the FAISS library on a RAM-constrained system with vector databases of size around 5 GB (about 20 million vectors where every vector length is 71), but the system has a lot of ROM (more than 500 GB). but the problem with the FAISS library is that it creates everything in the RAM. Is there a solution to my problem or shall I implement the FAISS library for my needs (I shall deal with text files to be my database where every node connection to another is based on id)? submitted by /u/abdosalm [link] [comments]  ( 9 min )
    [D] Run Pytorch model inference on Microcontroller
    I am currently researching ways to export models that I trained with Pytorch on a GPU to a microcontroller for inference. Think CM0 or a simple RISC-V. The ideal workflow would be to export c-sourcecode with as few dependencies as possible, so that it is completely platform agnostic. What I noticed in general is that most edge inference frameworks are based on tensorflow lite. Alternatively there are some closed workflows, like Edge Impulse, but I would prefer locally hosted OSS. Also, there seem to be many abandoned projects. What I found so far: Tensorflow lite based Tensorflow lite TinyEngine from MCUNet. Looks great, targeting ARM CM4. CMSIS-NN. ARM centric. Examples. They also have an example for a pytorch to tflite converter via onnx TinyMaix. Very minimalistic, can also be used on RISC-V nnom Pytorch based PyTorch Edge / Executorch Sounds like this could be a response to tflite, but it seems to target intermediate systems. Runtime is 50kb... microTVM. Targeting CM4, but claims to be platform agnostic. ONNX DeepC. Open source version of DeepSea. Very little activity, looks abandoned onnx2c - onnx to c sourcecode converter. Looks interesting, but also not very active. cONNXr - framework with C99 inference engine. Also interesting and not very active. Are there any recommendations out of those for my use case? Or anything I have missed? It feels like there no obvious choice for what I am trying to do. Most solutions that seem to hit the mark look rather abandoned. Is that because I should try a different approach or is the field of ultra-tiny-ml OSS in general not so active? submitted by /u/cpldcpu [link] [comments]  ( 9 min )
    [D] Deep Learning with GTX 1660 Super
    Did anyone use Tensorflow / Pytorch on GTX 1660 Super? Do these libraries work on this GPU perfectly? submitted by /u/Abu_BakarSiddik [link] [comments]  ( 9 min )
    [P] Python library for models or methods from research papers
    I have been trying to apply methods / models from research papers lately. After some thoughts, I think it can be shared as python package. Please give it a try if it suits your needs. Github submitted by /u/Ellzaf [link] [comments]  ( 9 min )
    "[Project]" Using AI to enhance drawing
    Hi, I am an industrial design student and as part of my project, I need to make an AI that enhances drawing. It is not expected of me to make something from scratch I can use already existing models. However, I need to add something to it of course. I don’t have sufficient knowledge in programming to get started and I could use any advice. I want to spend time on this but i would go with a simple option as I don’t have time to learn everything. I would be super happy if someone could help submitted by /u/burqeon [link] [comments]  ( 9 min )
    [R] UNINEXT : Universal Instance Perception as Object Discovery and Retrieval
    submitted by /u/iFighting [link] [comments]  ( 9 min )
  • Open

    Create Your Custom GPT (Step by Step Guide)
    submitted by /u/Senior_tasteey [link] [comments]  ( 9 min )
    Exclusive: Google in talks to invest in AI startup Character.AI
    submitted by /u/donutloop [link] [comments]  ( 9 min )
    Nvidia CEO says his AI powerhouse is ‘always in peril’ | Fortune
    . submitted by /u/AminoOxi [link] [comments]  ( 9 min )
    AI generator or chatbot
    AI Story Generator free to inexpensive Ai story writer Not sure if a similar question has been asked, but I have been testing out a few different AI story generator/chat bots and I am having trouble finding one that fit my needs. First need, I write psychological thrillers and Dark Romances. The ones I have found do not write violence or sexual scenes. Most of my stories are about murder, carnage and finding a semblance of twisted love along the way. Second need most I have found, even for there "novel chatbot/generator" top out at about 20k words. My novels are between 80k and 100k So I guess what I am asking is does anyone know of an AI writer or chatbot that will write for all genres and scenes that top out at more then 100k words? bonus if it has a structure to assist with multi book series submitted by /u/urLordV [link] [comments]  ( 9 min )
  • Open

    Problems with simple number forecast problem
    Hi, I want to get some knowledge in reinforcement learning and after reading the standard books try some practical things. Something which should be really easy to implement is a matching algorithm, where the action should match the numerical value of the last observation. However I have some troubles with implementing it with RLLib, the values seem to be rather random. Does anyone have an idea what is wrong here: https://github.com/RandomGitUser123/RL/blob/main/RLLibSimpleTest.ipynb submitted by /u/WeirdWirdo [link] [comments]  ( 9 min )

  • Open

    Implementing Gradient Descent in PyTorch
    The gradient descent algorithm is one of the most popular techniques for training deep neural networks. It has many applications in fields such as computer vision, speech recognition, and natural language processing. While the idea of gradient descent has been around for decades, it’s only recently that it’s been applied to applications related to deep […] The post Implementing Gradient Descent in PyTorch appeared first on MachineLearningMastery.com.  ( 25 min )

  • Open

    Training a Linear Regression Model in PyTorch
    Linear regression is a simple yet powerful technique for predicting the values of variables based on other variables. It is often used for modeling relationships between two or more continuous variables, such as the relationship between income and age, or the relationship between weight and height. Likewise, linear regression can be used to predict continuous […] The post Training a Linear Regression Model in PyTorch appeared first on MachineLearningMastery.com.  ( 24 min )
    Making Linear Predictions in PyTorch
    Linear regression is a statistical technique for estimating the relationship between two variables. A simple example of linear regression is to predict the height of someone based on the square root of the person’s weight (that’s what BMI is based on). To do this, we need to find the slope and intercept of the line. […] The post Making Linear Predictions in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Loading and Providing Datasets in PyTorch
    Structuring the data pipeline in a way that it can be effortlessly linked to your deep learning model is an important aspect of any deep learning-based system. PyTorch packs everything to do just that. While in the previous tutorial, we used simple datasets, we’ll need to work with larger datasets in real world scenarios in […] The post Loading and Providing Datasets in PyTorch appeared first on MachineLearningMastery.com.  ( 20 min )

  • Open

    Using Dataset Classes in PyTorch
    In machine learning and deep learning problems, a lot of effort goes into preparing the data. Data is usually messy and needs to be preprocessed before it can be used for training a model. If the data is not prepared correctly, the model won’t be able to generalize well. Some of the common steps required […] The post Using Dataset Classes in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Calculating Derivatives in PyTorch
    Derivatives are one of the most fundamental concepts in calculus. They describe how changes in the variable inputs affect the function outputs. The objective of this article is to provide a high-level introduction to calculating derivatives in PyTorch for those who are new to the framework. PyTorch offers a convenient way to calculate derivatives for […] The post Calculating Derivatives in PyTorch appeared first on Machine Learning Mastery.  ( 20 min )

  • Open

    Two-Dimensional Tensors in Pytorch
    Two-dimensional tensors are analogous to two-dimensional metrics. Like a two-dimensional metric, a two-dimensional tensor also has $n$ number of rows and columns. Let’s take a gray-scale image as an example, which is a two-dimensional matrix of numeric values, commonly known as pixels. Ranging from ‘0’ to ‘255’, each number represents a pixel intensity value. Here, […] The post Two-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 21 min )

  • Open

    One-Dimensional Tensors in Pytorch
    PyTorch is an open-source deep learning framework based on Python language. It allows you to build, train, and deploy deep learning models, offering a lot of versatility and efficiency. PyTorch is primarily focused on tensor operations while a tensor can be a number, matrix, or a multi-dimensional array. In this tutorial, we will perform some […] The post One-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 22 min )

  • Open

    365 Data Science courses free until November 21
    Sponsored Post   The unlimited access initiative presents a risk-free way to break into data science.     The online educational platform 365 Data Science launches the #21DaysFREE campaign and provides 100% free unlimited access to all content for three weeks. From November 1 to 21, you can take courses from renowned instructors and earn […] The post 365 Data Science courses free until November 21 appeared first on Machine Learning Mastery.  ( 15 min )

  • Open

    Attend the Data Science Symposium 2022, November 8 in Cincinnati
    Sponsored Post      Attend the Data Science Symposium 2022 on November 8 The Center for Business Analytics at the University of Cincinnati will present its annual Data Science Symposium 2022 on November 8. This all day in-person event will have three featured speakers and two tech talk tracks with four concurrent presentations in each track. The […] The post Attend the Data Science Symposium 2022, November 8 in Cincinnati appeared first on Machine Learning Mastery.  ( 10 min )

  • Open

    My family's unlikely homeschooling journey
    My husband Jeremy and I never intended to homeschool, and yet we have now, unexpectedly, committed to homeschooling long-term. Prior to the pandemic, we both worked full-time in careers that we loved and found meaningful, and we sent our daughter to a full-day Montessori school. Although I struggled with significant health issues, I felt unbelievably lucky and fulfilled in both my family life and my professional life. The pandemic upended my careful balance. Every family is different, with different needs, circumstances, and constraints, and what works for one may not work for others. My intention here is primarily to share the journey of my own (very privileged) family. Our unplanned introduction to homeschooling For the first year of the pandemic, most schools in California, where …  ( 7 min )

  • Open

    The Jupyter+git problem is now solved
    Jupyter notebooks don’t work with git by default. With nbdev2, the Jupyter+git problem has been totally solved. It provides a set of hooks which provide clean git diffs, solve most git conflicts automatically, and ensure that any remaining conflicts can be resolved entirely within the standard Jupyter notebook environment. To get started, follow the directions on Git-friendly Jupyter. Contents The Jupyter+git problem The solution The nbdev2 git merge driver The nbdev2 Jupyter save hook Background The result Postscript: other Jupyter+git tools ReviewNB An alternative solution: Jupytext nbdime The Jupyter+git problem Jupyter notebooks are a powerful tool for scientists, engineers, technical writers, students, teachers, and more. They provide an ideal notebook environment for interact…  ( 7 min )
2023-12-21T00:45:16.643Z osmosfeed 1.15.1